[GH-ISSUE #874] Bug: URL getting truncated at underscore or parens #3560

Closed
opened 2026-03-14 23:28:48 +03:00 by kerem · 1 comment
Owner

Originally created by @zblesk on GitHub (Oct 5, 2021).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/874

Describe the bug

After pasting a URL, it gets truncated at _ or ( and the wrong page is downloaded.

Steps to reproduce

I enter a URL, like this one: https://vocaloidlyrics.fandom.com/wiki/%E4%BA%BA%E6%9F%B1%E3%82%A2%E3%83%AA%E3%82%B9_(Hitobashira_Alice)
by going to the Web UI, clicking ADD, pasting this URL on a single line and confirming.

The archival process runs; however, it seems some of the archival methods truncate the URL to https://vocaloidlyrics.fandom.com/wiki/%E4%BA%BA%E6%9F%B1%E3%82%A2%E3%83%AA%E3%82%B9_
and others to https://vocaloidlyrics.fandom.com/wiki/%E4%BA%BA%E6%9F%B1%E3%82%A2%E3%83%AA%E3%82%B9

The "original URL" in the UI ends up being https://vocaloidlyrics.fandom.com/wiki/%E4%BA%BA%E6%9F%B1%E3%82%A2%E3%83%AA%E3%82%B9_

Screenshots or log output

[+] [2021-10-05 20:48:46] Adding 1 links to index (crawl depth=0)... > Saved verbatim input to sources/1633466926-import.txt > Parsed 1 URLs from input (Generic TXT) > Found 1 new URLs not already in index [*] [2021-10-05 20:48:46] Writing 1 links to main index... √ ./index.sqlite3 [▶] [2021-10-05 20:48:46] Starting archiving of 1 snapshots in index... [+] [2021-10-05 20:48:47] "vocaloidlyrics.fandom.com/wiki/%E4%BA%BA%E6%9F%B1%E3%82%A2%E3%83%AA%E3%82%B9_" https://vocaloidlyrics.fandom.com/wiki/%E4%BA%BA%E6%9F%B1%E3%82%A2%E3%83%AA%E3%82%B9_ > ./archive/1633466926.649866 > title > favicon > headers > screenshot > dom > wget Extractor failed: 404 Not Found Got wget response code: 8. https://vocaloidlyrics.fandom.com/wiki/%E4%BA%BA%E6%9F%B1%E3%82%A2%E3%83%AA%E3%82%B9: 2021-10-05 20:48:53 ERROR 404: Not Found. Run to see full output: cd /data/archive/1633466926.649866; wget --no-verbose --adjust-extension --convert-links --force-directories --backup-converted --span-hosts --no-parent -e robots=off --timeout=60 --restrict-file-names=windows --warc-file=/data/archive/1633466926.649866/warc/1633466933 --page-requisites --user-agent=Mozilla/5.0 --compression=auto https://vocaloidlyrics.fandom.com/wiki/%E4%BA%BA%E6%9F%B1%E3%82%A2%E3%83%AA%E3%82%B9_ > readability > mercury Extractor failed: Mercury was not able to get article text from the URL Run to see full output: cd /data/archive/1633466926.649866; /node/node_modules/@postlight/mercury-parser/cli.js https://vocaloidlyrics.fandom.com/wiki/%E4%BA%BA%E6%9F%B1%E3%82%A2%E3%83%AA%E3%82%B9_ --format=text > archive_org Extractor failed: Failed to find "content-location" URL header in Archive.org response. Run to see full output: cd /data/archive/1633466926.649866; curl --silent --location --compressed --head --max-time 60 --user-agent "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.61 Safari/537.36 ArchiveBox/0.6.2 (+https://github.com/ArchiveBox/ArchiveBox/) curl/curl 7.64.0 (x86_64-pc-linux-gnu)" https://web.archive.org/save/https://vocaloidlyrics.fandom.com/wiki/%E4%BA%BA%E6%9F%B1%E3%82%A2%E3%83%AA%E3%82%B9_ 10 files (993.5 KB) in 0:00:21s [√] [2021-10-05 20:49:08] Update of 1 pages complete (21.64 sec) - 0 links skipped - 118 links updated - 32 links had errors [+] [2021-10-05 20:48:46] Adding 1 links to index (crawl depth=0)... > Saved verbatim input to sources/1633466926-import.txt > Parsed 1 URLs from input (Generic TXT) > Found 1 new URLs not already in index [*] [2021-10-05 20:48:46] Writing 1 links to main index... √ ./index.sqlite3 [▶] [2021-10-05 20:48:46] Starting archiving of 1 snapshots in index... [+] [2021-10-05 20:48:47] "vocaloidlyrics.fandom.com/wiki/%E4%BA%BA%E6%9F%B1%E3%82%A2%E3%83%AA%E3%82%B9_" https://vocaloidlyrics.fandom.com/wiki/%E4%BA%BA%E6%9F%B1%E3%82%A2%E3%83%AA%E3%82%B9_ > ./archive/1633466926.649866 > title > favicon > headers > screenshot > dom > wget Extractor failed: 404 Not Found Got wget response code: 8. https://vocaloidlyrics.fandom.com/wiki/%E4%BA%BA%E6%9F%B1%E3%82%A2%E3%83%AA%E3%82%B9: 2021-10-05 20:48:53 ERROR 404: Not Found. Run to see full output: cd /data/archive/1633466926.649866; wget --no-verbose --adjust-extension --convert-links --force-directories --backup-converted --span-hosts --no-parent -e robots=off --timeout=60 --restrict-file-names=windows --warc-file=/data/archive/1633466926.649866/warc/1633466933 --page-requisites --user-agent=Mozilla/5.0 --compression=auto https://vocaloidlyrics.fandom.com/wiki/%E4%BA%BA%E6%9F%B1%E3%82%A2%E3%83%AA%E3%82%B9_ > readability > mercury Extractor failed: Mercury was not able to get article text from the URL Run to see full output: cd /data/archive/1633466926.649866; /node/node_modules/@postlight/mercury-parser/cli.js https://vocaloidlyrics.fandom.com/wiki/%E4%BA%BA%E6%9F%B1%E3%82%A2%E3%83%AA%E3%82%B9_ --format=text > archive_org Extractor failed: Failed to find "content-location" URL header in Archive.org response. Run to see full output: cd /data/archive/1633466926.649866; curl --silent --location --compressed --head --max-time 60 --user-agent "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.61 Safari/537.36 ArchiveBox/0.6.2 (+https://github.com/ArchiveBox/ArchiveBox/) curl/curl 7.64.0 (x86_64-pc-linux-gnu)" https://web.archive.org/save/https://vocaloidlyrics.fandom.com/wiki/%E4%BA%BA%E6%9F%B1%E3%82%A2%E3%83%AA%E3%82%B9_ 10 files (993.5 KB) in 0:00:21s [√] [2021-10-05 20:49:08] Update of 1 pages complete (21.64 sec) - 0 links skipped - 118 links updated - 32 links had errors 

ArchiveBox version

ArchiveBox v0.6.2
Cpython Linux Linux-5.4.0-1058-azure-x86_64-with-glibc2.28 x86_64
IN_DOCKER=True DEBUG=False IS_TTY=True TZ=UTC SEARCH_BACKEND_ENGINE=sonic

[i] Dependency versions:
 √  ARCHIVEBOX_BINARY     v0.6.2          valid     /usr/local/bin/archivebox                       
 √  PYTHON_BINARY         v3.9.4          valid     /usr/local/bin/python3.9                        
 √  DJANGO_BINARY         v3.1.8          valid     /usr/local/lib/python3.9/site-packages/django/bin/django-admin.py
 √  CURL_BINARY           v7.64.0         valid     /usr/bin/curl                                   
 √  WGET_BINARY           v1.20.1         valid     /usr/bin/wget                                   
 √  NODE_BINARY           v15.14.0        valid     /usr/bin/node                                   
 -  SINGLEFILE_BINARY     -               disabled  /node/node_modules/single-file/cli/single-file  
 √  READABILITY_BINARY    v0.0.2          valid     /node/node_modules/readability-extractor/readability-extractor
 √  MERCURY_BINARY        v1.0.0          valid     /node/node_modules/@postlight/mercury-parser/cli.js
 -  GIT_BINARY            -               disabled  /usr/bin/git                                    
 -  YOUTUBEDL_BINARY      -               disabled  /usr/local/bin/youtube-dl                       
 √  CHROME_BINARY         v89.0.4389.114  valid     /usr/bin/chromium                               
 √  RIPGREP_BINARY        v0.10.0         valid     /usr/bin/rg                                     

[i] Source-code locations:
 √  PACKAGE_DIR           22 files        valid     /app/archivebox                                 
 √  TEMPLATES_DIR         3 files         valid     /app/archivebox/templates                       
 -  CUSTOM_TEMPLATES_DIR  -               disabled                                                  

[i] Secrets locations:
 -  CHROME_USER_DATA_DIR  -               disabled                                                  
 -  COOKIES_FILE          -               disabled

[i] Data locations:
 √  OUTPUT_DIR            16 files        valid     /data                                           
 √  SOURCES_DIR           63 files        valid     ./sources                                       
 √  LOGS_DIR              1 files         valid     ./logs                                          
 √  ARCHIVE_DIR           21825 files     valid     ./archive                                       
 √  CONFIG_FILE           81.0 Bytes      valid     ./ArchiveBox.conf                               
 √  SQL_INDEX             251.9 MB        valid     ./index.sqlite3                                  
Originally created by @zblesk on GitHub (Oct 5, 2021). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/874 #### Describe the bug After pasting a URL, it gets truncated at `_` or `(` and the wrong page is downloaded. #### Steps to reproduce I enter a URL, like this one: `https://vocaloidlyrics.fandom.com/wiki/%E4%BA%BA%E6%9F%B1%E3%82%A2%E3%83%AA%E3%82%B9_(Hitobashira_Alice)` by going to the Web UI, clicking ADD, pasting this URL on a single line and confirming. The archival process runs; however, it seems some of the archival methods truncate the URL to `https://vocaloidlyrics.fandom.com/wiki/%E4%BA%BA%E6%9F%B1%E3%82%A2%E3%83%AA%E3%82%B9_` and others to `https://vocaloidlyrics.fandom.com/wiki/%E4%BA%BA%E6%9F%B1%E3%82%A2%E3%83%AA%E3%82%B9` The "original URL" in the UI ends up being `https://vocaloidlyrics.fandom.com/wiki/%E4%BA%BA%E6%9F%B1%E3%82%A2%E3%83%AA%E3%82%B9_` #### Screenshots or log output ``` [+] [2021-10-05 20:48:46] Adding 1 links to index (crawl depth=0)... > Saved verbatim input to sources/1633466926-import.txt > Parsed 1 URLs from input (Generic TXT) > Found 1 new URLs not already in index [*] [2021-10-05 20:48:46] Writing 1 links to main index... √ ./index.sqlite3 [▶] [2021-10-05 20:48:46] Starting archiving of 1 snapshots in index... [+] [2021-10-05 20:48:47] "vocaloidlyrics.fandom.com/wiki/%E4%BA%BA%E6%9F%B1%E3%82%A2%E3%83%AA%E3%82%B9_" https://vocaloidlyrics.fandom.com/wiki/%E4%BA%BA%E6%9F%B1%E3%82%A2%E3%83%AA%E3%82%B9_ > ./archive/1633466926.649866 > title > favicon > headers > screenshot > dom > wget Extractor failed: 404 Not Found Got wget response code: 8. https://vocaloidlyrics.fandom.com/wiki/%E4%BA%BA%E6%9F%B1%E3%82%A2%E3%83%AA%E3%82%B9: 2021-10-05 20:48:53 ERROR 404: Not Found. Run to see full output: cd /data/archive/1633466926.649866; wget --no-verbose --adjust-extension --convert-links --force-directories --backup-converted --span-hosts --no-parent -e robots=off --timeout=60 --restrict-file-names=windows --warc-file=/data/archive/1633466926.649866/warc/1633466933 --page-requisites --user-agent=Mozilla/5.0 --compression=auto https://vocaloidlyrics.fandom.com/wiki/%E4%BA%BA%E6%9F%B1%E3%82%A2%E3%83%AA%E3%82%B9_ > readability > mercury Extractor failed: Mercury was not able to get article text from the URL Run to see full output: cd /data/archive/1633466926.649866; /node/node_modules/@postlight/mercury-parser/cli.js https://vocaloidlyrics.fandom.com/wiki/%E4%BA%BA%E6%9F%B1%E3%82%A2%E3%83%AA%E3%82%B9_ --format=text > archive_org Extractor failed: Failed to find "content-location" URL header in Archive.org response. Run to see full output: cd /data/archive/1633466926.649866; curl --silent --location --compressed --head --max-time 60 --user-agent "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.61 Safari/537.36 ArchiveBox/0.6.2 (+https://github.com/ArchiveBox/ArchiveBox/) curl/curl 7.64.0 (x86_64-pc-linux-gnu)" https://web.archive.org/save/https://vocaloidlyrics.fandom.com/wiki/%E4%BA%BA%E6%9F%B1%E3%82%A2%E3%83%AA%E3%82%B9_ 10 files (993.5 KB) in 0:00:21s [√] [2021-10-05 20:49:08] Update of 1 pages complete (21.64 sec) - 0 links skipped - 118 links updated - 32 links had errors [+] [2021-10-05 20:48:46] Adding 1 links to index (crawl depth=0)... > Saved verbatim input to sources/1633466926-import.txt > Parsed 1 URLs from input (Generic TXT) > Found 1 new URLs not already in index [*] [2021-10-05 20:48:46] Writing 1 links to main index... √ ./index.sqlite3 [▶] [2021-10-05 20:48:46] Starting archiving of 1 snapshots in index... [+] [2021-10-05 20:48:47] "vocaloidlyrics.fandom.com/wiki/%E4%BA%BA%E6%9F%B1%E3%82%A2%E3%83%AA%E3%82%B9_" https://vocaloidlyrics.fandom.com/wiki/%E4%BA%BA%E6%9F%B1%E3%82%A2%E3%83%AA%E3%82%B9_ > ./archive/1633466926.649866 > title > favicon > headers > screenshot > dom > wget Extractor failed: 404 Not Found Got wget response code: 8. https://vocaloidlyrics.fandom.com/wiki/%E4%BA%BA%E6%9F%B1%E3%82%A2%E3%83%AA%E3%82%B9: 2021-10-05 20:48:53 ERROR 404: Not Found. Run to see full output: cd /data/archive/1633466926.649866; wget --no-verbose --adjust-extension --convert-links --force-directories --backup-converted --span-hosts --no-parent -e robots=off --timeout=60 --restrict-file-names=windows --warc-file=/data/archive/1633466926.649866/warc/1633466933 --page-requisites --user-agent=Mozilla/5.0 --compression=auto https://vocaloidlyrics.fandom.com/wiki/%E4%BA%BA%E6%9F%B1%E3%82%A2%E3%83%AA%E3%82%B9_ > readability > mercury Extractor failed: Mercury was not able to get article text from the URL Run to see full output: cd /data/archive/1633466926.649866; /node/node_modules/@postlight/mercury-parser/cli.js https://vocaloidlyrics.fandom.com/wiki/%E4%BA%BA%E6%9F%B1%E3%82%A2%E3%83%AA%E3%82%B9_ --format=text > archive_org Extractor failed: Failed to find "content-location" URL header in Archive.org response. Run to see full output: cd /data/archive/1633466926.649866; curl --silent --location --compressed --head --max-time 60 --user-agent "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.61 Safari/537.36 ArchiveBox/0.6.2 (+https://github.com/ArchiveBox/ArchiveBox/) curl/curl 7.64.0 (x86_64-pc-linux-gnu)" https://web.archive.org/save/https://vocaloidlyrics.fandom.com/wiki/%E4%BA%BA%E6%9F%B1%E3%82%A2%E3%83%AA%E3%82%B9_ 10 files (993.5 KB) in 0:00:21s [√] [2021-10-05 20:49:08] Update of 1 pages complete (21.64 sec) - 0 links skipped - 118 links updated - 32 links had errors ``` #### ArchiveBox version ```logs ArchiveBox v0.6.2 Cpython Linux Linux-5.4.0-1058-azure-x86_64-with-glibc2.28 x86_64 IN_DOCKER=True DEBUG=False IS_TTY=True TZ=UTC SEARCH_BACKEND_ENGINE=sonic [i] Dependency versions: √ ARCHIVEBOX_BINARY v0.6.2 valid /usr/local/bin/archivebox √ PYTHON_BINARY v3.9.4 valid /usr/local/bin/python3.9 √ DJANGO_BINARY v3.1.8 valid /usr/local/lib/python3.9/site-packages/django/bin/django-admin.py √ CURL_BINARY v7.64.0 valid /usr/bin/curl √ WGET_BINARY v1.20.1 valid /usr/bin/wget √ NODE_BINARY v15.14.0 valid /usr/bin/node - SINGLEFILE_BINARY - disabled /node/node_modules/single-file/cli/single-file √ READABILITY_BINARY v0.0.2 valid /node/node_modules/readability-extractor/readability-extractor √ MERCURY_BINARY v1.0.0 valid /node/node_modules/@postlight/mercury-parser/cli.js - GIT_BINARY - disabled /usr/bin/git - YOUTUBEDL_BINARY - disabled /usr/local/bin/youtube-dl √ CHROME_BINARY v89.0.4389.114 valid /usr/bin/chromium √ RIPGREP_BINARY v0.10.0 valid /usr/bin/rg [i] Source-code locations: √ PACKAGE_DIR 22 files valid /app/archivebox √ TEMPLATES_DIR 3 files valid /app/archivebox/templates - CUSTOM_TEMPLATES_DIR - disabled [i] Secrets locations: - CHROME_USER_DATA_DIR - disabled - COOKIES_FILE - disabled [i] Data locations: √ OUTPUT_DIR 16 files valid /data √ SOURCES_DIR 63 files valid ./sources √ LOGS_DIR 1 files valid ./logs √ ARCHIVE_DIR 21825 files valid ./archive √ CONFIG_FILE 81.0 Bytes valid ./ArchiveBox.conf √ SQL_INDEX 251.9 MB valid ./index.sqlite3 ```
kerem closed this issue 2026-03-14 23:28:53 +03:00
Author
Owner

@pirate commented on GitHub (Oct 6, 2021):

Duplicate of https://github.com/ArchiveBox/ArchiveBox/issues/864 / #235 / #287

<!-- gh-comment-id:935308153 --> @pirate commented on GitHub (Oct 6, 2021): Duplicate of https://github.com/ArchiveBox/ArchiveBox/issues/864 / #235 / #287
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#3560
No description provided.