[GH-ISSUE #968] Bug: Fails to parse list of URLs txt file #603

Closed
opened 2026-03-01 14:44:55 +03:00 by kerem · 6 comments
Owner

Originally created by @rossvor on GitHub (Apr 20, 2022).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/968

Describe the bug

I can't seem to get archivebox to add any URLs from simple txt file with a newline separated list of URLs.
Based on error message it fails to parse it. I may be doing something wrong.

Steps to reproduce

  1. Create txt file with some URLs. Eg.
https://www.example.com/
https://example.com/
  1. Run archivebox add /tmp/urls.txt

Screenshots or log output

Here's the output I get:

ross@xx> archivebox add /tmp/urls.txt                                                                                                                                                                                     /tmp/archivebox
[i] [2022-04-20 16:05:12] ArchiveBox v0.6.2: archivebox add /tmp/urls.txt
    > /tmp/archivebox

[!] Warning: Missing 3 recommended dependencies
    ! SINGLEFILE_BINARY: single-file (unable to detect version)
      Hint: To install all packages automatically run: archivebox setup
            or to disable it and silence this warning: archivebox config --set SAVE_SINGLEFILE=False
            
    ! READABILITY_BINARY: readability-extractor (unable to detect version)
      Hint: To install all packages automatically run: archivebox setup
            or to disable it and silence this warning: archivebox config --set SAVE_READABILITY=False
            
    ! MERCURY_BINARY: mercury-parser (unable to detect version)
      Hint: To install all packages automatically run: archivebox setup
            or to disable it and silence this warning: archivebox config --set SAVE_MERCURY=False
            

[+] [2022-04-20 16:05:13] Adding 1 links to index (crawl depth=0)...
    > Saved verbatim input to sources/1650470713-import.txt
                                                                                                                                                                                                                        0.0% (0/240sec)[X] Error while loading link! [1650470713.151664] /tmp/urls.txt "None"
    > Parsed 0 URLs from input (Failed to parse)                                                                                                                                                                                           
    > Found 0 new URLs not already in index

[*] [2022-04-20 16:05:13] Writing 0 links to main index...
    √ ./index.sqlite3

ArchiveBox version

ArchiveBox v0.6.2
Cpython Linux Linux-5.17.1-arch1-1-x86_64-with-glibc2.35 x86_64
IN_DOCKER=False DEBUG=False IS_TTY=True TZ=UTC SEARCH_BACKEND_ENGINE=ripgrep

[i] Dependency versions:
 √  ARCHIVEBOX_BINARY     v0.6.2          valid     /home/ross/.local/bin/archivebox                                            
 √  PYTHON_BINARY         v3.10.4         valid     /usr/bin/python3.10                                                         
 √  DJANGO_BINARY         v3.1.14         valid     /home/ross/.local/lib/python3.10/site-packages/django/bin/django-admin.py   
 √  CURL_BINARY           v7.82.0         valid     /usr/bin/curl                                                               
 √  WGET_BINARY           v1.21.3         valid     /usr/bin/wget                                                               
 √  NODE_BINARY           v17.9.0         valid     /usr/bin/node                                                               
 X  SINGLEFILE_BINARY     ?               invalid   single-file                                                                 
 X  READABILITY_BINARY    ?               invalid   readability-extractor                                                       
 X  MERCURY_BINARY        ?               invalid   mercury-parser                                                              
 √  GIT_BINARY            v2.35.2         valid     /usr/bin/git                                                                
 √  YOUTUBEDL_BINARY      v2021.12.17     valid     /home/ross/.local/bin/youtube-dl                                            
 √  CHROME_BINARY         v100.0.4896.88  valid     /usr/bin/chromium                                                           
 √  RIPGREP_BINARY        v13.0.0         valid     /usr/bin/rg                                                                 

[i] Source-code locations:
 √  PACKAGE_DIR           23 files        valid     /home/ross/.local/lib/python3.10/site-packages/archivebox                   
 √  TEMPLATES_DIR         3 files         valid     /home/ross/.local/lib/python3.10/site-packages/archivebox/templates         
 -  CUSTOM_TEMPLATES_DIR  -               disabled                                                                              

[i] Secrets locations:
 -  CHROME_USER_DATA_DIR  -               disabled                                                                              
 -  COOKIES_FILE          -               disabled                                                                              

[i] Data locations:
 √  OUTPUT_DIR            5 files         valid     /tmp/archivebox                                                             
 √  SOURCES_DIR           3 files         valid     ./sources                                                                   
 √  LOGS_DIR              1 files         valid     ./logs                                                                      
 √  ARCHIVE_DIR           0 files         valid     ./archive                                                                   
 √  CONFIG_FILE           81.0 Bytes      valid     ./ArchiveBox.conf                                                           
 √  SQL_INDEX             204.0 KB        valid     ./index.sqlite3                                                             

[!] Warning: Missing 3 recommended dependencies
    ! SINGLEFILE_BINARY: single-file (unable to detect version)
      Hint: To install all packages automatically run: archivebox setup
            or to disable it and silence this warning: archivebox config --set SAVE_SINGLEFILE=False
            
    ! READABILITY_BINARY: readability-extractor (unable to detect version)
      Hint: To install all packages automatically run: archivebox setup
            or to disable it and silence this warning: archivebox config --set SAVE_READABILITY=False
            
    ! MERCURY_BINARY: mercury-parser (unable to detect version)
      Hint: To install all packages automatically run: archivebox setup
            or to disable it and silence this warning: archivebox config --set SAVE_MERCURY=False
  
Originally created by @rossvor on GitHub (Apr 20, 2022). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/968 <!-- Please fill out the following information, feel free to delete sections if they're not applicable or if long issue templates annoy you. (the only required section is the version information) --> #### Describe the bug <!-- A description of what the bug is, what you expected to happen, and any relevant context about issue. --> I can't seem to get archivebox to add any URLs from simple txt file with a newline separated list of URLs. Based on error message it fails to parse it. I may be doing something wrong. #### Steps to reproduce <!-- For example: 1. Ran ArchiveBox with the following config '...' 2. Saw this output during archiving '....' 3. UI didn't show the thing I was expecting '....' --> 1. Create txt file with some URLs. Eg. ``` https://www.example.com/ https://example.com/ ``` 2. Run `archivebox add /tmp/urls.txt` #### Screenshots or log output <!-- If applicable, post any relevant screenshots or copy/pasted terminal output from ArchiveBox. If you're reporting a parsing / importing error, **you must paste a copy of your redacted import file here**. --> Here's the output I get: ``` ross@xx> archivebox add /tmp/urls.txt /tmp/archivebox [i] [2022-04-20 16:05:12] ArchiveBox v0.6.2: archivebox add /tmp/urls.txt > /tmp/archivebox [!] Warning: Missing 3 recommended dependencies ! SINGLEFILE_BINARY: single-file (unable to detect version) Hint: To install all packages automatically run: archivebox setup or to disable it and silence this warning: archivebox config --set SAVE_SINGLEFILE=False ! READABILITY_BINARY: readability-extractor (unable to detect version) Hint: To install all packages automatically run: archivebox setup or to disable it and silence this warning: archivebox config --set SAVE_READABILITY=False ! MERCURY_BINARY: mercury-parser (unable to detect version) Hint: To install all packages automatically run: archivebox setup or to disable it and silence this warning: archivebox config --set SAVE_MERCURY=False [+] [2022-04-20 16:05:13] Adding 1 links to index (crawl depth=0)... > Saved verbatim input to sources/1650470713-import.txt 0.0% (0/240sec)[X] Error while loading link! [1650470713.151664] /tmp/urls.txt "None" > Parsed 0 URLs from input (Failed to parse) > Found 0 new URLs not already in index [*] [2022-04-20 16:05:13] Writing 0 links to main index... √ ./index.sqlite3 ``` #### ArchiveBox version <!-- Run the `archivebox version` command locally then copy paste the result here: --> ```logs ArchiveBox v0.6.2 Cpython Linux Linux-5.17.1-arch1-1-x86_64-with-glibc2.35 x86_64 IN_DOCKER=False DEBUG=False IS_TTY=True TZ=UTC SEARCH_BACKEND_ENGINE=ripgrep [i] Dependency versions: √ ARCHIVEBOX_BINARY v0.6.2 valid /home/ross/.local/bin/archivebox √ PYTHON_BINARY v3.10.4 valid /usr/bin/python3.10 √ DJANGO_BINARY v3.1.14 valid /home/ross/.local/lib/python3.10/site-packages/django/bin/django-admin.py √ CURL_BINARY v7.82.0 valid /usr/bin/curl √ WGET_BINARY v1.21.3 valid /usr/bin/wget √ NODE_BINARY v17.9.0 valid /usr/bin/node X SINGLEFILE_BINARY ? invalid single-file X READABILITY_BINARY ? invalid readability-extractor X MERCURY_BINARY ? invalid mercury-parser √ GIT_BINARY v2.35.2 valid /usr/bin/git √ YOUTUBEDL_BINARY v2021.12.17 valid /home/ross/.local/bin/youtube-dl √ CHROME_BINARY v100.0.4896.88 valid /usr/bin/chromium √ RIPGREP_BINARY v13.0.0 valid /usr/bin/rg [i] Source-code locations: √ PACKAGE_DIR 23 files valid /home/ross/.local/lib/python3.10/site-packages/archivebox √ TEMPLATES_DIR 3 files valid /home/ross/.local/lib/python3.10/site-packages/archivebox/templates - CUSTOM_TEMPLATES_DIR - disabled [i] Secrets locations: - CHROME_USER_DATA_DIR - disabled - COOKIES_FILE - disabled [i] Data locations: √ OUTPUT_DIR 5 files valid /tmp/archivebox √ SOURCES_DIR 3 files valid ./sources √ LOGS_DIR 1 files valid ./logs √ ARCHIVE_DIR 0 files valid ./archive √ CONFIG_FILE 81.0 Bytes valid ./ArchiveBox.conf √ SQL_INDEX 204.0 KB valid ./index.sqlite3 [!] Warning: Missing 3 recommended dependencies ! SINGLEFILE_BINARY: single-file (unable to detect version) Hint: To install all packages automatically run: archivebox setup or to disable it and silence this warning: archivebox config --set SAVE_SINGLEFILE=False ! READABILITY_BINARY: readability-extractor (unable to detect version) Hint: To install all packages automatically run: archivebox setup or to disable it and silence this warning: archivebox config --set SAVE_READABILITY=False ! MERCURY_BINARY: mercury-parser (unable to detect version) Hint: To install all packages automatically run: archivebox setup or to disable it and silence this warning: archivebox config --set SAVE_MERCURY=False ``` <!-- Tickets without full version info will closed until it is provided, we need the full output here to help you solve your issue -->
kerem closed this issue 2026-03-01 14:44:55 +03:00
Author
Owner

@pirate commented on GitHub (Apr 20, 2022):

Are you sure your URLs have schemes at the front? They have to be fully qualified URLs (e.g. google.com/example doens't work but https://google.com/example does).

Can you post a redacted snippet of your actual urls.txt file?

You can also try and force a specific parser with archivebox add --parse=generic_txt /tmp/urls.txt or archivebox add --parse=url_list /tmp/urls.txt. Also check the contents of sources/1650470713-import.txt to see how ArchiveBox is interpreting the file.

<!-- gh-comment-id:1104190678 --> @pirate commented on GitHub (Apr 20, 2022): Are you sure your URLs have schemes at the front? They have to be fully qualified URLs (e.g. `google.com/example` doens't work but `https://google.com/example` does). Can you post a redacted snippet of your actual urls.txt file? You can also try and force a specific parser with `archivebox add --parse=generic_txt /tmp/urls.txt` or `archivebox add --parse=url_list /tmp/urls.txt`. Also check the contents of `sources/1650470713-import.txt` to see how ArchiveBox is interpreting the file.
Author
Owner

@rossvor commented on GitHub (Apr 20, 2022):

Are you sure your URLs have schemes at the front? They have to be fully qualified URLs (e.g. google.com/example doens't work but https://google.com/example does).

Yep, can confirm that file has fully qualified URLs.

Can you post a redacted snippet of your actual urls.txt file?

Sure.

https://www.example.com/
https://example.com/
https://github.com/ArchiveBox/ArchiveBox/
https://news.ycombinator.com/item?id=31083515
https://www.imdb.com/list/ls020840037/

I've tried setting parser explicitly as you suggested, none of them picked up the URLs, with slightly varying errors.

archivebox add --parse=generic_txt /tmp/urls.txt
Result:
archivebox add: error: argument --parser: invalid choice: 'generic_txt'

archivebox add --parse=url_list /tmp/urls.txt
Result:
[X] No links found using URL List parser

Also check the contents of sources/1650470713-import.txt to see how ArchiveBox is interpreting the file.

Contents of sources/1650479354-import.txt (with all the above variations of parsers) is just file path itself, so I guess it tries to interpret path as URL instead of a path.
Contents of sources/1650479354-import.txt:
/tmp/urls.txt

I can confirm that using input redirection does work fine, so this works:
archivebox add < /tmp/urls.txt

<!-- gh-comment-id:1104335027 --> @rossvor commented on GitHub (Apr 20, 2022): > Are you sure your URLs have schemes at the front? They have to be fully qualified URLs (e.g. google.com/example doens't work but https://google.com/example does). Yep, can confirm that file has fully qualified URLs. > Can you post a redacted snippet of your actual urls.txt file? Sure. ``` https://www.example.com/ https://example.com/ https://github.com/ArchiveBox/ArchiveBox/ https://news.ycombinator.com/item?id=31083515 https://www.imdb.com/list/ls020840037/ ``` I've tried setting parser explicitly as you suggested, none of them picked up the URLs, with slightly varying errors. `archivebox add --parse=generic_txt /tmp/urls.txt` Result: `archivebox add: error: argument --parser: invalid choice: 'generic_txt'` `archivebox add --parse=url_list /tmp/urls.txt` Result: `[X] No links found using URL List parser` > Also check the contents of sources/1650470713-import.txt to see how ArchiveBox is interpreting the file. Contents of `sources/1650479354-import.txt` (with all the above variations of parsers) is just file path itself, so I guess it tries to interpret path as URL instead of a path. Contents of `sources/1650479354-import.txt`: `/tmp/urls.txt` I can confirm that using input redirection does work fine, so this works: `archivebox add < /tmp/urls.txt`
Author
Owner

@pirate commented on GitHub (Apr 20, 2022):

Try with --depth=1 and passing the file path as the first argument.

<!-- gh-comment-id:1104344009 --> @pirate commented on GitHub (Apr 20, 2022): Try with `--depth=1` and passing the file path as the first argument.
Author
Owner

@rossvor commented on GitHub (Apr 20, 2022):

Doesn't seem to change the error

ross@xx> archivebox add --depth=1 /tmp/urls.txt                                                                                                                                                                             /media/shared-ext/archivebox
[i] [2022-04-20 19:00:23] ArchiveBox v0.6.2: archivebox add --depth=1 /tmp/urls.txt
    > /media/shared-ext/archivebox

[+] [2022-04-20 19:00:24] Adding 1 links to index (crawl depth=1)...
    > Saved verbatim input to sources/1650481224-import.txt
                                                                                                                                                                                                                                           0.0% (0/240sec)[X] Error while loading link! [1650481224.660288] /tmp/urls.txt "None"
    > Parsed 0 URLs from input (Failed to parse)                                                                                                                                                                                                              
    > Found 0 new URLs not already in index

[*] [2022-04-20 19:00:24] Writing 0 links to main index...
    √ ./index.sqlite3   
<!-- gh-comment-id:1104348312 --> @rossvor commented on GitHub (Apr 20, 2022): Doesn't seem to change the error ``` ross@xx> archivebox add --depth=1 /tmp/urls.txt /media/shared-ext/archivebox [i] [2022-04-20 19:00:23] ArchiveBox v0.6.2: archivebox add --depth=1 /tmp/urls.txt > /media/shared-ext/archivebox [+] [2022-04-20 19:00:24] Adding 1 links to index (crawl depth=1)... > Saved verbatim input to sources/1650481224-import.txt 0.0% (0/240sec)[X] Error while loading link! [1650481224.660288] /tmp/urls.txt "None" > Parsed 0 URLs from input (Failed to parse) > Found 0 new URLs not already in index [*] [2022-04-20 19:00:24] Writing 0 links to main index... √ ./index.sqlite3 ```
Author
Owner

@rossvor commented on GitHub (Apr 20, 2022):

I've also tried this using on a fresh docker image based installation and it fails similarly:

sudo docker run -v $PWD:/data -v /tmp/ff:/ff -it archivebox/archivebox add /ff/urls.txt
[i] [2022-04-20 21:32:03] ArchiveBox v0.6.2: archivebox add /ff/urls.txt
    > /data

[+] [2022-04-20 21:32:03] Adding 1 links to index (crawl depth=0)...
    > Saved verbatim input to sources/1650490323-import.txt
 0.0% (0/240sec)[X] Error while loading link! [1650490324.056402] /ff/urls.txt "None"
    > Parsed 0 URLs from input (Failed to parse)                                                                               
    > Found 0 new URLs not already in index

[*] [2022-04-20 21:32:04] Writing 0 links to main index...
    √ ./index.sqlite3      

/tmp/ff/urls.txt being the same simple file:

https://www.example.com/
https://example.com/
https://github.com/ArchiveBox/ArchiveBox/
https://news.ycombinator.com/item?id=31083515
https://www.imdb.com/list/ls020840037/
<!-- gh-comment-id:1104477206 --> @rossvor commented on GitHub (Apr 20, 2022): I've also tried this using on a fresh docker image based installation and it fails similarly: ``` sudo docker run -v $PWD:/data -v /tmp/ff:/ff -it archivebox/archivebox add /ff/urls.txt [i] [2022-04-20 21:32:03] ArchiveBox v0.6.2: archivebox add /ff/urls.txt > /data [+] [2022-04-20 21:32:03] Adding 1 links to index (crawl depth=0)... > Saved verbatim input to sources/1650490323-import.txt 0.0% (0/240sec)[X] Error while loading link! [1650490324.056402] /ff/urls.txt "None" > Parsed 0 URLs from input (Failed to parse) > Found 0 new URLs not already in index [*] [2022-04-20 21:32:04] Writing 0 links to main index... √ ./index.sqlite3 ``` /tmp/ff/urls.txt being the same simple file: ``` https://www.example.com/ https://example.com/ https://github.com/ArchiveBox/ArchiveBox/ https://news.ycombinator.com/item?id=31083515 https://www.imdb.com/list/ls020840037/ ```
Author
Owner

@pirate commented on GitHub (Apr 21, 2022):

Ah sorry I forgot I removed loading directly from a file path in a previous version because it conflicted with the new --depth=1 implementation!

I'll reopen and merge your original PR https://github.com/ArchiveBox/ArchiveBox/pull/967. For future reference stdin redirection is indeed necessary, or passing --depth=1 /path/to/file.txt also works.

<!-- gh-comment-id:1104574618 --> @pirate commented on GitHub (Apr 21, 2022): Ah sorry I forgot I removed loading directly from a file path in a previous version because it conflicted with the new `--depth=1` implementation! I'll reopen and merge your original PR https://github.com/ArchiveBox/ArchiveBox/pull/967. For future reference stdin redirection is indeed necessary, or passing `--depth=1 /path/to/file.txt` also works.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#603
No description provided.