[GH-ISSUE #774] Question: FileNotFoundError - 'single-file' and others #492

Closed
opened 2026-03-01 14:44:05 +03:00 by kerem · 2 comments
Owner

Originally created by @francwalter on GitHub (Jun 26, 2021).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/774

Hallo
I just installed on a virtual Ubuntu 20.04 (with GUI) machine (VMware Workstation 15) ArchiveBox. VMware running on Windows 10 (64-Bit) host.
I want to scrape a whole Website with all internal links as a local backup. I dont need versions, just the actual one, but with all sub sites (of same domain).

I ran
archivebox add 'https://efk-adoptionen.de'
and I got errors:

...
No such file or directory: 'single-file'
...
No such file or directory: 'readability-extractor'
...
No such file or directory: 'mercury-parser'
...

more Output here (collapsed):

archivebox output

franc@ubuntu20:~/data$ archivebox add 'https://efk-adoptionen.de'
[i] [2021-06-26 09:08:59] ArchiveBox v0.6.2: archivebox add https://efk-adoptionen.de
    > /home/franc/data

[!] Warning: Missing 5 recommended dependencies
    ! NODE_BINARY: node (unable to detect version)
    ! SINGLEFILE_BINARY: single-file (unable to detect version)
      Hint: To install all packages automatically run: archivebox setup
            or to disable it and silence this warning: archivebox config --set SAVE_SINGLEFILE=False
            
    ! READABILITY_BINARY: readability-extractor (unable to detect version)
      Hint: To install all packages automatically run: archivebox setup
            or to disable it and silence this warning: archivebox config --set SAVE_READABILITY=False
            
    ! MERCURY_BINARY: mercury-parser (unable to detect version)
      Hint: To install all packages automatically run: archivebox setup
            or to disable it and silence this warning: archivebox config --set SAVE_MERCURY=False
            
    ! RIPGREP_BINARY: rg (unable to detect version)

[+] [2021-06-26 09:09:00] Adding 1 links to index (crawl depth=0)...
    > Saved verbatim input to sources/1624698540-import.txt
    > Parsed 1 URLs from input (Generic TXT)                                                           
    > Found 1 new URLs not already in index

[*] [2021-06-26 09:09:00] Writing 1 links to main index...
    √ ./index.sqlite3                                                                                  

[▶️] [2021-06-26 09:09:00] Starting archiving of 1 snapshots in index...

[+] [2021-06-26 09:09:00] "efk-adoptionen.de"
    https://efk-adoptionen.de
    > ./archive/1624698540.780421
      > title
      > favicon                                                                                        
      > headers                                                                                        
      > singlefile                                                                                     
        Extractor failed:                                                                              
            FileNotFoundError [Errno 2] No such file or directory: 'single-file'
        Run to see full output:
            cd /home/franc/data/archive/1624698540.780421;
            single-file --browser-executable-path=chromium-browser "--browser-args=[\"--headless\", \"--user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.61 Safari/537.36 ArchiveBox/{VERSION} (+https://github.com/ArchiveBox/ArchiveBox/)\", \"--window-size=1440,2000\"]" https://efk-adoptionen.de singlefile.html

      > pdf
      > screenshot                                                                                     
      > dom                                                                                            
      > wget                                                                                           
      > readability                                                                                    
        Extractor failed:                                                                              
            FileNotFoundError [Errno 2] No such file or directory: 'readability-extractor'
        Run to see full output:
            cd /home/franc/data/archive/1624698540.780421;
            readability-extractor ./{singlefile,dom}.html

      > mercury
        Extractor failed:                                                                              
            FileNotFoundError [Errno 2] No such file or directory: 'mercury-parser'
        Run to see full output:
            cd /home/franc/data/archive/1624698540.780421;
            mercury-parser https://efk-adoptionen.de --format=text

      > media
      > archive_org                                                                                    
Extractor timed out after 60s.                                                                 
        Run to see full output:
            cd /home/franc/data/archive/1624698540.780421;
            curl --silent --location --compressed --head --max-time 60 --user-agent "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.61 Safari/537.36 ArchiveBox/0.6.2 (+https://github.com/ArchiveBox/ArchiveBox/) curl/curl 7.68.0 (x86_64-pc-linux-gnu)" https://web.archive.org/save/https://efk-adoptionen.de

        79 files (7.0 MB) in 0:01:23s 

[√] [2021-06-26 09:10:24] Update of 1 pages complete (1.39 min)
    - 0 links skipped
    - 1 links updated
    - 1 links had errors

    Hint: To manage your archive in a Web UI, run:
        archivebox server 0.0.0.0:8000
franc@ubuntu20:~/data$

I didnt find these errors here, I installed as described in the wiki, so I dont understand these errors.

Later I checked the contents in the browser with address localhost:8000 but I find only the Home page of the added address, All menu items or links are unsaved. I don't know if that is related to the mentioned errors.

Maybe I totally misunderstood this project?
Isnt it to scrape a website to a local archive?
I didnt find any option to check which link level to scrape or if allowed to leave the domain etc.

I was using HTTrack previously, when I wanted to download a whole website locally, but that project is abandoned and doesnt fit actual needs anymore (addresses with Umlaute e.g.).

How can I do this with ArchiveBox, is it possible?

Thanks for hints, frank

Originally created by @francwalter on GitHub (Jun 26, 2021). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/774 Hallo I just installed on a virtual Ubuntu 20.04 (with GUI) machine (VMware Workstation 15) ArchiveBox. VMware running on Windows 10 (64-Bit) host. I want to scrape a whole Website with all internal links as a local backup. I dont need versions, just the actual one, but with all sub sites (of same domain). I ran `archivebox add 'https://efk-adoptionen.de'` and I got errors: ``` ... No such file or directory: 'single-file' ... No such file or directory: 'readability-extractor' ... No such file or directory: 'mercury-parser' ... ``` more Output here (collapsed): <details> <summary>archivebox output</summary> ``` franc@ubuntu20:~/data$ archivebox add 'https://efk-adoptionen.de' [i] [2021-06-26 09:08:59] ArchiveBox v0.6.2: archivebox add https://efk-adoptionen.de > /home/franc/data [!] Warning: Missing 5 recommended dependencies ! NODE_BINARY: node (unable to detect version) ! SINGLEFILE_BINARY: single-file (unable to detect version) Hint: To install all packages automatically run: archivebox setup or to disable it and silence this warning: archivebox config --set SAVE_SINGLEFILE=False ! READABILITY_BINARY: readability-extractor (unable to detect version) Hint: To install all packages automatically run: archivebox setup or to disable it and silence this warning: archivebox config --set SAVE_READABILITY=False ! MERCURY_BINARY: mercury-parser (unable to detect version) Hint: To install all packages automatically run: archivebox setup or to disable it and silence this warning: archivebox config --set SAVE_MERCURY=False ! RIPGREP_BINARY: rg (unable to detect version) [+] [2021-06-26 09:09:00] Adding 1 links to index (crawl depth=0)... > Saved verbatim input to sources/1624698540-import.txt > Parsed 1 URLs from input (Generic TXT) > Found 1 new URLs not already in index [*] [2021-06-26 09:09:00] Writing 1 links to main index... √ ./index.sqlite3 [▶️] [2021-06-26 09:09:00] Starting archiving of 1 snapshots in index... [+] [2021-06-26 09:09:00] "efk-adoptionen.de" https://efk-adoptionen.de > ./archive/1624698540.780421 > title > favicon > headers > singlefile Extractor failed: FileNotFoundError [Errno 2] No such file or directory: 'single-file' Run to see full output: cd /home/franc/data/archive/1624698540.780421; single-file --browser-executable-path=chromium-browser "--browser-args=[\"--headless\", \"--user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.61 Safari/537.36 ArchiveBox/{VERSION} (+https://github.com/ArchiveBox/ArchiveBox/)\", \"--window-size=1440,2000\"]" https://efk-adoptionen.de singlefile.html > pdf > screenshot > dom > wget > readability Extractor failed: FileNotFoundError [Errno 2] No such file or directory: 'readability-extractor' Run to see full output: cd /home/franc/data/archive/1624698540.780421; readability-extractor ./{singlefile,dom}.html > mercury Extractor failed: FileNotFoundError [Errno 2] No such file or directory: 'mercury-parser' Run to see full output: cd /home/franc/data/archive/1624698540.780421; mercury-parser https://efk-adoptionen.de --format=text > media > archive_org Extractor timed out after 60s. Run to see full output: cd /home/franc/data/archive/1624698540.780421; curl --silent --location --compressed --head --max-time 60 --user-agent "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.61 Safari/537.36 ArchiveBox/0.6.2 (+https://github.com/ArchiveBox/ArchiveBox/) curl/curl 7.68.0 (x86_64-pc-linux-gnu)" https://web.archive.org/save/https://efk-adoptionen.de 79 files (7.0 MB) in 0:01:23s [√] [2021-06-26 09:10:24] Update of 1 pages complete (1.39 min) - 0 links skipped - 1 links updated - 1 links had errors Hint: To manage your archive in a Web UI, run: archivebox server 0.0.0.0:8000 franc@ubuntu20:~/data$ ``` </details> I didnt find these errors here, I installed as described in the wiki, so I dont understand these errors. Later I checked the contents in the browser with address localhost:8000 but I find only the Home page of the added address, All menu items or links are unsaved. I don't know if that is related to the mentioned errors. Maybe I totally misunderstood this project? Isnt it to scrape a website to a local archive? I didnt find any option to check which link level to scrape or if allowed to leave the domain etc. I was using HTTrack previously, when I wanted to download a whole website locally, but that project is abandoned and doesnt fit actual needs anymore (addresses with Umlaute e.g.). How can I do this with ArchiveBox, is it possible? Thanks for hints, frank
kerem closed this issue 2026-03-01 14:44:05 +03:00
Author
Owner

@pirate commented on GitHub (Jun 27, 2021):

ArchiveBox is not primarily designed for recursive archiving of a single site, you should use a different tool for that, see https://github.com/ArchiveBox/ArchiveBox/issues/191 and https://github.com/ArchiveBox/ArchiveBox/wiki/Web-Archiving-Community#other-archivebox-alternatives. ArchiveBox is meant to archive many different URLs, but only the URLs you give it (and up to one hop away with --depth=1). If you want to do n-depth recursive scraping you should use a dedicated scraper and pipe the URLs into archivebox for saving. ArchiveBox is not a recursive scraper itself, that's a different space of problems and is not our our main objective to solve.

The errors in your output mean you have not installed all the dependencies yet (you're missing 5 of them), I recommend using ArchiveBox with docker to avoid having to install all the dependencies manually, or install it with apt or using our bash helper script to finish the partial install you've started.

<!-- gh-comment-id:869081723 --> @pirate commented on GitHub (Jun 27, 2021): ArchiveBox is not primarily designed for recursive archiving of a single site, you should use a different tool for that, see https://github.com/ArchiveBox/ArchiveBox/issues/191 and https://github.com/ArchiveBox/ArchiveBox/wiki/Web-Archiving-Community#other-archivebox-alternatives. ArchiveBox is meant to archive many different URLs, but only the URLs you give it (and up to one hop away with `--depth=1`). If you want to do `n`-depth recursive scraping you should use a dedicated scraper and pipe the URLs into archivebox for saving. ArchiveBox is not a recursive scraper itself, that's a different space of problems and is not our our main objective to solve. The errors in your output mean you have not installed all the dependencies yet (you're missing 5 of them), I recommend using ArchiveBox [with docker](https://github.com/ArchiveBox/ArchiveBox#%EF%B8%8F-easy-setup) to avoid having to install all the dependencies manually, or install it [with `apt`](https://github.com/ArchiveBox/ArchiveBox#-manual-setup) or using our [bash helper script](https://github.com/ArchiveBox/ArchiveBox#%EF%B8%8F-easy-setup) to finish the partial install you've started.
Author
Owner

@francwalter commented on GitHub (Jul 2, 2021):

Thank a lot, Pirate!

First I checked again httrack, which seems abandoned since 2017 (and no fork did real work on it it, I found, see Find the newest fork of an old repo or Find the most popular fork on GitHub or Active GitHub Forks).

Then I checked wget which can do more than I knew :)

Lastly I repaired my ArchiveBox installation, had to install nodejs and ripgrep and run archivebox init --setup again, till no more errors :)

I asked the question about depth=1 in the other issue

Thanks.
Frank

<!-- gh-comment-id:872975818 --> @francwalter commented on GitHub (Jul 2, 2021): Thank a lot, Pirate! First I checked again httrack, which seems abandoned since 2017 (and no fork did real work on it it, I found, see [Find the newest fork of an old repo](https://github.community/t/find-the-newest-fork-of-an-old-repo/682) or [Find the most popular fork on GitHub](http://gitpop2.herokuapp.com/) or [Active GitHub Forks](https://techgaun.github.io/active-forks/index.html)). Then I checked wget which can do more than I knew :) Lastly I repaired my ArchiveBox installation, had to install nodejs and ripgrep and run archivebox init --setup again, till no more errors :) I asked the question about depth=1 in [the other issue](https://github.com/ArchiveBox/ArchiveBox/issues/191#issuecomment-872996332) Thanks. Frank
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#492
No description provided.