mirror of
https://github.com/ArchiveBox/ArchiveBox.git
synced 2026-04-26 01:26:00 +03:00
[GH-ISSUE #774] Question: FileNotFoundError - 'single-file' and others #3510
Labels
No labels
expected: maybe someday
expected: next release
expected: release after next
expected: unlikely unless contributed
good first ticket
help wanted
pull-request
scope: all users
scope: windows users
size: easy
size: hard
size: medium
size: medium
status: backlog
status: blocked
status: done
status: idea-phase
status: needs followup
status: wip
status: wontfix
touches: API/CLI/Spec
touches: configuration
touches: data/schema/architecture
touches: dependencies/packaging
touches: docs
touches: js
touches: views/replayers/html/css
why: correctness
why: functionality
why: performance
why: security
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/ArchiveBox#3510
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @francwalter on GitHub (Jun 26, 2021).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/774
Hallo
I just installed on a virtual Ubuntu 20.04 (with GUI) machine (VMware Workstation 15) ArchiveBox. VMware running on Windows 10 (64-Bit) host.
I want to scrape a whole Website with all internal links as a local backup. I dont need versions, just the actual one, but with all sub sites (of same domain).
I ran
archivebox add 'https://efk-adoptionen.de'and I got errors:
more Output here (collapsed):
archivebox output
I didnt find these errors here, I installed as described in the wiki, so I dont understand these errors.
Later I checked the contents in the browser with address localhost:8000 but I find only the Home page of the added address, All menu items or links are unsaved. I don't know if that is related to the mentioned errors.
Maybe I totally misunderstood this project?
Isnt it to scrape a website to a local archive?
I didnt find any option to check which link level to scrape or if allowed to leave the domain etc.
I was using HTTrack previously, when I wanted to download a whole website locally, but that project is abandoned and doesnt fit actual needs anymore (addresses with Umlaute e.g.).
How can I do this with ArchiveBox, is it possible?
Thanks for hints, frank
@pirate commented on GitHub (Jun 27, 2021):
ArchiveBox is not primarily designed for recursive archiving of a single site, you should use a different tool for that, see https://github.com/ArchiveBox/ArchiveBox/issues/191 and https://github.com/ArchiveBox/ArchiveBox/wiki/Web-Archiving-Community#other-archivebox-alternatives. ArchiveBox is meant to archive many different URLs, but only the URLs you give it (and up to one hop away with
--depth=1). If you want to don-depth recursive scraping you should use a dedicated scraper and pipe the URLs into archivebox for saving. ArchiveBox is not a recursive scraper itself, that's a different space of problems and is not our our main objective to solve.The errors in your output mean you have not installed all the dependencies yet (you're missing 5 of them), I recommend using ArchiveBox with docker to avoid having to install all the dependencies manually, or install it with
aptor using our bash helper script to finish the partial install you've started.@francwalter commented on GitHub (Jul 2, 2021):
Thank a lot, Pirate!
First I checked again httrack, which seems abandoned since 2017 (and no fork did real work on it it, I found, see Find the newest fork of an old repo or Find the most popular fork on GitHub or Active GitHub Forks).
Then I checked wget which can do more than I knew :)
Lastly I repaired my ArchiveBox installation, had to install nodejs and ripgrep and run archivebox init --setup again, till no more errors :)
I asked the question about depth=1 in the other issue
Thanks.
Frank