mirror of
https://github.com/ArchiveBox/ArchiveBox.git
synced 2026-04-25 09:06:02 +03:00
[GH-ISSUE #444] Feature Request: Ignore Errors Mode #296
Labels
No labels
expected: maybe someday
expected: next release
expected: release after next
expected: unlikely unless contributed
good first ticket
help wanted
pull-request
scope: all users
scope: windows users
size: easy
size: hard
size: medium
size: medium
status: backlog
status: blocked
status: done
status: idea-phase
status: needs followup
status: wip
status: wontfix
touches: API/CLI/Spec
touches: configuration
touches: data/schema/architecture
touches: dependencies/packaging
touches: docs
touches: js
touches: views/replayers/html/css
why: correctness
why: functionality
why: performance
why: security
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/ArchiveBox#296
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @jaw-sh on GitHub (Aug 15, 2020).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/444
I am importing just under 1 million links supplied by forum users over 7 years. Not all links work and I need the system to skip over links it cannot import.
Type
What is the problem that your feature request solves
I have a list of about 250,000 links that match an
%archive.%/%format. Not all links may be valid because users supply them. When I tried to import my links, it quickly bailed citing an "invalid IPv6 url".Describe the ideal specific solution you'd want, and whether it fits into any broader scope of changes
archivebox add < /tmp/links.txt --ignore-errorsWhat hacks or alternative solutions have you tried to solve the problem?
I can't imagine any way of accomplishing what I need without putting each url to a different file and making a bash script to execute add on each one.
How badly do you want this new feature?
I don't see any IPv6 links on my list, by the way. I can send that over as well. It looks like it may be a broken IPv6 on the page itself.
@pirate commented on GitHub (Aug 17, 2020):
250,000 links! That's incredible, I think Archive Box will immediately crash if you ever try to pipe it that many URLs. Right now it tries to hold the whole index in memory on every run, which will either fill your ramp and make you start paging or just outright crash. The only way you could get away with this is if you split it up into smaller indexes of 50k links maximum each. The good news is v0.5 should be better for performance and large index handling.
As for ignoring the errors, I think we can add something like this to the URL parser before this step, to filter bad URLs:
Patch: 225b63
https://github.com/pirate/ArchiveBox/wiki/Roadmap#v05-remove-live-updated-json--html-index-in-favor-of-archivebox-export
@pirate commented on GitHub (Aug 18, 2020):
The
urlparsebug should be fixed in the latest version. Comment back here if you're still having any issues and I'll reopen the ticket.As for large-archive support, you'll likely have to wait for v0.5 at the earliest (see the roadmap link above).
@jaw-sh commented on GitHub (Aug 18, 2020):
@pirate this patch makes readability-extractor a requirement/enabled by default but I cannot find adequate documentation of how to install it. or what it is. I cannot find any information just googling with quotes
"readability-extractor".@cdvv7788 commented on GitHub (Aug 18, 2020):
@jaw-sh https://github.com/pirate/ArchiveBox/blob/master/archivebox/config/init.py#L112 Readability now provides instructions on how to install itself. Please create a new issue if you are still experiencing errors related to that extractor after testing with the latest master version.
@pirate commented on GitHub (Aug 18, 2020):
Ah sorry, forgot to add the docs link for that. It should be fixed now on
master