[GH-ISSUE #444] Feature Request: Ignore Errors Mode #296

Closed
opened 2026-03-01 14:42:12 +03:00 by kerem · 5 comments
Owner

Originally created by @jaw-sh on GitHub (Aug 15, 2020).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/444

I am importing just under 1 million links supplied by forum users over 7 years. Not all links work and I need the system to skip over links it cannot import.

Type

  • General question or discussion
  • Propose a brand new feature
  • Request modification of existing behavior or design

What is the problem that your feature request solves

I have a list of about 250,000 links that match an %archive.%/% format. Not all links may be valid because users supply them. When I tried to import my links, it quickly bailed citing an "invalid IPv6 url".

Describe the ideal specific solution you'd want, and whether it fits into any broader scope of changes

archivebox add < /tmp/links.txt --ignore-errors

What hacks or alternative solutions have you tried to solve the problem?

I can't imagine any way of accomplishing what I need without putting each url to a different file and making a bash script to execute add on each one.

How badly do you want this new feature?

  • It's an urgent deal-breaker, I can't live without it
  • It's important to add it in the near-mid term future
  • It would be nice to have eventually

  • I'm willing to contribute dev time / money to fix this issue
  • I like ArchiveBox so far / would recommend it to a friend
  • I've had a lot of difficulty getting ArchiveBox set up
sudo -u archive archivebox add < /tmp/archives.txt 
[i] [2020-08-15 16:32:11] ArchiveBox v0.4.13: archivebox add < /dev/stdin
    > /opt/archive

[+] [2020-08-15 16:32:12] Adding 228732 links to index (crawl depth=0)...
    > Saved verbatim input to sources/1597509132-import.txt
Traceback (most recent call last):                                                           
  File "/usr/local/bin/archivebox", line 10, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.7/dist-packages/archivebox/cli/__init__.py", line 126, in main
    pwd=pwd or OUTPUT_DIR,
  File "/usr/local/lib/python3.7/dist-packages/archivebox/cli/__init__.py", line 62, in run_subcommand
    module.main(args=subcommand_args, stdin=stdin, pwd=pwd)    # type: ignore
  File "/usr/local/lib/python3.7/dist-packages/archivebox/cli/archivebox_add.py", line 72, in main
    out_dir=pwd or OUTPUT_DIR,
  File "/usr/local/lib/python3.7/dist-packages/archivebox/util.py", line 111, in typechecked_function
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/archivebox/main.py", line 544, in add
    new_links += parse_links_from_source(write_ahead_log)
  File "/usr/local/lib/python3.7/dist-packages/archivebox/util.py", line 111, in typechecked_function
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/archivebox/index/__init__.py", line 284, in parse_links_from_source
    new_links = validate_links(raw_links)
  File "/usr/local/lib/python3.7/dist-packages/archivebox/util.py", line 111, in typechecked_function
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/archivebox/index/__init__.py", line 130, in validate_links
    links = sorted_links(links)      # deterministically sort the links based on timstamp, url
  File "/usr/local/lib/python3.7/dist-packages/archivebox/util.py", line 111, in typechecked_function
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/archivebox/index/__init__.py", line 175, in sorted_links
    return sorted(links, key=sort_func, reverse=True)
  File "/usr/local/lib/python3.7/dist-packages/archivebox/index/__init__.py", line 142, in archivable_links
    scheme_is_valid = scheme(link.url) in ('http', 'https', 'ftp')
  File "/usr/local/lib/python3.7/dist-packages/archivebox/util.py", line 30, in <lambda>
    scheme = lambda url: urlparse(url).scheme.lower()
  File "/usr/lib/python3.7/urllib/parse.py", line 368, in urlparse
    splitresult = urlsplit(url, scheme, allow_fragments)
  File "/usr/lib/python3.7/urllib/parse.py", line 435, in urlsplit
    raise ValueError("Invalid IPv6 URL")
ValueError: Invalid IPv6 URL

I don't see any IPv6 links on my list, by the way. I can send that over as well. It looks like it may be a broken IPv6 on the page itself.

Originally created by @jaw-sh on GitHub (Aug 15, 2020). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/444 I am importing just under 1 million links supplied by forum users over 7 years. Not all links work and I need the system to skip over links it cannot import. ## Type - [ ] General question or discussion - [ ] Propose a brand new feature - [x] Request modification of existing behavior or design ## What is the problem that your feature request solves I have a list of about 250,000 links that match an `%archive.%/%` format. Not all links may be valid because users supply them. When I tried to import my links, it quickly bailed citing an "invalid IPv6 url". ## Describe the ideal specific solution you'd want, and whether it fits into any broader scope of changes `archivebox add < /tmp/links.txt --ignore-errors` ## What hacks or alternative solutions have you tried to solve the problem? I can't imagine any way of accomplishing what I need without putting each url to a different file and making a bash script to execute add on each one. ## How badly do you want this new feature? - [ ] It's an urgent deal-breaker, I can't live without it - [x] It's important to add it in the near-mid term future - [ ] It would be nice to have eventually --- - [x] I'm willing to contribute dev time / money to fix this issue - [ ] I like ArchiveBox so far / would recommend it to a friend - [ ] I've had a lot of difficulty getting ArchiveBox set up ``` sudo -u archive archivebox add < /tmp/archives.txt [i] [2020-08-15 16:32:11] ArchiveBox v0.4.13: archivebox add < /dev/stdin > /opt/archive [+] [2020-08-15 16:32:12] Adding 228732 links to index (crawl depth=0)... > Saved verbatim input to sources/1597509132-import.txt Traceback (most recent call last): File "/usr/local/bin/archivebox", line 10, in <module> sys.exit(main()) File "/usr/local/lib/python3.7/dist-packages/archivebox/cli/__init__.py", line 126, in main pwd=pwd or OUTPUT_DIR, File "/usr/local/lib/python3.7/dist-packages/archivebox/cli/__init__.py", line 62, in run_subcommand module.main(args=subcommand_args, stdin=stdin, pwd=pwd) # type: ignore File "/usr/local/lib/python3.7/dist-packages/archivebox/cli/archivebox_add.py", line 72, in main out_dir=pwd or OUTPUT_DIR, File "/usr/local/lib/python3.7/dist-packages/archivebox/util.py", line 111, in typechecked_function return func(*args, **kwargs) File "/usr/local/lib/python3.7/dist-packages/archivebox/main.py", line 544, in add new_links += parse_links_from_source(write_ahead_log) File "/usr/local/lib/python3.7/dist-packages/archivebox/util.py", line 111, in typechecked_function return func(*args, **kwargs) File "/usr/local/lib/python3.7/dist-packages/archivebox/index/__init__.py", line 284, in parse_links_from_source new_links = validate_links(raw_links) File "/usr/local/lib/python3.7/dist-packages/archivebox/util.py", line 111, in typechecked_function return func(*args, **kwargs) File "/usr/local/lib/python3.7/dist-packages/archivebox/index/__init__.py", line 130, in validate_links links = sorted_links(links) # deterministically sort the links based on timstamp, url File "/usr/local/lib/python3.7/dist-packages/archivebox/util.py", line 111, in typechecked_function return func(*args, **kwargs) File "/usr/local/lib/python3.7/dist-packages/archivebox/index/__init__.py", line 175, in sorted_links return sorted(links, key=sort_func, reverse=True) File "/usr/local/lib/python3.7/dist-packages/archivebox/index/__init__.py", line 142, in archivable_links scheme_is_valid = scheme(link.url) in ('http', 'https', 'ftp') File "/usr/local/lib/python3.7/dist-packages/archivebox/util.py", line 30, in <lambda> scheme = lambda url: urlparse(url).scheme.lower() File "/usr/lib/python3.7/urllib/parse.py", line 368, in urlparse splitresult = urlsplit(url, scheme, allow_fragments) File "/usr/lib/python3.7/urllib/parse.py", line 435, in urlsplit raise ValueError("Invalid IPv6 URL") ValueError: Invalid IPv6 URL ``` I don't see any IPv6 links on my list, by the way. I can send that over as well. It looks like it may be a broken IPv6 on the page itself.
Author
Owner

@pirate commented on GitHub (Aug 17, 2020):

250,000 links! That's incredible, I think Archive Box will immediately crash if you ever try to pipe it that many URLs. Right now it tries to hold the whole index in memory on every run, which will either fill your ramp and make you start paging or just outright crash. The only way you could get away with this is if you split it up into smaller indexes of 50k links maximum each. The good news is v0.5 should be better for performance and large index handling.

As for ignoring the errors, I think we can add something like this to the URL parser before this step, to filter bad URLs:

try:
    # Drop attributes with uri values that have protocols that aren't
    # allowed
    parsed = urlparse(new_value)
except ValueError:
    # URI is impossible to parse, therefore it's not allowed
    return None

Patch: 225b63

https://github.com/pirate/ArchiveBox/wiki/Roadmap#v05-remove-live-updated-json--html-index-in-favor-of-archivebox-export

<!-- gh-comment-id:674705922 --> @pirate commented on GitHub (Aug 17, 2020): 250,000 links! That's incredible, I think Archive Box will immediately crash if you ever try to pipe it that many URLs. Right now it tries to hold the whole index in memory on every run, which will either fill your ramp and make you start paging or just outright crash. The only way you could get away with this is if you split it up into smaller indexes of 50k links maximum each. The good news is v0.5 should be better for performance and large index handling. As for ignoring the errors, I think we can add something like this to the URL parser before this step, to filter bad URLs: ```python try: # Drop attributes with uri values that have protocols that aren't # allowed parsed = urlparse(new_value) except ValueError: # URI is impossible to parse, therefore it's not allowed return None ``` Patch: 225b63 https://github.com/pirate/ArchiveBox/wiki/Roadmap#v05-remove-live-updated-json--html-index-in-favor-of-archivebox-export
Author
Owner

@pirate commented on GitHub (Aug 18, 2020):

The urlparse bug should be fixed in the latest version. Comment back here if you're still having any issues and I'll reopen the ticket.

pip install --upgrade archivebox
# or if you use docker
docker pull nikisweeting/archivebox

As for large-archive support, you'll likely have to wait for v0.5 at the earliest (see the roadmap link above).

<!-- gh-comment-id:675279768 --> @pirate commented on GitHub (Aug 18, 2020): The `urlparse` bug should be fixed in the latest version. Comment back here if you're still having any issues and I'll reopen the ticket. ```bash pip install --upgrade archivebox # or if you use docker docker pull nikisweeting/archivebox ``` As for large-archive support, you'll likely have to wait for v0.5 at the earliest (see the roadmap link above).
Author
Owner

@jaw-sh commented on GitHub (Aug 18, 2020):

@pirate this patch makes readability-extractor a requirement/enabled by default but I cannot find adequate documentation of how to install it. or what it is. I cannot find any information just googling with quotes "readability-extractor".

<!-- gh-comment-id:675287642 --> @jaw-sh commented on GitHub (Aug 18, 2020): @pirate this patch makes readability-extractor a requirement/enabled by default but I cannot find adequate documentation of how to install it. or what it is. I cannot find any information just googling with quotes `"readability-extractor"`.
Author
Owner

@cdvv7788 commented on GitHub (Aug 18, 2020):

@jaw-sh https://github.com/pirate/ArchiveBox/blob/master/archivebox/config/init.py#L112 Readability now provides instructions on how to install itself. Please create a new issue if you are still experiencing errors related to that extractor after testing with the latest master version.

<!-- gh-comment-id:675479428 --> @cdvv7788 commented on GitHub (Aug 18, 2020): @jaw-sh https://github.com/pirate/ArchiveBox/blob/master/archivebox/config/__init__.py#L112 Readability now provides instructions on how to install itself. Please create a new issue if you are still experiencing errors related to that extractor after testing with the latest master version.
Author
Owner

@pirate commented on GitHub (Aug 18, 2020):

Ah sorry, forgot to add the docs link for that. It should be fixed now on master

npm install -g 'git+https://github.com/gildas-lormeau/SingleFile.git'
npm install -g 'git+https://github.com/pirate/readability-extractor.git'
<!-- gh-comment-id:675494430 --> @pirate commented on GitHub (Aug 18, 2020): Ah sorry, forgot to add the docs link for that. It should be fixed now on `master` ```bash npm install -g 'git+https://github.com/gildas-lormeau/SingleFile.git' npm install -g 'git+https://github.com/pirate/readability-extractor.git' ```
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#296
No description provided.