[GH-ISSUE #259] assert len(re.findall(URL_REGEX, link['url'])) == 1 AssertionError upon ./archive #181

Closed
opened 2026-03-01 14:41:20 +03:00 by kerem · 5 comments
Owner

Originally created by @chrisCasebeer on GitHub (Aug 28, 2019).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/259

chris@chris-VirtualBox:~/ArchiveBox$ ./archive /media/sf_share/bookmarks_8_28_2019.html
[*] [2019-08-28 10:48:36] Parsing new links from output/sources/bookmarks_8_28_2019.html...
Traceback (most recent call last):
File "./archive", line 136, in
main(*sys.argv)
File "./archive", line 98, in main
update_archive_data(import_path=import_path, resume=resume)
File "./archive", line 106, in update_archive_data
all_links, new_links = load_links_index(out_dir=OUTPUT_DIR, import_path=import_path)
File "/home/chris/ArchiveBox/archivebox/index.py", line 69, in load_links_index
new_links = validate_links(raw_links)
File "/home/chris/ArchiveBox/archivebox/links.py", line 37, in validate_links
check_links_structure(links)
File "/home/chris/ArchiveBox/archivebox/util.py", line 107, in check_links_structure
check_link_structure(links[0])
File "/home/chris/ArchiveBox/archivebox/util.py", line 96, in check_link_structure
assert len(re.findall(URL_REGEX, link['url'])) == 1
AssertionError

Bookmarks used, either Firefox or Pinboard HTML export.
OS: Ubuntu 16.04 LTS OR Raspberry Pi Stretch.

Commit:
github.com/pirate/ArchiveBox@e2b054ae75

Install method:
Automated followed by installation of youtube-dl manually.

All dependencies installed except for python3-distutils which could not be found on either system, Ubuntu or Raspberry Pi Stretch.

Originally created by @chrisCasebeer on GitHub (Aug 28, 2019). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/259 chris@chris-VirtualBox:~/ArchiveBox$ ./archive /media/sf_share/bookmarks_8_28_2019.html [*] [2019-08-28 10:48:36] Parsing new links from output/sources/bookmarks_8_28_2019.html... Traceback (most recent call last): File "./archive", line 136, in <module> main(*sys.argv) File "./archive", line 98, in main update_archive_data(import_path=import_path, resume=resume) File "./archive", line 106, in update_archive_data all_links, new_links = load_links_index(out_dir=OUTPUT_DIR, import_path=import_path) File "/home/chris/ArchiveBox/archivebox/index.py", line 69, in load_links_index new_links = validate_links(raw_links) File "/home/chris/ArchiveBox/archivebox/links.py", line 37, in validate_links check_links_structure(links) File "/home/chris/ArchiveBox/archivebox/util.py", line 107, in check_links_structure check_link_structure(links[0]) File "/home/chris/ArchiveBox/archivebox/util.py", line 96, in check_link_structure assert len(re.findall(URL_REGEX, link['url'])) == 1 AssertionError Bookmarks used, either Firefox or Pinboard HTML export. OS: Ubuntu 16.04 LTS OR Raspberry Pi Stretch. Commit: https://github.com/pirate/ArchiveBox/commit/e2b054ae7522ccb44d6af380d6400752a9a806ea Install method: Automated followed by installation of youtube-dl manually. All dependencies installed except for python3-distutils which could not be found on either system, Ubuntu or Raspberry Pi Stretch.
kerem 2026-03-01 14:41:20 +03:00
Author
Owner

@chrisCasebeer commented on GitHub (Aug 28, 2019):

tags/v0.2.3 seems to resolve this.

A simple checkout to that version and running again yields results.

<!-- gh-comment-id:525833009 --> @chrisCasebeer commented on GitHub (Aug 28, 2019): tags/v0.2.3 seems to resolve this. A simple checkout to that version and running again yields results.
Author
Owner

@pirate commented on GitHub (Sep 6, 2019):

Thanks, I knew I wouldn't regret putting that sanity check in. It's a guard to prevent wildly inaccurate parsing from proceeding if there's any bug in the REGEX code / url parsing, so I'm glad it failed in your case despite it being a bad UX. I'll make sure this regression is fixed in v0.4.0

<!-- gh-comment-id:529004496 --> @pirate commented on GitHub (Sep 6, 2019): Thanks, I knew I wouldn't regret putting that sanity check in. It's a guard to prevent wildly inaccurate parsing from proceeding if there's any bug in the REGEX code / url parsing, so I'm glad it failed in your case despite it being a bad UX. I'll make sure this regression is fixed in v0.4.0
Author
Owner

@FreneticScribbler commented on GitHub (Apr 22, 2020):

Also running into this issue - do we have any idea what kind of URLs are causing it so I can sanitize them from my list?

Alternatively, are we able to get a Docker image for the v0.2.3 version?

<!-- gh-comment-id:618070385 --> @FreneticScribbler commented on GitHub (Apr 22, 2020): Also running into this issue - do we have any idea what kind of URLs are causing it so I can sanitize them from my list? Alternatively, are we able to get a Docker image for the v0.2.3 version?
Author
Owner

@pirate commented on GitHub (Apr 23, 2020):

It's likely URLs that contain other URLs inside of them that aren't properly urlencoded, e.g. https://example.com/http://some.thing.here, or URLs that don't start with http:// or https://.

<!-- gh-comment-id:618552169 --> @pirate commented on GitHub (Apr 23, 2020): It's likely URLs that contain other URLs inside of them that aren't properly urlencoded, e.g. `https://example.com/http://some.thing.here`, or URLs that don't start with `http://` or `https://`.
Author
Owner

@pirate commented on GitHub (Jul 24, 2020):

This should be fixed in the latest release, can you try checking out the django branch and giving it a shot?

git checkout django
git pull
docker build . -t archivebox
docker run -v $PWD/output:/data archivebox init

If you still encounter any issues, comment back here and I'll reopen the ticket.

<!-- gh-comment-id:663621198 --> @pirate commented on GitHub (Jul 24, 2020): This should be fixed in the latest release, can you try checking out the `django` branch and giving it a shot? ```bash git checkout django git pull docker build . -t archivebox docker run -v $PWD/output:/data archivebox init ``` If you still encounter any issues, comment back here and I'll reopen the ticket.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#181
No description provided.