mirror of
https://github.com/ArchiveBox/ArchiveBox.git
synced 2026-04-25 17:16:00 +03:00
[GH-ISSUE #259] assert len(re.findall(URL_REGEX, link['url'])) == 1 AssertionError upon ./archive #1693
Labels
No labels
expected: maybe someday
expected: next release
expected: release after next
expected: unlikely unless contributed
good first ticket
help wanted
pull-request
scope: all users
scope: windows users
size: easy
size: hard
size: medium
size: medium
status: backlog
status: blocked
status: done
status: idea-phase
status: needs followup
status: wip
status: wontfix
touches: API/CLI/Spec
touches: configuration
touches: data/schema/architecture
touches: dependencies/packaging
touches: docs
touches: js
touches: views/replayers/html/css
why: correctness
why: functionality
why: performance
why: security
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/ArchiveBox#1693
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @chrisCasebeer on GitHub (Aug 28, 2019).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/259
chris@chris-VirtualBox:~/ArchiveBox$ ./archive /media/sf_share/bookmarks_8_28_2019.html
[*] [2019-08-28 10:48:36] Parsing new links from output/sources/bookmarks_8_28_2019.html...
Traceback (most recent call last):
File "./archive", line 136, in
main(*sys.argv)
File "./archive", line 98, in main
update_archive_data(import_path=import_path, resume=resume)
File "./archive", line 106, in update_archive_data
all_links, new_links = load_links_index(out_dir=OUTPUT_DIR, import_path=import_path)
File "/home/chris/ArchiveBox/archivebox/index.py", line 69, in load_links_index
new_links = validate_links(raw_links)
File "/home/chris/ArchiveBox/archivebox/links.py", line 37, in validate_links
check_links_structure(links)
File "/home/chris/ArchiveBox/archivebox/util.py", line 107, in check_links_structure
check_link_structure(links[0])
File "/home/chris/ArchiveBox/archivebox/util.py", line 96, in check_link_structure
assert len(re.findall(URL_REGEX, link['url'])) == 1
AssertionError
Bookmarks used, either Firefox or Pinboard HTML export.
OS: Ubuntu 16.04 LTS OR Raspberry Pi Stretch.
Commit:
github.com/pirate/ArchiveBox@e2b054ae75Install method:
Automated followed by installation of youtube-dl manually.
All dependencies installed except for python3-distutils which could not be found on either system, Ubuntu or Raspberry Pi Stretch.
@chrisCasebeer commented on GitHub (Aug 28, 2019):
tags/v0.2.3 seems to resolve this.
A simple checkout to that version and running again yields results.
@pirate commented on GitHub (Sep 6, 2019):
Thanks, I knew I wouldn't regret putting that sanity check in. It's a guard to prevent wildly inaccurate parsing from proceeding if there's any bug in the REGEX code / url parsing, so I'm glad it failed in your case despite it being a bad UX. I'll make sure this regression is fixed in v0.4.0
@FreneticScribbler commented on GitHub (Apr 22, 2020):
Also running into this issue - do we have any idea what kind of URLs are causing it so I can sanitize them from my list?
Alternatively, are we able to get a Docker image for the v0.2.3 version?
@pirate commented on GitHub (Apr 23, 2020):
It's likely URLs that contain other URLs inside of them that aren't properly urlencoded, e.g.
https://example.com/http://some.thing.here, or URLs that don't start withhttp://orhttps://.@pirate commented on GitHub (Jul 24, 2020):
This should be fixed in the latest release, can you try checking out the
djangobranch and giving it a shot?If you still encounter any issues, comment back here and I'll reopen the ticket.