[GH-ISSUE #821] Bug: Crash during Pinboard RSS import #511

Closed
opened 2026-03-01 14:44:14 +03:00 by kerem · 1 comment
Owner

Originally created by @overhacked on GitHub (Aug 4, 2021).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/821

Describe the bug

Using the Pinboard RSS parser results in a crash due to an assertion in index/schema.py. This assertion fails because parsers/pinboard_rss.py passes a None value as the url property of the Link object.

The url is set to None because the return value of item.find() is coerced to a bool in pinboard_rss.py. The result of the coercion is always False, because the <link> element does not have any child elements, which is the definition of __bool__() in ElementTree.Element. In fact, the RSS 1.0 specification forbids any child elements of <link>.

Steps to reproduce

  1. Run curl -sSL 'https://feeds.pinboard.in/rss/u:maciej/' | archivebox add --parser=pinboard_rss
  2. Python crashes with a backtrace, below.

Screenshots or log output

Backtrace
[X] Error while loading link! [1628081549.183792] None "None"
Traceback (most recent call last):
  File "./test_pinboard_rss.py", line 18, in <module>
    main()
  File "./test_pinboard_rss.py", line 13, in main
    items = [item for item in parse_pinboard_rss_export(args.rss_file)]
  File "./test_pinboard_rss.py", line 13, in <listcomp>
    items = [item for item in parse_pinboard_rss_export(args.rss_file)]
  File "/usr/local/lib/python3.8/site-packages/archivebox/parsers/pinboard_rss.py", line 41, in parse_pinboard_rss_export
    yield Link(
  File "<string>", line 11, in __init__
  File "/usr/local/lib/python3.8/site-packages/archivebox/index/schema.py", line 141, in __post_init__
    self.typecheck()
  File "/usr/local/lib/python3.8/site-packages/archivebox/index/schema.py", line 165, in typecheck
    assert isinstance(self.url, str) and '://' in self.url
AssertionError

ArchiveBox version

Running dev branch. Tested on commit 3d54b13.

ArchiveBox v0.6.3
Cpython FreeBSD FreeBSD-12.2-RELEASE-p2-amd64-64bit-ELF amd64
IN_DOCKER=False DEBUG=False IS_TTY=True TZ=UTC SEARCH_BACKEND_ENGINE=ripgrep

[i] Dependency versions:
 √  ARCHIVEBOX_BINARY     v0.6.3          valid     /usr/local/bin/archivebox
 √  PYTHON_BINARY         v3.8.10         valid     /usr/local/bin/python3.8
 √  DJANGO_BINARY         v3.1.13         valid     /usr/local/lib/python3.8/site-packages/django/bin/django-admin.py
 √  CURL_BINARY           v7.77.0         valid     /usr/local/bin/curl
 √  WGET_BINARY           v1.21           valid     /usr/local/bin/wget
 √  NODE_BINARY           v16.2.0         valid     /usr/local/bin/node
 √  SINGLEFILE_BINARY     v0.3.26         valid     ./node_modules/single-file/cli/single-file
 √  READABILITY_BINARY    v0.0.3          valid     ./node_modules/readability-extractor/readability-extractor
 √  MERCURY_BINARY        v1.0.0          valid     ./node_modules/@postlight/mercury-parser/cli.js
 √  GIT_BINARY            v2.32.0         valid     /usr/local/bin/git
 √  YOUTUBEDL_BINARY      v2021.06.06     valid     /usr/local/bin/youtube-dl
 √  CHROME_BINARY         v91.0.4472.164  valid     /usr/local/bin/chrome
 √  RIPGREP_BINARY        v13.0.0         valid     /usr/local/bin/rg

[i] Source-code locations:
 √  PACKAGE_DIR           23 files        valid     /usr/local/lib/python3.8/site-packages/archivebox
 √  TEMPLATES_DIR         3 files         valid     /usr/local/lib/python3.8/site-packages/archivebox/templates
 -  CUSTOM_TEMPLATES_DIR  -               disabled

[i] Secrets locations:
 -  CHROME_USER_DATA_DIR  -               disabled
 -  COOKIES_FILE          -               disabled

[i] Data locations:
 √  OUTPUT_DIR            11 files        valid     /usr/home/archivebox/data
 √  SOURCES_DIR           8 files         valid     ./sources
 √  LOGS_DIR              3 files         valid     ./logs
 √  ARCHIVE_DIR           877 files       valid     ./archive
 √  CONFIG_FILE           153.0 Bytes     valid     ./ArchiveBox.conf
 √  SQL_INDEX             6.8 MB          valid     ./index.sqlite3
Originally created by @overhacked on GitHub (Aug 4, 2021). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/821 <!-- Please fill out the following information, feel free to delete sections if they're not applicable or if long issue templates annoy you. (the only required section is the version information) --> #### Describe the bug Using the Pinboard RSS parser results in a crash due to [an assertion in `index/schema.py`](https://github.com/ArchiveBox/ArchiveBox/blob/3d54b1321bf8c56627aaa50efcc809cd99caee52/archivebox/index/schema.py#L165). This assertion fails because `parsers/pinboard_rss.py` passes a `None` value as the `url` property of the `Link` object. The `url` is set to `None` because the return value of `item.find()` is [coerced to a `bool` in `pinboard_rss.py`](https://github.com/ArchiveBox/ArchiveBox/blob/3d54b1321bf8c56627aaa50efcc809cd99caee52/archivebox/parsers/pinboard_rss.py#L24). The result of the coercion is always `False`, because the `<link>` element does not have any child elements, which is [the definition of `__bool__()` in `ElementTree.Element`](https://github.com/python/cpython/blob/3d8993a744813c5144851da5347d7b4b1885f234/Lib/xml/etree/ElementTree.py#L207). In fact, the [RSS 1.0 specification forbids any child elements of `<link>`](https://validator.w3.org/feed/docs/rss1.html#s5.5.2). #### Steps to reproduce <!-- For example: 1. Ran ArchiveBox with the following config '...' 2. Saw this output during archiving '....' 3. UI didn't show the thing I was expecting '....' --> 1. Run `curl -sSL 'https://feeds.pinboard.in/rss/u:maciej/' | archivebox add --parser=pinboard_rss` 2. Python crashes with a backtrace, below. #### Screenshots or log output ##### Backtrace ```logs [X] Error while loading link! [1628081549.183792] None "None" Traceback (most recent call last): File "./test_pinboard_rss.py", line 18, in <module> main() File "./test_pinboard_rss.py", line 13, in main items = [item for item in parse_pinboard_rss_export(args.rss_file)] File "./test_pinboard_rss.py", line 13, in <listcomp> items = [item for item in parse_pinboard_rss_export(args.rss_file)] File "/usr/local/lib/python3.8/site-packages/archivebox/parsers/pinboard_rss.py", line 41, in parse_pinboard_rss_export yield Link( File "<string>", line 11, in __init__ File "/usr/local/lib/python3.8/site-packages/archivebox/index/schema.py", line 141, in __post_init__ self.typecheck() File "/usr/local/lib/python3.8/site-packages/archivebox/index/schema.py", line 165, in typecheck assert isinstance(self.url, str) and '://' in self.url AssertionError ``` <!-- If applicable, post any relevant screenshots or copy/pasted terminal output from ArchiveBox. If you're reporting a parsing / importing error, **you must paste a copy of your redacted import file here**. --> #### ArchiveBox version *Running `dev` branch. Tested on commit 3d54b13.* <!-- Run the `archivebox version` command locally then copy paste the result here: --> ```logs ArchiveBox v0.6.3 Cpython FreeBSD FreeBSD-12.2-RELEASE-p2-amd64-64bit-ELF amd64 IN_DOCKER=False DEBUG=False IS_TTY=True TZ=UTC SEARCH_BACKEND_ENGINE=ripgrep [i] Dependency versions: √ ARCHIVEBOX_BINARY v0.6.3 valid /usr/local/bin/archivebox √ PYTHON_BINARY v3.8.10 valid /usr/local/bin/python3.8 √ DJANGO_BINARY v3.1.13 valid /usr/local/lib/python3.8/site-packages/django/bin/django-admin.py √ CURL_BINARY v7.77.0 valid /usr/local/bin/curl √ WGET_BINARY v1.21 valid /usr/local/bin/wget √ NODE_BINARY v16.2.0 valid /usr/local/bin/node √ SINGLEFILE_BINARY v0.3.26 valid ./node_modules/single-file/cli/single-file √ READABILITY_BINARY v0.0.3 valid ./node_modules/readability-extractor/readability-extractor √ MERCURY_BINARY v1.0.0 valid ./node_modules/@postlight/mercury-parser/cli.js √ GIT_BINARY v2.32.0 valid /usr/local/bin/git √ YOUTUBEDL_BINARY v2021.06.06 valid /usr/local/bin/youtube-dl √ CHROME_BINARY v91.0.4472.164 valid /usr/local/bin/chrome √ RIPGREP_BINARY v13.0.0 valid /usr/local/bin/rg [i] Source-code locations: √ PACKAGE_DIR 23 files valid /usr/local/lib/python3.8/site-packages/archivebox √ TEMPLATES_DIR 3 files valid /usr/local/lib/python3.8/site-packages/archivebox/templates - CUSTOM_TEMPLATES_DIR - disabled [i] Secrets locations: - CHROME_USER_DATA_DIR - disabled - COOKIES_FILE - disabled [i] Data locations: √ OUTPUT_DIR 11 files valid /usr/home/archivebox/data √ SOURCES_DIR 8 files valid ./sources √ LOGS_DIR 3 files valid ./logs √ ARCHIVE_DIR 877 files valid ./archive √ CONFIG_FILE 153.0 Bytes valid ./ArchiveBox.conf √ SQL_INDEX 6.8 MB valid ./index.sqlite3 ``` <!-- Tickets without full version info will closed until it is provided, we need the full output here to help you solve your issue -->
kerem 2026-03-01 14:44:14 +03:00
Author
Owner

@pirate commented on GitHub (Aug 4, 2021):

nice find, thanks!

<!-- gh-comment-id:892796500 --> @pirate commented on GitHub (Aug 4, 2021): nice find, thanks!
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#511
No description provided.