[GH-ISSUE #971] Bug: Parsing Wallabag RSS feed fails #3623

Closed
opened 2026-03-14 23:46:18 +03:00 by kerem · 4 comments
Owner

Originally created by @peterrus on GitHub (May 1, 2022).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/971

Describe the bug

I have a setup where (through a cronjob) Archivebox fetches a RSS feed of my archived (aka read) articles from Wallabag.it and imports them. This way I have a redundant archive of everything I read in Wallabag. Overkill? maybe.

Somewhere around 2022-03-31 the parsing of this RSS feed started to fail with the following error:

[+] [2022-05-01 10:03:42] Adding 5023 links to index (crawl depth=0)...
    > Saved verbatim input to sources/1651399422-import.txt
[X] Error while loading link! [1642338726.0] <link rel="via" "The Tragic Optimism of Software Projects"
Traceback (most recent call last):
[ ... redacted ...]
  File "/app/archivebox/index/schema.py", line 165, in typecheck
    assert isinstance(self.url, str) and '://' in self.url
AssertionError

Not long before 2022-03-31 Wallabag has released a new version: https://github.com/wallabag/wallabag/releases/tag/2.4.3 which includes a PR that modifies the formatting of the RSS feed it provides: https://github.com/wallabag/wallabag/pull/5347. I suspect this to be the culprit.

I am not exactly sure where the responsibility of fixing this lies but I want to at least document that I ran into this in case someone else experiences a similar issue.

Steps to reproduce

I have created a test account with one archived article on Wallabag.it. This account will expire in 14 days, but you can easily create a new one for testing purposes.

  1. curl https://app.wallabag.it/feed/dokafad/TDzxV9ejsZiWMq/archive | archivebox add --parser=wallabag_atom

Screenshots or log output

Full log

[+] [2022-05-01 10:03:42] Adding 5023 links to index (crawl depth=0)...
    > Saved verbatim input to sources/1651399422-import.txt
[X] Error while loading link! [1642338726.0] <link rel="via" "The Tragic Optimism of Software Projects"
Traceback (most recent call last):
  File "/usr/local/bin/archivebox", line 33, in <module>
    sys.exit(load_entry_point('archivebox', 'console_scripts', 'archivebox')())
  File "/app/archivebox/cli/__init__.py", line 140, in main
    run_subcommand(
  File "/app/archivebox/cli/__init__.py", line 80, in run_subcommand
    module.main(args=subcommand_args, stdin=stdin, pwd=pwd)    # type: ignore
  File "/app/archivebox/cli/archivebox_add.py", line 103, in main
    add(
  File "/app/archivebox/util.py", line 114, in typechecked_function
    return func(*args, **kwargs)
  File "/app/archivebox/main.py", line 588, in add
    new_links += parse_links_from_source(write_ahead_log, root_url=None, parser=parser)
  File "/app/archivebox/util.py", line 114, in typechecked_function
    return func(*args, **kwargs)
  File "/app/archivebox/index/__init__.py", line 275, in parse_links_from_source
    raw_links, parser_name = parse_links(source_path, root_url=root_url, parser=parser)
  File "/app/archivebox/util.py", line 114, in typechecked_function
    return func(*args, **kwargs)
  File "/app/archivebox/parsers/__init__.py", line 101, in parse_links
    links, parser = run_parser_functions(file, timer, root_url=root_url, parser=parser)
  File "/app/archivebox/parsers/__init__.py", line 115, in run_parser_functions
    parsed_links = list(parser_func(to_parse, root_url=root_url))
  File "/app/archivebox/parsers/wallabag_atom.py", line 51, in parse_wallabag_atom_export
    yield Link(
  File "<string>", line 11, in __init__
  File "/app/archivebox/index/schema.py", line 141, in __post_init__
    self.typecheck()
  File "/app/archivebox/index/schema.py", line 165, in typecheck
    assert isinstance(self.url, str) and '://' in self.url
AssertionError

ArchiveBox version

ArchiveBox v0.6.2
Cpython Linux Linux-5.13.0-39-generic-x86_64-with-glibc2.28 x86_64
IN_DOCKER=True DEBUG=False IS_TTY=False TZ=UTC SEARCH_BACKEND_ENGINE=ripgrep

[i] Dependency versions:
 √  ARCHIVEBOX_BINARY     v0.6.2          valid     /usr/local/bin/archivebox                                                   
 √  PYTHON_BINARY         v3.9.5          valid     /usr/local/bin/python3.9                                                    
 √  DJANGO_BINARY         v3.1.10         valid     /usr/local/lib/python3.9/site-packages/django/bin/django-admin.py           
 √  CURL_BINARY           v7.64.0         valid     /usr/bin/curl                                                               
 √  WGET_BINARY           v1.20.1         valid     /usr/bin/wget                                                               
 √  NODE_BINARY           v15.14.0        valid     /usr/bin/node                                                               
 √  SINGLEFILE_BINARY     v0.3.16         valid     /node/node_modules/single-file/cli/single-file                              
 √  READABILITY_BINARY    v0.0.2          valid     /node/node_modules/readability-extractor/readability-extractor              
 √  MERCURY_BINARY        v1.0.0          valid     /node/node_modules/@postlight/mercury-parser/cli.js                         
 √  GIT_BINARY            v2.20.1         valid     /usr/bin/git                                                                
 √  YOUTUBEDL_BINARY      v2021.04.26     valid     /usr/local/bin/youtube-dl                                                   
 √  CHROME_BINARY         v90.0.4430.93   valid     /usr/bin/chromium                                                           
 √  RIPGREP_BINARY        v0.10.0         valid     /usr/bin/rg                                                                 

[i] Source-code locations:
 √  PACKAGE_DIR           23 files        valid     /app/archivebox                                                             
 √  TEMPLATES_DIR         3 files         valid     /app/archivebox/templates                                                   
 -  CUSTOM_TEMPLATES_DIR  -               disabled                                                                              

[i] Secrets locations:
 -  CHROME_USER_DATA_DIR  -               disabled                                                                              
 -  COOKIES_FILE          -               disabled                                                                              

[i] Data locations:
 √  OUTPUT_DIR            8 files         valid     /data                                                                       
 √  SOURCES_DIR           948 files       valid     ./sources                                                                   
 √  LOGS_DIR              1 files         valid     ./logs                                                                      
 √  ARCHIVE_DIR           168 files       valid     ./archive                                                                   
 √  CONFIG_FILE           81.0 Bytes      valid     ./ArchiveBox.conf                                                           
 √  SQL_INDEX             1.9 MB          valid     ./index.sqlite3    
Originally created by @peterrus on GitHub (May 1, 2022). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/971 #### Describe the bug I have a setup where (through a cronjob) Archivebox fetches a RSS feed of my archived (aka read) articles from [Wallabag.it](https://wallabag.it) and imports them. This way I have a redundant archive of everything I read in Wallabag. Overkill? maybe. Somewhere around 2022-03-31 the parsing of this RSS feed started to fail with the following error: ``` [+] [2022-05-01 10:03:42] Adding 5023 links to index (crawl depth=0)... > Saved verbatim input to sources/1651399422-import.txt [X] Error while loading link! [1642338726.0] <link rel="via" "The Tragic Optimism of Software Projects" Traceback (most recent call last): [ ... redacted ...] File "/app/archivebox/index/schema.py", line 165, in typecheck assert isinstance(self.url, str) and '://' in self.url AssertionError ``` Not long before 2022-03-31 Wallabag has released a new version: https://github.com/wallabag/wallabag/releases/tag/2.4.3 which includes a PR that modifies the formatting of the RSS feed it provides: https://github.com/wallabag/wallabag/pull/5347. I suspect this to be the culprit. I am not exactly sure where the responsibility of fixing this lies but I want to at least document that I ran into this in case someone else experiences a similar issue. #### Steps to reproduce I have created a test account with one archived article on Wallabag.it. This account will expire in 14 days, but you can easily create a new one for testing purposes. 1. `curl https://app.wallabag.it/feed/dokafad/TDzxV9ejsZiWMq/archive | archivebox add --parser=wallabag_atom` #### Screenshots or log output Full log ``` [+] [2022-05-01 10:03:42] Adding 5023 links to index (crawl depth=0)... > Saved verbatim input to sources/1651399422-import.txt [X] Error while loading link! [1642338726.0] <link rel="via" "The Tragic Optimism of Software Projects" Traceback (most recent call last): File "/usr/local/bin/archivebox", line 33, in <module> sys.exit(load_entry_point('archivebox', 'console_scripts', 'archivebox')()) File "/app/archivebox/cli/__init__.py", line 140, in main run_subcommand( File "/app/archivebox/cli/__init__.py", line 80, in run_subcommand module.main(args=subcommand_args, stdin=stdin, pwd=pwd) # type: ignore File "/app/archivebox/cli/archivebox_add.py", line 103, in main add( File "/app/archivebox/util.py", line 114, in typechecked_function return func(*args, **kwargs) File "/app/archivebox/main.py", line 588, in add new_links += parse_links_from_source(write_ahead_log, root_url=None, parser=parser) File "/app/archivebox/util.py", line 114, in typechecked_function return func(*args, **kwargs) File "/app/archivebox/index/__init__.py", line 275, in parse_links_from_source raw_links, parser_name = parse_links(source_path, root_url=root_url, parser=parser) File "/app/archivebox/util.py", line 114, in typechecked_function return func(*args, **kwargs) File "/app/archivebox/parsers/__init__.py", line 101, in parse_links links, parser = run_parser_functions(file, timer, root_url=root_url, parser=parser) File "/app/archivebox/parsers/__init__.py", line 115, in run_parser_functions parsed_links = list(parser_func(to_parse, root_url=root_url)) File "/app/archivebox/parsers/wallabag_atom.py", line 51, in parse_wallabag_atom_export yield Link( File "<string>", line 11, in __init__ File "/app/archivebox/index/schema.py", line 141, in __post_init__ self.typecheck() File "/app/archivebox/index/schema.py", line 165, in typecheck assert isinstance(self.url, str) and '://' in self.url AssertionError ``` #### ArchiveBox version <!-- Run the `archivebox version` command locally then copy paste the result here: --> ```logs ArchiveBox v0.6.2 Cpython Linux Linux-5.13.0-39-generic-x86_64-with-glibc2.28 x86_64 IN_DOCKER=True DEBUG=False IS_TTY=False TZ=UTC SEARCH_BACKEND_ENGINE=ripgrep [i] Dependency versions: √ ARCHIVEBOX_BINARY v0.6.2 valid /usr/local/bin/archivebox √ PYTHON_BINARY v3.9.5 valid /usr/local/bin/python3.9 √ DJANGO_BINARY v3.1.10 valid /usr/local/lib/python3.9/site-packages/django/bin/django-admin.py √ CURL_BINARY v7.64.0 valid /usr/bin/curl √ WGET_BINARY v1.20.1 valid /usr/bin/wget √ NODE_BINARY v15.14.0 valid /usr/bin/node √ SINGLEFILE_BINARY v0.3.16 valid /node/node_modules/single-file/cli/single-file √ READABILITY_BINARY v0.0.2 valid /node/node_modules/readability-extractor/readability-extractor √ MERCURY_BINARY v1.0.0 valid /node/node_modules/@postlight/mercury-parser/cli.js √ GIT_BINARY v2.20.1 valid /usr/bin/git √ YOUTUBEDL_BINARY v2021.04.26 valid /usr/local/bin/youtube-dl √ CHROME_BINARY v90.0.4430.93 valid /usr/bin/chromium √ RIPGREP_BINARY v0.10.0 valid /usr/bin/rg [i] Source-code locations: √ PACKAGE_DIR 23 files valid /app/archivebox √ TEMPLATES_DIR 3 files valid /app/archivebox/templates - CUSTOM_TEMPLATES_DIR - disabled [i] Secrets locations: - CHROME_USER_DATA_DIR - disabled - COOKIES_FILE - disabled [i] Data locations: √ OUTPUT_DIR 8 files valid /data √ SOURCES_DIR 948 files valid ./sources √ LOGS_DIR 1 files valid ./logs √ ARCHIVE_DIR 168 files valid ./archive √ CONFIG_FILE 81.0 Bytes valid ./ArchiveBox.conf √ SQL_INDEX 1.9 MB valid ./index.sqlite3 ``` <!-- Tickets without full version info will closed until it is provided, we need the full output here to help you solve your issue -->
kerem closed this issue 2026-03-14 23:46:24 +03:00
Author
Owner

@peterrus commented on GitHub (May 3, 2022):

Also created a dedicated issue in the Wallabag project, see above.

<!-- gh-comment-id:1115895192 --> @peterrus commented on GitHub (May 3, 2022): Also created a dedicated issue in the Wallabag project, see above.
Author
Owner

@pirate commented on GitHub (May 10, 2022):

I fixed it, it's a mildly annoying change in their export format where they started inserting newline wrappings mid-XML tag which broke my janky parser. acd53c8

image
<!-- gh-comment-id:1121792999 --> @pirate commented on GitHub (May 10, 2022): I fixed it, it's a mildly annoying change in their export format where they started inserting newline wrappings mid-XML tag which broke my janky parser. acd53c8 <img width="1287" alt="image" src="https://user-images.githubusercontent.com/511499/167528115-87055aa0-89e4-4468-9582-469c331018ed.png">
Author
Owner

@peterrus commented on GitHub (May 10, 2022):

hey @pirate, thanks for this! One problem though :p I am getting some output that worries me:

image

After streaming a whole bunch of these errors the actual archival process does start, and does seem to work.

I am running the docker image with tag sha-eb77908 btw.

<!-- gh-comment-id:1122499507 --> @peterrus commented on GitHub (May 10, 2022): hey @pirate, thanks for this! One problem though :p I am getting some output that worries me: ![image](https://user-images.githubusercontent.com/1117858/167656592-0fe4a853-3c16-4ece-a2cd-e80e8c0299be.png) After streaming a whole bunch of these errors the actual archival process does start, and does seem to work. I am running the docker image with tag `sha-eb77908` btw.
Author
Owner

@pirate commented on GitHub (Sep 5, 2023):

Whoops reopened by accident, ignore that, tracking the latest wallabag issue over here: #1000

<!-- gh-comment-id:1707409253 --> @pirate commented on GitHub (Sep 5, 2023): Whoops reopened by accident, ignore that, tracking the latest wallabag issue over here: #1000
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#3623
No description provided.