[GH-ISSUE #1000] Bug: Parsing Wallabag Atom feed tries to open nonexisting files #2137

Open
opened 2026-03-01 17:56:46 +03:00 by kerem · 1 comment
Owner

Originally created by @peterrus on GitHub (Jul 18, 2022).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/1000

Describe the bug

After commit github.com/ArchiveBox/ArchiveBox@a6767671fb it seems that add() (github.com/ArchiveBox/ArchiveBox@a6767671fb/archivebox/main.py (L555)=) sometimes gets called with a url parameter containing the entire atom feed instead of a list of actual urls. I am not sure if this happens only with the Wallabag parser but this input is not expected at github.com/ArchiveBox/ArchiveBox@a6767671fb/archivebox/parsers/init.py#L158= and a No such file or directory error is raised.

Steps to reproduce

  1. Run archivebox from any commit on the dev branch after github.com/ArchiveBox/ArchiveBox@a6767671fb
  2. Import a wallabag feed, for example this test feed I created: curl https://app.wallabag.it/feed/dokafad/TDzxV9ejsZiWMq/archive | archivebox add --parser=wallabag_atom
  3. The feeds seems to get imported, but a lot of errors are raised and a bunch of random paths get accessed (read) by the process.

Screenshots or log output

See: https://github.com/ArchiveBox/ArchiveBox/issues/971#issuecomment-1122499507=

ArchiveBox version

0.6.3
ArchiveBox v0.6.3 03eb7e5 Cpython Linux Linux-5.15.0-40-generic-x86_64-with-glibc2.31 x86_64
DEBUG=False IN_DOCKER=True IS_TTY=True TZ=UTC FS_ATOMIC=True FS_REMOTE=True FS_PERMS=644 999:999 SEARCH_BACKEND=ripgrep

[i] Dependency versions:
 √  PYTHON_BINARY         v3.10.5         valid     /usr/local/bin/python3.10                                                   
 √  SQLITE_BINARY         v2.6.0          valid     /usr/local/lib/python3.10/sqlite3/dbapi2.py                                 
 √  DJANGO_BINARY         v3.1.14         valid     /usr/local/lib/python3.10/site-packages/django/__init__.py                  
 √  ARCHIVEBOX_BINARY     v0.6.3          valid     /usr/local/bin/archivebox                                                   

 √  CURL_BINARY           v7.74.0         valid     /usr/bin/curl                                                               
 √  WGET_BINARY           v1.21           valid     /usr/bin/wget                                                               
 √  NODE_BINARY           v17.9.0         valid     /usr/bin/node                                                               
 √  SINGLEFILE_BINARY     v0.3.16         valid     /node/node_modules/single-file/cli/single-file                              
 √  READABILITY_BINARY    v0.0.2          valid     /node/node_modules/readability-extractor/readability-extractor              
 √  MERCURY_BINARY        v1.0.0          valid     /node/node_modules/@postlight/mercury-parser/cli.js                         
 √  GIT_BINARY            v2.30.2         valid     /usr/bin/git                                                                
 √  YOUTUBEDL_BINARY      v2022.06.29     valid     /usr/local/bin/yt-dlp                                                       
 √  CHROME_BINARY         v103.0.5060.114  valid     /usr/bin/chromium                                                           
 √  RIPGREP_BINARY        v12.1.1         valid     /usr/bin/rg                                                                 

[i] Source-code locations:
 √  PACKAGE_DIR           23 files        valid     /app/archivebox                                                             
 √  TEMPLATES_DIR         3 files         valid     /app/archivebox/templates                                                   
 -  CUSTOM_TEMPLATES_DIR  -               disabled                                                                              

[i] Secrets locations:
 -  CHROME_USER_DATA_DIR  -               disabled                                                                              
 -  COOKIES_FILE          -               disabled                                                                              

[i] Data locations:
 √  OUTPUT_DIR            7 files @       valid     /data                                                                       
 √  SOURCES_DIR           12 files        valid     ./sources                                                                   
 √  LOGS_DIR              1 files         valid     ./logs                                                                      
 √  ARCHIVE_DIR           1 files         valid     ./archive                                                                   
 √  CONFIG_FILE           81.0 Bytes      valid     ./ArchiveBox.conf                                                           
 √  SQL_INDEX             204.0 KB        valid     ./index.sqlite3     

But actually running github.com/ArchiveBox/ArchiveBox@a6767671fb

Originally created by @peterrus on GitHub (Jul 18, 2022). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/1000 <!-- Please fill out the following information, feel free to delete sections if they're not applicable or if long issue templates annoy you. (the only required section is the version information) --> #### Describe the bug After commit https://github.com/ArchiveBox/ArchiveBox/commit/a6767671fb68e25f67edcf16afafe5234d2826dd it seems that `add()` (https://github.com/ArchiveBox/ArchiveBox/blob/a6767671fb68e25f67edcf16afafe5234d2826dd/archivebox/main.py#L555=) sometimes gets called with a `url` parameter containing the entire atom feed instead of a list of actual urls. I am not sure if this happens only with the Wallabag parser but this input is not expected at https://github.com/ArchiveBox/ArchiveBox/blob/a6767671fb68e25f67edcf16afafe5234d2826dd/archivebox/parsers/__init__.py#L158= and a `No such file or directory` error is raised. #### Steps to reproduce 1. Run archivebox from any commit on the `dev` branch after https://github.com/ArchiveBox/ArchiveBox/commit/a6767671fb68e25f67edcf16afafe5234d2826dd 2. Import a wallabag feed, for example this test feed I created: `curl https://app.wallabag.it/feed/dokafad/TDzxV9ejsZiWMq/archive | archivebox add --parser=wallabag_atom` 3. The feeds seems to get imported, but a lot of errors are raised and a bunch of random paths get accessed (read) by the process. #### Screenshots or log output See: https://github.com/ArchiveBox/ArchiveBox/issues/971#issuecomment-1122499507= #### ArchiveBox version <!-- Run the `archivebox version` command locally then copy paste the result here: --> ```logs 0.6.3 ArchiveBox v0.6.3 03eb7e5 Cpython Linux Linux-5.15.0-40-generic-x86_64-with-glibc2.31 x86_64 DEBUG=False IN_DOCKER=True IS_TTY=True TZ=UTC FS_ATOMIC=True FS_REMOTE=True FS_PERMS=644 999:999 SEARCH_BACKEND=ripgrep [i] Dependency versions: √ PYTHON_BINARY v3.10.5 valid /usr/local/bin/python3.10 √ SQLITE_BINARY v2.6.0 valid /usr/local/lib/python3.10/sqlite3/dbapi2.py √ DJANGO_BINARY v3.1.14 valid /usr/local/lib/python3.10/site-packages/django/__init__.py √ ARCHIVEBOX_BINARY v0.6.3 valid /usr/local/bin/archivebox √ CURL_BINARY v7.74.0 valid /usr/bin/curl √ WGET_BINARY v1.21 valid /usr/bin/wget √ NODE_BINARY v17.9.0 valid /usr/bin/node √ SINGLEFILE_BINARY v0.3.16 valid /node/node_modules/single-file/cli/single-file √ READABILITY_BINARY v0.0.2 valid /node/node_modules/readability-extractor/readability-extractor √ MERCURY_BINARY v1.0.0 valid /node/node_modules/@postlight/mercury-parser/cli.js √ GIT_BINARY v2.30.2 valid /usr/bin/git √ YOUTUBEDL_BINARY v2022.06.29 valid /usr/local/bin/yt-dlp √ CHROME_BINARY v103.0.5060.114 valid /usr/bin/chromium √ RIPGREP_BINARY v12.1.1 valid /usr/bin/rg [i] Source-code locations: √ PACKAGE_DIR 23 files valid /app/archivebox √ TEMPLATES_DIR 3 files valid /app/archivebox/templates - CUSTOM_TEMPLATES_DIR - disabled [i] Secrets locations: - CHROME_USER_DATA_DIR - disabled - COOKIES_FILE - disabled [i] Data locations: √ OUTPUT_DIR 7 files @ valid /data √ SOURCES_DIR 12 files valid ./sources √ LOGS_DIR 1 files valid ./logs √ ARCHIVE_DIR 1 files valid ./archive √ CONFIG_FILE 81.0 Bytes valid ./ArchiveBox.conf √ SQL_INDEX 204.0 KB valid ./index.sqlite3 ``` But actually running https://github.com/ArchiveBox/ArchiveBox/commit/a6767671fb68e25f67edcf16afafe5234d2826dd
Author
Owner

@peterrus commented on GitHub (Oct 7, 2024):

I have taken a look at the current code that parses the Wallabag feed and it seems it relies heavily on string parsing instead of a something that 'understands' the XML document (so one can use XPath for example). After some experimenting with Python's lxml I ran into issues where the document returned by Wallabag's RSS feed was simply too large for lxml to handle (at least on my machine/setup). I suspect this is due to the fact that the Wallabag feed includes the entire saved page's content and I have configured Wallabag te return a feed containing 2000 documents. My collection is already containing 1600+ items so I expect this to be a problem sooner of later.

I opted for an (imho) cleaner approach where the Wallabag API is called, using pagination and using Wallabag's metadata-only option (to prevent huge blobs of data being processed). Because I had trouble setting up a Docker dev environment on the dev branch (something related to nodejs) I opted to just put everything in a separate Python script that I pipe into the url_list parser and use that as a workaround for now. Maybe someone else feels up to integrating this into Archivebox:

# Print a list of the original urls of all saved entries in Wallabag. Doesn't
# filter on status or tags but this should be trivial to implement
# ( See API docs: https://app.wallabag.it/api/doc/ ).
#
# Usage:
#
# Create a `wallabag_fetch.toml` file in the same directory as this script
# with the following structure:
#
# base_url = "https://app.wallabag.it/"
# client_id = "secret"
# client_secret = "secret"
# username = "secret"
# password = "secret"


from requests.exceptions import RequestException
import requests
import sys
import os


if sys.version_info >= (3, 11):
    import tomllib
else:
    import tomli as tomllib


# Load configuration

with open(os.path.dirname(os.path.realpath(__file__)) +
          "/wallabag_fetch.toml", "rb") as f:
    config = tomllib.load(f)


HEADERS = {
    "Content-Type": "application/x-www-form-urlencoded"
}


def get_token():
    url = config['base_url'] + "oauth/v2/token"
    payload = {
        "grant_type": "password",
        "client_id": config['client_id'],
        "client_secret": config['client_secret'],
        "username": config['username'],
        "password": config['password']
    }
    response = requests.post(url, headers=HEADERS, data=payload)
    response.raise_for_status()
    return response.json()["access_token"]


def fetch_wallabag_entries(page=1):
    token = get_token()
    HEADERS['Authorization'] = f"Bearer {token}"
    url = f"{config['base_url']}/api/entries?page={page}&detail=metadata"
    try:
        response = requests.get(url, headers=HEADERS)
        response.raise_for_status()
        return response.json()
    except RequestException as e:
        print(f"An error occurred while fetching entries: {e}")
        return None


def main():
    # Fetch all pages
    all_entries = []
    page = 1
    while True:
        entries = fetch_wallabag_entries(page)
        if entries is None or len(entries.get('_embedded').get('items', [])) == 0:
            break

        all_entries.extend(entries['_embedded']['items'])

        total_pages = entries.get('pages')
        page += 1
        if page > total_pages:
            break

    # Print all entries
    for entry in all_entries:
        print(entry['url'])


if __name__ == "__main__":
    main()

Edit: I noticed are more intelligent RSS feed parser was already used in https://github.com/ArchiveBox/ArchiveBox/issues/1000#issuecomment-2396652382, I still believe - due to the fact that Wallabag's RSS feed can get huge - an API based solution is more elegant, the only 'downside' is that you would have to configure some credentials for the API somewhere.

<!-- gh-comment-id:2396652382 --> @peterrus commented on GitHub (Oct 7, 2024): I have taken a look at the current code that parses the Wallabag feed and it seems it relies heavily on string parsing instead of a something that 'understands' the XML document (so one can use XPath for example). After some experimenting with Python's `lxml` I ran into issues where the document returned by Wallabag's RSS feed was simply too large for `lxml` to handle (at least on my machine/setup). I suspect this is due to the fact that the Wallabag feed includes the entire saved page's content and I have configured Wallabag te return a feed containing 2000 documents. My collection is already containing 1600+ items so I expect this to be a problem sooner of later. I opted for an (imho) cleaner approach where the Wallabag API is called, using pagination and using Wallabag's `metadata`-only option (to prevent huge blobs of data being processed). Because I had trouble setting up a Docker dev environment on the `dev` branch (something related to `nodejs`) I opted to just put everything in a separate Python script that I pipe into the `url_list` parser and use that as a workaround for now. Maybe someone else feels up to integrating this into Archivebox: ```python # Print a list of the original urls of all saved entries in Wallabag. Doesn't # filter on status or tags but this should be trivial to implement # ( See API docs: https://app.wallabag.it/api/doc/ ). # # Usage: # # Create a `wallabag_fetch.toml` file in the same directory as this script # with the following structure: # # base_url = "https://app.wallabag.it/" # client_id = "secret" # client_secret = "secret" # username = "secret" # password = "secret" from requests.exceptions import RequestException import requests import sys import os if sys.version_info >= (3, 11): import tomllib else: import tomli as tomllib # Load configuration with open(os.path.dirname(os.path.realpath(__file__)) + "/wallabag_fetch.toml", "rb") as f: config = tomllib.load(f) HEADERS = { "Content-Type": "application/x-www-form-urlencoded" } def get_token(): url = config['base_url'] + "oauth/v2/token" payload = { "grant_type": "password", "client_id": config['client_id'], "client_secret": config['client_secret'], "username": config['username'], "password": config['password'] } response = requests.post(url, headers=HEADERS, data=payload) response.raise_for_status() return response.json()["access_token"] def fetch_wallabag_entries(page=1): token = get_token() HEADERS['Authorization'] = f"Bearer {token}" url = f"{config['base_url']}/api/entries?page={page}&detail=metadata" try: response = requests.get(url, headers=HEADERS) response.raise_for_status() return response.json() except RequestException as e: print(f"An error occurred while fetching entries: {e}") return None def main(): # Fetch all pages all_entries = [] page = 1 while True: entries = fetch_wallabag_entries(page) if entries is None or len(entries.get('_embedded').get('items', [])) == 0: break all_entries.extend(entries['_embedded']['items']) total_pages = entries.get('pages') page += 1 if page > total_pages: break # Print all entries for entry in all_entries: print(entry['url']) if __name__ == "__main__": main() ``` Edit: I noticed are more intelligent RSS feed parser was already used in https://github.com/ArchiveBox/ArchiveBox/issues/1000#issuecomment-2396652382, I still believe - due to the fact that Wallabag's RSS feed can get huge - an API based solution is more elegant, the only 'downside' is that you would have to configure some credentials for the API somewhere.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#2137
No description provided.