starred/ArchiveBox

Fork 0

mirror of https://github.com/ArchiveBox/ArchiveBox.git synced 2026-04-25 09:06:02 +03:00

[GH-ISSUE #1000] Bug: Parsing Wallabag Atom feed tries to open nonexisting files #2137

New issue

Open

opened 2026-03-01 17:56:46 +03:00 by kerem · 1 comment

kerem commented

2026-03-01 17:56:46 +03:00

Owner

Originally created by @peterrus on GitHub (Jul 18, 2022).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/1000

Describe the bug

After commit github.com/ArchiveBox/ArchiveBox@a6767671fb it seems that add() (github.com/ArchiveBox/ArchiveBox@a6767671fb/archivebox/main.py (L555)=) sometimes gets called with a url parameter containing the entire atom feed instead of a list of actual urls. I am not sure if this happens only with the Wallabag parser but this input is not expected at github.com/ArchiveBox/ArchiveBox@a6767671fb/archivebox/parsers/init.py#L158= and a No such file or directory error is raised.

Steps to reproduce

Run archivebox from any commit on the dev branch after github.com/ArchiveBox/ArchiveBox@a6767671fb
Import a wallabag feed, for example this test feed I created: curl https://app.wallabag.it/feed/dokafad/TDzxV9ejsZiWMq/archive | archivebox add --parser=wallabag_atom
The feeds seems to get imported, but a lot of errors are raised and a bunch of random paths get accessed (read) by the process.

Screenshots or log output

See: https://github.com/ArchiveBox/ArchiveBox/issues/971#issuecomment-1122499507=

ArchiveBox version

0.6.3
ArchiveBox v0.6.3 03eb7e5 Cpython Linux Linux-5.15.0-40-generic-x86_64-with-glibc2.31 x86_64
DEBUG=False IN_DOCKER=True IS_TTY=True TZ=UTC FS_ATOMIC=True FS_REMOTE=True FS_PERMS=644 999:999 SEARCH_BACKEND=ripgrep

[i] Dependency versions:
 √  PYTHON_BINARY         v3.10.5         valid     /usr/local/bin/python3.10                                                   
 √  SQLITE_BINARY         v2.6.0          valid     /usr/local/lib/python3.10/sqlite3/dbapi2.py                                 
 √  DJANGO_BINARY         v3.1.14         valid     /usr/local/lib/python3.10/site-packages/django/__init__.py                  
 √  ARCHIVEBOX_BINARY     v0.6.3          valid     /usr/local/bin/archivebox                                                   

 √  CURL_BINARY           v7.74.0         valid     /usr/bin/curl                                                               
 √  WGET_BINARY           v1.21           valid     /usr/bin/wget                                                               
 √  NODE_BINARY           v17.9.0         valid     /usr/bin/node                                                               
 √  SINGLEFILE_BINARY     v0.3.16         valid     /node/node_modules/single-file/cli/single-file                              
 √  READABILITY_BINARY    v0.0.2          valid     /node/node_modules/readability-extractor/readability-extractor              
 √  MERCURY_BINARY        v1.0.0          valid     /node/node_modules/@postlight/mercury-parser/cli.js                         
 √  GIT_BINARY            v2.30.2         valid     /usr/bin/git                                                                
 √  YOUTUBEDL_BINARY      v2022.06.29     valid     /usr/local/bin/yt-dlp                                                       
 √  CHROME_BINARY         v103.0.5060.114  valid     /usr/bin/chromium                                                           
 √  RIPGREP_BINARY        v12.1.1         valid     /usr/bin/rg                                                                 

[i] Source-code locations:
 √  PACKAGE_DIR           23 files        valid     /app/archivebox                                                             
 √  TEMPLATES_DIR         3 files         valid     /app/archivebox/templates                                                   
 -  CUSTOM_TEMPLATES_DIR  -               disabled                                                                              

[i] Secrets locations:
 -  CHROME_USER_DATA_DIR  -               disabled                                                                              
 -  COOKIES_FILE          -               disabled                                                                              

[i] Data locations:
 √  OUTPUT_DIR            7 files @       valid     /data                                                                       
 √  SOURCES_DIR           12 files        valid     ./sources                                                                   
 √  LOGS_DIR              1 files         valid     ./logs                                                                      
 √  ARCHIVE_DIR           1 files         valid     ./archive                                                                   
 √  CONFIG_FILE           81.0 Bytes      valid     ./ArchiveBox.conf                                                           
 √  SQL_INDEX             204.0 KB        valid     ./index.sqlite3

But actually running github.com/ArchiveBox/ArchiveBox@a6767671fb

Originally created by @peterrus on GitHub (Jul 18, 2022). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/1000  #### Describe the bug After commit https://github.com/ArchiveBox/ArchiveBox/commit/a6767671fb68e25f67edcf16afafe5234d2826dd it seems that `add()` (https://github.com/ArchiveBox/ArchiveBox/blob/a6767671fb68e25f67edcf16afafe5234d2826dd/archivebox/main.py#L555=) sometimes gets called with a `url` parameter containing the entire atom feed instead of a list of actual urls. I am not sure if this happens only with the Wallabag parser but this input is not expected at https://github.com/ArchiveBox/ArchiveBox/blob/a6767671fb68e25f67edcf16afafe5234d2826dd/archivebox/parsers/__init__.py#L158= and a `No such file or directory` error is raised. #### Steps to reproduce 1. Run archivebox from any commit on the `dev` branch after https://github.com/ArchiveBox/ArchiveBox/commit/a6767671fb68e25f67edcf16afafe5234d2826dd 2. Import a wallabag feed, for example this test feed I created: `curl https://app.wallabag.it/feed/dokafad/TDzxV9ejsZiWMq/archive | archivebox add --parser=wallabag_atom` 3. The feeds seems to get imported, but a lot of errors are raised and a bunch of random paths get accessed (read) by the process. #### Screenshots or log output See: https://github.com/ArchiveBox/ArchiveBox/issues/971#issuecomment-1122499507= #### ArchiveBox version  ```logs 0.6.3 ArchiveBox v0.6.3 03eb7e5 Cpython Linux Linux-5.15.0-40-generic-x86_64-with-glibc2.31 x86_64 DEBUG=False IN_DOCKER=True IS_TTY=True TZ=UTC FS_ATOMIC=True FS_REMOTE=True FS_PERMS=644 999:999 SEARCH_BACKEND=ripgrep [i] Dependency versions: √ PYTHON_BINARY v3.10.5 valid /usr/local/bin/python3.10 √ SQLITE_BINARY v2.6.0 valid /usr/local/lib/python3.10/sqlite3/dbapi2.py √ DJANGO_BINARY v3.1.14 valid /usr/local/lib/python3.10/site-packages/django/__init__.py √ ARCHIVEBOX_BINARY v0.6.3 valid /usr/local/bin/archivebox √ CURL_BINARY v7.74.0 valid /usr/bin/curl √ WGET_BINARY v1.21 valid /usr/bin/wget √ NODE_BINARY v17.9.0 valid /usr/bin/node √ SINGLEFILE_BINARY v0.3.16 valid /node/node_modules/single-file/cli/single-file √ READABILITY_BINARY v0.0.2 valid /node/node_modules/readability-extractor/readability-extractor √ MERCURY_BINARY v1.0.0 valid /node/node_modules/@postlight/mercury-parser/cli.js √ GIT_BINARY v2.30.2 valid /usr/bin/git √ YOUTUBEDL_BINARY v2022.06.29 valid /usr/local/bin/yt-dlp √ CHROME_BINARY v103.0.5060.114 valid /usr/bin/chromium √ RIPGREP_BINARY v12.1.1 valid /usr/bin/rg [i] Source-code locations: √ PACKAGE_DIR 23 files valid /app/archivebox √ TEMPLATES_DIR 3 files valid /app/archivebox/templates - CUSTOM_TEMPLATES_DIR - disabled [i] Secrets locations: - CHROME_USER_DATA_DIR - disabled - COOKIES_FILE - disabled [i] Data locations: √ OUTPUT_DIR 7 files @ valid /data √ SOURCES_DIR 12 files valid ./sources √ LOGS_DIR 1 files valid ./logs √ ARCHIVE_DIR 1 files valid ./archive √ CONFIG_FILE 81.0 Bytes valid ./ArchiveBox.conf √ SQL_INDEX 204.0 KB valid ./index.sqlite3 ``` But actually running https://github.com/ArchiveBox/ArchiveBox/commit/a6767671fb68e25f67edcf16afafe5234d2826dd

kerem added the

touches: dependencies/packaging

expected: release after next

labels

2026-03-01 17:56:46 +03:00

kerem commented

2026-03-01 17:56:47 +03:00

Author

Owner

@peterrus commented on GitHub (Oct 7, 2024):

I have taken a look at the current code that parses the Wallabag feed and it seems it relies heavily on string parsing instead of a something that 'understands' the XML document (so one can use XPath for example). After some experimenting with Python's lxml I ran into issues where the document returned by Wallabag's RSS feed was simply too large for lxml to handle (at least on my machine/setup). I suspect this is due to the fact that the Wallabag feed includes the entire saved page's content and I have configured Wallabag te return a feed containing 2000 documents. My collection is already containing 1600+ items so I expect this to be a problem sooner of later.

I opted for an (imho) cleaner approach where the Wallabag API is called, using pagination and using Wallabag's metadata-only option (to prevent huge blobs of data being processed). Because I had trouble setting up a Docker dev environment on the dev branch (something related to nodejs) I opted to just put everything in a separate Python script that I pipe into the url_list parser and use that as a workaround for now. Maybe someone else feels up to integrating this into Archivebox:

# Print a list of the original urls of all saved entries in Wallabag. Doesn't
# filter on status or tags but this should be trivial to implement
# ( See API docs: https://app.wallabag.it/api/doc/ ).
#
# Usage:
#
# Create a `wallabag_fetch.toml` file in the same directory as this script
# with the following structure:
#
# base_url = "https://app.wallabag.it/"
# client_id = "secret"
# client_secret = "secret"
# username = "secret"
# password = "secret"


from requests.exceptions import RequestException
import requests
import sys
import os


if sys.version_info >= (3, 11):
    import tomllib
else:
    import tomli as tomllib


# Load configuration

with open(os.path.dirname(os.path.realpath(__file__)) +
          "/wallabag_fetch.toml", "rb") as f:
    config = tomllib.load(f)


HEADERS = {
    "Content-Type": "application/x-www-form-urlencoded"
}


def get_token():
    url = config['base_url'] + "oauth/v2/token"
    payload = {
        "grant_type": "password",
        "client_id": config['client_id'],
        "client_secret": config['client_secret'],
        "username": config['username'],
        "password": config['password']
    }
    response = requests.post(url, headers=HEADERS, data=payload)
    response.raise_for_status()
    return response.json()["access_token"]


def fetch_wallabag_entries(page=1):
    token = get_token()
    HEADERS['Authorization'] = f"Bearer {token}"
    url = f"{config['base_url']}/api/entries?page={page}&detail=metadata"
    try:
        response = requests.get(url, headers=HEADERS)
        response.raise_for_status()
        return response.json()
    except RequestException as e:
        print(f"An error occurred while fetching entries: {e}")
        return None


def main():
    # Fetch all pages
    all_entries = []
    page = 1
    while True:
        entries = fetch_wallabag_entries(page)
        if entries is None or len(entries.get('_embedded').get('items', [])) == 0:
            break

        all_entries.extend(entries['_embedded']['items'])

        total_pages = entries.get('pages')
        page += 1
        if page > total_pages:
            break

    # Print all entries
    for entry in all_entries:
        print(entry['url'])


if __name__ == "__main__":
    main()

Edit: I noticed are more intelligent RSS feed parser was already used in https://github.com/ArchiveBox/ArchiveBox/issues/1000#issuecomment-2396652382, I still believe - due to the fact that Wallabag's RSS feed can get huge - an API based solution is more elegant, the only 'downside' is that you would have to configure some credentials for the API somewhere.

@peterrus commented on GitHub (Oct 7, 2024): I have taken a look at the current code that parses the Wallabag feed and it seems it relies heavily on string parsing instead of a something that 'understands' the XML document (so one can use XPath for example). After some experimenting with Python's `lxml` I ran into issues where the document returned by Wallabag's RSS feed was simply too large for `lxml` to handle (at least on my machine/setup). I suspect this is due to the fact that the Wallabag feed includes the entire saved page's content and I have configured Wallabag te return a feed containing 2000 documents. My collection is already containing 1600+ items so I expect this to be a problem sooner of later. I opted for an (imho) cleaner approach where the Wallabag API is called, using pagination and using Wallabag's `metadata`-only option (to prevent huge blobs of data being processed). Because I had trouble setting up a Docker dev environment on the `dev` branch (something related to `nodejs`) I opted to just put everything in a separate Python script that I pipe into the `url_list` parser and use that as a workaround for now. Maybe someone else feels up to integrating this into Archivebox: ```python # Print a list of the original urls of all saved entries in Wallabag. Doesn't # filter on status or tags but this should be trivial to implement # ( See API docs: https://app.wallabag.it/api/doc/ ). # # Usage: # # Create a `wallabag_fetch.toml` file in the same directory as this script # with the following structure: # # base_url = "https://app.wallabag.it/" # client_id = "secret" # client_secret = "secret" # username = "secret" # password = "secret" from requests.exceptions import RequestException import requests import sys import os if sys.version_info >= (3, 11): import tomllib else: import tomli as tomllib # Load configuration with open(os.path.dirname(os.path.realpath(__file__)) + "/wallabag_fetch.toml", "rb") as f: config = tomllib.load(f) HEADERS = { "Content-Type": "application/x-www-form-urlencoded" } def get_token(): url = config['base_url'] + "oauth/v2/token" payload = { "grant_type": "password", "client_id": config['client_id'], "client_secret": config['client_secret'], "username": config['username'], "password": config['password'] } response = requests.post(url, headers=HEADERS, data=payload) response.raise_for_status() return response.json()["access_token"] def fetch_wallabag_entries(page=1): token = get_token() HEADERS['Authorization'] = f"Bearer {token}" url = f"{config['base_url']}/api/entries?page={page}&detail=metadata" try: response = requests.get(url, headers=HEADERS) response.raise_for_status() return response.json() except RequestException as e: print(f"An error occurred while fetching entries: {e}") return None def main(): # Fetch all pages all_entries = [] page = 1 while True: entries = fetch_wallabag_entries(page) if entries is None or len(entries.get('_embedded').get('items', [])) == 0: break all_entries.extend(entries['_embedded']['items']) total_pages = entries.get('pages') page += 1 if page > total_pages: break # Print all entries for entry in all_entries: print(entry['url']) if __name__ == "__main__": main() ``` Edit: I noticed are more intelligent RSS feed parser was already used in https://github.com/ArchiveBox/ArchiveBox/issues/1000#issuecomment-2396652382, I still believe - due to the fact that Wallabag's RSS feed can get huge - an API based solution is more elegant, the only 'downside' is that you would have to configure some credentials for the API somewhere.

No milestone

No project

No assignees

1 participant

Notifications

Due date

The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference

starred/ArchiveBox#2137

No description provided.

Rows
Columns