[GH-ISSUE #1171] Feature Request: Parse Atom RSS feeds #3744

Open
opened 2026-03-15 00:17:16 +03:00 by kerem · 8 comments
Owner

Originally created by @sclu1034 on GitHub (Jul 6, 2023).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/1171

Type

  • General question or discussion
  • Propose a brand new feature
  • Request modification of existing behavior or design

What is the problem that your feature request solves

Generic Atom-based RSS feeds (see https://datatracker.ietf.org/doc/html/rfc4287) cannot be parsed by the current parsers.

Describe the ideal specific solution you'd want, and whether it fits into any broader scope of changes

Ideally, a library like feedparser would be used, both as a more robust solution than the current hand-rolled regex parsers, and something that already supports a wide range of feed formats.

What hacks or alternative solutions have you tried to solve the problem?

I wrote a script to use that library myself, and turn the feed info into JSON that I can pipe into archivebox add --parser json.

How badly do you want this new feature?

  • It's an urgent deal-breaker, I can't live without it
  • It's important to add it in the near-mid term future
  • It would be nice to have eventually

  • I'm willing to contribute dev time / money to fix this issue
  • I like ArchiveBox so far / would recommend it to a friend
  • I've had a lot of difficulty getting ArchiveBox set up
Originally created by @sclu1034 on GitHub (Jul 6, 2023). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/1171 <!-- Please fill out the following information, feel free to delete sections if they're not applicable or if long issue templates annoy you :) --> ## Type - [ ] General question or discussion - [ ] Propose a brand new feature - [x] Request modification of existing behavior or design ## What is the problem that your feature request solves Generic Atom-based RSS feeds (see https://datatracker.ietf.org/doc/html/rfc4287) cannot be parsed by the current parsers. ## Describe the ideal specific solution you'd want, and whether it fits into any broader scope of changes Ideally, a library like [feedparser](https://feedparser.readthedocs.io/en/latest/index.html) would be used, both as a more robust solution than the current hand-rolled regex parsers, and something that already supports a wide range of feed formats. ## What hacks or alternative solutions have you tried to solve the problem? I wrote a script to use that library myself, and turn the feed info into JSON that I can pipe into `archivebox add --parser json`. ## How badly do you want this new feature? - [ ] It's an urgent deal-breaker, I can't live without it - [x] It's important to add it in the near-mid term future - [ ] It would be nice to have eventually --- - [ ] I'm willing to contribute [dev time](https://github.com/ArchiveBox/ArchiveBox#archivebox-development) / [money](https://github.com/sponsors/pirate) to fix this issue - [x] I like ArchiveBox so far / would recommend it to a friend - [ ] I've had a lot of difficulty getting ArchiveBox set up
Author
Owner

@melyux commented on GitHub (Jul 12, 2023):

Please do this. I'm surprised this project isn't using a proper RSS parser. I just spent the entire day writing regex to pick out the random RSS and W3 links that ArchiveBox keeps pulling out of my RSS feeds somehow.

<!-- gh-comment-id:1631923262 --> @melyux commented on GitHub (Jul 12, 2023): Please do this. I'm surprised this project isn't using a proper RSS parser. I just spent the entire day writing regex to pick out the random RSS and W3 links that ArchiveBox keeps pulling out of my RSS feeds somehow.
Author
Owner

@sclu1034 commented on GitHub (Jul 12, 2023):

the random RSS and W3 links that ArchiveBox keeps pulling out of my RSS feeds somehow

That's because the RSS parsers fail, hand over to the next parser, and eventually the txt parser gets a shot, which just takes everything that looks like a URL.

Here's the script I use as a workaround to parse feeds into JSON first:

#!/usr/bin/env python3

import feedparser
import sys
import json

dom = feedparser.parse(sys.argv[1])

links = []

for entry in dom.entries:
    tags = ",".join(map(lambda tag: tag.term, entry.get('tags', [])))
    link = {
        'url': entry.link,
        'title': entry.title,
        'tags': tags,
        'description': entry.summary,
        # 'created': entry.published,
    }

    links.append(link)

print(json.dumps(links))
script.py <feed_url> | archivebox add --parser json
<!-- gh-comment-id:1632213056 --> @sclu1034 commented on GitHub (Jul 12, 2023): > the random RSS and W3 links that ArchiveBox keeps pulling out of my RSS feeds somehow That's because the RSS parsers fail, hand over to the next parser, and eventually the `txt` parser gets a shot, which just takes everything that looks like a URL. Here's the script I use as a workaround to parse feeds into JSON first: ```python #!/usr/bin/env python3 import feedparser import sys import json dom = feedparser.parse(sys.argv[1]) links = [] for entry in dom.entries: tags = ",".join(map(lambda tag: tag.term, entry.get('tags', []))) link = { 'url': entry.link, 'title': entry.title, 'tags': tags, 'description': entry.summary, # 'created': entry.published, } links.append(link) print(json.dumps(links)) ``` ``` script.py <feed_url> | archivebox add --parser json ```
Author
Owner

@melyux commented on GitHub (Jul 20, 2023):

Wonder if there's a way to use a script like this in the scheduler... I guess not officially, would be easier to just fix the parsers if that's the way... but maybe I can modify the crontab directly to use the script. Let's see

<!-- gh-comment-id:1644617584 --> @melyux commented on GitHub (Jul 20, 2023): Wonder if there's a way to use a script like this in the scheduler... I guess not officially, would be easier to just fix the parsers if that's the way... but maybe I can modify the crontab directly to use the script. Let's see
Author
Owner

@melyux commented on GitHub (Jul 20, 2023):

Wow feedparser is incredible, takes anything I throw at it. Could be an easy drop-in @pirate?

<!-- gh-comment-id:1644791900 --> @melyux commented on GitHub (Jul 20, 2023): Wow feedparser is incredible, takes anything I throw at it. Could be an easy drop-in @pirate?
Author
Owner

@melyux commented on GitHub (Jul 21, 2023):

Modified the crontab manually, and it works. I put the feedparse.py script into the container's mounted /data directory. I wrote a new Dockerfile to add the feedparser Python package into the image (along with the newer version of SingleFile, see #883) and used the local image. Here's the crontab mounted into the scheduler container:

@daily cd /data && /usr/local/bin/archivebox add --parser json "$(/data/feedparse.py 'https://www.domain.com/feed.rss')" >> /data/logs/schedule.log 2>&1 # archivebox_schedule

The format you suggested above with the | pipe works but ArchiveBox can't parse it from the crontab, but it is able to parse my version above in a way that's runnable by the archivebox schedule --run-all command.

The Dockerfile file:

FROM archivebox/archivebox:dev
RUN npm install -g single-file-cli
RUN pip install feedparser

after which we edit the docker-compose.yml block for archivebox to remove the image: archivebox/archivebox line and instead do build: ./archivebox (or whatever path you stored your new Dockerfile in). You can add image: archivebox to the block to make this custom image available for the scheduler container too, where you can also use image: archivebox to use this custom image.

Also in this block, set the env variable SINGLEFILE_BINARY=/usr/bin/single-file to use the newer SingleFile version's path, since npm installs it in a different one than the default path.

<!-- gh-comment-id:1644957101 --> @melyux commented on GitHub (Jul 21, 2023): Modified the crontab manually, and it works. I put the `feedparse.py` script into the container's mounted `/data` directory. I wrote a new `Dockerfile` to add the `feedparser` Python package into the image (along with the newer version of SingleFile, see #883) and used the local image. Here's the crontab mounted into the scheduler container: ``` @daily cd /data && /usr/local/bin/archivebox add --parser json "$(/data/feedparse.py 'https://www.domain.com/feed.rss')" >> /data/logs/schedule.log 2>&1 # archivebox_schedule ``` The format you suggested above with the `|` pipe works but ArchiveBox can't parse it from the crontab, but it is able to parse my version above in a way that's runnable by the `archivebox schedule --run-all` command. The Dockerfile file: ``` FROM archivebox/archivebox:dev RUN npm install -g single-file-cli RUN pip install feedparser ``` after which we edit the `docker-compose.yml` block for archivebox to remove the `image: archivebox/archivebox` line and instead do `build: ./archivebox` (or whatever path you stored your new Dockerfile in). You can add `image: archivebox` to the block to make this custom image available for the scheduler container too, where you can also use `image: archivebox` to use this custom image. Also in this block, set the env variable `SINGLEFILE_BINARY=/usr/bin/single-file` to use the newer SingleFile version's path, since `npm` installs it in a different one than the default path.
Author
Owner

@pirate commented on GitHub (Aug 16, 2023):

Sorry for causing you so much extra overhead / debugging time to have to resort to this workaround @melyux, but thanks for documenting your process here for others!

All my dev focus is currently on a refactor I have in progress to add Huey support to ArchiveBox, which has left a few of these relatively big issues to languish. I appreciate everyone's patience while I give some much-needed attention to the internal architecture!

<!-- gh-comment-id:1679801200 --> @pirate commented on GitHub (Aug 16, 2023): Sorry for causing you so much extra overhead / debugging time to have to resort to this workaround @melyux, but thanks for documenting your process here for others! All my dev focus is currently on a refactor I have in progress to add Huey support to ArchiveBox, which has left a few of these relatively big issues to languish. I appreciate everyone's patience while I give some much-needed attention to the internal architecture!
Author
Owner

@pirate commented on GitHub (Mar 27, 2024):

This should work now that we switched to feedparser, let me know if ya'll still have issues on the latest :dev build and I'll reopen.

<!-- gh-comment-id:2021704544 --> @pirate commented on GitHub (Mar 27, 2024): This should work now that we switched to `feedparser`, let me know if ya'll still have issues on the latest `:dev` build and I'll reopen.
Author
Owner

@Ramblurr commented on GitHub (Apr 11, 2024):

@pirate I have several feeds I'd like to parse using this feature, but none of them are working. I'm using the :dev tag (image hash 5c0d2df58cf7ac9aa314129de16121c991172e34081ce132e2575fe7160d5b1b)

Samples are attached (.txt added to bypass github mimetype detector)

all.rss.txt

user-favorites.xml.txt

The sources are:

<!-- gh-comment-id:2050046112 --> @Ramblurr commented on GitHub (Apr 11, 2024): @pirate I have several feeds I'd like to parse using this feature, but none of them are working. I'm using the `:dev` tag (image hash `5c0d2df58cf7ac9aa314129de16121c991172e34081ce132e2575fe7160d5b1b`) Samples are attached (.txt added to bypass github mimetype detector) [all.rss.txt](https://github.com/ArchiveBox/ArchiveBox/files/14948562/all.rss.txt) [user-favorites.xml.txt](https://github.com/ArchiveBox/ArchiveBox/files/14948560/user-favorites.xml.txt) The sources are: * [linkding](https://github.com/sissbruecker/linkding) * [inoreader](https://www.inoreader.com/)
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#3744
No description provided.