[PR #1362] Use feedparser for RSS parsing #2901

Closed
opened 2026-03-01 18:01:05 +03:00 by kerem · 0 comments
Owner

Original Pull Request: https://github.com/ArchiveBox/ArchiveBox/pull/1362

State: closed
Merged: Yes


Summary

The feedparser packages has 20 years of history and is very good at parsing RSS and Atom, so use that instead of ad-hoc regex and XML parsing.

The medium_rss and shaarli_rss parsers weren't touched because they are probably unnecessary and should be removed. (The special parse for pinboard is just needed because of how tags work.)

Related issues

Fixes #1171
Fixes #870 (probably, would need to test against a Wallabag Atom file to be certain)1
Fixes #135
Fixes #123
Fixes #106

Changes these areas

  • Bugfixes
  • Feature behavior
  • Command line interface
  • Configuration options
  • Internal architecture
  • Snapshot data layout on disk

  1. I tried divining the format of Wallabag's XML/Atom export by reading the code but there's a lot of abstractions and it would probably be faster to just install Wallabag and produce a sample, or faster yet for someone to provide a sample. ↩︎

**Original Pull Request:** https://github.com/ArchiveBox/ArchiveBox/pull/1362 **State:** closed **Merged:** Yes --- # Summary The feedparser packages has 20 years of history and is very good at parsing RSS and Atom, so use that instead of ad-hoc regex and XML parsing. The medium_rss and shaarli_rss parsers weren't touched because they are probably unnecessary and should be removed. (The special parse for pinboard is just needed because of how tags work.) # Related issues Fixes #1171 Fixes #870 (probably, would need to test against a Wallabag Atom file to be certain)[^wallabag] Fixes #135 Fixes #123 Fixes #106 [^wallabag]: I tried divining the format of Wallabag's XML/Atom export by reading the code but there's a lot of abstractions and it would probably be faster to just install Wallabag and produce a sample, or faster yet for someone to provide a sample. # Changes these areas - [X] Bugfixes - [ ] Feature behavior - [ ] Command line interface - [ ] Configuration options - [X] Internal architecture - [ ] Snapshot data layout on disk
kerem 2026-03-01 18:01:05 +03:00
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#2901
No description provided.