[PR #1362] [MERGED] Use feedparser for RSS parsing #1390

Closed
opened 2026-03-01 14:49:35 +03:00 by kerem · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/ArchiveBox/ArchiveBox/pull/1362
Author: @jimwins
Created: 2/25/2024
Status: Merged
Merged: 3/14/2024
Merged by: @pirate

Base: devHead: issue-1171


📝 Commits (5)

  • 22f9a28 Use feedparser for RSS parsing in generic_rss and pinboard_rss parsers
  • 1f828d9 Add tests for generic_rss and pinboard_rss parsers
  • 9f462a8 Use feedparser for RSS parsing in generic_rss and pinboard_rss parsers
  • e7119ad Add tests for generic_rss and pinboard_rss parsers
  • 0f402df Merge with latest dev

📊 Changes

6 files changed (+161 additions, -53 deletions)

View changed files

📝 archivebox/parsers/generic_rss.py (+20 -28)
📝 archivebox/parsers/pinboard_rss.py (+16 -25)
📝 pyproject.toml (+1 -0)
tests/mock_server/templates/example.atom (+24 -0)
tests/mock_server/templates/example.rss (+32 -0)
📝 tests/test_add.py (+68 -0)

📄 Description

Summary

The feedparser packages has 20 years of history and is very good at parsing RSS and Atom, so use that instead of ad-hoc regex and XML parsing.

The medium_rss and shaarli_rss parsers weren't touched because they are probably unnecessary and should be removed. (The special parse for pinboard is just needed because of how tags work.)

Related issues

Fixes #1171
Fixes #870 (probably, would need to test against a Wallabag Atom file to be certain)1
Fixes #135
Fixes #123
Fixes #106

Changes these areas

  • Bugfixes
  • Feature behavior
  • Command line interface
  • Configuration options
  • Internal architecture
  • Snapshot data layout on disk

🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.


  1. I tried divining the format of Wallabag's XML/Atom export by reading the code but there's a lot of abstractions and it would probably be faster to just install Wallabag and produce a sample, or faster yet for someone to provide a sample. ↩︎

## 📋 Pull Request Information **Original PR:** https://github.com/ArchiveBox/ArchiveBox/pull/1362 **Author:** [@jimwins](https://github.com/jimwins) **Created:** 2/25/2024 **Status:** ✅ Merged **Merged:** 3/14/2024 **Merged by:** [@pirate](https://github.com/pirate) **Base:** `dev` ← **Head:** `issue-1171` --- ### 📝 Commits (5) - [`22f9a28`](https://github.com/ArchiveBox/ArchiveBox/commit/22f9a289d399de5dda1de624ef92f93969f1473e) Use feedparser for RSS parsing in generic_rss and pinboard_rss parsers - [`1f828d9`](https://github.com/ArchiveBox/ArchiveBox/commit/1f828d94410eded4e23ee8778a2d6151a4c89c8c) Add tests for generic_rss and pinboard_rss parsers - [`9f462a8`](https://github.com/ArchiveBox/ArchiveBox/commit/9f462a87a8f021b5497dd75208b044dbe1c4ce40) Use feedparser for RSS parsing in generic_rss and pinboard_rss parsers - [`e7119ad`](https://github.com/ArchiveBox/ArchiveBox/commit/e7119adb0b1ff4b950bd61f88a69f8cf9f8ed145) Add tests for generic_rss and pinboard_rss parsers - [`0f402df`](https://github.com/ArchiveBox/ArchiveBox/commit/0f402df42fb630f83966d4e719d0ef7722338231) Merge with latest dev ### 📊 Changes **6 files changed** (+161 additions, -53 deletions) <details> <summary>View changed files</summary> 📝 `archivebox/parsers/generic_rss.py` (+20 -28) 📝 `archivebox/parsers/pinboard_rss.py` (+16 -25) 📝 `pyproject.toml` (+1 -0) ➕ `tests/mock_server/templates/example.atom` (+24 -0) ➕ `tests/mock_server/templates/example.rss` (+32 -0) 📝 `tests/test_add.py` (+68 -0) </details> ### 📄 Description # Summary The feedparser packages has 20 years of history and is very good at parsing RSS and Atom, so use that instead of ad-hoc regex and XML parsing. The medium_rss and shaarli_rss parsers weren't touched because they are probably unnecessary and should be removed. (The special parse for pinboard is just needed because of how tags work.) # Related issues Fixes #1171 Fixes #870 (probably, would need to test against a Wallabag Atom file to be certain)[^wallabag] Fixes #135 Fixes #123 Fixes #106 [^wallabag]: I tried divining the format of Wallabag's XML/Atom export by reading the code but there's a lot of abstractions and it would probably be faster to just install Wallabag and produce a sample, or faster yet for someone to provide a sample. # Changes these areas - [X] Bugfixes - [ ] Feature behavior - [ ] Command line interface - [ ] Configuration options - [X] Internal architecture - [ ] Snapshot data layout on disk --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
kerem 2026-03-01 14:49:35 +03:00
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#1390
No description provided.