mirror of
https://github.com/ArchiveBox/ArchiveBox.git
synced 2026-04-25 09:06:02 +03:00
[PR #1362] [MERGED] Use feedparser for RSS parsing #4403
Labels
No labels
expected: maybe someday
expected: next release
expected: release after next
expected: unlikely unless contributed
good first ticket
help wanted
pull-request
scope: all users
scope: windows users
size: easy
size: hard
size: medium
size: medium
status: backlog
status: blocked
status: done
status: idea-phase
status: needs followup
status: wip
status: wontfix
touches: API/CLI/Spec
touches: configuration
touches: data/schema/architecture
touches: dependencies/packaging
touches: docs
touches: js
touches: views/replayers/html/css
why: correctness
why: functionality
why: performance
why: security
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/ArchiveBox#4403
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
📋 Pull Request Information
Original PR: https://github.com/ArchiveBox/ArchiveBox/pull/1362
Author: @jimwins
Created: 2/25/2024
Status: ✅ Merged
Merged: 3/14/2024
Merged by: @pirate
Base:
dev← Head:issue-1171📝 Commits (5)
22f9a28Use feedparser for RSS parsing in generic_rss and pinboard_rss parsers1f828d9Add tests for generic_rss and pinboard_rss parsers9f462a8Use feedparser for RSS parsing in generic_rss and pinboard_rss parserse7119adAdd tests for generic_rss and pinboard_rss parsers0f402dfMerge with latest dev📊 Changes
6 files changed (+161 additions, -53 deletions)
View changed files
📝
archivebox/parsers/generic_rss.py(+20 -28)📝
archivebox/parsers/pinboard_rss.py(+16 -25)📝
pyproject.toml(+1 -0)➕
tests/mock_server/templates/example.atom(+24 -0)➕
tests/mock_server/templates/example.rss(+32 -0)📝
tests/test_add.py(+68 -0)📄 Description
Summary
The feedparser packages has 20 years of history and is very good at parsing RSS and Atom, so use that instead of ad-hoc regex and XML parsing.
The medium_rss and shaarli_rss parsers weren't touched because they are probably unnecessary and should be removed. (The special parse for pinboard is just needed because of how tags work.)
Related issues
Fixes #1171
Fixes #870 (probably, would need to test against a Wallabag Atom file to be certain)1
Fixes #135
Fixes #123
Fixes #106
Changes these areas
🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.
I tried divining the format of Wallabag's XML/Atom export by reading the code but there's a lot of abstractions and it would probably be faster to just install Wallabag and produce a sample, or faster yet for someone to provide a sample. ↩︎