mirror of
https://github.com/ArchiveBox/ArchiveBox.git
synced 2026-04-25 09:06:02 +03:00
[GH-ISSUE #1171] Feature Request: Parse Atom RSS feeds #2234
Labels
No labels
expected: maybe someday
expected: next release
expected: release after next
expected: unlikely unless contributed
good first ticket
help wanted
pull-request
scope: all users
scope: windows users
size: easy
size: hard
size: medium
size: medium
status: backlog
status: blocked
status: done
status: idea-phase
status: needs followup
status: wip
status: wontfix
touches: API/CLI/Spec
touches: configuration
touches: data/schema/architecture
touches: dependencies/packaging
touches: docs
touches: js
touches: views/replayers/html/css
why: correctness
why: functionality
why: performance
why: security
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/ArchiveBox#2234
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @sclu1034 on GitHub (Jul 6, 2023).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/1171
Type
What is the problem that your feature request solves
Generic Atom-based RSS feeds (see https://datatracker.ietf.org/doc/html/rfc4287) cannot be parsed by the current parsers.
Describe the ideal specific solution you'd want, and whether it fits into any broader scope of changes
Ideally, a library like feedparser would be used, both as a more robust solution than the current hand-rolled regex parsers, and something that already supports a wide range of feed formats.
What hacks or alternative solutions have you tried to solve the problem?
I wrote a script to use that library myself, and turn the feed info into JSON that I can pipe into
archivebox add --parser json.How badly do you want this new feature?
@melyux commented on GitHub (Jul 12, 2023):
Please do this. I'm surprised this project isn't using a proper RSS parser. I just spent the entire day writing regex to pick out the random RSS and W3 links that ArchiveBox keeps pulling out of my RSS feeds somehow.
@sclu1034 commented on GitHub (Jul 12, 2023):
That's because the RSS parsers fail, hand over to the next parser, and eventually the
txtparser gets a shot, which just takes everything that looks like a URL.Here's the script I use as a workaround to parse feeds into JSON first:
@melyux commented on GitHub (Jul 20, 2023):
Wonder if there's a way to use a script like this in the scheduler... I guess not officially, would be easier to just fix the parsers if that's the way... but maybe I can modify the crontab directly to use the script. Let's see
@melyux commented on GitHub (Jul 20, 2023):
Wow feedparser is incredible, takes anything I throw at it. Could be an easy drop-in @pirate?
@melyux commented on GitHub (Jul 21, 2023):
Modified the crontab manually, and it works. I put the
feedparse.pyscript into the container's mounted/datadirectory. I wrote a newDockerfileto add thefeedparserPython package into the image (along with the newer version of SingleFile, see #883) and used the local image. Here's the crontab mounted into the scheduler container:The format you suggested above with the
|pipe works but ArchiveBox can't parse it from the crontab, but it is able to parse my version above in a way that's runnable by thearchivebox schedule --run-allcommand.The Dockerfile file:
after which we edit the
docker-compose.ymlblock for archivebox to remove theimage: archivebox/archiveboxline and instead dobuild: ./archivebox(or whatever path you stored your new Dockerfile in). You can addimage: archiveboxto the block to make this custom image available for the scheduler container too, where you can also useimage: archiveboxto use this custom image.Also in this block, set the env variable
SINGLEFILE_BINARY=/usr/bin/single-fileto use the newer SingleFile version's path, sincenpminstalls it in a different one than the default path.@pirate commented on GitHub (Aug 16, 2023):
Sorry for causing you so much extra overhead / debugging time to have to resort to this workaround @melyux, but thanks for documenting your process here for others!
All my dev focus is currently on a refactor I have in progress to add Huey support to ArchiveBox, which has left a few of these relatively big issues to languish. I appreciate everyone's patience while I give some much-needed attention to the internal architecture!
@pirate commented on GitHub (Mar 27, 2024):
This should work now that we switched to
feedparser, let me know if ya'll still have issues on the latest:devbuild and I'll reopen.@Ramblurr commented on GitHub (Apr 11, 2024):
@pirate I have several feeds I'd like to parse using this feature, but none of them are working. I'm using the
:devtag (image hash5c0d2df58cf7ac9aa314129de16121c991172e34081ce132e2575fe7160d5b1b)Samples are attached (.txt added to bypass github mimetype detector)
all.rss.txt
user-favorites.xml.txt
The sources are: