mirror of
https://github.com/ArchiveBox/ArchiveBox.git
synced 2026-04-25 09:06:02 +03:00
[GH-ISSUE #135] Shaarli RSS parsing falls back to full-text and imports unneeded URLs from metadata fields #3112
Labels
No labels
expected: maybe someday
expected: next release
expected: release after next
expected: unlikely unless contributed
good first ticket
help wanted
pull-request
scope: all users
scope: windows users
size: easy
size: hard
size: medium
size: medium
status: backlog
status: blocked
status: done
status: idea-phase
status: needs followup
status: wip
status: wontfix
touches: API/CLI/Spec
touches: configuration
touches: data/schema/architecture
touches: dependencies/packaging
touches: docs
touches: js
touches: views/replayers/html/css
why: correctness
why: functionality
why: performance
why: security
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/ArchiveBox#3112
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @mawmawmawm on GitHub (Jan 30, 2019).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/135
It looks like Shaarli feeds are not being parsed correctly and markup is being included in the link structure (much like ticket 134 for pocket). Also, it looks like shaarli detail and tag pages are being parsed as source links, making the import much slower and leading to clutter in the archive.
You can use the public shaarli demo to reproduce this.
There's a demo (U: demo / PW: demo) running on https://demo.shaarli.org/.
The Atom feed then e.g. looks like this (with just one link, this is whats being parsed as the input file)
Note that ArchiveBox wants to include 8 links from this:
Most likely because 8 instances of
http://were found (that's just my speculation).However, the expected behaviour should be that only the source link should be parsed / added, not the shaarli detail pages like
https://demo.shaarli.org/?cEV4vwthat contain nothing but the actual link to the source (again). IMO that doesn't make sense. It's even "worse" if a link has tags, because every tag then will lead to a new link to be crawled.docker-compose exec archivebox /bin/archive https://demo.shaarli.org/?do=atom(note the
</id>at the end of the links)@mawmawmawm commented on GitHub (Jan 30, 2019):
The same is true e.g. for wallabag feeds that contain links in the fulltext RSS feed with (maybe) broken HTML - they're
These are being parsed as (example number 4 from above)...
leading to invalid links / 404s.
@pirate commented on GitHub (Jan 30, 2019):
Thanks for reporting this.
I think the fixes will be simple:
<or>characters, so we never include closing tags by accidentThe intended behavior is only to take the actual page to archive from each RSS entry. I'm favoring usability over completeness here, I don't want archives filled with garbage URLs on every import, and if a user wants to add those URLs manually, they can force full-text parsing by passing the URLs individually via
stdin.It might be worth just doing this ticket first, it will solve both these problems: https://github.com/pirate/ArchiveBox/issues/123
@mawmawmawm commented on GitHub (Jan 30, 2019):
Thanks for looking into it. I agree, switching to a different / more roboust parser should solve all this. Not sure how you would exclude the fulltext / links therein, I guess you would need to limit the parsing to e.g.
<entry> → <link href="...to omit all the other links.@pirate commented on GitHub (Jan 30, 2019):
Full-text parsing is only ever used as a fallback if all the other parsing methods fail, so once the RSS parser is working again it should automatically ignore those other links. (the RSS parser knows to only take the main ones)
@pirate commented on GitHub (Feb 1, 2019):
I fixed the regex
c37941e, give the latest master commit a try.@mawmawmawm commented on GitHub (Feb 1, 2019):
Thanks - should work according to the code change (reviewed that). I will
Have to wait until the RSS parser will be working again - otherwise my library will be flooded with additional links.
@pirate commented on GitHub (Feb 5, 2019):
Ok RSS parser is fixed. There's now a dedicated RSS parser for the Shaarli export format. Give it a try with the latest version of master.
@mawmawmawm commented on GitHub (Feb 6, 2019):
Hey there,
thanks for working on this but it still seems to be broken / fulltext being used.
I added a link to the shaarli demo instance and then added the atom feed:
docker-compose exec archivebox /bin/archive https://demo.shaarli.org/?do=atomThe result is this XML file in
sources:"Garbage" links like
https://demo.shaarli.org/?MdkVOworhttps://demo.shaarli.org/?searchtags=are still being pulled in, the parser doesn't seem to just look for<entry> → <link href="...:@pirate commented on GitHub (Feb 6, 2019):
Very strange. You can see in the output it now says
Adding 11 new links to index from /data/sources/demo.shaarli.org-1549427254.txt (Plain Text format), notice thePlain Text Formatat the end, that will sayShaarli RSS formatif it gets it right.I just ran it with exactly the XML you provided above, and it parsed it correctly...
I suspect there's some difference between our setups that's causing this, if you have a moment, do you mind uncommenting this line, and running it again to see why the Shaarli parser fails:
parse.py:# print('[!] Parser {} failed: {} {}'.format(parser_name, err.__class__.__name__, err))[!] Parser Shaarli RSS failed: ...some error details here...and paste that error hereThanks for helping debug this.
@mawmawmawm commented on GitHub (Feb 7, 2019):
Question - we're talking about the
parse.pyin thearchiveboxfolder, correct? I didn't see any other one :)Line 72 looks different there and I can't seem to find
print('[!] Parser {} failed: {} {}'.format(parser_name, err.__class__.__name__, err))or fragments thereof anywhere else in that file?!@pirate commented on GitHub (Feb 7, 2019):
Ah sorry I forgot to push it to master! It was just on my local branch. Try pulling master and uncommenting that line now.
@mawmawmawm commented on GitHub (Feb 9, 2019):
Took me a while, sorry. Here we go. I uncommented the line, found in line 75 in parse.py.
It looks like there's an extra colon in the shaarli timestamp that is not being parsed correctly.
full output:
@mawmawmawm commented on GitHub (Feb 9, 2019):
Same is true for
wallabagfeeds btw, maybe the same fix...Input XML (snippet):
@pirate commented on GitHub (Feb 11, 2019):
At a conference right now and have a busy week ahead, so apologies if I don't get around to fixing this for a bit.
@mawmawmawm commented on GitHub (Feb 11, 2019):
No rush at all (at least for me). Thanks for doing all this btw. Let me know if you need any additional input.
@pirate commented on GitHub (Feb 11, 2019):
A redacted copy of your
/data/sources/demo.shaarli.org-1549685314.txtwould be helpful, thx.@mawmawmawm commented on GitHub (Feb 12, 2019):
Sorry, I don't have that anymore. But it was basically just the shaarli demo with one link added to it.
@pirate commented on GitHub (Feb 19, 2019):
@mawmawmawm I think I fixed it (in
eff0100), pull the latest master and give it a shot. Comment if it's still broken and I'll reopen the issue.@mawmawmawm commented on GitHub (Feb 20, 2019):
Sorry, It's still happening for me after a full rebuild... Importing shaarli as plaintext.
@pirate commented on GitHub (Feb 27, 2019):
I just ran the latest master on this sample Shaarli export you provided: https://github.com/pirate/ArchiveBox/issues/135#issuecomment-460898443 and it worked as expected (imported 4 links and parsed as Shaarli RSS format). If the latest master still failing for you, post your export here as I need it to be able to debug the parsing.
@jeanregisser commented on GitHub (Mar 7, 2019):
I tried an RSS import from wallabag using latest master
github.com/pirate/ArchiveBox@4a7f1d57d5and it only found 1 link, though there are 50 of them in the feed.
app.wallabag.it-1551958696.txt
Let me know if you need more info.
@pirate commented on GitHub (Mar 25, 2019):
Sorry for the delay, just fixed this @jeanregisser in
58c9b47. Pull the latest master and give it a try. Comment back here if it doesn't work and I'll reopen the ticket.The issue was that wallabag adds a bunch of newlines between the RSS items which broke my crappy parsing code.
@mawmawmawm there have been lots of parser fixes since my last comment here, can you also give the latest master a shot and report back?
@mawmawmawm commented on GitHub (Apr 1, 2019):
Sorry for the late reply - tried it 3 days ago and was working fine except the
wgetissue mentioned in the other ticket.@amette commented on GitHub (May 3, 2019):
This error still persists for me. I have Shaarli v0.10.4 (latest) and ArchiveBox master from git. Shaarli produces for example the following XML (original, but domain redacted):
ArchiveBox correctly imports the gcemetery.co link once, but also imports the shaarli.example.com link once.
@sebw commented on GitHub (Apr 12, 2022):
Hey @pirate I just discovered about archivebox, that's awesome and a great extension to Shaarli!
Unfortunately the problem subsists.
Running dev branch of Shaarli.
Also besides the Shaarli links, I would not expect w3.org and purl.org from appearing.
@pirate commented on GitHub (Apr 12, 2022):
w3.org and purl.org are expected in full-text parsing mode (which it's falling back to due to a bug) because they are linked to in the RSS even though the links aren't visible, they wont archive multiple times so I recommend leaving them for now and ignoring those entries.
I've re-opened the issue to track fixing it, PRs to fix are welcome.
@wokawoka commented on GitHub (Jun 16, 2023):
is this issue still relevant?
@pirate commented on GitHub (Jun 20, 2023):
Yes, it hasn't been fixed yet. PRs are welcome. I haven't gotten to it myself as I don't use shaarli myself.
@melyux commented on GitHub (Jul 11, 2023):
I'm also getting the unnecessary links (like
http://www.w3.org/2005/Atom) with any kind of normal Atom RSS feed. Is that part of this bug?@pirate commented on GitHub (Aug 16, 2023):
Yes, because it falls back to URL parsing in plain text mode, it'll archive every single string that looks like a URL. Using a proper RSS parser library to fix the RSS parser bugs should result in not importing those w3 Atom schema reference URLs.