mirror of
https://github.com/ArchiveBox/ArchiveBox.git
synced 2026-04-25 09:06:02 +03:00
[GH-ISSUE #106] Link parsing: Pinboard private feeds don't seem to get parsed properly #1583
Labels
No labels
expected: maybe someday
expected: next release
expected: release after next
expected: unlikely unless contributed
good first ticket
help wanted
pull-request
scope: all users
scope: windows users
size: easy
size: hard
size: medium
size: medium
status: backlog
status: blocked
status: done
status: idea-phase
status: needs followup
status: wip
status: wontfix
touches: API/CLI/Spec
touches: configuration
touches: data/schema/architecture
touches: dependencies/packaging
touches: docs
touches: js
touches: views/replayers/html/css
why: correctness
why: functionality
why: performance
why: security
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/ArchiveBox#1583
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @drpfenderson on GitHub (Oct 18, 2018).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/106
I would love to have the cron job that monitors my Pocket feed also monitor my private Pinboard feed. However, no matter which method I use to pass the feed to bookmark-archiver using the instructions, all have their own unique failure.
If I pass a public feed, like
http://feeds.pinboard.in/rss/u:username/, it works fine. But if I pass a private feed, likehttps://feeds.pinboard.in/rss/secret:xxxx/u:username/private/, it errors out. I have tried the RSS, JSON, and Text feeds, and none work.Examples here: (I've simply replaced the actual feed I used to test, with the demo URL Pinboard provides)
./archive "https://feeds.pinboard.in/rss/secret:xxxx/u:username/private/"./archive "https://feeds.pinboard.in/json/secret:xxxx/u:username/private/"./archive "https://feeds.pinboard.in/text/secret:xxxx/u:username/private/"Even though the script says that links are not found, they are definitely there, and simply pasting the URL into a browser outputs the feed in the proper format. I used this script successfully with other methods, like the Pinboard manual export, Pocket manual export AND RSS feed, and browser export. Is this just not a supported method for importing/monitoring?
@pirate commented on GitHub (Oct 19, 2018):
Looks like theres some difference in the outputted json format for private feeds that's breaking the parser. Can you post a copy of
output/sources/feeds.pinboard.in-1539897226.txtin a gist somewhere (redacted/edited to hide the links if you want).@drpfenderson commented on GitHub (Oct 19, 2018):
@pirate Here is a link to the output of that file.
https://gist.github.com/drpfenderson/245c99f148b30cbf83dd3588c2fb0885
@f0086 commented on GitHub (Oct 19, 2018):
I've ran into the same problem. I solved this with a little go program which will login to pinboard and klick the actual "backup my bookmarks in legacy Netscape format" button -- which works fine for me.
@drpfenderson commented on GitHub (Nov 8, 2018):
Do you still need my Gist up for this? Or can I make it private?
@pirate commented on GitHub (Nov 12, 2018):
I only need one or two links in the file to debug this, so if you can keep a version up with only 1 or two links (can be example.com) in the same format, that would be helpful.
@f0086 commented on GitHub (Nov 19, 2018):
From the
settings->backuppage:Legacy HTML (seems to be broken HTML/XML?)
XML
JSON
Private RSS feed:
@pirate commented on GitHub (Feb 4, 2019):
Can you try the latest master? It might work now... although it might try to import all the extra pinboard links that aren't articles too.
@f0086 commented on GitHub (Feb 4, 2019):
Sorry, does not work (or do I miss something?)
It will download the bookmarks, but then hangs forever. This is the tracktrace after hitting CTRL+C:
@pirate commented on GitHub (Feb 4, 2019):
I'm assuming you're importing a lot of links, if so, that's normal. It can take up to 10s per link to fetch the title if it didn't find a title in the pinboard import.
@f0086 commented on GitHub (Feb 4, 2019):
You are right, I just need to wait. But it did not work. The archiver tried to download each tag(!) for each bookmark like "http://pinboard.in/u:yyy/t:lectures". Currently I do not have time to debug this further :(
@pirate commented on GitHub (Feb 5, 2019):
Ok I just made a bunch of fixes, and tested it on all four of the snippets you posted above. All of them worked correctly and only extracted the article links, without all the other pinboard tag urls.
Give the latest version of master a try.
@f0086 commented on GitHub (Feb 5, 2019):
I am very sorry, but it does not work. You are using the wrong URLs. You need to use the URL in the
<link></link>tag. I will have a look at this.#123 seems related to this :)
EDIT: Ok, I had a quick look at the code, but did not find a proper solution. The
xml.etree.ElementTreecomponent is not working as expected I think, but I am not a Python guy, so not sure about that. My setup (see above) works great for me, so I have no interest in spending an evening debugging this for now, sorry :( Maybe it is not worth it anyway, because of #123 ?!?@drpfenderson commented on GitHub (Feb 5, 2019):
Seems to work for me on the most recent master (
ce257949b4). :) Thanks a ton.My original issue doesn't seem to be the same problem that @f0086 is dealing with.
@pirate commented on GitHub (Feb 7, 2019):
@f0086 when you get a chance, do you mind pulling the latest master and trying it? I've made a bunch of fixes to the parsers in the last 3 days, and now it'll tell you exactly why the parser fails if you uncomment this line:
archivebox/parse.py:75
If it still doesn't work, after uncommenting that line you can copy/paste the error output here and I'll debug it for you :)
@f0086 commented on GitHub (Feb 10, 2019):
Here we go:
@pirate commented on GitHub (Mar 1, 2019):
I think part of the issue was that I was fetching page titles without showing progress, so it looks like it was hanging forever / breaking when actually it was doing stuff.
That's all been changed significantly now, as I treat title fetching like any other archive method now instead of trying to do it during the parsing phase.
Try pulling the latest
masterand running it again. If you're still having issues, I'll need two things to debug it:output/sources/feeds.pinboard.in-xxx.txtprintstatement onparse.py:56uncommented@f0086 commented on GitHub (Mar 9, 2019):
@pirate commented on GitHub (Mar 19, 2019):
Fixed in
f9a7c53, give the latest master a shot and let me know if it works.@f0086 commented on GitHub (Mar 21, 2019):
Looking good.
This will finally fix this issue, thank you!