[GH-ISSUE #106] Link parsing: Pinboard private feeds don't seem to get parsed properly #1583

Closed
opened 2026-03-01 17:51:56 +03:00 by kerem · 19 comments
Owner

Originally created by @drpfenderson on GitHub (Oct 18, 2018).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/106

I would love to have the cron job that monitors my Pocket feed also monitor my private Pinboard feed. However, no matter which method I use to pass the feed to bookmark-archiver using the instructions, all have their own unique failure.

If I pass a public feed, like http://feeds.pinboard.in/rss/u:username/, it works fine. But if I pass a private feed, like https://feeds.pinboard.in/rss/secret:xxxx/u:username/private/, it errors out. I have tried the RSS, JSON, and Text feeds, and none work.

Examples here: (I've simply replaced the actual feed I used to test, with the demo URL Pinboard provides)
./archive "https://feeds.pinboard.in/rss/secret:xxxx/u:username/private/"

[*] [2018-10-18 21:14:03] Downloadinghttps://feeds.pinboard.in/rss/secret:xxxx/u:username/private/ > output/sources/feeds.pinboard.in-1539897243.txt
[X] No links found :(

./archive "https://feeds.pinboard.in/json/secret:xxxx/u:username/private/"

[*] [2018-10-18 21:13:46] Downloading https://feeds.pinboard.in/json/secret:xxxx/u:username/private/ > output/sources/feeds.pinboard.in-1539897226.txt
Traceback (most recent call last):
  File "./archive", line 161, in <module>
    links = merge_links(archive_path=out_dir, import_path=source)
  File "./archive", line 53, in merge_links
    raw_links = parse_links(import_path)
  File "/home/USERNAME/datahoarding/bookmark-archiver/archiver/parse.py", line 54, in parse_links
    links += list(parser_func(file))
  File "/home/USERNAME/bookmark-archiver/archiver/parse.py", line 108, in parse_json_export
    url = erg['url']
KeyError: 'url'

./archive "https://feeds.pinboard.in/text/secret:xxxx/u:username/private/"

[*] [2018-10-18 21:17:57] Downloading https://feeds.pinboard.in/text/secret:xxxx/u:username/private/ > output/sources/feeds.pinboard.in-1539897477.txt
[X] No links found :(

Even though the script says that links are not found, they are definitely there, and simply pasting the URL into a browser outputs the feed in the proper format. I used this script successfully with other methods, like the Pinboard manual export, Pocket manual export AND RSS feed, and browser export. Is this just not a supported method for importing/monitoring?

Originally created by @drpfenderson on GitHub (Oct 18, 2018). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/106 I would love to have the cron job that monitors my Pocket feed also monitor my private Pinboard feed. However, no matter which method I use to pass the feed to bookmark-archiver using the instructions, all have their own unique failure. If I pass a public feed, like ` http://feeds.pinboard.in/rss/u:username/`, it works fine. But if I pass a private feed, like ` https://feeds.pinboard.in/rss/secret:xxxx/u:username/private/`, it errors out. I have tried the RSS, JSON, and Text feeds, and none work. Examples here: (I've simply replaced the actual feed I used to test, with the demo URL Pinboard provides) ` ./archive "https://feeds.pinboard.in/rss/secret:xxxx/u:username/private/"` ``` [*] [2018-10-18 21:14:03] Downloadinghttps://feeds.pinboard.in/rss/secret:xxxx/u:username/private/ > output/sources/feeds.pinboard.in-1539897243.txt [X] No links found :( ``` `./archive "https://feeds.pinboard.in/json/secret:xxxx/u:username/private/"` ``` [*] [2018-10-18 21:13:46] Downloading https://feeds.pinboard.in/json/secret:xxxx/u:username/private/ > output/sources/feeds.pinboard.in-1539897226.txt Traceback (most recent call last): File "./archive", line 161, in <module> links = merge_links(archive_path=out_dir, import_path=source) File "./archive", line 53, in merge_links raw_links = parse_links(import_path) File "/home/USERNAME/datahoarding/bookmark-archiver/archiver/parse.py", line 54, in parse_links links += list(parser_func(file)) File "/home/USERNAME/bookmark-archiver/archiver/parse.py", line 108, in parse_json_export url = erg['url'] KeyError: 'url' ``` `./archive "https://feeds.pinboard.in/text/secret:xxxx/u:username/private/"` ``` [*] [2018-10-18 21:17:57] Downloading https://feeds.pinboard.in/text/secret:xxxx/u:username/private/ > output/sources/feeds.pinboard.in-1539897477.txt [X] No links found :( ``` Even though the script says that links are not found, they are definitely there, and simply pasting the URL into a browser outputs the feed in the proper format. I used this script successfully with other methods, like the Pinboard manual export, Pocket manual export AND RSS feed, and browser export. Is this just not a supported method for importing/monitoring?
kerem 2026-03-01 17:51:56 +03:00
Author
Owner

@pirate commented on GitHub (Oct 19, 2018):

Looks like theres some difference in the outputted json format for private feeds that's breaking the parser. Can you post a copy of output/sources/feeds.pinboard.in-1539897226.txt in a gist somewhere (redacted/edited to hide the links if you want).

<!-- gh-comment-id:431214483 --> @pirate commented on GitHub (Oct 19, 2018): Looks like theres some difference in the outputted json format for private feeds that's breaking the parser. Can you post a copy of `output/sources/feeds.pinboard.in-1539897226.txt` in a gist somewhere (redacted/edited to hide the links if you want).
Author
Owner

@drpfenderson commented on GitHub (Oct 19, 2018):

@pirate Here is a link to the output of that file.

https://gist.github.com/drpfenderson/245c99f148b30cbf83dd3588c2fb0885

<!-- gh-comment-id:431416343 --> @drpfenderson commented on GitHub (Oct 19, 2018): @pirate Here is a link to the output of that file. https://gist.github.com/drpfenderson/245c99f148b30cbf83dd3588c2fb0885
Author
Owner

@f0086 commented on GitHub (Oct 19, 2018):

I've ran into the same problem. I solved this with a little go program which will login to pinboard and klick the actual "backup my bookmarks in legacy Netscape format" button -- which works fine for me.

package main

import (
  "gopkg.in/headzoo/surf.v1"
  "os"
  "flag"
)

var username = flag.String("username", "", "pinboard username")
var password = flag.String("password", "", "pinboard password")

func main() {
  flag.Parse()

  bow := surf.NewBrowser()
  err := bow.Open("https://pinboard.in/")
  if err != nil {
    panic(err)
  }

  form, formErr := bow.Form("form[name=login]")
  if formErr != nil {
    panic(formErr)
  }

  form.Input("username", *username)
  form.Input("password", *password)
  if form.Submit() != nil {
    panic(err);
  }

  err = bow.Open("https://pinboard.in/export/format:html/")
  if err != nil {
    panic(err)
  }

  bow.Download(os.Stdout)
}
$ export GOPATH=.
$ go get gopkg.in/headzoo/surf.v1
$ go build src/aaron-fischer.net/fupin/main.go
$ ./fuPin -username=[USERNAME] -password=[PASSWORD] > bookmarks.html
<!-- gh-comment-id:431478814 --> @f0086 commented on GitHub (Oct 19, 2018): I've ran into the same problem. I solved this with a little go program which will login to pinboard and klick the actual "backup my bookmarks in legacy Netscape format" button -- which works fine for me. ``` package main import ( "gopkg.in/headzoo/surf.v1" "os" "flag" ) var username = flag.String("username", "", "pinboard username") var password = flag.String("password", "", "pinboard password") func main() { flag.Parse() bow := surf.NewBrowser() err := bow.Open("https://pinboard.in/") if err != nil { panic(err) } form, formErr := bow.Form("form[name=login]") if formErr != nil { panic(formErr) } form.Input("username", *username) form.Input("password", *password) if form.Submit() != nil { panic(err); } err = bow.Open("https://pinboard.in/export/format:html/") if err != nil { panic(err) } bow.Download(os.Stdout) } ``` $ export GOPATH=. $ go get gopkg.in/headzoo/surf.v1 $ go build src/aaron-fischer.net/fupin/main.go $ ./fuPin -username=[USERNAME] -password=[PASSWORD] > bookmarks.html
Author
Owner

@drpfenderson commented on GitHub (Nov 8, 2018):

Do you still need my Gist up for this? Or can I make it private?

<!-- gh-comment-id:437156846 --> @drpfenderson commented on GitHub (Nov 8, 2018): Do you still need my Gist up for this? Or can I make it private?
Author
Owner

@pirate commented on GitHub (Nov 12, 2018):

I only need one or two links in the file to debug this, so if you can keep a version up with only 1 or two links (can be example.com) in the same format, that would be helpful.

<!-- gh-comment-id:437738130 --> @pirate commented on GitHub (Nov 12, 2018): I only need one or two links in the file to debug this, so if you can keep a version up with only 1 or two links (can be example.com) in the same format, that would be helpful.
Author
Owner

@f0086 commented on GitHub (Nov 19, 2018):

From the settings->backup page:

Legacy HTML (seems to be broken HTML/XML?)

<!DOCTYPE NETSCAPE-Bookmark-file-1>
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=UTF-8">
<TITLE>Pinboard Bookmarks</TITLE>
<H1>Bookmarks</H1>
<DL>
<p>

<DT><A HREF="https://github.com/trailofbits/algo" ADD_DATE="1542616733" PRIVATE="1" TOREAD="1" TAGS="vpn,scripts,toread">Algo VPN scripts</A>
<DT><A HREF="http://www.ulisp.com/" ADD_DATE="1542374412" PRIVATE="1" TOREAD="1" TAGS="arduino,avr,embedded,lisp,toread">uLisp</A>

</DL>
</p>

XML

<?xml version="1.0" encoding="UTF-8"?>
	<posts user="aaronmueller">
<post href="https://github.com/trailofbits/algo" time="2018-11-19T08:38:53Z" description="Algo VPN scripts" extended="" tag="vpn scripts" hash="18d708f67bb26d843b1cac4530bb52aa"  shared="no" toread="yes" />
<post href="http://www.ulisp.com/" time="2018-11-16T13:20:12Z" description="uLisp" extended="" tag="arduino avr embedded lisp" hash="2a17ae95925a03a5b9bb38cf7f6c6f9b"  shared="no" toread="yes" />
</posts>

JSON

[{"href":"https:\/\/github.com\/trailofbits\/algo","description":"Algo VPN scripts","extended":"","meta":"62325ba3b577683aee854d7f191034dc","hash":"18d708f67bb26d843b1cac4530bb52aa","time":"2018-11-19T08:38:53Z","shared":"no","toread":"yes","tags":"vpn scripts"},
{"href":"http:\/\/www.ulisp.com\/","description":"uLisp","extended":"","meta":"7bd0c0ef31f69d1459e3d37366e742b3","hash":"2a17ae95925a03a5b9bb38cf7f6c6f9b","time":"2018-11-16T13:20:12Z","shared":"no","toread":"yes","tags":"arduino avr embedded lisp"}]

Private RSS feed:

<?xml version="1.0" encoding="UTF-8"?>
<rdf:RDF xmlns="http://purl.org/rss/1.0/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:cc="http://web.resource.org/cc/" xmlns:syn="http://purl.org/rss/1.0/modules/syndication/" xmlns:admin="http://webns.net/mvcb/">
  <channel rdf:about="http://pinboard.in">
    <title>Pinboard (private aaronmueller)</title>
    <link>https://pinboard.in/u:aaronmueller/private/</link>
    <description></description>
    <items>
      <rdf:Seq>
        <rdf:li rdf:resource="https://mehkee.com/"/>
        <rdf:li rdf:resource="https://qmk.fm/"/>
      </rdf:Seq>
    </items>
  </channel>

  <item rdf:about="https://mehkee.com/">
    <title>Mehkee - Mechanical Keyboard Parts &amp; Accessories</title>
    <dc:date>2018-11-08T21:29:32+00:00</dc:date>
    <link>https://mehkee.com/</link>
    <dc:creator>aaronmueller</dc:creator>
    <dc:subject>keyboard gadget diy</dc:subject>
    <dc:source>http://pinboard.in/</dc:source>
    <dc:identifier>http://pinboard.in/u:aaronmueller/b:xxx/</dc:identifier>
    <taxo:topics>
      <rdf:Bag>
        <rdf:li rdf:resource="http://pinboard.in/u:aaronmueller/t:keyboard"/>
        <rdf:li rdf:resource="http://pinboard.in/u:aaronmueller/t:gadget"/>
        <rdf:li rdf:resource="http://pinboard.in/u:aaronmueller/t:diy"/>
      </rdf:Bag>
    </taxo:topics>
  </item>
  <item rdf:about="https://qmk.fm/">
    <title>QMK Firmware - An open source firmware for AVR and ARM based keyboards</title>
    <dc:date>2018-11-06T22:36:21+00:00</dc:date>
    <link>https://qmk.fm/</link>
    <dc:creator>aaronmueller</dc:creator>
    <dc:subject>firmware keyboard</dc:subject>
    <dc:source>http://pinboard.in/</dc:source>
    <dc:identifier>http://pinboard.in/u:aaronmueller/b:xxx/</dc:identifier>
    <taxo:topics>
      <rdf:Bag>
        <rdf:li rdf:resource="http://pinboard.in/u:aaronmueller/t:firmware"/>
        <rdf:li rdf:resource="http://pinboard.in/u:aaronmueller/t:keyboard"/>
      </rdf:Bag>
    </taxo:topics>
  </item>
</rdf:RDF>
<!-- gh-comment-id:439817898 --> @f0086 commented on GitHub (Nov 19, 2018): From the ```settings```->```backup``` page: Legacy HTML (seems to be broken HTML/XML?) ``` <!DOCTYPE NETSCAPE-Bookmark-file-1> <META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=UTF-8"> <TITLE>Pinboard Bookmarks</TITLE> <H1>Bookmarks</H1> <DL> <p> <DT><A HREF="https://github.com/trailofbits/algo" ADD_DATE="1542616733" PRIVATE="1" TOREAD="1" TAGS="vpn,scripts,toread">Algo VPN scripts</A> <DT><A HREF="http://www.ulisp.com/" ADD_DATE="1542374412" PRIVATE="1" TOREAD="1" TAGS="arduino,avr,embedded,lisp,toread">uLisp</A> </DL> </p> ``` XML ``` <?xml version="1.0" encoding="UTF-8"?> <posts user="aaronmueller"> <post href="https://github.com/trailofbits/algo" time="2018-11-19T08:38:53Z" description="Algo VPN scripts" extended="" tag="vpn scripts" hash="18d708f67bb26d843b1cac4530bb52aa" shared="no" toread="yes" /> <post href="http://www.ulisp.com/" time="2018-11-16T13:20:12Z" description="uLisp" extended="" tag="arduino avr embedded lisp" hash="2a17ae95925a03a5b9bb38cf7f6c6f9b" shared="no" toread="yes" /> </posts> ``` JSON ``` [{"href":"https:\/\/github.com\/trailofbits\/algo","description":"Algo VPN scripts","extended":"","meta":"62325ba3b577683aee854d7f191034dc","hash":"18d708f67bb26d843b1cac4530bb52aa","time":"2018-11-19T08:38:53Z","shared":"no","toread":"yes","tags":"vpn scripts"}, {"href":"http:\/\/www.ulisp.com\/","description":"uLisp","extended":"","meta":"7bd0c0ef31f69d1459e3d37366e742b3","hash":"2a17ae95925a03a5b9bb38cf7f6c6f9b","time":"2018-11-16T13:20:12Z","shared":"no","toread":"yes","tags":"arduino avr embedded lisp"}] ``` Private RSS feed: ``` <?xml version="1.0" encoding="UTF-8"?> <rdf:RDF xmlns="http://purl.org/rss/1.0/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:cc="http://web.resource.org/cc/" xmlns:syn="http://purl.org/rss/1.0/modules/syndication/" xmlns:admin="http://webns.net/mvcb/"> <channel rdf:about="http://pinboard.in"> <title>Pinboard (private aaronmueller)</title> <link>https://pinboard.in/u:aaronmueller/private/</link> <description></description> <items> <rdf:Seq> <rdf:li rdf:resource="https://mehkee.com/"/> <rdf:li rdf:resource="https://qmk.fm/"/> </rdf:Seq> </items> </channel> <item rdf:about="https://mehkee.com/"> <title>Mehkee - Mechanical Keyboard Parts &amp; Accessories</title> <dc:date>2018-11-08T21:29:32+00:00</dc:date> <link>https://mehkee.com/</link> <dc:creator>aaronmueller</dc:creator> <dc:subject>keyboard gadget diy</dc:subject> <dc:source>http://pinboard.in/</dc:source> <dc:identifier>http://pinboard.in/u:aaronmueller/b:xxx/</dc:identifier> <taxo:topics> <rdf:Bag> <rdf:li rdf:resource="http://pinboard.in/u:aaronmueller/t:keyboard"/> <rdf:li rdf:resource="http://pinboard.in/u:aaronmueller/t:gadget"/> <rdf:li rdf:resource="http://pinboard.in/u:aaronmueller/t:diy"/> </rdf:Bag> </taxo:topics> </item> <item rdf:about="https://qmk.fm/"> <title>QMK Firmware - An open source firmware for AVR and ARM based keyboards</title> <dc:date>2018-11-06T22:36:21+00:00</dc:date> <link>https://qmk.fm/</link> <dc:creator>aaronmueller</dc:creator> <dc:subject>firmware keyboard</dc:subject> <dc:source>http://pinboard.in/</dc:source> <dc:identifier>http://pinboard.in/u:aaronmueller/b:xxx/</dc:identifier> <taxo:topics> <rdf:Bag> <rdf:li rdf:resource="http://pinboard.in/u:aaronmueller/t:firmware"/> <rdf:li rdf:resource="http://pinboard.in/u:aaronmueller/t:keyboard"/> </rdf:Bag> </taxo:topics> </item> </rdf:RDF> ```
Author
Owner

@pirate commented on GitHub (Feb 4, 2019):

Can you try the latest master? It might work now... although it might try to import all the extra pinboard links that aren't articles too.

<!-- gh-comment-id:460359570 --> @pirate commented on GitHub (Feb 4, 2019): Can you try the latest master? It might work now... although it might try to import all the extra pinboard links that aren't articles too.
Author
Owner

@f0086 commented on GitHub (Feb 4, 2019):

Sorry, does not work (or do I miss something?)
It will download the bookmarks, but then hangs forever. This is the tracktrace after hitting CTRL+C:

└─ $ ▶ FETCH_PDF=False TIMEOUT=20 ONLY_NEW=True SUBMIT_ARCHIVE_DOT_ORG=False CHECK_SSL_VALIDITY=False WGET_USER_AGENT="Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:21.0) Gecko/2010" ./archive "https://feeds.pinboard.in/rss/secret:xxx/u:yyy/"
[*] [2019-02-04 20:23:46] Downloading https://feeds.pinboard.in/rss/secret:xxx/u:yyy/ > output/sources/feeds.pinboard.in-1549308226.txt
^CTraceback (most recent call last):                                                                                                                                     
  File "./archive", line 189, in <module>
    links = merge_links(archive_path=out_dir, import_path=source, only_new=False)
  File "./archive", line 62, in merge_links
    raw_links = parse_links(import_path)
  File "/tmp/ArchiveBox/archivebox/parse.py", line 59, in parse_links
    links += list(parser_func(file))
  File "/tmp/ArchiveBox/archivebox/parse.py", line 271, in parse_plain_text
    'title': fetch_page_title(url),
  File "/tmp/ArchiveBox/archivebox/util.py", line 236, in fetch_page_title
    html_content = urllib.request.urlopen(url, timeout=10).read().decode('utf-8')
  File "/usr/lib/python3.7/urllib/request.py", line 222, in urlopen
    return opener.open(url, data, timeout)
  File "/usr/lib/python3.7/urllib/request.py", line 525, in open
    response = self._open(req, data)
  File "/usr/lib/python3.7/urllib/request.py", line 543, in _open
    '_open', req)
  File "/usr/lib/python3.7/urllib/request.py", line 503, in _call_chain
    result = func(*args)
  File "/usr/lib/python3.7/urllib/request.py", line 1345, in http_open
    return self.do_open(http.client.HTTPConnection, req)
  File "/usr/lib/python3.7/urllib/request.py", line 1320, in do_open
    r = h.getresponse()
  File "/usr/lib/python3.7/http/client.py", line 1321, in getresponse
    response.begin()
  File "/usr/lib/python3.7/http/client.py", line 296, in begin
    version, status, reason = self._read_status()
  File "/usr/lib/python3.7/http/client.py", line 257, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
  File "/usr/lib/python3.7/socket.py", line 589, in readinto
    return self._sock.recv_into(b)
KeyboardInterrupt
<!-- gh-comment-id:460379187 --> @f0086 commented on GitHub (Feb 4, 2019): Sorry, does not work (or do I miss something?) It will download the bookmarks, but then hangs forever. This is the tracktrace after hitting CTRL+C: ``` └─ $ ▶ FETCH_PDF=False TIMEOUT=20 ONLY_NEW=True SUBMIT_ARCHIVE_DOT_ORG=False CHECK_SSL_VALIDITY=False WGET_USER_AGENT="Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:21.0) Gecko/2010" ./archive "https://feeds.pinboard.in/rss/secret:xxx/u:yyy/" [*] [2019-02-04 20:23:46] Downloading https://feeds.pinboard.in/rss/secret:xxx/u:yyy/ > output/sources/feeds.pinboard.in-1549308226.txt ^CTraceback (most recent call last): File "./archive", line 189, in <module> links = merge_links(archive_path=out_dir, import_path=source, only_new=False) File "./archive", line 62, in merge_links raw_links = parse_links(import_path) File "/tmp/ArchiveBox/archivebox/parse.py", line 59, in parse_links links += list(parser_func(file)) File "/tmp/ArchiveBox/archivebox/parse.py", line 271, in parse_plain_text 'title': fetch_page_title(url), File "/tmp/ArchiveBox/archivebox/util.py", line 236, in fetch_page_title html_content = urllib.request.urlopen(url, timeout=10).read().decode('utf-8') File "/usr/lib/python3.7/urllib/request.py", line 222, in urlopen return opener.open(url, data, timeout) File "/usr/lib/python3.7/urllib/request.py", line 525, in open response = self._open(req, data) File "/usr/lib/python3.7/urllib/request.py", line 543, in _open '_open', req) File "/usr/lib/python3.7/urllib/request.py", line 503, in _call_chain result = func(*args) File "/usr/lib/python3.7/urllib/request.py", line 1345, in http_open return self.do_open(http.client.HTTPConnection, req) File "/usr/lib/python3.7/urllib/request.py", line 1320, in do_open r = h.getresponse() File "/usr/lib/python3.7/http/client.py", line 1321, in getresponse response.begin() File "/usr/lib/python3.7/http/client.py", line 296, in begin version, status, reason = self._read_status() File "/usr/lib/python3.7/http/client.py", line 257, in _read_status line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1") File "/usr/lib/python3.7/socket.py", line 589, in readinto return self._sock.recv_into(b) KeyboardInterrupt ```
Author
Owner

@pirate commented on GitHub (Feb 4, 2019):

I'm assuming you're importing a lot of links, if so, that's normal. It can take up to 10s per link to fetch the title if it didn't find a title in the pinboard import.

<!-- gh-comment-id:460381025 --> @pirate commented on GitHub (Feb 4, 2019): I'm assuming you're importing a lot of links, if so, that's normal. It can take up to 10s per link to fetch the title if it didn't find a title in the pinboard import.
Author
Owner

@f0086 commented on GitHub (Feb 4, 2019):

You are right, I just need to wait. But it did not work. The archiver tried to download each tag(!) for each bookmark like "http://pinboard.in/u:yyy/t:lectures". Currently I do not have time to debug this further :(

<!-- gh-comment-id:460393778 --> @f0086 commented on GitHub (Feb 4, 2019): You are right, I just need to wait. But it did not work. The archiver tried to download each tag(!) for each bookmark like "http://pinboard.in/u:yyy/t:lectures". Currently I do not have time to debug this further :(
Author
Owner

@pirate commented on GitHub (Feb 5, 2019):

Ok I just made a bunch of fixes, and tested it on all four of the snippets you posted above. All of them worked correctly and only extracted the article links, without all the other pinboard tag urls.

Give the latest version of master a try.

<!-- gh-comment-id:460507534 --> @pirate commented on GitHub (Feb 5, 2019): Ok I just made a bunch of fixes, and tested it on all four of the snippets you posted above. All of them worked correctly and only extracted the article links, without all the other pinboard tag urls. Give the latest version of master a try.
Author
Owner

@f0086 commented on GitHub (Feb 5, 2019):

I am very sorry, but it does not work. You are using the wrong URLs. You need to use the URL in the <link></link> tag. I will have a look at this.

#123 seems related to this :)

EDIT: Ok, I had a quick look at the code, but did not find a proper solution. The xml.etree.ElementTree component is not working as expected I think, but I am not a Python guy, so not sure about that. My setup (see above) works great for me, so I have no interest in spending an evening debugging this for now, sorry :( Maybe it is not worth it anyway, because of #123 ?!?

<!-- gh-comment-id:460726949 --> @f0086 commented on GitHub (Feb 5, 2019): I am very sorry, but it does not work. You are using the wrong URLs. You need to use the URL in the `<link></link>` tag. I will have a look at this. #123 seems related to this :) EDIT: Ok, I had a quick look at the code, but did not find a proper solution. The `xml.etree.ElementTree` component is not working as expected I think, but I am not a Python guy, so not sure about that. My setup (see above) works great for me, so I have no interest in spending an evening debugging this for now, sorry :( Maybe it is not worth it anyway, because of #123 ?!?
Author
Owner

@drpfenderson commented on GitHub (Feb 5, 2019):

Seems to work for me on the most recent master (ce257949b4). :) Thanks a ton.

My original issue doesn't seem to be the same problem that @f0086 is dealing with.

<!-- gh-comment-id:460818757 --> @drpfenderson commented on GitHub (Feb 5, 2019): Seems to work for me on the most recent master (ce257949b4468c77412c026b5987c3f37bad6443). :) Thanks a ton. My original issue doesn't seem to be the same problem that @f0086 is dealing with.
Author
Owner

@pirate commented on GitHub (Feb 7, 2019):

@f0086 when you get a chance, do you mind pulling the latest master and trying it? I've made a bunch of fixes to the parsers in the last 3 days, and now it'll tell you exactly why the parser fails if you uncomment this line:

archivebox/parse.py:75

# print('[!] Parser {} failed: {} {}'.format(parser_name, err.__class__.__name__, err))

If it still doesn't work, after uncommenting that line you can copy/paste the error output here and I'll debug it for you :)

<!-- gh-comment-id:461623338 --> @pirate commented on GitHub (Feb 7, 2019): @f0086 when you get a chance, do you mind pulling the latest master and trying it? I've made a bunch of fixes to the parsers in the last 3 days, and now it'll tell you exactly why the parser fails if you uncomment this line: **archivebox/parse.py:75** ```python # print('[!] Parser {} failed: {} {}'.format(parser_name, err.__class__.__name__, err)) ``` If it still doesn't work, after uncommenting that line you can copy/paste the error output here and I'll debug it for you :)
Author
Owner

@f0086 commented on GitHub (Feb 10, 2019):

Here we go:

└─ $ ▶ FETCH_PDF=False TIMEOUT=20 ONLY_NEW=True SUBMIT_ARCHIVE_DOT_ORG=False CHECK_SSL_VALIDITY=False WGET_USER_AGENT="Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:21.0) Gecko/2010" ./archive "https://feeds.pinboard.in/rss/secret:yyy/u:zzz/private"
[*] [2019-02-10 21:17:13] Downloading https://feeds.pinboard.in/rss/secret:yyy/u:zzz/private > output/sources/feeds.pinboard.in-xxx.txt
[*] [2019-02-10 21:17:14] Parsing new links from output/sources/feeds.pinboard.in-xxx.txt and fetching titles...                                                  
    [!] Parser Pinboard JSON failed: JSONDecodeError Expecting value: line 1 column 1 (char 0)
[!] Parser RSS failed: IndexError list index out of range
[!] Parser Pinboard RSS failed: AttributeError 'NoneType' object has no attribute 'text'
[!] Parser Medium RSS failed: AttributeError 'NoneType' object has no attribute 'findall'
<!-- gh-comment-id:462168135 --> @f0086 commented on GitHub (Feb 10, 2019): Here we go: ``` └─ $ ▶ FETCH_PDF=False TIMEOUT=20 ONLY_NEW=True SUBMIT_ARCHIVE_DOT_ORG=False CHECK_SSL_VALIDITY=False WGET_USER_AGENT="Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:21.0) Gecko/2010" ./archive "https://feeds.pinboard.in/rss/secret:yyy/u:zzz/private" [*] [2019-02-10 21:17:13] Downloading https://feeds.pinboard.in/rss/secret:yyy/u:zzz/private > output/sources/feeds.pinboard.in-xxx.txt [*] [2019-02-10 21:17:14] Parsing new links from output/sources/feeds.pinboard.in-xxx.txt and fetching titles... [!] Parser Pinboard JSON failed: JSONDecodeError Expecting value: line 1 column 1 (char 0) [!] Parser RSS failed: IndexError list index out of range [!] Parser Pinboard RSS failed: AttributeError 'NoneType' object has no attribute 'text' [!] Parser Medium RSS failed: AttributeError 'NoneType' object has no attribute 'findall' ```
Author
Owner

@pirate commented on GitHub (Mar 1, 2019):

I think part of the issue was that I was fetching page titles without showing progress, so it looks like it was hanging forever / breaking when actually it was doing stuff.

That's all been changed significantly now, as I treat title fetching like any other archive method now instead of trying to do it during the parsing phase.

Try pulling the latest master and running it again. If you're still having issues, I'll need two things to debug it:

  1. A redacted copy of the failing import file output/sources/feeds.pinboard.in-xxx.txt
  2. The terminal output with that print statement on parse.py:56 uncommented
<!-- gh-comment-id:468605539 --> @pirate commented on GitHub (Mar 1, 2019): I think part of the issue was that I was fetching page titles without showing progress, so it looks like it was hanging forever / breaking when actually it was doing stuff. That's all been changed significantly now, as I treat title fetching like any other archive method now instead of trying to do it during the parsing phase. Try pulling the latest `master` and running it again. If you're still having issues, I'll need two things to debug it: 1. A redacted copy of the failing import file `output/sources/feeds.pinboard.in-xxx.txt` 2. The terminal output with that `print` statement on `parse.py:56` uncommented
Author
Owner

@f0086 commented on GitHub (Mar 9, 2019):

└─ $ ▶ FETCH_PDF=False TIMEOUT=20 ONLY_NEW=True SUBMIT_ARCHIVE_DOT_ORG=False CHECK_SSL_VALIDITY=False WGET_USER_AGENT="Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:21.0) Gecko/2010" ./archive "https://feeds.pinboard.in/rss/secret:XXX/u:YYY/private"
[*] [2019-03-09 17:43:21] Downloading https://feeds.pinboard.in/rss/secret:XXX/u:YYY/private
    > output/sources/feeds.pinboard.in-xxx.txt                                                                                                                    
[*] [2019-03-09 17:43:23] Parsing new links from output/sources/feeds.pinboard.in-xxx.txt...
[!] Parser Pinboard JSON failed: JSONDecodeError Expecting value: line 1 column 1 (char 0)
[!] Parser RSS failed: IndexError list index out of range
[!] Parser Pinboard RSS failed: AttributeError 'NoneType' object has no attribute 'text'
[!] Parser Medium RSS failed: AttributeError 'NoneType' object has no attribute 'findall'
    > Adding 207 new links to index (parsed import as Plain Text)
[*] [2019-03-09 17:43:23] Updating main index files...
...

image

<?xml version="1.0" encoding="UTF-8"?>
 <rdf:RDF xmlns="http://purl.org/rss/1.0/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:cc="http://web.resource.org/cc/" xmlns:syn="http://purl.org/rss/1.0/modules/syndication/" xmlns:admin="http://webns.net/mvcb/">
  <channel rdf:about="http://pinboard.in">
    <title>Pinboard (private YYY)</title>
    <link>https://pinboard.in/u:YYY/private/</link>
    <description></description>
    <items>
      <rdf:Seq>
	<rdf:li rdf:resource="https://bugs.archlinux.org/task/56957"/>
	<rdf:li rdf:resource="https://www.oldunreal.com/wiki/index.php?title=Beginner%27s_Guide_to_Unrealscript"/>
      </rdf:Seq>
    </items>
    </channel>
<item rdf:about="https://bugs.archlinux.org/task/56957">
    <title>FS#56957 : [systemd] systemd-networkd crash after updating to linux 4.14.11</title>
    <dc:date>2019-02-10T19:46:52+00:00</dc:date>
    <link>https://bugs.archlinux.org/task/56957</link>
    <dc:creator>YYY</dc:creator><description><![CDATA[<blockquote>Flyspray, a Bug Tracking System written in PHP.</blockquote>]]></description>
<dc:identifier>http://pinboard.in/u:YYY/b:ZZZ/</dc:identifier>
</item>
<item rdf:about="https://www.oldunreal.com/wiki/index.php?title=Beginner%27s_Guide_to_Unrealscript">
    <title>UnrealScript Beginners Guide</title>
    <dc:date>2019-02-08T14:24:34+00:00</dc:date>
    <link>https://www.oldunreal.com/wiki/index.php?title=Beginner%27s_Guide_to_Unrealscript</link>
    <dc:creator>YYY</dc:creator><dc:subject>unreal</dc:subject>
<dc:source>http://pinboard.in/</dc:source>
<dc:identifier>http://pinboard.in/u:YYY/b:ZZZ/</dc:identifier>
<taxo:topics><rdf:Bag>	<rdf:li rdf:resource="http://pinboard.in/u:YYY/t:unreal"/>
</rdf:Bag></taxo:topics>
</item>
</rdf:RDF>
<!-- gh-comment-id:471199698 --> @f0086 commented on GitHub (Mar 9, 2019): ``` └─ $ ▶ FETCH_PDF=False TIMEOUT=20 ONLY_NEW=True SUBMIT_ARCHIVE_DOT_ORG=False CHECK_SSL_VALIDITY=False WGET_USER_AGENT="Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:21.0) Gecko/2010" ./archive "https://feeds.pinboard.in/rss/secret:XXX/u:YYY/private" [*] [2019-03-09 17:43:21] Downloading https://feeds.pinboard.in/rss/secret:XXX/u:YYY/private > output/sources/feeds.pinboard.in-xxx.txt [*] [2019-03-09 17:43:23] Parsing new links from output/sources/feeds.pinboard.in-xxx.txt... [!] Parser Pinboard JSON failed: JSONDecodeError Expecting value: line 1 column 1 (char 0) [!] Parser RSS failed: IndexError list index out of range [!] Parser Pinboard RSS failed: AttributeError 'NoneType' object has no attribute 'text' [!] Parser Medium RSS failed: AttributeError 'NoneType' object has no attribute 'findall' > Adding 207 new links to index (parsed import as Plain Text) [*] [2019-03-09 17:43:23] Updating main index files... ... ``` ![image](https://user-images.githubusercontent.com/121972/54074485-30ed6200-4293-11e9-9598-5c1c480652f6.png) ``` <?xml version="1.0" encoding="UTF-8"?> <rdf:RDF xmlns="http://purl.org/rss/1.0/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:cc="http://web.resource.org/cc/" xmlns:syn="http://purl.org/rss/1.0/modules/syndication/" xmlns:admin="http://webns.net/mvcb/"> <channel rdf:about="http://pinboard.in"> <title>Pinboard (private YYY)</title> <link>https://pinboard.in/u:YYY/private/</link> <description></description> <items> <rdf:Seq> <rdf:li rdf:resource="https://bugs.archlinux.org/task/56957"/> <rdf:li rdf:resource="https://www.oldunreal.com/wiki/index.php?title=Beginner%27s_Guide_to_Unrealscript"/> </rdf:Seq> </items> </channel> <item rdf:about="https://bugs.archlinux.org/task/56957"> <title>FS#56957 : [systemd] systemd-networkd crash after updating to linux 4.14.11</title> <dc:date>2019-02-10T19:46:52+00:00</dc:date> <link>https://bugs.archlinux.org/task/56957</link> <dc:creator>YYY</dc:creator><description><![CDATA[<blockquote>Flyspray, a Bug Tracking System written in PHP.</blockquote>]]></description> <dc:identifier>http://pinboard.in/u:YYY/b:ZZZ/</dc:identifier> </item> <item rdf:about="https://www.oldunreal.com/wiki/index.php?title=Beginner%27s_Guide_to_Unrealscript"> <title>UnrealScript Beginners Guide</title> <dc:date>2019-02-08T14:24:34+00:00</dc:date> <link>https://www.oldunreal.com/wiki/index.php?title=Beginner%27s_Guide_to_Unrealscript</link> <dc:creator>YYY</dc:creator><dc:subject>unreal</dc:subject> <dc:source>http://pinboard.in/</dc:source> <dc:identifier>http://pinboard.in/u:YYY/b:ZZZ/</dc:identifier> <taxo:topics><rdf:Bag> <rdf:li rdf:resource="http://pinboard.in/u:YYY/t:unreal"/> </rdf:Bag></taxo:topics> </item> </rdf:RDF> ```
Author
Owner

@pirate commented on GitHub (Mar 19, 2019):

Fixed in f9a7c53, give the latest master a shot and let me know if it works.

<!-- gh-comment-id:474613440 --> @pirate commented on GitHub (Mar 19, 2019): Fixed in f9a7c53, give the latest master a shot and let me know if it works.
Author
Owner

@f0086 commented on GitHub (Mar 21, 2019):

Looking good.
This will finally fix this issue, thank you!

<!-- gh-comment-id:475365545 --> @f0086 commented on GitHub (Mar 21, 2019): Looking good. This will finally fix this issue, thank you!
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#1583
No description provided.