[GH-ISSUE #135] Shaarli RSS parsing falls back to full-text and imports unneeded URLs from metadata fields #3112

Open
opened 2026-03-14 21:05:55 +03:00 by kerem · 30 comments
Owner

Originally created by @mawmawmawm on GitHub (Jan 30, 2019).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/135

It looks like Shaarli feeds are not being parsed correctly and markup is being included in the link structure (much like ticket 134 for pocket). Also, it looks like shaarli detail and tag pages are being parsed as source links, making the import much slower and leading to clutter in the archive.

You can use the public shaarli demo to reproduce this.

There's a demo (U: demo / PW: demo) running on https://demo.shaarli.org/.

  1. Add whatever link to this instance

The Atom feed then e.g. looks like this (with just one link, this is whats being parsed as the input file)

<?xml  version="1.0" encoding="UTF-8" ?>
<feed xmlns="http://www.w3.org/2005/Atom">
  <title>Shaarli demo (master)</title>
  <subtitle>Shaared links</subtitle>
  
    <updated>2019-01-30T06:06:01+00:00</updated>
  
  <link rel="self" href="https://demo.shaarli.org/?do=atom" />
  
  <author>
    <name>https://demo.shaarli.org/</name>
    <uri>https://demo.shaarli.org/</uri>
  </author>
  <id>https://demo.shaarli.org/</id>
  <generator>Shaarli</generator>
  
    <entry>
      <title>Aktuelle Trojaner-Welle: Emotet lauert in gefälschten Rechnungsmails | heise online</title>
      
        <link href="https://www.heise.de/security/meldung/Aktuelle-Trojaner-Welle-Emotet-lauert-in-gefaelschten-Rechnungsmails-4291268.html" />
      
      <id>https://demo.shaarli.org/?cEV4vw</id>
      
        <published>2019-01-30T06:06:01+00:00</published>
        <updated>2019-01-30T06:06:01+00:00</updated>
      
      <content type="html" xml:lang="en"><![CDATA[<div class="markdown"><p>&#8212; <a href="https://demo.shaarli.org/?cEV4vw">Permalink</a></p></div>]]></content>
      
      
    </entry>
  
</feed>

Note that ArchiveBox wants to include 8 links from this:

Adding 8 new links from /data/sources/demo.shaarli.org-1548828643.txt to /data/index.json

Most likely because 8 instances of http:// were found (that's just my speculation).
However, the expected behaviour should be that only the source link should be parsed / added, not the shaarli detail pages like https://demo.shaarli.org/?cEV4vw that contain nothing but the actual link to the source (again). IMO that doesn't make sense. It's even "worse" if a link has tags, because every tag then will lead to a new link to be crawled.

  1. Grab the Atom Feed https://demo.shaarli.org/?do=atom and import to ArchiveBox: docker-compose exec archivebox /bin/archive https://demo.shaarli.org/?do=atom
  2. You will see that markup fragments end up in the parser:
root@NASi:/volume1/docker/ArchiveBox/ArchiveBox-master# docker-compose exec archivebox /bin/archive https://demo.shaarli.org/?do=atom
[*] [2019-01-30 06:10:43] Downloading https://demo.shaarli.org/?do=atom > /data/sources/demo.shaarli.org-1548828643.txt
[+] [2019-01-30 06:11:02] Adding 8 new links from /data/sources/demo.shaarli.org-1548828643.txt to /data/index.json
[√] [2019-01-30 06:11:18] Updated main index files:
    > /data/index.json
    > /data/index.html
[▶] [2019-01-30 06:11:18] Updating files for 8 links in archive...
[+] [2019-01-30 06:11:27] "Aktuelle Trojaner-Welle: Emotet lauert in gefälschten Rechnungsmails | heise online - Shaarli demo (master)"
    https://demo.shaarli.org/?cEV4vw
    > /data/archive/1548828660 (new)
      > favicon
      > wget
        Got wget response code 8:
          Total wall clock time: 5.1s
          Downloaded: 20 files, 1.1M in 0.7s (1.54 MB/s)
        Some resources were skipped: 404 Not Found
        Run to see full output:
            cd /data/archive/1548828660;
            wget --no-verbose --adjust-extension --convert-links --force-directories --backup-converted --span-hosts --no-parent --restrict-file-names=unix --timeout=60 --warc-file=warc/1548828689 --page-requisites --user-agent="ArchiveBox/b53251fe4 (+https://github.com/pirate/ArchiveBox/) wget/1.18" https://demo.shaarli.org/?cEV4vw
      > pdf
      > screenshot
      > dom
      > archive_org
      > git
      > media
      √ index.json
      √ index.html
[+] [2019-01-30 06:11:50] "Aktuelle Trojaner-Welle: Emotet lauert in gefälschten Rechnungsmails | heise online - Shaarli demo (master)"
    https://demo.shaarli.org/?cEV4vw</id>
    > /data/archive/1548828659 (new)
      > favicon
      > wget
        Got wget response code 8:
          Total wall clock time: 5.1s
          Downloaded: 20 files, 1.1M in 0.7s (1.54 MB/s)
        Some resources were skipped: 404 Not Found
        Run to see full output:
            cd /data/archive/1548828659;
            wget --no-verbose --adjust-extension --convert-links --force-directories --backup-converted --span-hosts --no-parent --restrict-file-names=unix --timeout=60 --warc-file=warc/1548828710 --page-requisites --user-agent="ArchiveBox/b53251fe4 (+https://github.com/pirate/ArchiveBox/) wget/1.18" https://demo.shaarli.org/?cEV4vw</id>
      > pdf
      > screenshot
      > dom
      > archive_org
        Failed: Exception BadQueryException: Illegal character in query at index 32: https://demo.shaarli.org/?cEV4vw</id>
        Run to see full output:
            curl --location --head --max-time 60 --get https://web.archive.org/save/https://demo.shaarli.org/?cEV4vw</id>
      > git
      > media
      √ index.json
      √ index.html
[+] [2019-01-30 06:12:10] "comments_outline_white"
    https://www.heise.de/security/meldung/Aktuelle-Trojaner-Welle-Emotet-lauert-in-gefaelschten-Rechnungsmails-4291268.html
    > /data/archive/1548828658 (new)
      > favicon
      > wget
        Got wget response code 4:
          Total wall clock time: 38s
          Downloaded: 128 files, 6.0M in 12s (502 KB/s)
        Some resources were skipped: Got an error from the server
        Run to see full output:
            cd /data/archive/1548828658;
            wget --no-verbose --adjust-extension --convert-links --force-directories --backup-converted --span-hosts --no-parent --restrict-file-names=unix --timeout=60 --warc-file=warc/1548828730 --page-requisites --user-agent="ArchiveBox/b53251fe4 (+https://github.com/pirate/ArchiveBox/) wget/1.18" https://www.heise.de/security/meldung/Aktuelle-Trojaner-Welle-Emotet-lauert-in-gefaelschten-Rechnungsmails-4291268.html
      > pdf
      > screenshot
      > dom
      > archive_org
      > git
      > media
        got youtubedl response code 1:
b'ERROR: Unable to extract container ID; please report this issue on https://yt-dl.org/bug . Make sure you are using the latest version; see  https://yt-dl.org/update  on how to update. Be sure to call youtube-dl with the --verbose flag and include its complete output.\n'
        Failed: Exception Failed to download media
        Run to see full output:
            cd /data/archive/1548828658;
            youtube-dl --write-description --write-info-json --write-annotations --yes-playlist --write-thumbnail --no-call-home --no-check-certificate --user-agent --all-subs -x -k --audio-format mp3 --audio-quality 320K --embed-thumbnail --add-metadata https://www.heise.de/security/meldung/Aktuelle-Trojaner-Welle-Emotet-lauert-in-gefaelschten-Rechnungsmails-4291268.html
      √ index.json
      √ index.html
[+] [2019-01-30 06:13:06] "https://demo.shaarli.org/</id>"
    https://demo.shaarli.org/</id>
    > /data/archive/1548828657 (new)
      > favicon
      > wget
        Got wget response code 8:
          https://demo.shaarli.org/%3C/id%3E:
          2019-01-30 06:13:07 ERROR 404: Not Found.
        Some resources were skipped: 404 Not Found
        Run to see full output:
            cd /data/archive/1548828657;
            wget --no-verbose --adjust-extension --convert-links --force-directories --backup-converted --span-hosts --no-parent --restrict-file-names=unix --timeout=60 --warc-file=warc/1548828786 --page-requisites --user-agent="ArchiveBox/b53251fe4 (+https://github.com/pirate/ArchiveBox/) wget/1.18" https://demo.shaarli.org/</id>
      > pdf
      > screenshot
      > dom
      > archive_org
        Failed: Exception BadQueryException: Illegal character in path at index 25: https://demo.shaarli.org/</id>
        Run to see full output:
            curl --location --head --max-time 60 --get https://web.archive.org/save/https://demo.shaarli.org/</id>
      > git
      > media
        got youtubedl response code 1:
b"WARNING: Could not send HEAD request to https://demo.shaarli.org/</id>: HTTP Error 404: Not Found\nERROR: Unable to download webpage: HTTP Error 404: Not Found (caused by <HTTPError 404: 'Not Found'>); please report this issue on https://yt-dl.org/bug . Make sure you are using the latest version; see  https://yt-dl.org/update  on how to update. Be sure to call youtube-dl with the --verbose flag and include its complete output.\n"
        Failed: Exception Failed to download media
        Run to see full output:
            cd /data/archive/1548828657;
            youtube-dl --write-description --write-info-json --write-annotations --yes-playlist --write-thumbnail --no-call-home --no-check-certificate --user-agent --all-subs -x -k --audio-format mp3 --audio-quality 320K --embed-thumbnail --add-metadata https://demo.shaarli.org/</id>
      √ index.json
      √ index.html
[+] [2019-01-30 06:13:16] "https://demo.shaarli.org/</uri>"
    https://demo.shaarli.org/</uri>
    > /data/archive/1548828656 (new)
      > favicon
      > wget
        Got wget response code 8:
          https://demo.shaarli.org/%3C/uri%3E:
          2019-01-30 06:13:17 ERROR 404: Not Found.
        Some resources were skipped: 404 Not Found
        Run to see full output:
            cd /data/archive/1548828656;
            wget --no-verbose --adjust-extension --convert-links --force-directories --backup-converted --span-hosts --no-parent --restrict-file-names=unix --timeout=60 --warc-file=warc/1548828796 --page-requisites --user-agent="ArchiveBox/b53251fe4 (+https://github.com/pirate/ArchiveBox/) wget/1.18" https://demo.shaarli.org/</uri>
      > pdf
      > screenshot
      > dom
      > archive_org
        Failed: Exception BadQueryException: Illegal character in path at index 25: https://demo.shaarli.org/</uri>
        Run to see full output:
            curl --location --head --max-time 60 --get https://web.archive.org/save/https://demo.shaarli.org/</uri>
      > git
      > media
        got youtubedl response code 1:
b"WARNING: Could not send HEAD request to https://demo.shaarli.org/</uri>: HTTP Error 404: Not Found\nERROR: Unable to download webpage: HTTP Error 404: Not Found (caused by <HTTPError 404: 'Not Found'>); please report this issue on https://yt-dl.org/bug . Make sure you are using the latest version; see  https://yt-dl.org/update  on how to update. Be sure to call youtube-dl with the --verbose flag and include its complete output.\n"
        Failed: Exception Failed to download media
        Run to see full output:
            cd /data/archive/1548828656;
            youtube-dl --write-description --write-info-json --write-annotations --yes-playlist --write-thumbnail --no-call-home --no-check-certificate --user-agent --all-subs -x -k --audio-format mp3 --audio-quality 320K --embed-thumbnail --add-metadata https://demo.shaarli.org/</uri>
      √ index.json
      √ index.html
[+] [2019-01-30 06:13:25] "Shaarli demo (master)"
    https://demo.shaarli.org/?do=atom
    > /data/archive/1548828655 (new)
      > favicon
      > wget
      > pdf
      > screenshot
      > dom
      > archive_org
      > git
      > media
      √ index.json
      √ index.html
[+] [2019-01-30 06:13:36] "https://demo.shaarli.org/</name>"
    https://demo.shaarli.org/</name>
    > /data/archive/1548828655.0 (new)
      > favicon
      > wget
        Got wget response code 8:
          https://demo.shaarli.org/%3C/name%3E:
          2019-01-30 06:13:37 ERROR 404: Not Found.
        Some resources were skipped: 404 Not Found
        Run to see full output:
            cd /data/archive/1548828655.0;
            wget --no-verbose --adjust-extension --convert-links --force-directories --backup-converted --span-hosts --no-parent --restrict-file-names=unix --timeout=60 --warc-file=warc/1548828816 --page-requisites --user-agent="ArchiveBox/b53251fe4 (+https://github.com/pirate/ArchiveBox/) wget/1.18" https://demo.shaarli.org/</name>
      > pdf
      > screenshot
      > dom
      > archive_org
        Failed: Exception BadQueryException: Illegal character in path at index 25: https://demo.shaarli.org/</name>
        Run to see full output:
            curl --location --head --max-time 60 --get https://web.archive.org/save/https://demo.shaarli.org/</name>
      > git
      > media
        got youtubedl response code 1:
b"WARNING: Could not send HEAD request to https://demo.shaarli.org/</name>: HTTP Error 404: Not Found\nERROR: Unable to download webpage: HTTP Error 404: Not Found (caused by <HTTPError 404: 'Not Found'>); please report this issue on https://yt-dl.org/bug . Make sure you are using the latest version; see  https://yt-dl.org/update  on how to update. Be sure to call youtube-dl with the --verbose flag and include its complete output.\n"
        Failed: Exception Failed to download media
        Run to see full output:
            cd /data/archive/1548828655.0;
            youtube-dl --write-description --write-info-json --write-annotations --yes-playlist --write-thumbnail --no-call-home --no-check-certificate --user-agent --all-subs -x -k --audio-format mp3 --audio-quality 320K --embed-thumbnail --add-metadata https://demo.shaarli.org/</name>
      √ index.json
      √ index.html
[+] [2019-01-30 06:13:45] "http://www.w3.org/2005/Atom"
    http://www.w3.org/2005/Atom
    > /data/archive/1548828644 (new)
      > favicon
      > wget
      > pdf
      > screenshot
      > dom
      > archive_org
        Failed: Exception LiveDocumentNotAvailableException: http://www.w3.org/2005/Atom: live document unavailable: java.net.SocketTimeoutException: Read timed out
        Run to see full output:
            curl --location --head --max-time 60 --get https://web.archive.org/save/http://www.w3.org/2005/Atom
      > git
      > media
      √ index.json
      √ index.html
[√] [2019-01-30 06:15:28] Update of 8 links complete (4.17 min)
    - 8 entries skipped
    - 41 entries updated
    - 15 errors

(note the </id> at the end of the links)

Originally created by @mawmawmawm on GitHub (Jan 30, 2019). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/135 It looks like Shaarli feeds are not being parsed correctly and markup is being included in the link structure (much like ticket 134 for pocket). Also, it looks like shaarli detail and tag pages are being parsed as _source_ links, making the import much slower and leading to clutter in the archive. You can use the public shaarli demo to reproduce this. There's a demo (U: _demo_ / PW: _demo_) running on https://demo.shaarli.org/. 1. Add whatever link to this instance The Atom feed then e.g. looks like this (with just one link, this is whats being parsed as the input file) ``` <?xml version="1.0" encoding="UTF-8" ?> <feed xmlns="http://www.w3.org/2005/Atom"> <title>Shaarli demo (master)</title> <subtitle>Shaared links</subtitle> <updated>2019-01-30T06:06:01+00:00</updated> <link rel="self" href="https://demo.shaarli.org/?do=atom" /> <author> <name>https://demo.shaarli.org/</name> <uri>https://demo.shaarli.org/</uri> </author> <id>https://demo.shaarli.org/</id> <generator>Shaarli</generator> <entry> <title>Aktuelle Trojaner-Welle: Emotet lauert in gefälschten Rechnungsmails | heise online</title> <link href="https://www.heise.de/security/meldung/Aktuelle-Trojaner-Welle-Emotet-lauert-in-gefaelschten-Rechnungsmails-4291268.html" /> <id>https://demo.shaarli.org/?cEV4vw</id> <published>2019-01-30T06:06:01+00:00</published> <updated>2019-01-30T06:06:01+00:00</updated> <content type="html" xml:lang="en"><![CDATA[<div class="markdown"><p>&#8212; <a href="https://demo.shaarli.org/?cEV4vw">Permalink</a></p></div>]]></content> </entry> </feed> ``` Note that ArchiveBox wants to include 8 links from this: ``` Adding 8 new links from /data/sources/demo.shaarli.org-1548828643.txt to /data/index.json ``` Most likely because 8 instances of `http://` were found (that's just my speculation). However, the expected behaviour should be that only the source link should be parsed / added, not the shaarli detail pages like `https://demo.shaarli.org/?cEV4vw` that contain nothing but the actual link to the source (again). IMO that doesn't make sense. It's even "worse" if a link has tags, because every tag then will lead to a new link to be crawled. 2. Grab the Atom Feed https://demo.shaarli.org/?do=atom and import to ArchiveBox: `docker-compose exec archivebox /bin/archive https://demo.shaarli.org/?do=atom` 3. You will see that markup fragments end up in the parser: ``` root@NASi:/volume1/docker/ArchiveBox/ArchiveBox-master# docker-compose exec archivebox /bin/archive https://demo.shaarli.org/?do=atom [*] [2019-01-30 06:10:43] Downloading https://demo.shaarli.org/?do=atom > /data/sources/demo.shaarli.org-1548828643.txt [+] [2019-01-30 06:11:02] Adding 8 new links from /data/sources/demo.shaarli.org-1548828643.txt to /data/index.json [√] [2019-01-30 06:11:18] Updated main index files: > /data/index.json > /data/index.html [▶] [2019-01-30 06:11:18] Updating files for 8 links in archive... [+] [2019-01-30 06:11:27] "Aktuelle Trojaner-Welle: Emotet lauert in gefälschten Rechnungsmails | heise online - Shaarli demo (master)" https://demo.shaarli.org/?cEV4vw > /data/archive/1548828660 (new) > favicon > wget Got wget response code 8: Total wall clock time: 5.1s Downloaded: 20 files, 1.1M in 0.7s (1.54 MB/s) Some resources were skipped: 404 Not Found Run to see full output: cd /data/archive/1548828660; wget --no-verbose --adjust-extension --convert-links --force-directories --backup-converted --span-hosts --no-parent --restrict-file-names=unix --timeout=60 --warc-file=warc/1548828689 --page-requisites --user-agent="ArchiveBox/b53251fe4 (+https://github.com/pirate/ArchiveBox/) wget/1.18" https://demo.shaarli.org/?cEV4vw > pdf > screenshot > dom > archive_org > git > media √ index.json √ index.html [+] [2019-01-30 06:11:50] "Aktuelle Trojaner-Welle: Emotet lauert in gefälschten Rechnungsmails | heise online - Shaarli demo (master)" https://demo.shaarli.org/?cEV4vw</id> > /data/archive/1548828659 (new) > favicon > wget Got wget response code 8: Total wall clock time: 5.1s Downloaded: 20 files, 1.1M in 0.7s (1.54 MB/s) Some resources were skipped: 404 Not Found Run to see full output: cd /data/archive/1548828659; wget --no-verbose --adjust-extension --convert-links --force-directories --backup-converted --span-hosts --no-parent --restrict-file-names=unix --timeout=60 --warc-file=warc/1548828710 --page-requisites --user-agent="ArchiveBox/b53251fe4 (+https://github.com/pirate/ArchiveBox/) wget/1.18" https://demo.shaarli.org/?cEV4vw</id> > pdf > screenshot > dom > archive_org Failed: Exception BadQueryException: Illegal character in query at index 32: https://demo.shaarli.org/?cEV4vw</id> Run to see full output: curl --location --head --max-time 60 --get https://web.archive.org/save/https://demo.shaarli.org/?cEV4vw</id> > git > media √ index.json √ index.html [+] [2019-01-30 06:12:10] "comments_outline_white" https://www.heise.de/security/meldung/Aktuelle-Trojaner-Welle-Emotet-lauert-in-gefaelschten-Rechnungsmails-4291268.html > /data/archive/1548828658 (new) > favicon > wget Got wget response code 4: Total wall clock time: 38s Downloaded: 128 files, 6.0M in 12s (502 KB/s) Some resources were skipped: Got an error from the server Run to see full output: cd /data/archive/1548828658; wget --no-verbose --adjust-extension --convert-links --force-directories --backup-converted --span-hosts --no-parent --restrict-file-names=unix --timeout=60 --warc-file=warc/1548828730 --page-requisites --user-agent="ArchiveBox/b53251fe4 (+https://github.com/pirate/ArchiveBox/) wget/1.18" https://www.heise.de/security/meldung/Aktuelle-Trojaner-Welle-Emotet-lauert-in-gefaelschten-Rechnungsmails-4291268.html > pdf > screenshot > dom > archive_org > git > media got youtubedl response code 1: b'ERROR: Unable to extract container ID; please report this issue on https://yt-dl.org/bug . Make sure you are using the latest version; see https://yt-dl.org/update on how to update. Be sure to call youtube-dl with the --verbose flag and include its complete output.\n' Failed: Exception Failed to download media Run to see full output: cd /data/archive/1548828658; youtube-dl --write-description --write-info-json --write-annotations --yes-playlist --write-thumbnail --no-call-home --no-check-certificate --user-agent --all-subs -x -k --audio-format mp3 --audio-quality 320K --embed-thumbnail --add-metadata https://www.heise.de/security/meldung/Aktuelle-Trojaner-Welle-Emotet-lauert-in-gefaelschten-Rechnungsmails-4291268.html √ index.json √ index.html [+] [2019-01-30 06:13:06] "https://demo.shaarli.org/</id>" https://demo.shaarli.org/</id> > /data/archive/1548828657 (new) > favicon > wget Got wget response code 8: https://demo.shaarli.org/%3C/id%3E: 2019-01-30 06:13:07 ERROR 404: Not Found. Some resources were skipped: 404 Not Found Run to see full output: cd /data/archive/1548828657; wget --no-verbose --adjust-extension --convert-links --force-directories --backup-converted --span-hosts --no-parent --restrict-file-names=unix --timeout=60 --warc-file=warc/1548828786 --page-requisites --user-agent="ArchiveBox/b53251fe4 (+https://github.com/pirate/ArchiveBox/) wget/1.18" https://demo.shaarli.org/</id> > pdf > screenshot > dom > archive_org Failed: Exception BadQueryException: Illegal character in path at index 25: https://demo.shaarli.org/</id> Run to see full output: curl --location --head --max-time 60 --get https://web.archive.org/save/https://demo.shaarli.org/</id> > git > media got youtubedl response code 1: b"WARNING: Could not send HEAD request to https://demo.shaarli.org/</id>: HTTP Error 404: Not Found\nERROR: Unable to download webpage: HTTP Error 404: Not Found (caused by <HTTPError 404: 'Not Found'>); please report this issue on https://yt-dl.org/bug . Make sure you are using the latest version; see https://yt-dl.org/update on how to update. Be sure to call youtube-dl with the --verbose flag and include its complete output.\n" Failed: Exception Failed to download media Run to see full output: cd /data/archive/1548828657; youtube-dl --write-description --write-info-json --write-annotations --yes-playlist --write-thumbnail --no-call-home --no-check-certificate --user-agent --all-subs -x -k --audio-format mp3 --audio-quality 320K --embed-thumbnail --add-metadata https://demo.shaarli.org/</id> √ index.json √ index.html [+] [2019-01-30 06:13:16] "https://demo.shaarli.org/</uri>" https://demo.shaarli.org/</uri> > /data/archive/1548828656 (new) > favicon > wget Got wget response code 8: https://demo.shaarli.org/%3C/uri%3E: 2019-01-30 06:13:17 ERROR 404: Not Found. Some resources were skipped: 404 Not Found Run to see full output: cd /data/archive/1548828656; wget --no-verbose --adjust-extension --convert-links --force-directories --backup-converted --span-hosts --no-parent --restrict-file-names=unix --timeout=60 --warc-file=warc/1548828796 --page-requisites --user-agent="ArchiveBox/b53251fe4 (+https://github.com/pirate/ArchiveBox/) wget/1.18" https://demo.shaarli.org/</uri> > pdf > screenshot > dom > archive_org Failed: Exception BadQueryException: Illegal character in path at index 25: https://demo.shaarli.org/</uri> Run to see full output: curl --location --head --max-time 60 --get https://web.archive.org/save/https://demo.shaarli.org/</uri> > git > media got youtubedl response code 1: b"WARNING: Could not send HEAD request to https://demo.shaarli.org/</uri>: HTTP Error 404: Not Found\nERROR: Unable to download webpage: HTTP Error 404: Not Found (caused by <HTTPError 404: 'Not Found'>); please report this issue on https://yt-dl.org/bug . Make sure you are using the latest version; see https://yt-dl.org/update on how to update. Be sure to call youtube-dl with the --verbose flag and include its complete output.\n" Failed: Exception Failed to download media Run to see full output: cd /data/archive/1548828656; youtube-dl --write-description --write-info-json --write-annotations --yes-playlist --write-thumbnail --no-call-home --no-check-certificate --user-agent --all-subs -x -k --audio-format mp3 --audio-quality 320K --embed-thumbnail --add-metadata https://demo.shaarli.org/</uri> √ index.json √ index.html [+] [2019-01-30 06:13:25] "Shaarli demo (master)" https://demo.shaarli.org/?do=atom > /data/archive/1548828655 (new) > favicon > wget > pdf > screenshot > dom > archive_org > git > media √ index.json √ index.html [+] [2019-01-30 06:13:36] "https://demo.shaarli.org/</name>" https://demo.shaarli.org/</name> > /data/archive/1548828655.0 (new) > favicon > wget Got wget response code 8: https://demo.shaarli.org/%3C/name%3E: 2019-01-30 06:13:37 ERROR 404: Not Found. Some resources were skipped: 404 Not Found Run to see full output: cd /data/archive/1548828655.0; wget --no-verbose --adjust-extension --convert-links --force-directories --backup-converted --span-hosts --no-parent --restrict-file-names=unix --timeout=60 --warc-file=warc/1548828816 --page-requisites --user-agent="ArchiveBox/b53251fe4 (+https://github.com/pirate/ArchiveBox/) wget/1.18" https://demo.shaarli.org/</name> > pdf > screenshot > dom > archive_org Failed: Exception BadQueryException: Illegal character in path at index 25: https://demo.shaarli.org/</name> Run to see full output: curl --location --head --max-time 60 --get https://web.archive.org/save/https://demo.shaarli.org/</name> > git > media got youtubedl response code 1: b"WARNING: Could not send HEAD request to https://demo.shaarli.org/</name>: HTTP Error 404: Not Found\nERROR: Unable to download webpage: HTTP Error 404: Not Found (caused by <HTTPError 404: 'Not Found'>); please report this issue on https://yt-dl.org/bug . Make sure you are using the latest version; see https://yt-dl.org/update on how to update. Be sure to call youtube-dl with the --verbose flag and include its complete output.\n" Failed: Exception Failed to download media Run to see full output: cd /data/archive/1548828655.0; youtube-dl --write-description --write-info-json --write-annotations --yes-playlist --write-thumbnail --no-call-home --no-check-certificate --user-agent --all-subs -x -k --audio-format mp3 --audio-quality 320K --embed-thumbnail --add-metadata https://demo.shaarli.org/</name> √ index.json √ index.html [+] [2019-01-30 06:13:45] "http://www.w3.org/2005/Atom" http://www.w3.org/2005/Atom > /data/archive/1548828644 (new) > favicon > wget > pdf > screenshot > dom > archive_org Failed: Exception LiveDocumentNotAvailableException: http://www.w3.org/2005/Atom: live document unavailable: java.net.SocketTimeoutException: Read timed out Run to see full output: curl --location --head --max-time 60 --get https://web.archive.org/save/http://www.w3.org/2005/Atom > git > media √ index.json √ index.html [√] [2019-01-30 06:15:28] Update of 8 links complete (4.17 min) - 8 entries skipped - 41 entries updated - 15 errors ``` (note the `</id>` at the end of the links)
Author
Owner

@mawmawmawm commented on GitHub (Jan 30, 2019):

The same is true e.g. for wallabag feeds that contain links in the fulltext RSS feed with (maybe) broken HTML - they're

  1. all getting imported (I'm not sure if that's a bug or a feature) -- and I don't have a definitive opinion on it. On one hand it's nice that all things being linked in the original link / article are being fetched, on the other hand this creates an enormous influx of links and clutter -- if I wanted to save all these links I would probably add them
  2. they're not getting imported correctly if they're nested into HTML somehow, e.g. this shows in my input feed (source URL, links at the very bottom):
<strong>
Links
in
diesem
Artikel:</strong><br /><small>
<code>
<strong>
[1]</strong> https://twitter.com/certbund/status/1089903361816739843</code></small><br /><small>
<code>
<strong>
[2]</strong> https://www.heise.de/security/artikel/Dynamit-Phishing-mit-Emotet-So-schuetzen-Sie-sich-vor-der-Trojaner-Welle-4243695.html</code></small><br /><small>
<code>
<strong>
[3]</strong> https://www.heise.de/meldung/Achtung-Dynamit-Phishing-Gefaehrliche-Trojaner-Welle-legt-ganze-Firmen-lahm-4241424.html</code></small><br /><small>
<code>
<strong>
[4]</strong> https://blog.talosintelligence.com/2019/01/return-of-emotet.html</code></small><br /><small>
<code>
<strong>
[5]</strong> mailto:des@heise.de</code></small><br /></p>

These are being parsed as (example number 4 from above)...

[▶] [2019-01-30 06:56:07] Updating files for 33 links in archive...
[+] [2019-01-30 06:56:15] "https://blog.talosintelligence.com/2019/01/return-of-emotet.html</code></small><br"

leading to invalid links / 404s.

<!-- gh-comment-id:458836911 --> @mawmawmawm commented on GitHub (Jan 30, 2019): The same is true e.g. for wallabag feeds that contain links in the fulltext RSS feed with (maybe) broken HTML - they're 1) all getting imported (I'm not sure if that's a bug or a feature) -- and I don't have a definitive opinion on it. On one hand it's nice that all things being linked in the original link / article are being fetched, on the other hand this creates an enormous influx of links and clutter -- if I wanted to save all these links I would probably add them 2) they're not getting imported correctly if they're nested into HTML somehow, e.g. this shows in my input feed ([source URL](https://www.heise.de/security/meldung/Aktuelle-Trojaner-Welle-Emotet-lauert-in-gefaelschten-Rechnungsmails-4291268.html?view=print), links at the very bottom): ``` <strong> Links in diesem Artikel:</strong><br /><small> <code> <strong> [1]</strong> https://twitter.com/certbund/status/1089903361816739843</code></small><br /><small> <code> <strong> [2]</strong> https://www.heise.de/security/artikel/Dynamit-Phishing-mit-Emotet-So-schuetzen-Sie-sich-vor-der-Trojaner-Welle-4243695.html</code></small><br /><small> <code> <strong> [3]</strong> https://www.heise.de/meldung/Achtung-Dynamit-Phishing-Gefaehrliche-Trojaner-Welle-legt-ganze-Firmen-lahm-4241424.html</code></small><br /><small> <code> <strong> [4]</strong> https://blog.talosintelligence.com/2019/01/return-of-emotet.html</code></small><br /><small> <code> <strong> [5]</strong> mailto:des@heise.de</code></small><br /></p> ``` These are being parsed as (example number 4 from above)... ``` [▶] [2019-01-30 06:56:07] Updating files for 33 links in archive... [+] [2019-01-30 06:56:15] "https://blog.talosintelligence.com/2019/01/return-of-emotet.html</code></small><br" ``` leading to invalid links / 404s.
Author
Owner

@pirate commented on GitHub (Jan 30, 2019):

Thanks for reporting this.

I think the fixes will be simple:

  1. make the URL parser stop on < or > characters, so we never include closing tags by accident
  2. fix the RSS parser for shaarli & wallabag's formats so that it doesn't fall back to full-text parser (the one that matches all strings with http(s)://)

The intended behavior is only to take the actual page to archive from each RSS entry. I'm favoring usability over completeness here, I don't want archives filled with garbage URLs on every import, and if a user wants to add those URLs manually, they can force full-text parsing by passing the URLs individually via stdin.

It might be worth just doing this ticket first, it will solve both these problems: https://github.com/pirate/ArchiveBox/issues/123

<!-- gh-comment-id:458869057 --> @pirate commented on GitHub (Jan 30, 2019): Thanks for reporting this. I think the fixes will be simple: 1. make the URL parser stop on `<` or `>` characters, so we never include closing tags by accident 2. fix the RSS parser for shaarli & wallabag's formats so that it doesn't fall back to full-text parser (the one that matches all strings with http(s)://) The intended behavior is only to take the actual page to archive from each RSS entry. I'm favoring usability over completeness here, I don't want archives filled with garbage URLs on every import, and if a user wants to add those URLs manually, they can force full-text parsing by passing the URLs individually via `stdin`. It might be worth just doing this ticket first, it will solve both these problems: https://github.com/pirate/ArchiveBox/issues/123
Author
Owner

@mawmawmawm commented on GitHub (Jan 30, 2019):

Thanks for looking into it. I agree, switching to a different / more roboust parser should solve all this. Not sure how you would exclude the fulltext / links therein, I guess you would need to limit the parsing to e.g. <entry> → <link href="... to omit all the other links.

<!-- gh-comment-id:459030529 --> @mawmawmawm commented on GitHub (Jan 30, 2019): Thanks for looking into it. I agree, switching to a different / more roboust parser should solve all this. Not sure how you would exclude the fulltext / links therein, I guess you would need to limit the parsing to e.g. `<entry> → <link href="...` to omit all the other links.
Author
Owner

@pirate commented on GitHub (Jan 30, 2019):

Full-text parsing is only ever used as a fallback if all the other parsing methods fail, so once the RSS parser is working again it should automatically ignore those other links. (the RSS parser knows to only take the main ones)

<!-- gh-comment-id:459108671 --> @pirate commented on GitHub (Jan 30, 2019): Full-text parsing is only ever used as a fallback if all the other parsing methods fail, so once the RSS parser is working again it should automatically ignore those other links. (the RSS parser knows to only take the main ones)
Author
Owner

@pirate commented on GitHub (Feb 1, 2019):

I fixed the regex c37941e, give the latest master commit a try.

<!-- gh-comment-id:459595860 --> @pirate commented on GitHub (Feb 1, 2019): I fixed the regex c37941e, give the latest master commit a try.
Author
Owner

@mawmawmawm commented on GitHub (Feb 1, 2019):

Thanks - should work according to the code change (reviewed that). I will
Have to wait until the RSS parser will be working again - otherwise my library will be flooded with additional links.

<!-- gh-comment-id:459614941 --> @mawmawmawm commented on GitHub (Feb 1, 2019): Thanks - should work according to the code change (reviewed that). I will Have to wait until the RSS parser will be working again - otherwise my library will be flooded with additional links.
Author
Owner

@pirate commented on GitHub (Feb 5, 2019):

Ok RSS parser is fixed. There's now a dedicated RSS parser for the Shaarli export format. Give it a try with the latest version of master.

<!-- gh-comment-id:460531612 --> @pirate commented on GitHub (Feb 5, 2019): Ok RSS parser is fixed. There's now a dedicated RSS parser for the Shaarli export format. Give it a try with the latest version of master.
Author
Owner

@mawmawmawm commented on GitHub (Feb 6, 2019):

Hey there,
thanks for working on this but it still seems to be broken / fulltext being used.
I added a link to the shaarli demo instance and then added the atom feed:

docker-compose exec archivebox /bin/archive https://demo.shaarli.org/?do=atom

The result is this XML file in sources:

<?xml  version="1.0" encoding="UTF-8" ?>
<feed xmlns="http://www.w3.org/2005/Atom">
  <title>Shaarli demo (master)</title>
  <subtitle>Shaared links</subtitle>
  
    <updated>2019-02-06T03:48:27+00:00</updated>
  
  <link rel="self" href="https://demo.shaarli.org/?do=atom" />
  
  <author>
    <name>https://demo.shaarli.org/</name>
    <uri>https://demo.shaarli.org/</uri>
  </author>
  <id>https://demo.shaarli.org/</id>
  <generator>Shaarli</generator>
  
    <entry>
      <title>Pope Acknowledges Priests and Bishops Have Sexually Abused Nuns - The New York Times</title>
      
        <link href="https://www.nytimes.com/2019/02/05/world/europe/pope-nuns-sexual-abuse.html" />
      
      <id>https://demo.shaarli.org/?aDynig</id>
      
        <published>2019-02-06T03:48:27+00:00</published>
        <updated>2019-02-06T03:48:27+00:00</updated>
      
      <content type="html" xml:lang="en"><![CDATA[<div class="markdown"><p>&#8212; <a href="https://demo.shaarli.org/?aDynig">Permalink</a></p></div>]]></content>
      
      
    </entry>
  
    <entry>
      <title>GitHub - shaarli/Shaarli: The personal, minimalist, super-fast, database free, bookmarking service - community repo</title>
      
        <link href="https://github.com/shaarli/Shaarli" />
      
      <id>https://demo.shaarli.org/?YmapXQ</id>
      
        <published>2019-02-06T02:01:56+00:00</published>
        <updated>2019-02-06T02:01:56+00:00</updated>
      
      <content type="html" xml:lang="en"><![CDATA[<div class="markdown"><p>&#8212; <a href="https://demo.shaarli.org/?YmapXQ">Permalink</a></p></div>]]></content>
      
        <category scheme="https://demo.shaarli.org/?searchtags=" term="secretstuff" label="secretstuff" />
      
        <category scheme="https://demo.shaarli.org/?searchtags=" term="software" label="software" />
      
        <category scheme="https://demo.shaarli.org/?searchtags=" term="stuff" label="stuff" />
      
        <category scheme="https://demo.shaarli.org/?searchtags=" term="tags" label="tags" />
      
        <category scheme="https://demo.shaarli.org/?searchtags=" term="testing" label="testing" />
      
      
    </entry>
  
    <entry>
      <title>Note: testing notes</title>
      
        <link href="https://demo.shaarli.org/?aAHk4Q" />
      
      <id>https://demo.shaarli.org/?aAHk4Q</id>
      
        <published>2019-02-06T02:01:29+00:00</published>
        <updated>2019-02-06T02:01:29+00:00</updated>
      
      <content type="html" xml:lang="en"><![CDATA[<div class="markdown"><p>This is a note<br />
&#8212; <a href="https://demo.shaarli.org/?aAHk4Q">Permalink</a></p></div>]]></content>
      
      
    </entry>
  
    <entry>
      <title>The personal, minimalist, super-fast, database free, bookmarking service</title>
      
        <link href="https://shaarli.readthedocs.io" />
      
      <id>https://demo.shaarli.org/?MdkVOw</id>
      
        <published>2019-02-06T01:01:12+00:00</published>
        <updated>2019-02-06T01:01:12+00:00</updated>
      
      <content type="html" xml:lang="en"><![CDATA[<div class="markdown"><p>Welcome to Shaarli! This is your first public bookmark. To edit or delete me, you must first login.</p>
<p>To learn how to use Shaarli, consult the link &quot;Documentation&quot; at the bottom of this page.</p>
<p>You use the community supported version of the original Shaarli project, by Sebastien Sauvage.<br />
&#8212; <a href="https://demo.shaarli.org/?MdkVOw">Permalink</a></p></div>]]></content>
      
        <category scheme="https://demo.shaarli.org/?searchtags=" term="opensource" label="opensource" />
      
        <category scheme="https://demo.shaarli.org/?searchtags=" term="software" label="software" />
      
      
    </entry>
  
</feed>

"Garbage" links like https://demo.shaarli.org/?MdkVOw or https://demo.shaarli.org/?searchtags= are still being pulled in, the parser doesn't seem to just look for <entry> → <link href="... :

[*] [2019-02-06 04:27:34] Downloading https://demo.shaarli.org/?do=atom > /data/sources/demo.shaarli.org-1549427254.txt
[*] [2019-02-06 04:27:35] Parsing new links from output/sources/demo.shaarli.org-1549427254.txt and fetching titles...
........................
[+] [2019-02-06 04:28:07] Adding 11 new links to index from /data/sources/demo.shaarli.org-1549427254.txt (Plain Text format)
[√] [2019-02-06 04:28:07] Updated main index files:
    > /data/index.json
    > /data/index.html
[▶] [2019-02-06 04:28:07] Downloading content for 12 pages in archive...
[+] [2019-02-06 04:28:09] "The personal, minimalist, super-fast, database free, bookmarking service - Shaarli demo (master)"
    https://demo.shaarli.org/?MdkVOw
    > /data/archive/1549427283 (new)
      > favicon
      > wget
      > pdf
      > screenshot
      > dom
      > archive_org
      > git
      > media
      √ index.json
      √ index.html
...
2019-02-05 at 20 34
<!-- gh-comment-id:460898443 --> @mawmawmawm commented on GitHub (Feb 6, 2019): Hey there, thanks for working on this but it still seems to be broken / fulltext being used. I added a link to the shaarli demo instance and then added the atom feed: ```docker-compose exec archivebox /bin/archive https://demo.shaarli.org/?do=atom``` The result is this XML file in `sources`: ``` <?xml version="1.0" encoding="UTF-8" ?> <feed xmlns="http://www.w3.org/2005/Atom"> <title>Shaarli demo (master)</title> <subtitle>Shaared links</subtitle> <updated>2019-02-06T03:48:27+00:00</updated> <link rel="self" href="https://demo.shaarli.org/?do=atom" /> <author> <name>https://demo.shaarli.org/</name> <uri>https://demo.shaarli.org/</uri> </author> <id>https://demo.shaarli.org/</id> <generator>Shaarli</generator> <entry> <title>Pope Acknowledges Priests and Bishops Have Sexually Abused Nuns - The New York Times</title> <link href="https://www.nytimes.com/2019/02/05/world/europe/pope-nuns-sexual-abuse.html" /> <id>https://demo.shaarli.org/?aDynig</id> <published>2019-02-06T03:48:27+00:00</published> <updated>2019-02-06T03:48:27+00:00</updated> <content type="html" xml:lang="en"><![CDATA[<div class="markdown"><p>&#8212; <a href="https://demo.shaarli.org/?aDynig">Permalink</a></p></div>]]></content> </entry> <entry> <title>GitHub - shaarli/Shaarli: The personal, minimalist, super-fast, database free, bookmarking service - community repo</title> <link href="https://github.com/shaarli/Shaarli" /> <id>https://demo.shaarli.org/?YmapXQ</id> <published>2019-02-06T02:01:56+00:00</published> <updated>2019-02-06T02:01:56+00:00</updated> <content type="html" xml:lang="en"><![CDATA[<div class="markdown"><p>&#8212; <a href="https://demo.shaarli.org/?YmapXQ">Permalink</a></p></div>]]></content> <category scheme="https://demo.shaarli.org/?searchtags=" term="secretstuff" label="secretstuff" /> <category scheme="https://demo.shaarli.org/?searchtags=" term="software" label="software" /> <category scheme="https://demo.shaarli.org/?searchtags=" term="stuff" label="stuff" /> <category scheme="https://demo.shaarli.org/?searchtags=" term="tags" label="tags" /> <category scheme="https://demo.shaarli.org/?searchtags=" term="testing" label="testing" /> </entry> <entry> <title>Note: testing notes</title> <link href="https://demo.shaarli.org/?aAHk4Q" /> <id>https://demo.shaarli.org/?aAHk4Q</id> <published>2019-02-06T02:01:29+00:00</published> <updated>2019-02-06T02:01:29+00:00</updated> <content type="html" xml:lang="en"><![CDATA[<div class="markdown"><p>This is a note<br /> &#8212; <a href="https://demo.shaarli.org/?aAHk4Q">Permalink</a></p></div>]]></content> </entry> <entry> <title>The personal, minimalist, super-fast, database free, bookmarking service</title> <link href="https://shaarli.readthedocs.io" /> <id>https://demo.shaarli.org/?MdkVOw</id> <published>2019-02-06T01:01:12+00:00</published> <updated>2019-02-06T01:01:12+00:00</updated> <content type="html" xml:lang="en"><![CDATA[<div class="markdown"><p>Welcome to Shaarli! This is your first public bookmark. To edit or delete me, you must first login.</p> <p>To learn how to use Shaarli, consult the link &quot;Documentation&quot; at the bottom of this page.</p> <p>You use the community supported version of the original Shaarli project, by Sebastien Sauvage.<br /> &#8212; <a href="https://demo.shaarli.org/?MdkVOw">Permalink</a></p></div>]]></content> <category scheme="https://demo.shaarli.org/?searchtags=" term="opensource" label="opensource" /> <category scheme="https://demo.shaarli.org/?searchtags=" term="software" label="software" /> </entry> </feed> ``` "Garbage" links like `https://demo.shaarli.org/?MdkVOw` or `https://demo.shaarli.org/?searchtags=` are still being pulled in, the parser doesn't seem to just look for `<entry> → <link href="...` : ``` [*] [2019-02-06 04:27:34] Downloading https://demo.shaarli.org/?do=atom > /data/sources/demo.shaarli.org-1549427254.txt [*] [2019-02-06 04:27:35] Parsing new links from output/sources/demo.shaarli.org-1549427254.txt and fetching titles... ........................ [+] [2019-02-06 04:28:07] Adding 11 new links to index from /data/sources/demo.shaarli.org-1549427254.txt (Plain Text format) [√] [2019-02-06 04:28:07] Updated main index files: > /data/index.json > /data/index.html [▶] [2019-02-06 04:28:07] Downloading content for 12 pages in archive... [+] [2019-02-06 04:28:09] "The personal, minimalist, super-fast, database free, bookmarking service - Shaarli demo (master)" https://demo.shaarli.org/?MdkVOw > /data/archive/1549427283 (new) > favicon > wget > pdf > screenshot > dom > archive_org > git > media √ index.json √ index.html ... ``` <img width="1039" alt="2019-02-05 at 20 34" src="https://user-images.githubusercontent.com/9080951/52320664-84634c00-2985-11e9-8af8-18ddbd1909f7.png">
Author
Owner

@pirate commented on GitHub (Feb 6, 2019):

Very strange. You can see in the output it now says Adding 11 new links to index from /data/sources/demo.shaarli.org-1549427254.txt (Plain Text format), notice the Plain Text Format at the end, that will say Shaarli RSS format if it gets it right.

I just ran it with exactly the XML you provided above, and it parsed it correctly...

[*] [2019-02-05 23:45:46] Parsing new links from output/sources/test.txt and fetching titles...

[+] [2019-02-05 23:45:46] Adding 4 new links to index from test.txt (Shaarli RSS format)
[√] [2019-02-05 23:45:46] Updated main index files:
    > output/index.json
    > output/index.html
[▶] [2019-02-05 23:45:46] Downloading content for 4 pages in archive...

I suspect there's some difference between our setups that's causing this, if you have a moment, do you mind uncommenting this line, and running it again to see why the Shaarli parser fails:

  1. pull master
  2. uncomment line 72 in parse.py:
    # print('[!] Parser {} failed: {} {}'.format(parser_name, err.__class__.__name__, err))
  3. run it with your Shaarli export
  4. look for [!] Parser Shaarli RSS failed: ...some error details here... and paste that error here

Thanks for helping debug this.

<!-- gh-comment-id:460938067 --> @pirate commented on GitHub (Feb 6, 2019): Very strange. You can see in the output it now says `Adding 11 new links to index from /data/sources/demo.shaarli.org-1549427254.txt (Plain Text format)`, notice the `Plain Text Format` at the end, that will say `Shaarli RSS format` if it gets it right. I just ran it with exactly the XML you provided above, and it parsed it correctly... ``` [*] [2019-02-05 23:45:46] Parsing new links from output/sources/test.txt and fetching titles... [+] [2019-02-05 23:45:46] Adding 4 new links to index from test.txt (Shaarli RSS format) [√] [2019-02-05 23:45:46] Updated main index files: > output/index.json > output/index.html [▶] [2019-02-05 23:45:46] Downloading content for 4 pages in archive... ``` I suspect there's some difference between our setups that's causing this, if you have a moment, do you mind uncommenting this line, and running it again to see why the Shaarli parser fails: 1. pull master 2. uncomment line 72 in `parse.py`: `# print('[!] Parser {} failed: {} {}'.format(parser_name, err.__class__.__name__, err))` 3. run it with your Shaarli export 4. look for `[!] Parser Shaarli RSS failed: ...some error details here...` and paste that error here Thanks for helping debug this.
Author
Owner

@mawmawmawm commented on GitHub (Feb 7, 2019):

Question - we're talking about the parse.py in the archivebox folder, correct? I didn't see any other one :)
Line 72 looks different there and I can't seem to find
print('[!] Parser {} failed: {} {}'.format(parser_name, err.__class__.__name__, err)) or fragments thereof anywhere else in that file?!

<!-- gh-comment-id:461285279 --> @mawmawmawm commented on GitHub (Feb 7, 2019): Question - we're talking about the `parse.py` in the `archivebox` folder, correct? I didn't see any other one :) Line 72 [looks different](https://github.com/pirate/ArchiveBox/blob/5441abdcc23e06a0102e526e7a20d8c13a5c3a7e/archivebox/parse.py#L72) there and I can't seem to find ```print('[!] Parser {} failed: {} {}'.format(parser_name, err.__class__.__name__, err))``` or fragments thereof anywhere else in that file?!
Author
Owner

@pirate commented on GitHub (Feb 7, 2019):

Ah sorry I forgot to push it to master! It was just on my local branch. Try pulling master and uncommenting that line now.

<!-- gh-comment-id:461299103 --> @pirate commented on GitHub (Feb 7, 2019): Ah sorry I forgot to push it to master! It was just on my local branch. Try pulling master and uncommenting that line now.
Author
Owner

@mawmawmawm commented on GitHub (Feb 9, 2019):

Took me a while, sorry. Here we go. I uncommented the line, found in line 75 in parse.py.

[!] Parser Shaarli RSS failed: ValueError time data '2019-02-09T03:38:51+00:00' does not match format '%Y-%m-%dT%H:%M:%S%z'

It looks like there's an extra colon in the shaarli timestamp that is not being parsed correctly.

full output:

root@NASi:/volume1/docker/ArchiveBox/ArchiveBox-master# docker-compose exec archivebox /bin/archive https://demo.shaarli.org/?do=atom
[*] [2019-02-09 04:08:34] Downloading https://demo.shaarli.org/?do=atom > /data/sources/demo.shaarli.org-1549685314.txt
[*] [2019-02-09 04:08:35] Parsing new links from output/sources/demo.shaarli.org-1549685314.txt and fetching titles...
[!] Parser Pinboard JSON failed: JSONDecodeError Expecting value: line 1 column 1 (char 0)
[!] Parser RSS failed: IndexError list index out of range
[!] Parser Shaarli RSS failed: ValueError time data '2019-02-09T03:38:51+00:00' does not match format '%Y-%m-%dT%H:%M:%S%z'
[!] Parser Medium RSS failed: AttributeError 'NoneType' object has no attribute 'findall'
    > Adding 8 new links to index from /data/sources/demo.shaarli.org-1549685314.txt (parsed as Plain Text format)
[*] [2019-02-09 04:08:55] Updating main index files...
    > /data/index.json
    > /data/index.html
[▶] [2019-02-09 04:08:55] Updating content for 8 pages in archive...
[+] [2019-02-09 04:08:58] "Shaarli demo (master)"
    https://demo.shaarli.org/?searchtags=
    > /data/archive/1549685333 (new)
      > favicon
      > wget
...
<!-- gh-comment-id:462011993 --> @mawmawmawm commented on GitHub (Feb 9, 2019): Took me a while, sorry. Here we go. I uncommented the line, found in line 75 in parse.py. ``` [!] Parser Shaarli RSS failed: ValueError time data '2019-02-09T03:38:51+00:00' does not match format '%Y-%m-%dT%H:%M:%S%z' ``` It looks like there's an extra colon in the shaarli timestamp that is not being parsed correctly. full output: ``` root@NASi:/volume1/docker/ArchiveBox/ArchiveBox-master# docker-compose exec archivebox /bin/archive https://demo.shaarli.org/?do=atom [*] [2019-02-09 04:08:34] Downloading https://demo.shaarli.org/?do=atom > /data/sources/demo.shaarli.org-1549685314.txt [*] [2019-02-09 04:08:35] Parsing new links from output/sources/demo.shaarli.org-1549685314.txt and fetching titles... [!] Parser Pinboard JSON failed: JSONDecodeError Expecting value: line 1 column 1 (char 0) [!] Parser RSS failed: IndexError list index out of range [!] Parser Shaarli RSS failed: ValueError time data '2019-02-09T03:38:51+00:00' does not match format '%Y-%m-%dT%H:%M:%S%z' [!] Parser Medium RSS failed: AttributeError 'NoneType' object has no attribute 'findall' > Adding 8 new links to index from /data/sources/demo.shaarli.org-1549685314.txt (parsed as Plain Text format) [*] [2019-02-09 04:08:55] Updating main index files... > /data/index.json > /data/index.html [▶] [2019-02-09 04:08:55] Updating content for 8 pages in archive... [+] [2019-02-09 04:08:58] "Shaarli demo (master)" https://demo.shaarli.org/?searchtags= > /data/archive/1549685333 (new) > favicon > wget ... ```
Author
Owner

@mawmawmawm commented on GitHub (Feb 9, 2019):

Same is true for wallabag feeds btw, maybe the same fix...

[*] [2019-02-09 04:17:28] Downloading https://[redacted]/[redacted]/tags/autofocus.xml > /data/sources/[redacted]-1549685848.txt
[*] [2019-02-09 04:17:30] Parsing new links from output/sources/[redacted]-1549685848.txt and fetching titles...
[!] Parser Pinboard JSON failed: JSONDecodeError Expecting value: line 1 column 1 (char 0)
[!] Parser RSS failed: IndexError list index out of range
[!] Parser Medium RSS failed: ValueError time data 'Mon, 04 Feb 2019 00:43:03 +0000' does not match format '%a, %d %b %Y %H:%M:%S %Z'
    > Adding 72 new links to index from /data/sources/[redacted]-1549685848.txt (parsed as Plain Text format)
[*] [2019-02-09 04:18:39] Updating main index files...
    > /data/index.json
    > /data/index.html

Input XML (snippet):

<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:media="http://search.yahoo.com/mrss/">
    <channel>
        <title>wallabag - tag (autofocus) feed</title>
        <link>[redacted]/tag/list/autofocus</link>
        <link rel="self" href="[redacted]/tags/autofocus.xml"/>
        <link rel="last" href="[redacted]/tags/autofocus.xml?page=1"/>
        <pubDate>Sat, 09 Feb 2019 04:17:30 +0000</pubDate>
        <generator>wallabag</generator>
        <description>wallabag tag (autofocus) elements</description>

        
            <item>
                <title><![CDATA[Autofokus im Test | c&#039;t Fotografie]]></title>
                <source url="[redacted]">wallabag</source>
                <link>https://www.heise.de/foto/artikel/Autofokus-im-Test-3196103.html?view=print</link>
                <guid>https://www.heise.de/foto/artikel/Autofokus-im-Test-3196103.html?view=print</guid>
                <pubDate>Mon, 04 Feb 2019 00:43:03 +0000</pubDate>
                <description>
<!-- gh-comment-id:462012251 --> @mawmawmawm commented on GitHub (Feb 9, 2019): Same is true for `wallabag` feeds btw, maybe the same fix... ``` [*] [2019-02-09 04:17:28] Downloading https://[redacted]/[redacted]/tags/autofocus.xml > /data/sources/[redacted]-1549685848.txt [*] [2019-02-09 04:17:30] Parsing new links from output/sources/[redacted]-1549685848.txt and fetching titles... [!] Parser Pinboard JSON failed: JSONDecodeError Expecting value: line 1 column 1 (char 0) [!] Parser RSS failed: IndexError list index out of range [!] Parser Medium RSS failed: ValueError time data 'Mon, 04 Feb 2019 00:43:03 +0000' does not match format '%a, %d %b %Y %H:%M:%S %Z' > Adding 72 new links to index from /data/sources/[redacted]-1549685848.txt (parsed as Plain Text format) [*] [2019-02-09 04:18:39] Updating main index files... > /data/index.json > /data/index.html ``` Input XML (snippet): ``` <?xml version="1.0" encoding="utf-8"?> <rss version="2.0" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:media="http://search.yahoo.com/mrss/"> <channel> <title>wallabag - tag (autofocus) feed</title> <link>[redacted]/tag/list/autofocus</link> <link rel="self" href="[redacted]/tags/autofocus.xml"/> <link rel="last" href="[redacted]/tags/autofocus.xml?page=1"/> <pubDate>Sat, 09 Feb 2019 04:17:30 +0000</pubDate> <generator>wallabag</generator> <description>wallabag tag (autofocus) elements</description> <item> <title><![CDATA[Autofokus im Test | c&#039;t Fotografie]]></title> <source url="[redacted]">wallabag</source> <link>https://www.heise.de/foto/artikel/Autofokus-im-Test-3196103.html?view=print</link> <guid>https://www.heise.de/foto/artikel/Autofokus-im-Test-3196103.html?view=print</guid> <pubDate>Mon, 04 Feb 2019 00:43:03 +0000</pubDate> <description> ```
Author
Owner

@pirate commented on GitHub (Feb 11, 2019):

At a conference right now and have a busy week ahead, so apologies if I don't get around to fixing this for a bit.

<!-- gh-comment-id:462220009 --> @pirate commented on GitHub (Feb 11, 2019): At a conference right now and have a busy week ahead, so apologies if I don't get around to fixing this for a bit.
Author
Owner

@mawmawmawm commented on GitHub (Feb 11, 2019):

No rush at all (at least for me). Thanks for doing all this btw. Let me know if you need any additional input.

<!-- gh-comment-id:462222024 --> @mawmawmawm commented on GitHub (Feb 11, 2019): No rush at all (at least for me). Thanks for doing all this btw. Let me know if you need any additional input.
Author
Owner

@pirate commented on GitHub (Feb 11, 2019):

A redacted copy of your /data/sources/demo.shaarli.org-1549685314.txt would be helpful, thx.

<!-- gh-comment-id:462222107 --> @pirate commented on GitHub (Feb 11, 2019): A redacted copy of your `/data/sources/demo.shaarli.org-1549685314.txt` would be helpful, thx.
Author
Owner

@mawmawmawm commented on GitHub (Feb 12, 2019):

Sorry, I don't have that anymore. But it was basically just the shaarli demo with one link added to it.

<!-- gh-comment-id:462958370 --> @mawmawmawm commented on GitHub (Feb 12, 2019): Sorry, I don't have that anymore. But it was basically just the shaarli demo with one link added to it.
Author
Owner

@pirate commented on GitHub (Feb 19, 2019):

@mawmawmawm I think I fixed it (in eff0100), pull the latest master and give it a shot. Comment if it's still broken and I'll reopen the issue.

<!-- gh-comment-id:465018337 --> @pirate commented on GitHub (Feb 19, 2019): @mawmawmawm I think I fixed it (in eff0100), pull the latest master and give it a shot. Comment if it's still broken and I'll reopen the issue.
Author
Owner

@mawmawmawm commented on GitHub (Feb 20, 2019):

Sorry, It's still happening for me after a full rebuild... Importing shaarli as plaintext.

<!-- gh-comment-id:465759822 --> @mawmawmawm commented on GitHub (Feb 20, 2019): Sorry, It's still happening for me after a full rebuild... Importing shaarli as plaintext.
Author
Owner

@pirate commented on GitHub (Feb 27, 2019):

I just ran the latest master on this sample Shaarli export you provided: https://github.com/pirate/ArchiveBox/issues/135#issuecomment-460898443 and it worked as expected (imported 4 links and parsed as Shaarli RSS format). If the latest master still failing for you, post your export here as I need it to be able to debug the parsing.

<!-- gh-comment-id:468023410 --> @pirate commented on GitHub (Feb 27, 2019): I just ran the latest master on this sample Shaarli export you provided: https://github.com/pirate/ArchiveBox/issues/135#issuecomment-460898443 and it worked as expected (imported 4 links and parsed as Shaarli RSS format). If the latest master still failing for you, post your export here as I need it to be able to debug the parsing.
Author
Owner

@jeanregisser commented on GitHub (Mar 7, 2019):

I tried an RSS import from wallabag using latest master github.com/pirate/ArchiveBox@4a7f1d57d5
and it only found 1 link, though there are 50 of them in the feed.

docker-compose run archivebox /bin/archive https://app.wallabag.it/jeanregisser/Hb3M3PHiPfZYvra/all.xml
[*] [2019-03-07 11:38:16] Downloading https://app.wallabag.it/jeanregisser/Hb3M3PHiPfZYvra/all.xml
    > /data/sources/app.wallabag.it-1551958696.txt
[*] [2019-03-07 11:38:16] Parsing new links from output/sources/app.wallabag.it-1551958696.txt...
    > Adding 1 new links to index (parsed import as RSS)
[...]

app.wallabag.it-1551958696.txt

Let me know if you need more info.

<!-- gh-comment-id:470501921 --> @jeanregisser commented on GitHub (Mar 7, 2019): I tried an RSS import from wallabag using latest master https://github.com/pirate/ArchiveBox/commit/4a7f1d57d5bf9d9493254b8fdaa9bf6fd0dc2c4c and it only found 1 link, though there are 50 of them in the feed. ```shell docker-compose run archivebox /bin/archive https://app.wallabag.it/jeanregisser/Hb3M3PHiPfZYvra/all.xml [*] [2019-03-07 11:38:16] Downloading https://app.wallabag.it/jeanregisser/Hb3M3PHiPfZYvra/all.xml > /data/sources/app.wallabag.it-1551958696.txt [*] [2019-03-07 11:38:16] Parsing new links from output/sources/app.wallabag.it-1551958696.txt... > Adding 1 new links to index (parsed import as RSS) [...] ``` [app.wallabag.it-1551958696.txt](https://github.com/pirate/ArchiveBox/files/2941125/app.wallabag.it-1551958696.txt) Let me know if you need more info.
Author
Owner

@pirate commented on GitHub (Mar 25, 2019):

Sorry for the delay, just fixed this @jeanregisser in 58c9b47. Pull the latest master and give it a try. Comment back here if it doesn't work and I'll reopen the ticket.

The issue was that wallabag adds a bunch of newlines between the RSS items which broke my crappy parsing code.

@mawmawmawm there have been lots of parser fixes since my last comment here, can you also give the latest master a shot and report back?

<!-- gh-comment-id:476364450 --> @pirate commented on GitHub (Mar 25, 2019): Sorry for the delay, just fixed this @jeanregisser in 58c9b47. Pull the latest master and give it a try. Comment back here if it doesn't work and I'll reopen the ticket. The issue was that wallabag adds a bunch of newlines between the RSS items which broke my crappy parsing code. @mawmawmawm there have been lots of parser fixes since my last comment here, can you also give the latest master a shot and report back?
Author
Owner

@mawmawmawm commented on GitHub (Apr 1, 2019):

Sorry for the late reply - tried it 3 days ago and was working fine except the wget issue mentioned in the other ticket.

<!-- gh-comment-id:478750562 --> @mawmawmawm commented on GitHub (Apr 1, 2019): Sorry for the late reply - tried it 3 days ago and was working fine except the `wget` issue mentioned in the other ticket.
Author
Owner

@amette commented on GitHub (May 3, 2019):

This error still persists for me. I have Shaarli v0.10.4 (latest) and ArchiveBox master from git. Shaarli produces for example the following XML (original, but domain redacted):

<entry>                                                                                                                                                                                                                                                                    
  <title>The Google Cemetery - Google Graveyard: Dead Google products</title>                                                                                                                                                                                              
                                                                                                                                                                                                                                                                           
    <link href="https://gcemetery.co/" />                                                                                                                                                                                                                                  
                                                                                                                                                                                                                                                                           
  <id>https://shaarli.example.com/?Z485AQ</id>                                                                                                                                                                                                                               
                                                                                                                                                                                                                                                                           
    <published>2019-05-02T21:09:13+02:00</published>                                                                                                                                                                                                                       
    <updated>2019-05-02T21:09:13+02:00</updated>                                                                                                                                                                                                                           
                                                                                                                                                                                                                                                                           
  <content type="html" xml:lang="de"><![CDATA[                                                                                                                                                                                                                             
   <br>&#8212; <a href="https://shaarli.example.com/?Z485AQ" title="Permalink">Permalink</a>]]></content>                                                                                                                                                                                                                                                                                                                                                                                                                                        
</entry> 

ArchiveBox correctly imports the gcemetery.co link once, but also imports the shaarli.example.com link once.

<!-- gh-comment-id:489076540 --> @amette commented on GitHub (May 3, 2019): This error still persists for me. I have Shaarli v0.10.4 (latest) and ArchiveBox master from git. Shaarli produces for example the following XML (original, but domain redacted): <entry> <title>The Google Cemetery - Google Graveyard: Dead Google products</title> <link href="https://gcemetery.co/" /> <id>https://shaarli.example.com/?Z485AQ</id> <published>2019-05-02T21:09:13+02:00</published> <updated>2019-05-02T21:09:13+02:00</updated> <content type="html" xml:lang="de"><![CDATA[ <br>&#8212; <a href="https://shaarli.example.com/?Z485AQ" title="Permalink">Permalink</a>]]></content> </entry> ArchiveBox correctly imports the gcemetery.co link once, but also imports the shaarli.example.com link once.
Author
Owner

@sebw commented on GitHub (Apr 12, 2022):

Hey @pirate I just discovered about archivebox, that's awesome and a great extension to Shaarli!

Unfortunately the problem subsists.

image

Running dev branch of Shaarli.

Also besides the Shaarli links, I would not expect w3.org and purl.org from appearing.

<!-- gh-comment-id:1096814847 --> @sebw commented on GitHub (Apr 12, 2022): Hey @pirate I just discovered about archivebox, that's awesome and a great extension to Shaarli! Unfortunately the problem subsists. ![image](https://user-images.githubusercontent.com/2285094/162987251-dc5e2694-0983-4ca4-a934-5e90e0aa034f.png) Running dev branch of Shaarli. Also besides the Shaarli links, I would not expect w3.org and purl.org from appearing.
Author
Owner

@pirate commented on GitHub (Apr 12, 2022):

w3.org and purl.org are expected in full-text parsing mode (which it's falling back to due to a bug) because they are linked to in the RSS even though the links aren't visible, they wont archive multiple times so I recommend leaving them for now and ignoring those entries.

I've re-opened the issue to track fixing it, PRs to fix are welcome.

<!-- gh-comment-id:1097216095 --> @pirate commented on GitHub (Apr 12, 2022): w3.org and purl.org are expected in full-text parsing mode (which it's falling back to due to a bug) because they are linked to in the RSS even though the links aren't visible, they wont archive multiple times so I recommend leaving them for now and ignoring those entries. I've re-opened the issue to track fixing it, PRs to fix are welcome.
Author
Owner

@wokawoka commented on GitHub (Jun 16, 2023):

is this issue still relevant?

<!-- gh-comment-id:1594590713 --> @wokawoka commented on GitHub (Jun 16, 2023): is this issue still relevant?
Author
Owner

@pirate commented on GitHub (Jun 20, 2023):

Yes, it hasn't been fixed yet. PRs are welcome. I haven't gotten to it myself as I don't use shaarli myself.

<!-- gh-comment-id:1598111964 --> @pirate commented on GitHub (Jun 20, 2023): Yes, it hasn't been fixed yet. PRs are welcome. I haven't gotten to it myself as I don't use shaarli myself.
Author
Owner

@melyux commented on GitHub (Jul 11, 2023):

I'm also getting the unnecessary links (like http://www.w3.org/2005/Atom) with any kind of normal Atom RSS feed. Is that part of this bug?

<!-- gh-comment-id:1631350489 --> @melyux commented on GitHub (Jul 11, 2023): I'm also getting the unnecessary links (like `http://www.w3.org/2005/Atom`) with any kind of normal Atom RSS feed. Is that part of this bug?
Author
Owner

@pirate commented on GitHub (Aug 16, 2023):

Yes, because it falls back to URL parsing in plain text mode, it'll archive every single string that looks like a URL. Using a proper RSS parser library to fix the RSS parser bugs should result in not importing those w3 Atom schema reference URLs.

<!-- gh-comment-id:1679791274 --> @pirate commented on GitHub (Aug 16, 2023): Yes, because it falls back to URL parsing in plain text mode, it'll archive every single string that looks like a URL. Using a proper RSS parser library to fix the RSS parser bugs should result in not importing those w3 Atom schema reference URLs.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#3112
No description provided.