[GH-ISSUE #726] Bug: Pocket since high-water-mark gets set even when indexing fails #1970

Open
opened 2026-03-01 17:55:27 +03:00 by kerem · 5 comments
Owner

Originally created by @cpmsmith on GitHub (Apr 28, 2021).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/726

Describe the bug

I'm setting up Pocket importing for the first time, meaning I'm importing a lot of old links, some of which are on now-defunct websites. When one of them fails, the entire import fails, but the since value in pocket_api.db is still set, meaning when I try to re-import my Pocket feed, it only retrieves new items, leaving me with no URLs archived.

Steps to reproduce

  1. Set up Pocket config, per #528
  2. Have a URL in pocket on a domain which refuses connections, or does not exist
  3. Import from Pocket:
    $ archivebox add --depth=1 pocket://myUserName
    [+] [2021-04-28 14:00:05] Adding 1 links to index (crawl depth=1)...
        > Saved verbatim input to sources/1619618411-import.txt
        > Parsed 169 URLs from input (Pocket API)
    
    [*] Starting crawl of 169 sites 1 hop out from starting point
        > Downloading http://my-working-url.com/ contents
        > Saved verbatim input to sources/1619618411-crawl-my-working-url.com.txt
        > Parsed 12 URLs from input (Generic TXT)
        > Downloading http://my-defunct-url.com/ contents
    [!] Failed to download http://my-defunct-url.com/
    
         HTTPConnectionPool(host='my-defunct-url.com', port=80): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0xb4dbf9b8>: Failed to establish a new connection: [Errno -2] Name or service not known'))
    
  4. Remove broken URL from Pocket
  5. Try importing again:
    $ archivebox add --depth=1 pocket://myUserName
    [+] [2021-04-28 14:39:35] Adding 1 links to index (crawl depth=1)...
        > Saved verbatim input to sources/1619620775-import.txt
                                                                                                                                   0.1% (0/240sec)
    [X] No links found using Pocket API parser
        Hint: Try a different parser or double check the input?
    
        > Parsed 0 URLs from input (Pocket API)
        > Found 0 new URLs not already in index
    
    [*] [2021-04-28 14:39:35] Writing 0 links to main index...
        √ ./index.sqlite3
    

ArchiveBox version

ArchiveBox v0.6.2
Cpython Linux Linux-5.4.79-v7l+-armv7l-with-glibc2.28 armv7l
IN_DOCKER=True DEBUG=False IS_TTY=True TZ=UTC SEARCH_BACKEND_ENGINE=ripgrep

[i] Dependency versions:
 √  ARCHIVEBOX_BINARY     v0.6.2          valid     /usr/local/bin/archivebox
 √  PYTHON_BINARY         v3.9.4          valid     /usr/local/bin/python3.9
 √  DJANGO_BINARY         v3.1.8          valid     /usr/local/lib/python3.9/site-packages/django/bin/django-admin.py
 √  CURL_BINARY           v7.64.0         valid     /usr/bin/curl
 √  WGET_BINARY           v1.20.1         valid     /usr/bin/wget
 √  NODE_BINARY           v15.14.0        valid     /usr/bin/node
 √  SINGLEFILE_BINARY     v0.3.16         valid     /node/node_modules/single-file/cli/single-file
 √  READABILITY_BINARY    v0.0.2          valid     /node/node_modules/readability-extractor/readability-extractor
 √  MERCURY_BINARY        v1.0.0          valid     /node/node_modules/@postlight/mercury-parser/cli.js
 √  GIT_BINARY            v2.20.1         valid     /usr/bin/git
 √  YOUTUBEDL_BINARY      v2021.04.07     valid     /usr/local/bin/youtube-dl
 √  CHROME_BINARY         v89.0.4389.114  valid     /usr/bin/chromium
 √  RIPGREP_BINARY        v0.10.0         valid     /usr/bin/rg

[i] Source-code locations:
 √  PACKAGE_DIR           22 files        valid     /app/archivebox
 √  TEMPLATES_DIR         3 files         valid     /app/archivebox/templates
 -  CUSTOM_TEMPLATES_DIR  -               disabled

[i] Secrets locations:
 -  CHROME_USER_DATA_DIR  -               disabled
 -  COOKIES_FILE          -               disabled

[i] Data locations:
 √  OUTPUT_DIR            5 files         valid     /data
 √  SOURCES_DIR           28 files        valid     ./sources
 √  LOGS_DIR              1 files         valid     ./logs
 √  ARCHIVE_DIR           0 files         valid     ./archive
 √  CONFIG_FILE           204.0 Bytes     valid     ./ArchiveBox.conf
 √  SQL_INDEX             204.0 KB        valid     ./index.sqlite3
Originally created by @cpmsmith on GitHub (Apr 28, 2021). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/726 #### Describe the bug I'm setting up Pocket importing for the first time, meaning I'm importing a lot of old links, some of which are on now-defunct websites. When one of them fails, the entire import fails, but the `since` value in `pocket_api.db` is still set, meaning when I try to re-import my Pocket feed, it only retrieves new items, leaving me with no URLs archived. #### Steps to reproduce <!-- For example: 1. Ran ArchiveBox with the following config '...' 2. Saw this output during archiving '....' 3. UI didn't show the thing I was expecting '....' --> 1. Set up Pocket config, per #528 2. Have a URL in pocket on a domain which refuses connections, or does not exist 3. Import from Pocket: ``` $ archivebox add --depth=1 pocket://myUserName [+] [2021-04-28 14:00:05] Adding 1 links to index (crawl depth=1)... > Saved verbatim input to sources/1619618411-import.txt > Parsed 169 URLs from input (Pocket API) [*] Starting crawl of 169 sites 1 hop out from starting point > Downloading http://my-working-url.com/ contents > Saved verbatim input to sources/1619618411-crawl-my-working-url.com.txt > Parsed 12 URLs from input (Generic TXT) > Downloading http://my-defunct-url.com/ contents [!] Failed to download http://my-defunct-url.com/ HTTPConnectionPool(host='my-defunct-url.com', port=80): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0xb4dbf9b8>: Failed to establish a new connection: [Errno -2] Name or service not known')) ``` 4. Remove broken URL from Pocket 5. Try importing again: ``` $ archivebox add --depth=1 pocket://myUserName [+] [2021-04-28 14:39:35] Adding 1 links to index (crawl depth=1)... > Saved verbatim input to sources/1619620775-import.txt 0.1% (0/240sec) [X] No links found using Pocket API parser Hint: Try a different parser or double check the input? > Parsed 0 URLs from input (Pocket API) > Found 0 new URLs not already in index [*] [2021-04-28 14:39:35] Writing 0 links to main index... √ ./index.sqlite3 ``` #### ArchiveBox version <!-- Run the `archivebox version` command locally then copy paste the result here: --> ```logs ArchiveBox v0.6.2 Cpython Linux Linux-5.4.79-v7l+-armv7l-with-glibc2.28 armv7l IN_DOCKER=True DEBUG=False IS_TTY=True TZ=UTC SEARCH_BACKEND_ENGINE=ripgrep [i] Dependency versions: √ ARCHIVEBOX_BINARY v0.6.2 valid /usr/local/bin/archivebox √ PYTHON_BINARY v3.9.4 valid /usr/local/bin/python3.9 √ DJANGO_BINARY v3.1.8 valid /usr/local/lib/python3.9/site-packages/django/bin/django-admin.py √ CURL_BINARY v7.64.0 valid /usr/bin/curl √ WGET_BINARY v1.20.1 valid /usr/bin/wget √ NODE_BINARY v15.14.0 valid /usr/bin/node √ SINGLEFILE_BINARY v0.3.16 valid /node/node_modules/single-file/cli/single-file √ READABILITY_BINARY v0.0.2 valid /node/node_modules/readability-extractor/readability-extractor √ MERCURY_BINARY v1.0.0 valid /node/node_modules/@postlight/mercury-parser/cli.js √ GIT_BINARY v2.20.1 valid /usr/bin/git √ YOUTUBEDL_BINARY v2021.04.07 valid /usr/local/bin/youtube-dl √ CHROME_BINARY v89.0.4389.114 valid /usr/bin/chromium √ RIPGREP_BINARY v0.10.0 valid /usr/bin/rg [i] Source-code locations: √ PACKAGE_DIR 22 files valid /app/archivebox √ TEMPLATES_DIR 3 files valid /app/archivebox/templates - CUSTOM_TEMPLATES_DIR - disabled [i] Secrets locations: - CHROME_USER_DATA_DIR - disabled - COOKIES_FILE - disabled [i] Data locations: √ OUTPUT_DIR 5 files valid /data √ SOURCES_DIR 28 files valid ./sources √ LOGS_DIR 1 files valid ./logs √ ARCHIVE_DIR 0 files valid ./archive √ CONFIG_FILE 204.0 Bytes valid ./ArchiveBox.conf √ SQL_INDEX 204.0 KB valid ./index.sqlite3 ``` <!-- Tickets without full version info will closed until it is provided, we need the full output here to help you solve your issue -->
Author
Owner

@mAAdhaTTah commented on GitHub (Apr 28, 2021):

The URLs should still be added to your database, so if you need to download those urls, you can run archivebox update to go through your db and do that process.

@pirate My understanding is this isn't specific to the Pocket API implementation though, is it?

<!-- gh-comment-id:828559398 --> @mAAdhaTTah commented on GitHub (Apr 28, 2021): The URLs should still be added to your database, so if you need to download those urls, you can run `archivebox update` to go through your db and do that process. @pirate My understanding is this isn't specific to the Pocket API implementation though, is it?
Author
Owner

@cpmsmith commented on GitHub (Apr 28, 2021):

The URLs should still be added to your database, so if you need to download those urls, you can run archivebox update to go through your db and do that process.

I thought so as well, but when the add fails, no snapshots get added:

$ archivebox update
[i] [2021-04-28 17:12:12] ArchiveBox v0.6.2: archivebox update
    > /data

[!] No Snapshots matched your filters: [] (exact)
$ archivebox status
[*] Scanning archive main index...
    /data/*
    Index size: 236.0 KB across 3 files

    > SQL Main Index: 0 links        (found in index.sqlite3)
    > JSON Link Details: 0 links     (found in archive/*/index.json)

[*] Scanning archive data directories...
    /data/archive/*
    Size: 0.0 Bytes across 0 files in 0 directories

    > indexed: 0                     (indexed links without checking archive status or data directory validity)
      > archived: 0                  (indexed links that are archived with a valid data directory)
      > unarchived: 0                (indexed links that are unarchived with no data directory or an empty data directory)

    > present: 0                     (dirs that actually exist in the archive/ folder)
      > valid: 0                     (dirs with a valid index matched to the main index and archived content)
      > invalid: 0                   (dirs that are invalid for any reason: corrupted/duplicate/orphaned/unrecognized)
        > duplicate: 0               (dirs that conflict with other directories that have the same link URL or timestamp)
        > orphaned: 0                (dirs that contain a valid index but aren't listed in the main index)
        > corrupted: 0               (dirs that don't contain a valid index and aren't listed in the main index)
        > unrecognized: 0            (dirs that don't contain recognizable archive data and aren't listed in the main index)


[*] Scanning recent archive changes and user logins:
    /data/logs/*
    UI users 1: archivebox
    Last UI login: archivebox @ 2021-04-28 12:50

    ...
<!-- gh-comment-id:828630223 --> @cpmsmith commented on GitHub (Apr 28, 2021): > The URLs should still be added to your database, so if you need to download those urls, you can run `archivebox update` to go through your db and do that process. I thought so as well, but when the `add` fails, no snapshots get added: ``` $ archivebox update [i] [2021-04-28 17:12:12] ArchiveBox v0.6.2: archivebox update > /data [!] No Snapshots matched your filters: [] (exact) $ archivebox status [*] Scanning archive main index... /data/* Index size: 236.0 KB across 3 files > SQL Main Index: 0 links (found in index.sqlite3) > JSON Link Details: 0 links (found in archive/*/index.json) [*] Scanning archive data directories... /data/archive/* Size: 0.0 Bytes across 0 files in 0 directories > indexed: 0 (indexed links without checking archive status or data directory validity) > archived: 0 (indexed links that are archived with a valid data directory) > unarchived: 0 (indexed links that are unarchived with no data directory or an empty data directory) > present: 0 (dirs that actually exist in the archive/ folder) > valid: 0 (dirs with a valid index matched to the main index and archived content) > invalid: 0 (dirs that are invalid for any reason: corrupted/duplicate/orphaned/unrecognized) > duplicate: 0 (dirs that conflict with other directories that have the same link URL or timestamp) > orphaned: 0 (dirs that contain a valid index but aren't listed in the main index) > corrupted: 0 (dirs that don't contain a valid index and aren't listed in the main index) > unrecognized: 0 (dirs that don't contain recognizable archive data and aren't listed in the main index) [*] Scanning recent archive changes and user logins: /data/logs/* UI users 1: archivebox Last UI login: archivebox @ 2021-04-28 12:50 ... ```
Author
Owner

@pirate commented on GitHub (Apr 28, 2021):

Yeah I think it's missing a try: except: block around the crawl step, because it's a preprocessing step done before the actual archiving it doesn't have our usual "if snapshot fails, continue anyway" logic around it.

For now you may have to revert to adding URLs one at a time when using --depth=1 mode until this is fixed in the next version. To fix the immediate issue of not being able to add these pages / re-update urls that are already present, you can try using the archivebox add --update-all ... or archivebox add --overwrite ... flags depending on what your desired behavior when encountering a previously-added URL is.

<!-- gh-comment-id:828641272 --> @pirate commented on GitHub (Apr 28, 2021): Yeah I think it's missing a try: except: block around the crawl step, because it's a preprocessing step done before the actual archiving it doesn't have our usual "if snapshot fails, continue anyway" logic around it. For now you may have to revert to adding URLs one at a time when using `--depth=1` mode until this is fixed in the next version. To fix the immediate issue of not being able to add these pages / re-update urls that are already present, you can try using the `archivebox add --update-all ...` or `archivebox add --overwrite ...` flags depending on what your desired behavior when encountering a previously-added URL is.
Author
Owner

@cpmsmith commented on GitHub (Apr 28, 2021):

To fix the immediate issue of not being able to add these pages / re-update urls that are already present, you can try using the archivebox add --update-all ... or archivebox add --overwrite ... flags

I tried this as well, but it doesn't work either, as the URLs aren't being indexed at all.

Raw output from --update-all and --overwrite
$ archivebox add --depth=1 --update-all pocket://myUserName
[i] [2021-04-28 17:33:44] ArchiveBox v0.6.2: archivebox add --depth=1 --update-all pocket://myUserName
    > /data

[+] [2021-04-28 17:33:48] Adding 1 links to index (crawl depth=1)...
    > Saved verbatim input to sources/1619631228-import.txt
    > Parsed 0 URLs from input (Failed to parse)
    > Found 0 new URLs not already in index

[*] [2021-04-28 17:33:48] Writing 0 links to main index...
    √ ./index.sqlite3
$ archivebox add --depth=1 --overwrite pocket://myUserName
[i] [2021-04-28 17:34:04] ArchiveBox v0.6.2: archivebox add --depth=1 --overwrite pocket://myUserName
    > /data

[+] [2021-04-28 17:34:08] Adding 1 links to index (crawl depth=1)...
    > Saved verbatim input to sources/1619631248-import.txt
    > Parsed 0 URLs from input (Failed to parse)
    > Found 0 new URLs not already in index

[*] [2021-04-28 17:34:08] Writing 0 links to main index...
    √ ./index.sqlite3

This is why I think it's Pocket-specific: because the Pocket API consumer saves the since value, it stops reporting the existence of the old URLs at all, even though it never successfully indexed them. Files have been written in data/sources, but data/archive is still empty.

More to the point, if I delete the sources/pocket_api.db file and run add again, the behaviour changes: it starts attempting to index all the 1-hop-out URLs again, rather than just saying No links found, even without using --overwrite or --update-all.

<!-- gh-comment-id:828653557 --> @cpmsmith commented on GitHub (Apr 28, 2021): > To fix the immediate issue of not being able to add these pages / re-update urls that are already present, you can try using the `archivebox add --update-all ...` or `archivebox add --overwrite ...` flags I tried this as well, but it doesn't work either, as the URLs aren't being indexed at all. <details> <summary>Raw output from <code>--update-all</code> and <code>--overwrite</code></summary> ``` $ archivebox add --depth=1 --update-all pocket://myUserName [i] [2021-04-28 17:33:44] ArchiveBox v0.6.2: archivebox add --depth=1 --update-all pocket://myUserName > /data [+] [2021-04-28 17:33:48] Adding 1 links to index (crawl depth=1)... > Saved verbatim input to sources/1619631228-import.txt > Parsed 0 URLs from input (Failed to parse) > Found 0 new URLs not already in index [*] [2021-04-28 17:33:48] Writing 0 links to main index... √ ./index.sqlite3 $ archivebox add --depth=1 --overwrite pocket://myUserName [i] [2021-04-28 17:34:04] ArchiveBox v0.6.2: archivebox add --depth=1 --overwrite pocket://myUserName > /data [+] [2021-04-28 17:34:08] Adding 1 links to index (crawl depth=1)... > Saved verbatim input to sources/1619631248-import.txt > Parsed 0 URLs from input (Failed to parse) > Found 0 new URLs not already in index [*] [2021-04-28 17:34:08] Writing 0 links to main index... √ ./index.sqlite3 ``` </details> This is why I think it's Pocket-specific: because the Pocket API consumer saves the `since` value, it stops reporting the existence of the old URLs at all, even though it never successfully indexed them. Files have been written in `data/sources`, but `data/archive` is still empty. More to the point, if I delete the `sources/pocket_api.db` file and run `add` again, the behaviour changes: it starts attempting to index all the 1-hop-out URLs again, rather than just saying `No links found`, even without using `--overwrite` or `--update-all`.
Author
Owner

@pirate commented on GitHub (Apr 28, 2021):

Ah yeah, you're right, this is pocket api specific then.

<!-- gh-comment-id:828667273 --> @pirate commented on GitHub (Apr 28, 2021): Ah yeah, you're right, this is pocket api specific then.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#1970
No description provided.