[GH-ISSUE #1178] Bug: Doing a later, successful "Pull" on snapshots classified as "failed" doesn't change their status away from "failed" #2242

Open
opened 2026-03-01 17:57:36 +03:00 by kerem · 5 comments
Owner

Originally created by @melyux on GitHub (Jul 13, 2023).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/1178

Describe the bug

If one of the extractors for a snapshot fails to get its content, the snapshot is classified as "Failed". If you later go in and do a "Pull" on it with the button, it retries these failed extractors. If this operation succeeds, the snapshot's status does not get moved from "failed" to "succeeded".

The "status" in the filter seems to apply to the individual extractor results inside snapshots rather than the snapshots themselves, since this snapshot shows up under both "succeeded" and "failed", which is weird. Once everything works upon subsequent "Pull"s, the error count and the failed statuses should be removed, I think. Otherwise there's no point to these filters

Steps to reproduce

  1. Add a URL and have one or more of its extractors fail.
  2. Run a "Pull" on the snapshot, and this time have the extractors succeed.
  3. See that the snapshot still shows up when filtering by status "failed".

Screenshots or log output

ArchiveBox version

find: '/.config/chromium/Crash Reports/pending/': No such file or directory
0.6.3
ArchiveBox v0.6.3 Cpython Linux Linux-6.1.0-10-amd64-x86_64-with-glibc2.31 x86_64
DEBUG=False IN_DOCKER=True IS_TTY=True TZ=UTC FS_ATOMIC=True FS_REMOTE=True FS_PERMS=644 999:999 SEARCH_BACKEND=ripgrep

[i] Dependency versions:
 √  PYTHON_BINARY         v3.11.3         valid     /usr/local/bin/python3.11                                                   
 √  SQLITE_BINARY         v2.6.0          valid     /usr/local/lib/python3.11/sqlite3/dbapi2.py                                 
 √  DJANGO_BINARY         v3.1.14         valid     /usr/local/lib/python3.11/site-packages/django/__init__.py                  
 √  ARCHIVEBOX_BINARY     v0.6.3          valid     /usr/local/bin/archivebox                                                   

 √  CURL_BINARY           v7.74.0         valid     /usr/bin/curl                                                               
 √  WGET_BINARY           v1.21           valid     /usr/bin/wget                                                               
 √  NODE_BINARY           v18.16.0        valid     /usr/bin/node                                                               
 √  SINGLEFILE_BINARY     v0.3.16         valid     /node/node_modules/single-file/cli/single-file                              
 √  READABILITY_BINARY    v0.0.2          valid     /node/node_modules/readability-extractor/readability-extractor              
 √  MERCURY_BINARY        v1.0.0          valid     /node/node_modules/@postlight/mercury-parser/cli.js                         
 √  GIT_BINARY            v2.30.2         valid     /usr/bin/git                                                                
 √  YOUTUBEDL_BINARY      v2023.03.04     valid     /usr/local/bin/yt-dlp                                                       
 √  CHROME_BINARY         v112.0.5615.138  valid     /usr/bin/chromium                                                           
 √  RIPGREP_BINARY        v12.1.1         valid     /usr/bin/rg                                                                 

[i] Source-code locations:
 √  PACKAGE_DIR           22 files        valid     /app/archivebox                                                             
 √  TEMPLATES_DIR         3 files         valid     /app/archivebox/templates                                                   
 -  CUSTOM_TEMPLATES_DIR  -               disabled                                                                              

[i] Secrets locations:
 -  CHROME_USER_DATA_DIR  -               disabled                                                                              
 -  COOKIES_FILE          -               disabled                                                                              

[i] Data locations:
 √  OUTPUT_DIR            7 files @       valid     /data                                                                       
 √  SOURCES_DIR           172 files       valid     ./sources                                                                   
 √  LOGS_DIR              2 files         valid     ./logs                                                                      
 √  ARCHIVE_DIR           54 files        valid     ./archive                                                                   
 √  CONFIG_FILE           81.0 Bytes      valid     ./ArchiveBox.conf                                                           
 √  SQL_INDEX             776.0 KB        valid     ./index.sqlite3  
Originally created by @melyux on GitHub (Jul 13, 2023). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/1178 <!-- Please fill out the following information, feel free to delete sections if they're not applicable or if long issue templates annoy you. (the only required section is the version information) --> #### Describe the bug <!-- A description of what the bug is, what you expected to happen, and any relevant context about issue. --> If one of the extractors for a snapshot fails to get its content, the snapshot is classified as "Failed". If you later go in and do a "Pull" on it with the button, it retries these failed extractors. If this operation succeeds, the snapshot's status does not get moved from "failed" to "succeeded". The "status" in the filter seems to apply to the individual extractor results inside snapshots rather than the snapshots themselves, since this snapshot shows up under both "succeeded" and "failed", which is weird. Once everything works upon subsequent "Pull"s, the error count and the failed statuses should be removed, I think. Otherwise there's no point to these filters #### Steps to reproduce <!-- For example: 1. Ran ArchiveBox with the following config '...' 2. Saw this output during archiving '....' 3. UI didn't show the thing I was expecting '....' --> 1. Add a URL and have one or more of its extractors fail. 2. Run a "Pull" on the snapshot, and this time have the extractors succeed. 3. See that the snapshot still shows up when filtering by status "failed". #### Screenshots or log output <!-- If applicable, post any relevant screenshots or copy/pasted terminal output from ArchiveBox. If you're reporting a parsing / importing error, **you must paste a copy of your redacted import file here**. --> #### ArchiveBox version <!-- Run the `archivebox version` command locally then copy paste the result here: --> ```logs find: '/.config/chromium/Crash Reports/pending/': No such file or directory 0.6.3 ArchiveBox v0.6.3 Cpython Linux Linux-6.1.0-10-amd64-x86_64-with-glibc2.31 x86_64 DEBUG=False IN_DOCKER=True IS_TTY=True TZ=UTC FS_ATOMIC=True FS_REMOTE=True FS_PERMS=644 999:999 SEARCH_BACKEND=ripgrep [i] Dependency versions: √ PYTHON_BINARY v3.11.3 valid /usr/local/bin/python3.11 √ SQLITE_BINARY v2.6.0 valid /usr/local/lib/python3.11/sqlite3/dbapi2.py √ DJANGO_BINARY v3.1.14 valid /usr/local/lib/python3.11/site-packages/django/__init__.py √ ARCHIVEBOX_BINARY v0.6.3 valid /usr/local/bin/archivebox √ CURL_BINARY v7.74.0 valid /usr/bin/curl √ WGET_BINARY v1.21 valid /usr/bin/wget √ NODE_BINARY v18.16.0 valid /usr/bin/node √ SINGLEFILE_BINARY v0.3.16 valid /node/node_modules/single-file/cli/single-file √ READABILITY_BINARY v0.0.2 valid /node/node_modules/readability-extractor/readability-extractor √ MERCURY_BINARY v1.0.0 valid /node/node_modules/@postlight/mercury-parser/cli.js √ GIT_BINARY v2.30.2 valid /usr/bin/git √ YOUTUBEDL_BINARY v2023.03.04 valid /usr/local/bin/yt-dlp √ CHROME_BINARY v112.0.5615.138 valid /usr/bin/chromium √ RIPGREP_BINARY v12.1.1 valid /usr/bin/rg [i] Source-code locations: √ PACKAGE_DIR 22 files valid /app/archivebox √ TEMPLATES_DIR 3 files valid /app/archivebox/templates - CUSTOM_TEMPLATES_DIR - disabled [i] Secrets locations: - CHROME_USER_DATA_DIR - disabled - COOKIES_FILE - disabled [i] Data locations: √ OUTPUT_DIR 7 files @ valid /data √ SOURCES_DIR 172 files valid ./sources √ LOGS_DIR 2 files valid ./logs √ ARCHIVE_DIR 54 files valid ./archive √ CONFIG_FILE 81.0 Bytes valid ./ArchiveBox.conf √ SQL_INDEX 776.0 KB valid ./index.sqlite3 ``` <!-- Tickets without full version info will closed until it is provided, we need the full output here to help you solve your issue -->
Author
Owner

@melyux commented on GitHub (Jul 15, 2023):

Thought about this for a while, and snapshot filters should work like this: if the latest update was successful for an extractor for that snapshot, previous failures of that extractor should not count against the snapshot as "failed". Only when the latest update for at least one extractor failed should that snapshot be designated as "failed".

<!-- gh-comment-id:1636692067 --> @melyux commented on GitHub (Jul 15, 2023): Thought about this for a while, and snapshot filters should work like this: if the latest update was successful for an extractor for that snapshot, previous failures of that extractor should not count against the snapshot as "failed". Only when the latest update for at least one extractor failed should that snapshot be designated as "failed".
Author
Owner

@pirate commented on GitHub (Aug 13, 2023):

Yeah I agree @melyux that's how they were intended to work already, there must be a bug in the filter logic.

<!-- gh-comment-id:1676498011 --> @pirate commented on GitHub (Aug 13, 2023): Yeah I agree @melyux that's how they were intended to work already, there must be a bug in the filter logic.
Author
Owner

@neel-suthar commented on GitHub (Jan 22, 2024):

I thought Snapshots had some kind of status field but seems like I am wrong. But is it worth it to add a status field for each snapshot? Internally the logic should stay the same but this can help us fetch snapshots very effectively. Just a thought.

<!-- gh-comment-id:1904469947 --> @neel-suthar commented on GitHub (Jan 22, 2024): I thought Snapshots had some kind of status field but seems like I am wrong. But is it worth it to add a status field for each snapshot? Internally the logic should stay the same but this can help us fetch snapshots very effectively. Just a thought.
Author
Owner

@pirate commented on GitHub (Jan 23, 2024):

Nah @neel-suthar, I want Snapshots to stay basically immutable (i.e. no flag/status/etc fields) because we're moving to an event-driven model soon. But we can add a @cached_property that gets the status using a query over ArchiveResults and stores it in cache.

<!-- gh-comment-id:1907092514 --> @pirate commented on GitHub (Jan 23, 2024): Nah @neel-suthar, I want Snapshots to stay basically immutable (i.e. no flag/status/etc fields) because we're moving to an event-driven model soon. But we can add a `@cached_property` that gets the status using a query over `ArchiveResult`s and stores it in cache.
Author
Owner

@pirate commented on GitHub (Oct 27, 2024):

This longstanding bug should soon be fixed, each model is now a finite state machine with only a few valid states. Everything gets moved towards a final state deterministically on tick() (like a game engine), and if a snapshot fails enough times it will eventually be marked "fatal", and will have to be retried as a new snapshot. This should make it much clearer when something is failing intermittently vs permanently.

<!-- gh-comment-id:2440167767 --> @pirate commented on GitHub (Oct 27, 2024): This longstanding bug should soon be fixed, each model is now a finite state machine with only a few valid states. Everything gets moved towards a final state deterministically on `tick()` (like a game engine), and if a snapshot fails enough times it will eventually be marked "fatal", and will have to be retried as a new snapshot. This should make it much clearer when something is failing intermittently vs permanently.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#2242
No description provided.