[GH-ISSUE #1102] Bug: running pull (or update --resume via CLI) doesn't finish incomplete youtube-dl download #692

Open
opened 2026-03-01 14:45:34 +03:00 by kerem · 7 comments
Owner

Originally created by @SeanDS on GitHub (Feb 19, 2023).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/1102

Describe the bug

A YouTube video didn't complete its download before the timeout, so I have lots of files in the site's media directory representing subtitles, partial video files etc. but youtube-dl hasn't finished compiling them into one webm file. However, selecting the site and clicking "pull" in the web UI or running archivebox update --resume <timestamp> doesn't appear to re-run youtube-dl to finish the video archival.

Steps to reproduce

  1. Start archival of a YouTube video.
  2. Cancel the youtube-dl step so that the download is only partial.
  3. Attempt to resume download with web UI or CLI and note that it doesn't complete the video archival.

Screenshots or log output

Partially redacted output, showing "success" for the video in question despite it not finishing the archival.

$ docker compose run -e SAVE_MEDIA=1 archivebox update --resume 1673798553.228281
[i] [2023-02-19 16:45:22] ArchiveBox v0.6.2: archivebox update --resume 1673798553.228281
    > /data


[] [2023-02-19 16:46:54] Starting archiving of 14 snapshots in index...

[redacted]

[] [2023-02-19 16:47:00] "<redacted>"
    https://www.youtube.com/watch?v=<redacted>
    √ ./archive/1673798553.228281
        268 files (310.7 MB) in 0:00:00s 

[redacted]

[] [2023-02-19 16:47:03] Update of 14 pages complete (9.03 sec)
    - 12 links skipped
    - 2 links updated

ArchiveBox version

ArchiveBox v0.6.2
Cpython Linux Linux-5.4.0-139-generic-x86_64-with-glibc2.28 x86_64
IN_DOCKER=True DEBUG=False IS_TTY=True TZ=UTC SEARCH_BACKEND_ENGINE=ripgrep

[i] Dependency versions:
 √  ARCHIVEBOX_BINARY     v0.6.2          valid     /usr/local/bin/archivebox                                                   
 √  PYTHON_BINARY         v3.9.5          valid     /usr/local/bin/python3.9                                                    
 √  DJANGO_BINARY         v3.1.10         valid     /usr/local/lib/python3.9/site-packages/django/bin/django-admin.py           
 √  CURL_BINARY           v7.64.0         valid     /usr/bin/curl                                                               
 √  WGET_BINARY           v1.20.1         valid     /usr/bin/wget                                                               
 √  NODE_BINARY           v15.14.0        valid     /usr/bin/node                                                               
 √  SINGLEFILE_BINARY     v0.3.16         valid     /node/node_modules/single-file/cli/single-file                              
 √  READABILITY_BINARY    v0.0.2          valid     /node/node_modules/readability-extractor/readability-extractor              
 √  MERCURY_BINARY        v1.0.0          valid     /node/node_modules/@postlight/mercury-parser/cli.js                         
 -  GIT_BINARY            -               disabled  /usr/bin/git                                                                
 √  YOUTUBEDL_BINARY      v2021.04.26     valid     /usr/local/bin/youtube-dl                                                   
 √  CHROME_BINARY         v90.0.4430.93   valid     /usr/bin/chromium                                                           
 √  RIPGREP_BINARY        v0.10.0         valid     /usr/bin/rg                                                                 

[i] Source-code locations:
 √  PACKAGE_DIR           22 files        valid     /app/archivebox                                                             
 √  TEMPLATES_DIR         3 files         valid     /app/archivebox/templates                                                   
 -  CUSTOM_TEMPLATES_DIR  -               disabled                                                                              

[i] Secrets locations:
 -  CHROME_USER_DATA_DIR  -               disabled                                                                              
 -  COOKIES_FILE          -               disabled                                                                              

[i] Data locations:
 √  OUTPUT_DIR            5 files         valid     /data                                                                       
 √  SOURCES_DIR           71 files        valid     ./sources                                                                   
 √  LOGS_DIR              1 files         valid     ./logs                                                                      
 √  ARCHIVE_DIR           1258 files      valid     ./archive                                                                   
 √  CONFIG_FILE           81.0 Bytes      valid     ./ArchiveBox.conf                                                           
 √  SQL_INDEX             9.6 MB          valid     ./index.sqlite3
Originally created by @SeanDS on GitHub (Feb 19, 2023). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/1102 <!-- Please fill out the following information, feel free to delete sections if they're not applicable or if long issue templates annoy you. (the only required section is the version information) --> #### Describe the bug <!-- A description of what the bug is, what you expected to happen, and any relevant context about issue. --> A YouTube video didn't complete its download before the timeout, so I have lots of files in the site's media directory representing subtitles, partial video files etc. but youtube-dl hasn't finished compiling them into one webm file. However, selecting the site and clicking "pull" in the web UI or running `archivebox update --resume <timestamp>` doesn't appear to re-run youtube-dl to finish the video archival. #### Steps to reproduce <!-- For example: 1. Ran ArchiveBox with the following config '...' 2. Saw this output during archiving '....' 3. UI didn't show the thing I was expecting '....' --> 1. Start archival of a YouTube video. 2. Cancel the youtube-dl step so that the download is only partial. 3. Attempt to resume download with web UI or CLI and note that it doesn't complete the video archival. #### Screenshots or log output <!-- If applicable, post any relevant screenshots or copy/pasted terminal output from ArchiveBox. If you're reporting a parsing / importing error, **you must paste a copy of your redacted import file here**. --> Partially redacted output, showing "success" for the video in question despite it not finishing the archival. ```bash $ docker compose run -e SAVE_MEDIA=1 archivebox update --resume 1673798553.228281 [i] [2023-02-19 16:45:22] ArchiveBox v0.6.2: archivebox update --resume 1673798553.228281 > /data [▶] [2023-02-19 16:46:54] Starting archiving of 14 snapshots in index... [redacted] [√] [2023-02-19 16:47:00] "<redacted>" https://www.youtube.com/watch?v=<redacted> √ ./archive/1673798553.228281 268 files (310.7 MB) in 0:00:00s [redacted] [√] [2023-02-19 16:47:03] Update of 14 pages complete (9.03 sec) - 12 links skipped - 2 links updated ``` #### ArchiveBox version <!-- Run the `archivebox version` command locally then copy paste the result here: --> ```logs ArchiveBox v0.6.2 Cpython Linux Linux-5.4.0-139-generic-x86_64-with-glibc2.28 x86_64 IN_DOCKER=True DEBUG=False IS_TTY=True TZ=UTC SEARCH_BACKEND_ENGINE=ripgrep [i] Dependency versions: √ ARCHIVEBOX_BINARY v0.6.2 valid /usr/local/bin/archivebox √ PYTHON_BINARY v3.9.5 valid /usr/local/bin/python3.9 √ DJANGO_BINARY v3.1.10 valid /usr/local/lib/python3.9/site-packages/django/bin/django-admin.py √ CURL_BINARY v7.64.0 valid /usr/bin/curl √ WGET_BINARY v1.20.1 valid /usr/bin/wget √ NODE_BINARY v15.14.0 valid /usr/bin/node √ SINGLEFILE_BINARY v0.3.16 valid /node/node_modules/single-file/cli/single-file √ READABILITY_BINARY v0.0.2 valid /node/node_modules/readability-extractor/readability-extractor √ MERCURY_BINARY v1.0.0 valid /node/node_modules/@postlight/mercury-parser/cli.js - GIT_BINARY - disabled /usr/bin/git √ YOUTUBEDL_BINARY v2021.04.26 valid /usr/local/bin/youtube-dl √ CHROME_BINARY v90.0.4430.93 valid /usr/bin/chromium √ RIPGREP_BINARY v0.10.0 valid /usr/bin/rg [i] Source-code locations: √ PACKAGE_DIR 22 files valid /app/archivebox √ TEMPLATES_DIR 3 files valid /app/archivebox/templates - CUSTOM_TEMPLATES_DIR - disabled [i] Secrets locations: - CHROME_USER_DATA_DIR - disabled - COOKIES_FILE - disabled [i] Data locations: √ OUTPUT_DIR 5 files valid /data √ SOURCES_DIR 71 files valid ./sources √ LOGS_DIR 1 files valid ./logs √ ARCHIVE_DIR 1258 files valid ./archive √ CONFIG_FILE 81.0 Bytes valid ./ArchiveBox.conf √ SQL_INDEX 9.6 MB valid ./index.sqlite3 ``` <!-- Tickets without full version info will closed until it is provided, we need the full output here to help you solve your issue -->
Author
Owner

@pirate commented on GitHub (Feb 19, 2023):

This is a valid ask, but is it happening so often that you need it to be automated? usually people fix it case-by-case manually, by deleting the ArchiveResult/Snapshot and re-archiving.

Automatic repair of partial snapshots is not really something I want to add to ArchiveBox right now, that code could be brittle and high-maintenance (as it would have to change it every time any extractor changes its output schema). Right now ArchiveResults are created atomically on archive success and are immutable (but it's hard to tell whether a result terminated early if archivebox is terminated at the same time). I will eventually add stronger guarantees that the index isn't updated until after the methods succeed (but that is already going to happen as a as result of another larger refactor: https://github.com/ArchiveBox/ArchiveBox/issues/91#issuecomment-871343428).

For now I recommend a small bash/python solution to fix it:

https://github.com/ArchiveBox/ArchiveBox/wiki/Usage#python-shell-usage
https://docs.archivebox.io/en/master/archivebox.html#module-archivebox.main

$ archivebox shell

# pseudocode
for each snapshot in archive/ that doesn't have a *.webm file:
    delete the media/ folder and ArchiveResult
    update/re-archive that snapshot

lemme know if you need help translating that pseudocode to a real script

<!-- gh-comment-id:1436105426 --> @pirate commented on GitHub (Feb 19, 2023): This is a valid ask, but is it happening so often that you need it to be automated? usually people fix it case-by-case manually, by deleting the ArchiveResult/Snapshot and re-archiving. Automatic repair of partial snapshots is not really something I want to add to ArchiveBox right now, that code could be brittle and high-maintenance (as it would have to change it every time any extractor changes its output schema). Right now ArchiveResults are created atomically on archive success and are immutable (but it's hard to tell whether a result terminated early if archivebox is terminated at the same time). I will eventually add stronger guarantees that the index isn't updated until after the methods succeed (but that is already going to happen as a as result of another larger refactor: https://github.com/ArchiveBox/ArchiveBox/issues/91#issuecomment-871343428). For now I recommend a small bash/python solution to fix it: https://github.com/ArchiveBox/ArchiveBox/wiki/Usage#python-shell-usage https://docs.archivebox.io/en/master/archivebox.html#module-archivebox.main ```python $ archivebox shell # pseudocode for each snapshot in archive/ that doesn't have a *.webm file: delete the media/ folder and ArchiveResult update/re-archive that snapshot ``` lemme know if you need help translating that pseudocode to a real script
Author
Owner

@SeanDS commented on GitHub (Feb 21, 2023):

Fair enough, and thanks for the reply.

Just to clarify some things about your pseudocode/description. What is an ArchiveResult, exactly? I don't see any file with that name inside my archives. I guess you refer to the entry in the database, corresponding to a line in the web interface? So when you say delete the ArchiveResult, do you mean I check the entry in the web interface and click Delete? And why then do I not delete the whole directory (e.g. 1670188878.145459) but just the media directory?

<!-- gh-comment-id:1439051646 --> @SeanDS commented on GitHub (Feb 21, 2023): Fair enough, and thanks for the reply. Just to clarify some things about your pseudocode/description. What is an ArchiveResult, exactly? I don't see any file with that name inside my archives. I guess you refer to the entry in the database, corresponding to a line in the web interface? So when you say delete the ArchiveResult, do you mean I check the entry in the web interface and click Delete? And why then do I not delete the whole directory (e.g. 1670188878.145459) but just the media directory?
Author
Owner

@SeanDS commented on GitHub (Feb 21, 2023):

BTW I'm using ArchiveBox via Docker - this seems quite out of date. Is there a Docker image for the dev branch anywhere?

<!-- gh-comment-id:1439052509 --> @SeanDS commented on GitHub (Feb 21, 2023): BTW I'm using ArchiveBox via Docker - this seems quite out of date. Is there a Docker image for the `dev` branch anywhere?
Author
Owner

@pirate commented on GitHub (Feb 24, 2023):

Yes, you can just pull archivebox/archivebox:dev and it will pull the latest pre-release branch. https://github.com/ArchiveBox/ArchiveBox#install-and-run-a-specific-github-branch

ArchiveResult refers to the line in the database that points to a file/folder inside the snapshot dir archive/<timestamp>/<some result>. You can find it in archivebox/core/models.py:ArchiveResult and interact with it in the ArchiveBox shell as ArchiveResult.objects.all(). In the UI ArchiveResults can be found under the Log page in the top, and you can delete/edit/create them manually from there too.

<!-- gh-comment-id:1444477495 --> @pirate commented on GitHub (Feb 24, 2023): Yes, you can just pull `archivebox/archivebox:dev` and it will pull the latest pre-release branch. https://github.com/ArchiveBox/ArchiveBox#install-and-run-a-specific-github-branch `ArchiveResult` refers to the line in the database that points to a file/folder inside the snapshot dir `archive/<timestamp>/<some result>`. You can find it in `archivebox/core/models.py:ArchiveResult` and interact with it in the ArchiveBox shell as `ArchiveResult.objects.all()`. In the UI ArchiveResults can be found under the `Log` page in the top, and you can delete/edit/create them manually from there too.
Author
Owner

@melyux commented on GitHub (Jul 13, 2023):

Don't the other extractors have their ArchiveResults repaired when you say "Pull"? What sets the media extractor apart that it can't re-try like the others can? Just ran into this myself

EDIT: Wget also doesn't retry when it fails. You have to do a Reset for it to even attempt retrying a failed Wget extract. But Singlefile, DOM, Screenshot all successfully retry failed extracts upon Pull.

<!-- gh-comment-id:1633634681 --> @melyux commented on GitHub (Jul 13, 2023): Don't the other extractors have their ArchiveResults repaired when you say "Pull"? What sets the media extractor apart that it can't re-try like the others can? Just ran into this myself EDIT: Wget also doesn't retry when it fails. You have to do a Reset for it to even attempt retrying a failed Wget extract. But Singlefile, DOM, Screenshot all successfully retry failed extracts upon Pull.
Author
Owner

@pirate commented on GitHub (Aug 16, 2023):

ArchiveResults should be effectively immutable, they are only ever added, not repaired.

Wget and youtube-dl don't retry in the same way because they can produce confusing temporary output / partial output that broke my retry logic in the past. Codifying the expectations around how partial output is handled is on my roadmap, I'm designing a new spec to work with an event-sourcing model that guarantees extractors will only produce fully complete output, or fail hard, since partial output seems to be too tricky to handle well across all extractors without adding a lot of brittle logic.

<!-- gh-comment-id:1679788918 --> @pirate commented on GitHub (Aug 16, 2023): ArchiveResults should be effectively immutable, they are only ever added, not repaired. Wget and youtube-dl don't retry in the same way because they can produce confusing temporary output / partial output that broke my retry logic in the past. Codifying the expectations around how partial output is handled is on my roadmap, I'm designing a new spec to work with an event-sourcing model that guarantees extractors will only produce fully complete output, or fail hard, since partial output seems to be too tricky to handle well across all extractors without adding a lot of brittle logic.
Author
Owner

@melyux commented on GitHub (Aug 16, 2023):

Right, meant Singlefile/DOM/Screenshot failures are not overwritten but retried while Media/WGET are never retried. Good to hear about the upcoming either/or logic

<!-- gh-comment-id:1679893416 --> @melyux commented on GitHub (Aug 16, 2023): Right, meant Singlefile/DOM/Screenshot failures are not overwritten but retried while Media/WGET are never retried. Good to hear about the upcoming either/or logic
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#692
No description provided.