mirror of
https://github.com/ArchiveBox/ArchiveBox.git
synced 2026-04-26 01:26:00 +03:00
[GH-ISSUE #1102] Bug: running pull (or update --resume via CLI) doesn't finish incomplete youtube-dl download #2202
Labels
No labels
expected: maybe someday
expected: next release
expected: release after next
expected: unlikely unless contributed
good first ticket
help wanted
pull-request
scope: all users
scope: windows users
size: easy
size: hard
size: medium
size: medium
status: backlog
status: blocked
status: done
status: idea-phase
status: needs followup
status: wip
status: wontfix
touches: API/CLI/Spec
touches: configuration
touches: data/schema/architecture
touches: dependencies/packaging
touches: docs
touches: js
touches: views/replayers/html/css
why: correctness
why: functionality
why: performance
why: security
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/ArchiveBox#2202
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @SeanDS on GitHub (Feb 19, 2023).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/1102
Describe the bug
A YouTube video didn't complete its download before the timeout, so I have lots of files in the site's media directory representing subtitles, partial video files etc. but youtube-dl hasn't finished compiling them into one webm file. However, selecting the site and clicking "pull" in the web UI or running
archivebox update --resume <timestamp>doesn't appear to re-run youtube-dl to finish the video archival.Steps to reproduce
Screenshots or log output
Partially redacted output, showing "success" for the video in question despite it not finishing the archival.
ArchiveBox version
@pirate commented on GitHub (Feb 19, 2023):
This is a valid ask, but is it happening so often that you need it to be automated? usually people fix it case-by-case manually, by deleting the ArchiveResult/Snapshot and re-archiving.
Automatic repair of partial snapshots is not really something I want to add to ArchiveBox right now, that code could be brittle and high-maintenance (as it would have to change it every time any extractor changes its output schema). Right now ArchiveResults are created atomically on archive success and are immutable (but it's hard to tell whether a result terminated early if archivebox is terminated at the same time). I will eventually add stronger guarantees that the index isn't updated until after the methods succeed (but that is already going to happen as a as result of another larger refactor: https://github.com/ArchiveBox/ArchiveBox/issues/91#issuecomment-871343428).
For now I recommend a small bash/python solution to fix it:
https://github.com/ArchiveBox/ArchiveBox/wiki/Usage#python-shell-usage
https://docs.archivebox.io/en/master/archivebox.html#module-archivebox.main
lemme know if you need help translating that pseudocode to a real script
@SeanDS commented on GitHub (Feb 21, 2023):
Fair enough, and thanks for the reply.
Just to clarify some things about your pseudocode/description. What is an ArchiveResult, exactly? I don't see any file with that name inside my archives. I guess you refer to the entry in the database, corresponding to a line in the web interface? So when you say delete the ArchiveResult, do you mean I check the entry in the web interface and click Delete? And why then do I not delete the whole directory (e.g. 1670188878.145459) but just the media directory?
@SeanDS commented on GitHub (Feb 21, 2023):
BTW I'm using ArchiveBox via Docker - this seems quite out of date. Is there a Docker image for the
devbranch anywhere?@pirate commented on GitHub (Feb 24, 2023):
Yes, you can just pull
archivebox/archivebox:devand it will pull the latest pre-release branch. https://github.com/ArchiveBox/ArchiveBox#install-and-run-a-specific-github-branchArchiveResultrefers to the line in the database that points to a file/folder inside the snapshot dirarchive/<timestamp>/<some result>. You can find it inarchivebox/core/models.py:ArchiveResultand interact with it in the ArchiveBox shell asArchiveResult.objects.all(). In the UI ArchiveResults can be found under theLogpage in the top, and you can delete/edit/create them manually from there too.@melyux commented on GitHub (Jul 13, 2023):
Don't the other extractors have their ArchiveResults repaired when you say "Pull"? What sets the media extractor apart that it can't re-try like the others can? Just ran into this myself
EDIT: Wget also doesn't retry when it fails. You have to do a Reset for it to even attempt retrying a failed Wget extract. But Singlefile, DOM, Screenshot all successfully retry failed extracts upon Pull.
@pirate commented on GitHub (Aug 16, 2023):
ArchiveResults should be effectively immutable, they are only ever added, not repaired.
Wget and youtube-dl don't retry in the same way because they can produce confusing temporary output / partial output that broke my retry logic in the past. Codifying the expectations around how partial output is handled is on my roadmap, I'm designing a new spec to work with an event-sourcing model that guarantees extractors will only produce fully complete output, or fail hard, since partial output seems to be too tricky to handle well across all extractors without adding a lot of brittle logic.
@melyux commented on GitHub (Aug 16, 2023):
Right, meant Singlefile/DOM/Screenshot failures are not overwritten but retried while Media/WGET are never retried. Good to hear about the upcoming either/or logic