mirror of
https://github.com/ArchiveBox/ArchiveBox.git
synced 2026-04-25 17:16:00 +03:00
[GH-ISSUE #1477] Bug: archivebox add --update creates a different snapshot directory #3893
Labels
No labels
expected: maybe someday
expected: next release
expected: release after next
expected: unlikely unless contributed
good first ticket
help wanted
pull-request
scope: all users
scope: windows users
size: easy
size: hard
size: medium
size: medium
status: backlog
status: blocked
status: done
status: idea-phase
status: needs followup
status: wip
status: wontfix
touches: API/CLI/Spec
touches: configuration
touches: data/schema/architecture
touches: dependencies/packaging
touches: docs
touches: js
touches: views/replayers/html/css
why: correctness
why: functionality
why: performance
why: security
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/ArchiveBox#3893
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @blastrock on GitHub (Aug 4, 2024).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/1477
Describe the bug
I used
archivebox add --update, but I discovered that every time I used it for a single URL, a new different archive folder is made. I now have a lot ofinvalidsnapshots.I expected
archivebox addto not put my archive into an inconsistent state, and just update the links that were already added.Steps to reproduce
This is what I (think I) did, not all steps may be necessary:
archivebox add --index-onlyarchivebox add --extract headers,title,favicon,media --updateEvery time you repeat step 2, a new snapshot with a different timestamp is created. Only the first one (which is empty) appears in
archivebox list.Screenshots or log output
I now have a lot of invalid archive dirs that
archivebox initignores.So you have an idea about how to fix this? I'd like to keep only the latest snapshot for each URL
ArchiveBox version
I am using the
archivebox/archivebox:maindocker image.@huyz commented on GitHub (Aug 4, 2024):
Sounds like my problem: https://github.com/ArchiveBox/ArchiveBox/discussions/1474
@pirate commented on GitHub (Aug 5, 2024):
Yes I believe it's probably the same issue, I'll ensure it's fixed in the upcoming 0.8 mega release. Apologies for the trouble, thanks for reporting.
@huyz commented on GitHub (Aug 5, 2024):
Is there an easy way to identify and nuke the invalid directories?
@pirate commented on GitHub (Aug 6, 2024):
archivebox list --status=invalid --csv=timestamp,url--status=xyzcan be any of the statuses listed inarchivebox status(e.g.invalid,orphaned,duplicate, etc.)--csv=timestampwill show the ts as the first column in the output, the archives dirs are stored in./data/archive/<ts>so you can use that to delete those dirs.