mirror of
https://github.com/ArchiveBox/ArchiveBox.git
synced 2026-04-25 09:06:02 +03:00
[GH-ISSUE #1549] Question: invalid link data directories, what to do next? #919
Labels
No labels
expected: maybe someday
expected: next release
expected: release after next
expected: unlikely unless contributed
good first ticket
help wanted
pull-request
scope: all users
scope: windows users
size: easy
size: hard
size: medium
size: medium
status: backlog
status: blocked
status: done
status: idea-phase
status: needs followup
status: wip
status: wontfix
touches: API/CLI/Spec
touches: configuration
touches: data/schema/architecture
touches: dependencies/packaging
touches: docs
touches: js
touches: views/replayers/html/css
why: correctness
why: functionality
why: performance
why: security
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/ArchiveBox#919
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @blastrock on GitHub (Oct 18, 2024).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/1549
Hi,
I tried upgrading to 0.8.5rc44 and I have a lot of invalid data directories. I haven't started to use archivebox seriously yet, I'm not sure whether these directories were invalid on 0.7 too or not.
Anyway,
archivebox initsaysSkipped adding 53 invalid link data directories.. I have checked the directories, they look fine to me, but I don't know what to look for.I have run
archivebox statuswhich showsorphaned: 53and suggests to runarchivebox initagain. Andarchivebox list --status=invalidjust gives me the same list asarchivebox init.I'm a bit lost here, what's invalid about these directories? What can I do to recover them? Is there a way to make archivebox more verbose?
Note that I may have corrupted the index by manually deleting folders because of #1477, but my understanding is that
archivebox initshould fix the index just fine.@pirate commented on GitHub (Oct 19, 2024):
They have index.json files present in each? If so, Could you share one? You can redact parts of it if needed
@blastrock commented on GitHub (Oct 19, 2024):
I didn't check them all, but all that I have checked had an index.json.
Sure, here's one.
marginalia.nu_index.json
@blastrock commented on GitHub (Jan 4, 2025):
I dug deeper into this, looking at the sources of v0.8.5rc51 . I think my issue is that the folder is correctly detected as an orphaned entry in
but the link exists in
all_linkswith a different timestamp. As said in my first message, I had duplicates because of #1477, which I deleted manually. So, some links have a different timestamp in the main index because I kept the wrong duplicate.What can I do to fix that? Deleting the index makes
archivebox initcomplain. Is there a way to rebuild the index from just the archive folder? Can IDELETE FROM core_snapshotand havearchivebox initrepopulate the table?Also, is this a bug? Should
archivebox initjust detect that some folders were deleted and remove them from the main index before considering orphaned folders?@pirate commented on GitHub (Jan 4, 2025):
ArchiveBox is designed to pick up merged archive dirs even when the indexes are not merged. It's an intentional feature to allow merging archives by just dragging and dropping two folders together.
The fix is the same:
archivebox initin a new empty directoryoldarchive/archive/*into the newarchive/folder just created byarchivebox initarchivebox initinside the new data directory a 2nd time to pick up all the snapshots that were just copied inI recommend doing this process on 0.7.3, not 0.8.5. You can upgrade to 0.8.5 after.
@blastrock commented on GitHub (Jan 5, 2025):
Thank you, it worked!
I misunderstood your comment on #1477 and thought that the issue was fixed in 0.8, but I understand that you are still working on it. I fell in the trap again and now have duplicate directories ^^
Anyway, keep up the good work, I can't wait for 0.8 to come out!
--browser-height=noption to SingleFile extractor during archiving to support archiving full page height #2079--browser-height=noption to SingleFile extractor during archiving to support archiving full page height #3589