[GH-ISSUE #1549] Question: invalid link data directories, what to do next? #3939

Open
opened 2026-03-15 01:03:27 +03:00 by kerem · 5 comments
Owner

Originally created by @blastrock on GitHub (Oct 18, 2024).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/1549

Hi,

I tried upgrading to 0.8.5rc44 and I have a lot of invalid data directories. I haven't started to use archivebox seriously yet, I'm not sure whether these directories were invalid on 0.7 too or not.

Anyway, archivebox init says Skipped adding 53 invalid link data directories.. I have checked the directories, they look fine to me, but I don't know what to look for.

I have run archivebox status which shows orphaned: 53 and suggests to run archivebox init again. And archivebox list --status=invalid just gives me the same list as archivebox init.

I'm a bit lost here, what's invalid about these directories? What can I do to recover them? Is there a way to make archivebox more verbose?

Note that I may have corrupted the index by manually deleting folders because of #1477, but my understanding is that archivebox init should fix the index just fine.

Originally created by @blastrock on GitHub (Oct 18, 2024). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/1549 Hi, I tried upgrading to 0.8.5rc44 and I have a lot of invalid data directories. I haven't started to use archivebox seriously yet, I'm not sure whether these directories were invalid on 0.7 too or not. Anyway, `archivebox init` says `Skipped adding 53 invalid link data directories.`. I have checked the directories, they look fine to me, but I don't know what to look for. I have run `archivebox status` which shows `orphaned: 53` and suggests to run `archivebox init` again. And `archivebox list --status=invalid` just gives me the same list as `archivebox init`. I'm a bit lost here, what's invalid about these directories? What can I do to recover them? Is there a way to make archivebox more verbose? Note that I may have corrupted the index by manually deleting folders because of #1477, but my understanding is that `archivebox init` should fix the index just fine.
Author
Owner

@pirate commented on GitHub (Oct 19, 2024):

They have index.json files present in each? If so, Could you share one? You can redact parts of it if needed

<!-- gh-comment-id:2423573647 --> @pirate commented on GitHub (Oct 19, 2024): They have index.json files present in each? If so, Could you share one? You can redact parts of it if needed
Author
Owner

@blastrock commented on GitHub (Oct 19, 2024):

I didn't check them all, but all that I have checked had an index.json.

Sure, here's one.

marginalia.nu_index.json

<!-- gh-comment-id:2423587768 --> @blastrock commented on GitHub (Oct 19, 2024): I didn't check them all, but all that I have checked had an index.json. Sure, here's one. [marginalia.nu_index.json](https://github.com/user-attachments/files/17443224/marginalia.nu_index.json)
Author
Owner

@blastrock commented on GitHub (Jan 4, 2025):

I dug deeper into this, looking at the sources of v0.8.5rc51 . I think my issue is that the folder is correctly detected as an orphaned entry in

# Links in data dir indexes but not in main index
            orphaned_data_dir_links = {
                link.url: link                    
                for link in parse_json_links_details(out_dir)
                if not all_links.filter(url=link.url).exists()
            }

but the link exists in all_links with a different timestamp. As said in my first message, I had duplicates because of #1477, which I deleted manually. So, some links have a different timestamp in the main index because I kept the wrong duplicate.

What can I do to fix that? Deleting the index makes archivebox init complain. Is there a way to rebuild the index from just the archive folder? Can I DELETE FROM core_snapshot and have archivebox init repopulate the table?

Also, is this a bug? Should archivebox init just detect that some folders were deleted and remove them from the main index before considering orphaned folders?

<!-- gh-comment-id:2571409242 --> @blastrock commented on GitHub (Jan 4, 2025): I dug deeper into this, looking at the sources of v0.8.5rc51 . I think my issue is that the folder is correctly detected as an orphaned entry in ```python # Links in data dir indexes but not in main index orphaned_data_dir_links = { link.url: link for link in parse_json_links_details(out_dir) if not all_links.filter(url=link.url).exists() } ``` but the link exists in `all_links` with a different timestamp. As said in my first message, I had duplicates because of #1477, which I deleted manually. So, some links have a different timestamp in the main index because I kept the wrong duplicate. What can I do to fix that? Deleting the index makes `archivebox init` complain. Is there a way to rebuild the index from just the archive folder? Can I `DELETE FROM core_snapshot` and have `archivebox init` repopulate the table? Also, is this a bug? Should `archivebox init` just detect that some folders were deleted and remove them from the main index before considering orphaned folders?
Author
Owner

@pirate commented on GitHub (Jan 4, 2025):

ArchiveBox is designed to pick up merged archive dirs even when the indexes are not merged. It's an intentional feature to allow merging archives by just dragging and dropping two folders together.

The fix is the same:

  1. Run archivebox init in a new empty directory
  2. Copy oldarchive/archive/* into the new archive/ folder just created by archivebox init
  3. Run archivebox init inside the new data directory a 2nd time to pick up all the snapshots that were just copied in

I recommend doing this process on 0.7.3, not 0.8.5. You can upgrade to 0.8.5 after.

<!-- gh-comment-id:2571422088 --> @pirate commented on GitHub (Jan 4, 2025): ArchiveBox is designed to pick up merged archive dirs even when the indexes are not merged. It's an intentional feature to allow merging archives by just dragging and dropping two folders together. The fix is the same: 1. Run `archivebox init` in a new empty directory 2. Copy `oldarchive/archive/*` into the new `archive/` folder just created by `archivebox init` 3. Run `archivebox init` inside the new data directory a 2nd time to pick up all the snapshots that were just copied in I recommend doing this process on 0.7.3, not 0.8.5. You can upgrade to 0.8.5 after.
Author
Owner

@blastrock commented on GitHub (Jan 5, 2025):

Thank you, it worked!

I misunderstood your comment on #1477 and thought that the issue was fixed in 0.8, but I understand that you are still working on it. I fell in the trap again and now have duplicate directories ^^

Anyway, keep up the good work, I can't wait for 0.8 to come out!

<!-- gh-comment-id:2571634611 --> @blastrock commented on GitHub (Jan 5, 2025): Thank you, it worked! I misunderstood your comment on #1477 and thought that the issue was fixed in 0.8, but I understand that you are still working on it. I fell in the trap again and now have duplicate directories ^^ Anyway, keep up the good work, I can't wait for 0.8 to come out!
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#3939
No description provided.