[GH-ISSUE #1481] Empty duplicate archive directories are created when update is run on 0.7.2 #2382

Open
opened 2026-03-01 17:58:39 +03:00 by kerem · 0 comments
Owner

Originally created by @pirate on GitHub (Aug 6, 2024).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/1481

Discussed in https://github.com/ArchiveBox/ArchiveBox/discussions/1474

Originally posted by huyz July 30, 2024
@pirate wrote at https://github.com/ArchiveBox/ArchiveBox/discussions/1375#discussioncomment-8939092:

  • Snapshot 1:1 with each URL added to ArchiveBox, not created unless a new unique URL is added

If there's a 1:1 mapping from URL to snapshot, why do I have 22 URLs (UI says 22 snapshots) but my filesystem has 368 timestamped folders?

I have a schedule every 15 minutes to check on the feed so that ArchiveBox can detect new URLs in my feed fairly quickly:

❯ archivebox schedule --show
[i] [2024-07-30 11:02:27] ArchiveBox v0.7.2: archivebox schedule --show
    > /data

*/15 * * * * cd /data && /usr/local/bin/archivebox add --depth=1 "https://***/feeds/*****/all" >> /data/logs/schedule.log 2>&1 # archivebox_schedule

My understanding is that ArchiveBox wouldn't try to snapshot URLs it has already seen (unless there was an error (assuming ONLY_NEW is set to the default False)).
So why do I keep getting more and more timestamped folders in the filesystem?

In the UI, only the oldest timestamp actually shows up. These newer timestamped folders don't seem accessible through the web UI. If I manually edit the https://archivebox***/archive/1722239597.311123/index.html URL and replace the oldest timestamp with a one of the newer ones, it's not found: No Snapshot directories match the given timestamp or UUID: 1722263974.282957

My archive folder keeps growing by GBs.

Originally created by @pirate on GitHub (Aug 6, 2024). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/1481 ### Discussed in https://github.com/ArchiveBox/ArchiveBox/discussions/1474 <div type='discussions-op-text'> <sup>Originally posted by **huyz** July 30, 2024</sup> @pirate wrote at https://github.com/ArchiveBox/ArchiveBox/discussions/1375#discussioncomment-8939092: > * `Snapshot` 1:1 with each URL added to ArchiveBox, not created unless a new unique URL is added If there's a 1:1 mapping from URL to snapshot, why do I have 22 URLs (UI says `22 snapshots`) but my filesystem has 368 timestamped folders? I have a schedule every 15 minutes to check on the feed so that ArchiveBox can detect new URLs in my feed fairly quickly: ```shell ❯ archivebox schedule --show [i] [2024-07-30 11:02:27] ArchiveBox v0.7.2: archivebox schedule --show > /data */15 * * * * cd /data && /usr/local/bin/archivebox add --depth=1 "https://***/feeds/*****/all" >> /data/logs/schedule.log 2>&1 # archivebox_schedule ``` My understanding is that ArchiveBox wouldn't try to snapshot URLs it has already seen (unless there was an error (assuming `ONLY_NEW` is set to the default `False`)). So why do I keep getting more and more timestamped folders in the filesystem? In the UI, only the oldest timestamp actually shows up. These newer timestamped folders don't seem accessible through the web UI. If I manually edit the `https://archivebox***/archive/1722239597.311123/index.html` URL and replace the oldest timestamp with a one of the newer ones, it's not found: `No Snapshot directories match the given timestamp or UUID: 1722263974.282957` My archive folder keeps growing by GBs.</div>
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#2382
No description provided.