[GH-ISSUE #433] Bugfix: deleted item re-appears upon next import of URLs #291

Closed
opened 2026-03-01 14:42:08 +03:00 by kerem · 13 comments
Owner

Originally created by @aayio on GitHub (Aug 10, 2020).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/433

Thank you in advance for your help,
Sorry if this isn't experienced universally and it's just something I'm not doing right 😕

Describe the bug

Deleted item is re-imported upon the next import of (unrelated) URLs

Steps to reproduce

  1. Delete item from web UI by clicking on item timestamp > Delete
  2. Import new (unrelated) URLs in web UI
  3. New URLs import correctly, but the recently deleted item is also re-imported

Software versions

  • OS: Debian 10
  • ArchiveBox version: Docker c8e3aed
Originally created by @aayio on GitHub (Aug 10, 2020). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/433 Thank you in advance for your help, Sorry if this isn't experienced universally and it's just something I'm not doing right 😕 #### Describe the bug Deleted item is re-imported upon the next import of (unrelated) URLs #### Steps to reproduce 1. Delete item from web UI by clicking on item timestamp > Delete 2. Import new (unrelated) URLs in web UI 3. New URLs import correctly, but the recently deleted item is also re-imported #### Software versions - OS: Debian 10 - ArchiveBox version: Docker c8e3aed
kerem 2026-03-01 14:42:08 +03:00
Author
Owner

@cdvv7788 commented on GitHub (Aug 10, 2020):

I was able to reproduce the bug. @mauvity for now, as a workaround, you can select the items you want to delete from the list and click the delete button at the top right:
image

I will send a PR to fix the issue soon.

<!-- gh-comment-id:671383723 --> @cdvv7788 commented on GitHub (Aug 10, 2020): I was able to reproduce the bug. @mauvity for now, as a workaround, you can select the items you want to delete from the list and click the `delete` button at the top right: ![image](https://user-images.githubusercontent.com/5531776/89792666-7fd3a280-daea-11ea-88e0-d1d31a0eddcc.png) I will send a PR to fix the issue soon.
Author
Owner

@pirate commented on GitHub (Aug 10, 2020):

@cdvv7788 the timestamp > delete version will be fixed automatically once we remove the json main index

don't bother fixing it for now, it would just add a bunch of workaround complexity for a problem that's going away soon anyway.

<!-- gh-comment-id:671389769 --> @pirate commented on GitHub (Aug 10, 2020): @cdvv7788 the timestamp > delete version will be fixed automatically once we remove the json main index don't bother fixing it for now, it would just add a bunch of workaround complexity for a problem that's going away soon anyway.
Author
Owner

@cdvv7788 commented on GitHub (Aug 10, 2020):

Ok. Please leave this open so we don't forget to check back once we merge the index changes.

<!-- gh-comment-id:671393820 --> @cdvv7788 commented on GitHub (Aug 10, 2020): Ok. Please leave this open so we don't forget to check back once we merge the index changes.
Author
Owner

@cdvv7788 commented on GitHub (Oct 7, 2020):

@mauvity can you please check if the current version on master fixes it? We refactored the index internals.

<!-- gh-comment-id:705018215 --> @cdvv7788 commented on GitHub (Oct 7, 2020): @mauvity can you please check if the current version on master fixes it? We refactored the index internals.
Author
Owner

@pirate commented on GitHub (Oct 7, 2020):

There is still a functional difference between the two ways:

  • Delete button = delete the index record and all the archived files
  • timestamp -> delete = delete only the index record without removing any archived files (they become orphans that will be re-imported on the next archivebox init)
<!-- gh-comment-id:705059256 --> @pirate commented on GitHub (Oct 7, 2020): There is still a functional difference between the two ways: - Delete button = delete the index record and all the archived files - timestamp -> delete = delete only the index record without removing any archived files (they become orphans that will be re-imported on the next `archivebox init`)
Author
Owner

@cdvv7788 commented on GitHub (Oct 7, 2020):

Oh right, the delete functionality has not been touched in the refactor.

<!-- gh-comment-id:705069410 --> @cdvv7788 commented on GitHub (Oct 7, 2020): Oh right, the delete functionality has not been touched in the refactor.
Author
Owner

@cdvv7788 commented on GitHub (Oct 9, 2020):

@pirate what should we do about this? Maybe add a confirmation and change both methods to remove the actual files? If the admin is a way to maintain the index, leaving orphaned folders may be unnecessary.

<!-- gh-comment-id:706204545 --> @cdvv7788 commented on GitHub (Oct 9, 2020): @pirate what should we do about this? Maybe add a confirmation and change both methods to remove the actual files? If the admin is a way to maintain the index, leaving orphaned folders may be unnecessary.
Author
Owner

@pirate commented on GitHub (Oct 10, 2020):

I think removing the delete button from the snapshot admin detail page is enough for now. (Leave the delete button on the list page the way it is now).

<!-- gh-comment-id:706466969 --> @pirate commented on GitHub (Oct 10, 2020): I think removing the delete button from the snapshot admin detail page is enough for now. (Leave the delete button on the list page the way it is now).
Author
Owner

@pirate commented on GitHub (Dec 11, 2020):

@cdvv7788 is this fixed in v0.5.0? If not can we do that.

<!-- gh-comment-id:743171377 --> @pirate commented on GitHub (Dec 11, 2020): @cdvv7788 is this fixed in v0.5.0? If not can we do that.
Author
Owner

@pirate commented on GitHub (Apr 6, 2021):

I'm pretty sure this was already fixed in v0.5.6. Comment back here if you're still seeing the issue and I'll reopen the ticket.

<!-- gh-comment-id:813838508 --> @pirate commented on GitHub (Apr 6, 2021): I'm pretty sure this was already fixed in v0.5.6. Comment back here if you're still seeing the issue and I'll reopen the ticket.
Author
Owner

@235 commented on GitHub (Jan 4, 2024):

The bug re-appearing in ArchiveBox version v0.7.1. Quite odd to observe new import full of deleted entries earlier.

I've just observed another bug, which could be related - a handful of deleted entries re-appeared on the top of the list with newer dates. These entries weren't indexed yet, I suspect the extractor had them already in the queue, inserting them back as it went though them.

cc: @pirate

<!-- gh-comment-id:1877239216 --> @235 commented on GitHub (Jan 4, 2024): The bug re-appearing in ArchiveBox version v0.7.1. Quite odd to observe new import full of deleted entries earlier. I've just observed another bug, which could be related - a handful of deleted entries re-appeared on the top of the list with newer dates. These entries weren't indexed yet, I suspect the extractor had them already in the queue, inserting them back as it went though them. cc: @pirate
Author
Owner

@pirate commented on GitHub (Jan 4, 2024):

@235 Can you confirm this is happening when you delete an older completed Snapshot that does not have the same URL present in a later import?

Deleting does not prevent a URL from being re-added in the future, so if you deleted some Snapshots and then re-imported the same URLs later on, they will re-appear (as new Snapshot entries).

Deleting during an import is also totally broken/not advised. This is the downside of making all my import code immutable/indempotent (it overwrites entries entirely on changes instead of mutating them in-place). Because Snapshots are operated on in-memory, it rewrites the DB and disk entries several times from memory as it does work during the import process, and as long as it's still in-memory being operated on it doesn't notice when a user deletes the DB/disk entry out from underneath it.

<!-- gh-comment-id:1877843252 --> @pirate commented on GitHub (Jan 4, 2024): @235 Can you confirm this is happening when you delete an older completed Snapshot that does not have the same URL present in a later import? Deleting does not prevent a URL from being re-added in the future, so if you deleted some Snapshots and then re-imported the same URLs later on, they will re-appear (as *new* Snapshot entries). Deleting *during* an import is also totally broken/not advised. This is the downside of making all my import code immutable/indempotent (it overwrites entries entirely on changes instead of mutating them in-place). Because Snapshots are operated on in-memory, it rewrites the DB and disk entries several times from memory as it does work during the import process, and as long as it's still in-memory being operated on it doesn't notice when a user deletes the DB/disk entry out from underneath it.
Author
Owner

@235 commented on GitHub (Jan 17, 2024):

As discussed in the other ticket - this was deletion DURING an import. We can ignore the report here, and focus on on the other ticket discussion. TY!

<!-- gh-comment-id:1896256017 --> @235 commented on GitHub (Jan 17, 2024): As discussed in the other ticket - this was deletion DURING an import. We can ignore the report here, and focus on on the other ticket discussion. TY!
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#291
No description provided.