[GH-ISSUE #95] Add a way to delete an entry from the index and archive #1577

Closed
opened 2026-03-01 17:51:53 +03:00 by kerem · 10 comments
Owner

Originally created by @pigmonkey on GitHub (Sep 14, 2018).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/95

Occasionally I want to remove a URL from my archive. Currently this is a manual process of finding the entry in index.json, pulling out the timestamp, deleting the relevant lines, doing the same for index.html, and finally rm -r output/archive/$timestamp.

It would be nice if there was some slightly more automated way of doing this. Ideally I think this would be done with a final step after archiving, where the script would try to match each directory name in output/ with a timestamp in index.json. If a match isn't found, the user is prompted with something like:

1536723384 not found in bookmark index. Delete output directory? (y/n)

This may be a behavior that is only enabled by an optional config option.

Originally created by @pigmonkey on GitHub (Sep 14, 2018). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/95 Occasionally I want to remove a URL from my archive. Currently this is a manual process of finding the entry in `index.json`, pulling out the timestamp, deleting the relevant lines, doing the same for `index.html`, and finally `rm -r output/archive/$timestamp`. It would be nice if there was some slightly more automated way of doing this. Ideally I think this would be done with a final step after archiving, where the script would try to match each directory name in `output/` with a timestamp in `index.json`. If a match isn't found, the user is prompted with something like: 1536723384 not found in bookmark index. Delete output directory? (y/n) This may be a behavior that is only enabled by an optional config option.
Author
Owner

@pirate commented on GitHub (Sep 20, 2018):

👍 Good idea

<!-- gh-comment-id:423013743 --> @pirate commented on GitHub (Sep 20, 2018): :+1: Good idea
Author
Owner

@pirate commented on GitHub (Oct 12, 2018):

I cant promise I'll get around to this soon since I'm super busy with my day-job, but it's on the TODO list!

<!-- gh-comment-id:429449805 --> @pirate commented on GitHub (Oct 12, 2018): I cant promise I'll get around to this soon since I'm super busy with my day-job, but it's on the TODO list!
Author
Owner

@f0086 commented on GitHub (Oct 18, 2018):

I've recently imported my complete Pinboard archive, there where a lot of bookmarks with dead links in it:

[√] [2018-10-17 23:28:37] Update of 4249 links complete (133.12 min)
    - 15219 entries skipped
    - 714 entries updated
    - 1063 errors

(The script crashed a few times with "Too many open files" errors, so I had to rerun it a couple of times)

My idea is to run this script once a day with a fresh dump from my pinboard export (I've wrote a little go program which dumps the whole list from pinboard). But with that 1063 links with errors, it will take hours (even with small timeouts) and is totally useless to retry that links.

Because that 1063 dead links will always be in that exported list, the archiver will always retry to download it. It would be nice if there where a flag or environment variable to skip that links which where previously failed to download. A "cleanup" flag would be even better, but skipping that links would be sufficient for my usecase.

<!-- gh-comment-id:431108498 --> @f0086 commented on GitHub (Oct 18, 2018): I've recently imported my complete Pinboard archive, there where a lot of bookmarks with dead links in it: [√] [2018-10-17 23:28:37] Update of 4249 links complete (133.12 min) - 15219 entries skipped - 714 entries updated - 1063 errors (The script crashed a few times with "Too many open files" errors, so I had to rerun it a couple of times) My idea is to run this script once a day with a fresh dump from my pinboard export (I've wrote a little go program which dumps the whole list from pinboard). But with that 1063 links with errors, it will take hours (even with small timeouts) and is totally useless to retry that links. Because that 1063 dead links will always be in that exported list, the archiver will always retry to download it. It would be nice if there where a flag or environment variable to skip that links which where previously failed to download. A "cleanup" flag would be even better, but skipping that links would be sufficient for my usecase.
Author
Owner

@pirate commented on GitHub (Oct 19, 2018):

How about an --only-new flag which makes it only add new links to the index, without attempting to update/retry older archived links?

<!-- gh-comment-id:431214065 --> @pirate commented on GitHub (Oct 19, 2018): How about an `--only-new` flag which makes it only add new links to the index, without attempting to update/retry older archived links?
Author
Owner

@f0086 commented on GitHub (Oct 19, 2018):

Sounds good to me!

<!-- gh-comment-id:431266306 --> @f0086 commented on GitHub (Oct 19, 2018): Sounds good to me!
Author
Owner

@f0086 commented on GitHub (Oct 19, 2018):

I've opened a pull request for that.

<!-- gh-comment-id:431475980 --> @f0086 commented on GitHub (Oct 19, 2018): I've opened a pull request for that.
Author
Owner

@pigmonkey commented on GitHub (Oct 26, 2018):

While I like the ONLY_NEW addition, I don't think it closes this issue. Eventually I'd still like to see some cleaner method for removing bookmarks, even if their content has already been downloaded.

One of my use cases is similar to the one @f0086 mentioned: I've imported a Pinboard json file which includes old URLs that now 404. It's annoying to have bookmark-archiver waste time on every run attempting to archive them. However, I don't want to enable ONLY_NEW. If I do that, and I add a good URL, but my machine goes offline during the next run (or the machine serving the URL happens to be offline, or if they happen to be restarting their web server at that second) the archive will fail and it will not try again next time, despite there being a good chance the next attempt would succeed. This use case would be better solved by tracking the number of attempts per URL and having some sort of MAX_URL_ATTEMPTS configuration option. If this was set to 1, it would basically act exactly the same as ONLY_NEW: try once and ignore forever on failure. To address my concerns about temporary connectivity issues, I could set it to a value like 5.

My other use case is that I have successfully imported a URL and archived it, but it no longer provides any value to me and I want to recover the disk space. In this case neither ONLY_NEW or MAX_URL_ATTEMPTS helps me. This may also be useful if you are serving your archive publicly and receive a take-down request that you wish to comply with.

I do think this is a low priority issue, since bookmark-archiver's focus is sort of the opposite of deleting things.

<!-- gh-comment-id:433287689 --> @pigmonkey commented on GitHub (Oct 26, 2018): While I like the `ONLY_NEW` addition, I don't think it closes this issue. Eventually I'd still like to see some cleaner method for removing bookmarks, even if their content has already been downloaded. One of my use cases is similar to the one @f0086 mentioned: I've imported a Pinboard json file which includes old URLs that now 404. It's annoying to have bookmark-archiver waste time on every run attempting to archive them. However, I don't want to enable `ONLY_NEW`. If I do that, and I add a good URL, but my machine goes offline during the next run (or the machine serving the URL happens to be offline, or if they happen to be restarting their web server at that second) the archive will fail and it will not try again next time, despite there being a good chance the next attempt would succeed. This use case would be better solved by tracking the number of attempts per URL and having some sort of `MAX_URL_ATTEMPTS` configuration option. If this was set to `1`, it would basically act exactly the same as `ONLY_NEW`: try once and ignore forever on failure. To address my concerns about temporary connectivity issues, I could set it to a value like `5`. My other use case is that I have successfully imported a URL and archived it, but it no longer provides any value to me and I want to recover the disk space. In this case neither `ONLY_NEW` or `MAX_URL_ATTEMPTS` helps me. This may also be useful if you are serving your archive publicly and receive a take-down request that you wish to comply with. I do think this is a low priority issue, since bookmark-archiver's focus is sort of the opposite of deleting things.
Author
Owner

@pirate commented on GitHub (Oct 26, 2018):

Agreed, I didn't mean to close this issue by merging that PR, Github did it automatically.

I like the idea for MAX_URL_ATTEMPTS, that wouldn't be too hard to track in our index.json files.

@pigmonkey I split that out into its own issue: https://github.com/pirate/bookmark-archiver/issues/109

<!-- gh-comment-id:433460180 --> @pirate commented on GitHub (Oct 26, 2018): Agreed, I didn't mean to close this issue by merging that PR, Github did it automatically. I like the idea for `MAX_URL_ATTEMPTS`, that wouldn't be too hard to track in our index.json files. @pigmonkey I split that out into its own issue: https://github.com/pirate/bookmark-archiver/issues/109
Author
Owner

@f0086 commented on GitHub (Oct 26, 2018):

@pigmonkey I've also thought about the fact that a page could potentially went offline at the time I scrape it. But for that case, you can simply run the archiver with ONLY_NEW=False once a day or so. Most of the time, a page went offline either forever or for a short time (hours/days). So, If I set MAX_URL_ATTEMPTS to 10 and run that script every hour, I have the same problem as with ONLY_NEW for a page which is offline for a day or two. I set up two cronjobs: Once a week with ONLY_NEW=False and once a hour with ONLY_NEW=True.

<!-- gh-comment-id:433528352 --> @f0086 commented on GitHub (Oct 26, 2018): @pigmonkey I've also thought about the fact that a page could potentially went offline at the time I scrape it. But for that case, you can simply run the archiver with ONLY_NEW=False once a day or so. Most of the time, a page went offline either forever or for a short time (hours/days). So, If I set ```MAX_URL_ATTEMPTS``` to 10 and run that script every hour, I have the same problem as with ```ONLY_NEW``` for a page which is offline for a day or two. I set up two cronjobs: Once a week with ```ONLY_NEW=False``` and once a hour with ```ONLY_NEW=True```.
Author
Owner

@pirate commented on GitHub (Jul 24, 2020):

The new django version has both the ability to remove snapshots from the archive, and a separate archivebox update command independent from archivebox add so that you can control when to retry previously failed links.

git checkout django
git pull
# or pip install -e . to run it without docker
docker build . -t archivebox
docker run -v $PWD/output:/data archivebox init
docker run -v $PWD/output:/data archivebox add 'https://example.com'
docker run -v $PWD/output:/data archivebox remove --help
docker run -v $PWD/output:/data archivebox remove --delete 'https://example.com'
docker run -v $PWD/output:/data archivebox update

Adding a MAX_URL_ATTEMPTS option will be tracked in this separate issue: https://github.com/pirate/ArchiveBox/issues/109

<!-- gh-comment-id:663649550 --> @pirate commented on GitHub (Jul 24, 2020): The new `django` version has both the ability to remove snapshots from the archive, and a separate `archivebox update` command independent from `archivebox add` so that you can control when to retry previously failed links. ``` git checkout django git pull # or pip install -e . to run it without docker docker build . -t archivebox docker run -v $PWD/output:/data archivebox init docker run -v $PWD/output:/data archivebox add 'https://example.com' docker run -v $PWD/output:/data archivebox remove --help docker run -v $PWD/output:/data archivebox remove --delete 'https://example.com' docker run -v $PWD/output:/data archivebox update ``` Adding a `MAX_URL_ATTEMPTS` option will be tracked in this separate issue: https://github.com/pirate/ArchiveBox/issues/109
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#1577
No description provided.