mirror of
https://github.com/ArchiveBox/ArchiveBox.git
synced 2026-04-25 09:06:02 +03:00
[GH-ISSUE #95] Add a way to delete an entry from the index and archive #1577
Labels
No labels
expected: maybe someday
expected: next release
expected: release after next
expected: unlikely unless contributed
good first ticket
help wanted
pull-request
scope: all users
scope: windows users
size: easy
size: hard
size: medium
size: medium
status: backlog
status: blocked
status: done
status: idea-phase
status: needs followup
status: wip
status: wontfix
touches: API/CLI/Spec
touches: configuration
touches: data/schema/architecture
touches: dependencies/packaging
touches: docs
touches: js
touches: views/replayers/html/css
why: correctness
why: functionality
why: performance
why: security
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/ArchiveBox#1577
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @pigmonkey on GitHub (Sep 14, 2018).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/95
Occasionally I want to remove a URL from my archive. Currently this is a manual process of finding the entry in
index.json, pulling out the timestamp, deleting the relevant lines, doing the same forindex.html, and finallyrm -r output/archive/$timestamp.It would be nice if there was some slightly more automated way of doing this. Ideally I think this would be done with a final step after archiving, where the script would try to match each directory name in
output/with a timestamp inindex.json. If a match isn't found, the user is prompted with something like:This may be a behavior that is only enabled by an optional config option.
@pirate commented on GitHub (Sep 20, 2018):
👍 Good idea
@pirate commented on GitHub (Oct 12, 2018):
I cant promise I'll get around to this soon since I'm super busy with my day-job, but it's on the TODO list!
@f0086 commented on GitHub (Oct 18, 2018):
I've recently imported my complete Pinboard archive, there where a lot of bookmarks with dead links in it:
(The script crashed a few times with "Too many open files" errors, so I had to rerun it a couple of times)
My idea is to run this script once a day with a fresh dump from my pinboard export (I've wrote a little go program which dumps the whole list from pinboard). But with that 1063 links with errors, it will take hours (even with small timeouts) and is totally useless to retry that links.
Because that 1063 dead links will always be in that exported list, the archiver will always retry to download it. It would be nice if there where a flag or environment variable to skip that links which where previously failed to download. A "cleanup" flag would be even better, but skipping that links would be sufficient for my usecase.
@pirate commented on GitHub (Oct 19, 2018):
How about an
--only-newflag which makes it only add new links to the index, without attempting to update/retry older archived links?@f0086 commented on GitHub (Oct 19, 2018):
Sounds good to me!
@f0086 commented on GitHub (Oct 19, 2018):
I've opened a pull request for that.
@pigmonkey commented on GitHub (Oct 26, 2018):
While I like the
ONLY_NEWaddition, I don't think it closes this issue. Eventually I'd still like to see some cleaner method for removing bookmarks, even if their content has already been downloaded.One of my use cases is similar to the one @f0086 mentioned: I've imported a Pinboard json file which includes old URLs that now 404. It's annoying to have bookmark-archiver waste time on every run attempting to archive them. However, I don't want to enable
ONLY_NEW. If I do that, and I add a good URL, but my machine goes offline during the next run (or the machine serving the URL happens to be offline, or if they happen to be restarting their web server at that second) the archive will fail and it will not try again next time, despite there being a good chance the next attempt would succeed. This use case would be better solved by tracking the number of attempts per URL and having some sort ofMAX_URL_ATTEMPTSconfiguration option. If this was set to1, it would basically act exactly the same asONLY_NEW: try once and ignore forever on failure. To address my concerns about temporary connectivity issues, I could set it to a value like5.My other use case is that I have successfully imported a URL and archived it, but it no longer provides any value to me and I want to recover the disk space. In this case neither
ONLY_NEWorMAX_URL_ATTEMPTShelps me. This may also be useful if you are serving your archive publicly and receive a take-down request that you wish to comply with.I do think this is a low priority issue, since bookmark-archiver's focus is sort of the opposite of deleting things.
@pirate commented on GitHub (Oct 26, 2018):
Agreed, I didn't mean to close this issue by merging that PR, Github did it automatically.
I like the idea for
MAX_URL_ATTEMPTS, that wouldn't be too hard to track in our index.json files.@pigmonkey I split that out into its own issue: https://github.com/pirate/bookmark-archiver/issues/109
@f0086 commented on GitHub (Oct 26, 2018):
@pigmonkey I've also thought about the fact that a page could potentially went offline at the time I scrape it. But for that case, you can simply run the archiver with ONLY_NEW=False once a day or so. Most of the time, a page went offline either forever or for a short time (hours/days). So, If I set
MAX_URL_ATTEMPTSto 10 and run that script every hour, I have the same problem as withONLY_NEWfor a page which is offline for a day or two. I set up two cronjobs: Once a week withONLY_NEW=Falseand once a hour withONLY_NEW=True.@pirate commented on GitHub (Jul 24, 2020):
The new
djangoversion has both the ability to remove snapshots from the archive, and a separatearchivebox updatecommand independent fromarchivebox addso that you can control when to retry previously failed links.Adding a
MAX_URL_ATTEMPTSoption will be tracked in this separate issue: https://github.com/pirate/ArchiveBox/issues/109