mirror of
https://github.com/ArchiveBox/ArchiveBox.git
synced 2026-04-25 09:06:02 +03:00
[GH-ISSUE #38] Config: Add a way to blacklist URLs from being archived or indexed #3045
Labels
No labels
expected: maybe someday
expected: next release
expected: release after next
expected: unlikely unless contributed
good first ticket
help wanted
pull-request
scope: all users
scope: windows users
size: easy
size: hard
size: medium
size: medium
status: backlog
status: blocked
status: done
status: idea-phase
status: needs followup
status: wip
status: wontfix
touches: API/CLI/Spec
touches: configuration
touches: data/schema/architecture
touches: dependencies/packaging
touches: docs
touches: js
touches: views/replayers/html/css
why: correctness
why: functionality
why: performance
why: security
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/ArchiveBox#3045
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @nodiscc on GitHub (Aug 3, 2017).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/38
There are some items in my bookmarks I wish were not archived (to save bandwidth and disk space), for example items I already have archived via other means (youtube-dl), or items that are just bookmarks, with no value in offline use, etc.
Would it be possible to add an option where the user can provide a list of URLs, which will never be archived?
They would still appear in the index archive, maybe as grayed out items? With the files/pdf/screenshot/a.org links removed?
@pirate commented on GitHub (Aug 3, 2017):
How about
env EXCLUDE_URLS=.*youtube.com.*,.*facebook.com/.*,.*.exeandenv INDEX_EXCLUDED=True/Falseto choose whether they appear in the index (without files/screenshot) or are entirely removed.(defined in
config.pyof course)@nodiscc commented on GitHub (Aug 3, 2017):
sound really good. Having this in the config file would be better than having to copy-paste it to the command line (especially if the exclude list gets long).
@issmirnov commented on GitHub (Jul 7, 2018):
Great, started work on this.
@pirate do you prefer to keep
EXCLUDE_URLSas a comma separated list of regex rules, or just have users provide strings and assume an implied.*suffix and prefix?Personally I prefer the brevity of
youtube.com,facebook.comover.*youtube.com.*,.*facebook.com.*@pirate commented on GitHub (Jul 8, 2018):
Explicit > implicit, lets keep full regex. This also allows us to exclude
extensions like .*+.exe$
@pirate commented on GitHub (Jan 30, 2019):
Hey @issmirnov, just checking in. I might do this ticket at some point in the next couple months, if you have code you've worked on already you're welcome to share it as a PR and I'll incorporate it into my patch for this feature.
@issmirnov commented on GitHub (Jan 30, 2019):
Hey @pirate, got swamped by life obligations so didn't have time to finish anything meaningful. All yours - thanks for checking in!
@pirate commented on GitHub (Mar 5, 2019):
@mlazana this is a great issue to start with. You can practice adding a new config option, going through the archiving/parsing flow to skip certain URLs, and updating the documentation to describe the new config option.
@mlazana commented on GitHub (Mar 14, 2019):
Great, I'll work on it!
@pirate commented on GitHub (Mar 30, 2019):
This is done, thanks @mlazana!