[GH-ISSUE #38] Config: Add a way to blacklist URLs from being archived or indexed #1535

Closed
opened 2026-03-01 17:51:32 +03:00 by kerem · 9 comments
Owner

Originally created by @nodiscc on GitHub (Aug 3, 2017).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/38

There are some items in my bookmarks I wish were not archived (to save bandwidth and disk space), for example items I already have archived via other means (youtube-dl), or items that are just bookmarks, with no value in offline use, etc.

Would it be possible to add an option where the user can provide a list of URLs, which will never be archived?

They would still appear in the index archive, maybe as grayed out items? With the files/pdf/screenshot/a.org links removed?

Originally created by @nodiscc on GitHub (Aug 3, 2017). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/38 There are some items in my bookmarks I wish were not archived (to save bandwidth and disk space), for example items I already have archived via other means (youtube-dl), or items that are just bookmarks, with no value in offline use, etc. Would it be possible to add an option where the user can provide a list of URLs, which will never be archived? They would still appear in the index archive, maybe as grayed out items? With the files/pdf/screenshot/a.org links removed?
Author
Owner

@pirate commented on GitHub (Aug 3, 2017):

How about env EXCLUDE_URLS=.*youtube.com.*,.*facebook.com/.*,.*.exe and env INDEX_EXCLUDED=True/False to choose whether they appear in the index (without files/screenshot) or are entirely removed.

(defined in config.py of course)

<!-- gh-comment-id:320023990 --> @pirate commented on GitHub (Aug 3, 2017): How about `env EXCLUDE_URLS=.*youtube.com.*,.*facebook.com/.*,.*.exe` and `env INDEX_EXCLUDED=True/False` to choose whether they appear in the index (without files/screenshot) or are entirely removed. (defined in `config.py` of course)
Author
Owner

@nodiscc commented on GitHub (Aug 3, 2017):

env EXCLUDE_URLS=.youtube.com.,.facebook.com/.,.*.exe

sound really good. Having this in the config file would be better than having to copy-paste it to the command line (especially if the exclude list gets long).

<!-- gh-comment-id:320066890 --> @nodiscc commented on GitHub (Aug 3, 2017): > env EXCLUDE_URLS=.*youtube.com.*,.*facebook.com/.*,.*.exe sound really good. Having this in the config file would be better than having to copy-paste it to the command line (especially if the exclude list gets long).
Author
Owner

@issmirnov commented on GitHub (Jul 7, 2018):

Great, started work on this.

@pirate do you prefer to keep EXCLUDE_URLS as a comma separated list of regex rules, or just have users provide strings and assume an implied .* suffix and prefix?

Personally I prefer the brevity of youtube.com,facebook.com over .*youtube.com.*,.*facebook.com.*

<!-- gh-comment-id:403236720 --> @issmirnov commented on GitHub (Jul 7, 2018): Great, started work on this. @pirate do you prefer to keep `EXCLUDE_URLS` as a comma separated list of regex rules, or just have users provide strings and assume an implied `.*` suffix and prefix? Personally I prefer the brevity of `youtube.com,facebook.com` over `.*youtube.com.*,.*facebook.com.*`
Author
Owner

@pirate commented on GitHub (Jul 8, 2018):

Explicit > implicit, lets keep full regex. This also allows us to exclude
extensions like .*+.exe$

<!-- gh-comment-id:403303223 --> @pirate commented on GitHub (Jul 8, 2018): Explicit > implicit, lets keep full regex. This also allows us to exclude extensions like .*+\.exe$
Author
Owner

@pirate commented on GitHub (Jan 30, 2019):

Hey @issmirnov, just checking in. I might do this ticket at some point in the next couple months, if you have code you've worked on already you're welcome to share it as a PR and I'll incorporate it into my patch for this feature.

<!-- gh-comment-id:458872555 --> @pirate commented on GitHub (Jan 30, 2019): Hey @issmirnov, just checking in. I might do this ticket at some point in the next couple months, if you have code you've worked on already you're welcome to share it as a PR and I'll incorporate it into my patch for this feature.
Author
Owner

@issmirnov commented on GitHub (Jan 30, 2019):

Hey @pirate, got swamped by life obligations so didn't have time to finish anything meaningful. All yours - thanks for checking in!

<!-- gh-comment-id:458872990 --> @issmirnov commented on GitHub (Jan 30, 2019): Hey @pirate, got swamped by life obligations so didn't have time to finish anything meaningful. All yours - thanks for checking in!
Author
Owner

@pirate commented on GitHub (Mar 5, 2019):

@mlazana this is a great issue to start with. You can practice adding a new config option, going through the archiving/parsing flow to skip certain URLs, and updating the documentation to describe the new config option.

<!-- gh-comment-id:469828682 --> @pirate commented on GitHub (Mar 5, 2019): @mlazana this is a great issue to start with. You can practice adding a new config option, going through the archiving/parsing flow to skip certain URLs, and updating the documentation to describe the new config option.
Author
Owner

@mlazana commented on GitHub (Mar 14, 2019):

Great, I'll work on it!

<!-- gh-comment-id:473012873 --> @mlazana commented on GitHub (Mar 14, 2019): Great, I'll work on it!
Author
Owner

@pirate commented on GitHub (Mar 30, 2019):

This is done, thanks @mlazana!

<!-- gh-comment-id:478278554 --> @pirate commented on GitHub (Mar 30, 2019): This is done, thanks @mlazana!
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#1535
No description provided.