[GH-ISSUE #1250] Feature Request: Simple lists as well as regexes for allow/denylists #2277

Open
opened 2026-03-01 17:57:51 +03:00 by kerem · 1 comment
Owner

Originally created by @admiral-Guck on GitHub (Oct 21, 2023).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/1250

Type

  • General question or discussion
  • Propose a brand new feature
  • Request modification of existing behavior or design

Regexes quickly become unwieldy when dealing with many domains. Two new files like data/allowlist and data/denylist with a proto://domain.tld\n/proto://*.domain.tld\n format would make quickly adding domains much easier than modifying or writing regexes. I suppose this represents an additional pass over queued URLs with a regex compiled from these two lists.

I recognise the flexibility a regex provides and don't want to impinge on that functionality, merely add a more comfortable method for the simple cases (https://github.com/ArchiveBox/ArchiveBox/issues/1251).

How badly do you want this new feature?

  • It's an urgent deal-breaker, I can't live without it
  • It's important to add it in the near-mid term future
  • It would be nice to have eventually

  • I'm willing to contribute dev time / money to fix this issue
  • I like ArchiveBox so far / would recommend it to a friend
  • I've had a lot of difficulty getting ArchiveBox set up
Originally created by @admiral-Guck on GitHub (Oct 21, 2023). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/1250 ## Type - [ ] General question or discussion - [ ] Propose a brand new feature - [x] Request modification of existing behavior or design Regexes quickly become unwieldy when dealing with many domains. Two new files like `data/allowlist` and `data/denylist` with a `proto://domain.tld\n`/`proto://*.domain.tld\n` format would make quickly adding domains much easier than modifying or writing regexes. I suppose this represents an additional pass over queued URLs with a regex compiled from these two lists. I recognise the flexibility a regex provides and don't want to impinge on that functionality, merely add a more comfortable method for the simple cases (https://github.com/ArchiveBox/ArchiveBox/issues/1251). ## How badly do you want this new feature? - [ ] It's an urgent deal-breaker, I can't live without it - [x] It's important to add it in the near-mid term future - [ ] It would be nice to have eventually --- - [x] I'm willing to contribute [dev time](https://github.com/ArchiveBox/ArchiveBox#archivebox-development) / [money](https://github.com/sponsors/pirate) to fix this issue - [x] I like ArchiveBox so far / would recommend it to a friend - [ ] I've had a lot of difficulty getting ArchiveBox set up
Author
Owner

@pirate commented on GitHub (Oct 22, 2023):

I totally agree, a better long term solution here is needed.

I've been toying with the idea of "rules" and "rule sets" lately. Not just as a way to block or allow certain URLs but also as a way to trigger extractors to run in the first place / as a basic architectural building block for a lot of archivebox behavior in a more event-driven reimagining of the current structure.

<!-- gh-comment-id:1774066145 --> @pirate commented on GitHub (Oct 22, 2023): I totally agree, a better long term solution here is needed. I've been toying with the idea of "rules" and "rule sets" lately. Not just as a way to block or allow certain URLs but also as a way to trigger extractors to run in the first place / as a basic architectural building block for a lot of archivebox behavior in a more event-driven reimagining of the current structure.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#2277
No description provided.