[PR #1195] [MERGED] Add method-specific URL allow/deny lists #1341

Closed
opened 2026-03-01 14:49:23 +03:00 by kerem · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/ArchiveBox/ArchiveBox/pull/1195
Author: @overhacked
Created: 7/31/2023
Status: Merged
Merged: 10/28/2023
Merged by: @pirate

Base: devHead: method_allow_deny


📝 Commits (4)

  • 46e80dd Rename URL_(WHITE|BLACK)LIST to URL_(ALLOW|DENY)LIST
  • b44f7e6 Add URL-specific method allow/deny lists
  • 2076474 Drop use of TypeAlias to maintain Python 3.9 compat
  • 63ad43f Merge branch 'dev' into method_allow_deny

📊 Changes

6 files changed (+96 additions, -24 deletions)

View changed files

📝 archivebox/config.py (+13 -5)
📝 archivebox/config_stubs.py (+1 -1)
📝 archivebox/core/forms.py (+1 -1)
📝 archivebox/extractors/__init__.py (+36 -11)
📝 archivebox/index/__init__.py (+4 -4)
📝 tests/test_extractors.py (+41 -2)

📄 Description

Summary

This adds the ability to toggle extractors (aka methods, aka outputs) on an URL-specific basis. This is useful for sites on which singlepage, for example, does not provide a usable snapshot. Or, in cases in which you might want to only download the media for a URL and nothing else.

This PR also includes a commit to rename URL_(WHITE|BLACK)LIST to URL_(ALLOW|DENY)LIST as proposed in the documentation. The old names are preserved as aliases. I included this change in this PR so as not to have to name the new configuration parameters with the deprecated terms.

Config Example

# Only save media from TikTok, with a favicon and title
SAVE_ALLOWLIST = {"tiktok\\.com/": ["favicon", "title", "media"]}

Documentation

Glad to share some Wiki commits if you'd like to move forward with this PR. You can't PR a wiki, right?

Related issues

None found

Changes these areas

  • Bugfixes
  • Feature behavior
  • Command line interface
  • Configuration options
  • Internal architecture
  • Snapshot data layout on disk

🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/ArchiveBox/ArchiveBox/pull/1195 **Author:** [@overhacked](https://github.com/overhacked) **Created:** 7/31/2023 **Status:** ✅ Merged **Merged:** 10/28/2023 **Merged by:** [@pirate](https://github.com/pirate) **Base:** `dev` ← **Head:** `method_allow_deny` --- ### 📝 Commits (4) - [`46e80dd`](https://github.com/ArchiveBox/ArchiveBox/commit/46e80dd50933e563712e9ce90fc536f02b3c983c) Rename URL_(WHITE|BLACK)LIST to URL_(ALLOW|DENY)LIST - [`b44f7e6`](https://github.com/ArchiveBox/ArchiveBox/commit/b44f7e68b180276aff61fcd918b0ef96d9b9fa28) Add URL-specific method allow/deny lists - [`2076474`](https://github.com/ArchiveBox/ArchiveBox/commit/207647425292f703ae5cd21e41a980c1cb0d939e) Drop use of TypeAlias to maintain Python 3.9 compat - [`63ad43f`](https://github.com/ArchiveBox/ArchiveBox/commit/63ad43f46c00f5e8341524ef1bd0ce4e3da5bbf8) Merge branch 'dev' into method_allow_deny ### 📊 Changes **6 files changed** (+96 additions, -24 deletions) <details> <summary>View changed files</summary> 📝 `archivebox/config.py` (+13 -5) 📝 `archivebox/config_stubs.py` (+1 -1) 📝 `archivebox/core/forms.py` (+1 -1) 📝 `archivebox/extractors/__init__.py` (+36 -11) 📝 `archivebox/index/__init__.py` (+4 -4) 📝 `tests/test_extractors.py` (+41 -2) </details> ### 📄 Description # Summary This adds the ability to toggle extractors (aka methods, aka outputs) on an URL-specific basis. This is useful for sites on which `singlepage`, for example, does not provide a usable snapshot. Or, in cases in which you might want to only download the media for a URL and nothing else. This PR also includes a commit to rename `URL_(WHITE|BLACK)LIST` to `URL_(ALLOW|DENY)LIST` as [proposed in the documentation](https://github.com/ArchiveBox/ArchiveBox/wiki/Configuration#user-content-url_blacklist). The old names are preserved as aliases. I included this change in this PR so as not to have to name the new configuration parameters with the deprecated terms. ## Config Example ```ini # Only save media from TikTok, with a favicon and title SAVE_ALLOWLIST = {"tiktok\\.com/": ["favicon", "title", "media"]} ``` ## Documentation Glad to share some Wiki commits if you'd like to move forward with this PR. You can't PR a wiki, right? # Related issues *None found* # Changes these areas - [ ] Bugfixes - [x] Feature behavior - [ ] Command line interface - [x] Configuration options - [ ] Internal architecture - [ ] Snapshot data layout on disk --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
kerem 2026-03-01 14:49:23 +03:00
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#1341
No description provided.