[PR #1360] [MERGED] Add _EXTRA_ARGS for various extractors #1388

Closed
opened 2026-03-01 14:49:35 +03:00 by kerem · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/ArchiveBox/ArchiveBox/pull/1360
Author: @benmuth
Created: 2/23/2024
Status: Merged
Merged: 3/14/2024
Merged by: @pirate

Base: devHead: extra-args


📝 Commits (6)

  • 4e69d2c Add EXTRA_*_ARGS for wget, curl, and singlefile
  • ab8f395 Add YOUTUBEDL_EXTRA_ARGS
  • 4d9c5a7 Add CHROME_EXTRA_ARGS
  • d74ddd4 Flip dedupe precedence order
  • d8cf09c Remove unnecessary variable length args for dedupe
  • f4deb97 Add ARGS and EXTRA_ARGS for Mercury extractor

📊 Changes

10 files changed (+115 additions, -42 deletions)

View changed files

📝 archivebox/config.py (+16 -1)
📝 archivebox/extractors/archive_org.py (+9 -2)
📝 archivebox/extractors/favicon.py (+13 -3)
📝 archivebox/extractors/headers.py (+9 -3)
📝 archivebox/extractors/media.py (+9 -2)
📝 archivebox/extractors/mercury.py (+10 -4)
📝 archivebox/extractors/singlefile.py (+6 -19)
📝 archivebox/extractors/title.py (+9 -2)
📝 archivebox/extractors/wget.py (+10 -3)
📝 archivebox/util.py (+24 -3)

📄 Description

Summary

This PR adds a way to configure wget, curl, singlefile, youtube-dl, and chrome without overriding the default options.

The main default options, extra options, and more specific options (like WGET_USER_AGENT) are all deduplicated. It's assumed that options set with more specificity should take precedence, so something like the --user-agent argument for wget will come from WGET_USER_AGENT instead of _ARGS or _EXTRA_ARGS, and options set in _EXTRA_ARGS take precedence over _ARGS.

This PR might need some more testing with more complex configurations. Hopefully it's simple enough that won't break anything while still being useful, but I'm not a wizard with curl or wget so there might be some possibilities I don't know about.

Related issues

#1025

Changes these areas

  • Bugfixes
  • Feature behavior
  • Command line interface
  • Configuration options
  • Internal architecture
  • Snapshot data layout on disk

🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/ArchiveBox/ArchiveBox/pull/1360 **Author:** [@benmuth](https://github.com/benmuth) **Created:** 2/23/2024 **Status:** ✅ Merged **Merged:** 3/14/2024 **Merged by:** [@pirate](https://github.com/pirate) **Base:** `dev` ← **Head:** `extra-args` --- ### 📝 Commits (6) - [`4e69d2c`](https://github.com/ArchiveBox/ArchiveBox/commit/4e69d2c9e14bbbc4597731fdc349f5461a726b54) Add `EXTRA_*_ARGS` for wget, curl, and singlefile - [`ab8f395`](https://github.com/ArchiveBox/ArchiveBox/commit/ab8f395e0a4104dd01385be3d8fcea082a6987ee) Add `YOUTUBEDL_EXTRA_ARGS` - [`4d9c5a7`](https://github.com/ArchiveBox/ArchiveBox/commit/4d9c5a7b4b0bc0f490b6d8928878853fad363d16) Add `CHROME_EXTRA_ARGS` - [`d74ddd4`](https://github.com/ArchiveBox/ArchiveBox/commit/d74ddd42ae104004e656929036c55f972a9d63d4) Flip dedupe precedence order - [`d8cf09c`](https://github.com/ArchiveBox/ArchiveBox/commit/d8cf09c21e2d6e3ece8a7e5c93d537596c3687d0) Remove unnecessary variable length args for dedupe - [`f4deb97`](https://github.com/ArchiveBox/ArchiveBox/commit/f4deb97f59abffae4faa5f93a5108c9f28cb09f3) Add `ARGS` and `EXTRA_ARGS` for Mercury extractor ### 📊 Changes **10 files changed** (+115 additions, -42 deletions) <details> <summary>View changed files</summary> 📝 `archivebox/config.py` (+16 -1) 📝 `archivebox/extractors/archive_org.py` (+9 -2) 📝 `archivebox/extractors/favicon.py` (+13 -3) 📝 `archivebox/extractors/headers.py` (+9 -3) 📝 `archivebox/extractors/media.py` (+9 -2) 📝 `archivebox/extractors/mercury.py` (+10 -4) 📝 `archivebox/extractors/singlefile.py` (+6 -19) 📝 `archivebox/extractors/title.py` (+9 -2) 📝 `archivebox/extractors/wget.py` (+10 -3) 📝 `archivebox/util.py` (+24 -3) </details> ### 📄 Description <!-- IMPORTANT: Do not submit PRs with only formatting / PEP8 / line length changes. --> # Summary This PR adds a way to configure `wget`, `curl`, `singlefile`, `youtube-dl`, and `chrome` without overriding the default options. The main default options, extra options, and more specific options (like `WGET_USER_AGENT`) are all deduplicated. It's assumed that options set with more specificity should take precedence, so something like the `--user-agent` argument for `wget` will come from `WGET_USER_AGENT` instead of `_ARGS` or `_EXTRA_ARGS`, and options set in `_EXTRA_ARGS` take precedence over `_ARGS`. This PR might need some more testing with more complex configurations. Hopefully it's simple enough that won't break anything while still being useful, but I'm not a wizard with `curl` or `wget` so there might be some possibilities I don't know about. <!--e.g. This PR fixes ABC or adds the ability to do XYZ...--> # Related issues #1025 <!-- e.g. #123 or Roadmap goal # https://github.com/pirate/ArchiveBox/wiki/Roadmap --> # Changes these areas - [ ] Bugfixes - [ ] Feature behavior - [ ] Command line interface - [x] Configuration options - [ ] Internal architecture - [ ] Snapshot data layout on disk --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
kerem 2026-03-01 14:49:35 +03:00
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#1388
No description provided.