[GH-ISSUE #503] Add ability to configure arbitrary additional CLI args for each dependency #3347

Closed
opened 2026-03-14 22:18:26 +03:00 by kerem · 7 comments
Owner

Originally created by @pirate on GitHub (Oct 10, 2020).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/503

I'm imagining a feature like this:

config.py:

YOUTUBEDL_ARGS = ['-f', 'bestvideo[filesize<500M][height<=?480]+bestaudio/best']
...
WGET_ARGS = ['--no-warc-compression']
...

extractors/youtubedl.py:

...
CMD = [
    YOUTUBEDL_BIN,
    ...
    *YOUTUBEDL_ARGS,
]
...

This will allow people to self-configure the extractor behavior with much more customizability, without requiring archivebox-level config flags for each separate feature.

Originally created by @pirate on GitHub (Oct 10, 2020). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/503 I'm imagining a feature like this: `config.py`: ```python YOUTUBEDL_ARGS = ['-f', 'bestvideo[filesize<500M][height<=?480]+bestaudio/best'] ... WGET_ARGS = ['--no-warc-compression'] ... ``` `extractors/youtubedl.py`: ```python ... CMD = [ YOUTUBEDL_BIN, ... *YOUTUBEDL_ARGS, ] ... ``` This will allow people to self-configure the extractor behavior with much more customizability, without requiring archivebox-level config flags for each separate feature.
Author
Owner

@cdvv7788 commented on GitHub (Oct 10, 2020):

How does this work with the refactor to django's settings? Can it be done at the same time? should we get this one out first?

<!-- gh-comment-id:706467051 --> @cdvv7788 commented on GitHub (Oct 10, 2020): How does this work with the refactor to django's settings? Can it be done at the same time? should we get this one out first?
Author
Owner

@pirate commented on GitHub (Oct 10, 2020):

Don't bother with the Django settings refactor for now, I have an idea of how to do it and I'd like to do it myself later on with a module re-organization. Get this one out first as it's quite simple.

<!-- gh-comment-id:706467740 --> @pirate commented on GitHub (Oct 10, 2020): Don't bother with the Django settings refactor for now, I have an idea of how to do it and I'd like to do it myself later on with a module re-organization. Get this one out first as it's quite simple.
Author
Owner

@cdvv7788 commented on GitHub (Oct 12, 2020):

@pirate we have the following cmd in the media extractor:

cmd = [
        YOUTUBEDL_BINARY,
        '--write-description',
        '--write-info-json',
        '--write-annotations',
        '--write-thumbnail',
        '--no-call-home',
        '--no-check-certificate',
        '--user-agent',
        '--all-subs',
        '--extract-audio',
        '--keep-video',
        '--ignore-errors',
        '--geo-bypass',
        '--audio-format', 'mp3',
        '--audio-quality', '320K',
        '--embed-thumbnail',
        '--add-metadata',
        *(['--yes-playlist'] if SAVE_PLAYLISTS else []),
        *([] if CHECK_SSL_VALIDITY else ['--no-check-certificate']),
        link.url,
    ]

The idea is to remove SAVE_PLAYLISTS and CHECK_SSL_VALIDITY and pass the new YOUTUBEDL_ARGS config, right? The default should contain all of these defaults?
(Also, --no-check-certificate is always present, so CHECK_SSL_VALIDITY is not doing anything in this case)

<!-- gh-comment-id:707164381 --> @cdvv7788 commented on GitHub (Oct 12, 2020): @pirate we have the following `cmd` in the `media` extractor: ``` cmd = [ YOUTUBEDL_BINARY, '--write-description', '--write-info-json', '--write-annotations', '--write-thumbnail', '--no-call-home', '--no-check-certificate', '--user-agent', '--all-subs', '--extract-audio', '--keep-video', '--ignore-errors', '--geo-bypass', '--audio-format', 'mp3', '--audio-quality', '320K', '--embed-thumbnail', '--add-metadata', *(['--yes-playlist'] if SAVE_PLAYLISTS else []), *([] if CHECK_SSL_VALIDITY else ['--no-check-certificate']), link.url, ] ``` The idea is to remove `SAVE_PLAYLISTS` and `CHECK_SSL_VALIDITY` and pass the new `YOUTUBEDL_ARGS` config, right? The default should contain all of these defaults? (Also, `--no-check-certificate` is always present, so `CHECK_SSL_VALIDITY` is not doing anything in this case)
Author
Owner

@pirate commented on GitHub (Oct 13, 2020):

leave CHECK_SSL_VALIDITY separate from the args system, because it applies to many extractors, but yeah you can make SAVE_PLAYLIST an arg.

<!-- gh-comment-id:707884490 --> @pirate commented on GitHub (Oct 13, 2020): leave `CHECK_SSL_VALIDITY` separate from the args system, because it applies to many extractors, but yeah you can make `SAVE_PLAYLIST` an arg.
Author
Owner

@cdvv7788 commented on GitHub (Oct 13, 2020):

Ok, I will start working on this one.

<!-- gh-comment-id:707886182 --> @cdvv7788 commented on GitHub (Oct 13, 2020): Ok, I will start working on this one.
Author
Owner

@cdvv7788 commented on GitHub (Oct 14, 2020):

@pirate I just pushed a commit (https://github.com/pirate/ArchiveBox/pull/506). Can you please check if that is what you had in mind? I will wait until you check it to move forward with other extractors.

<!-- gh-comment-id:708487212 --> @cdvv7788 commented on GitHub (Oct 14, 2020): @pirate I just pushed a commit (https://github.com/pirate/ArchiveBox/pull/506). Can you please check if that is what you had in mind? I will wait until you check it to move forward with other extractors.
Author
Owner

@cdvv7788 commented on GitHub (Nov 13, 2020):

@pirate Closing this. Please reopen if there is something else we are missing for this.

<!-- gh-comment-id:726997845 --> @cdvv7788 commented on GitHub (Nov 13, 2020): @pirate Closing this. Please reopen if there is something else we are missing for this.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#3347
No description provided.