starred/ArchiveBox

Fork 0

mirror of https://github.com/ArchiveBox/ArchiveBox.git synced 2026-04-25 17:16:00 +03:00

[GH-ISSUE #824] Question: The correct format for URL_BLACKLIST property? #513

New issue

Closed

opened 2026-03-01 14:44:14 +03:00 by kerem · 3 comments

kerem commented

2026-03-01 14:44:14 +03:00

Owner

Originally created by @levitabris on GitHub (Aug 11, 2021).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/824

Hi All,

I have questions regarding using the correct format to exclude certain domains/subdomains, e.g. to skip google search result pages.

My regex looks like this:

^https?:\/\/(www\.google\..+|\w+\.youtube\.com|\w+\.etools\.ch).*$

I use

archivebox config --set URL_BLACKLIST = r'^https?:\/\/(www\.google\..+|\w+\.youtube\.com|\w+\.etools\.ch).*$'

which gives me this in the archivebox.conf:

URL_BLACKLIST = r^http(s)?:\/\/(www\.google\..+|\w+\.youtube\.com|\w+\.etools\.ch).*$

The config did not work as expected. I also tried manual input such as

[GENERAL_CONFIG]
TIMEOUT = 20
URL_BLACKLIST = r'^https?:\/\/(www\.google\..+|\w+\.youtube\.com|\w+\.etools\.ch).*$'

[GENERAL_CONFIG]
TIMEOUT = 20
URL_BLACKLIST = '^https?:\/\/(www\.google\..+|\w+\.youtube\.com|\w+\.etools\.ch).*$'

[GENERAL_CONFIG]
TIMEOUT = 20
URL_BLACKLIST= ^https?:\/\/(www\.google\..+|\w+\.youtube\.com|\w+\.etools\.ch).*$

None of the above worked. Can anyone share a workable archivebox.conf to help me set the correct format to input the regex?

I also found the documentation of URL_BLACKLIST seems to have a regex with unclosed single quote. Please help to verify.

I'm using ArchiveBox v0.6.2

Originally created by @levitabris on GitHub (Aug 11, 2021). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/824 Hi All, I have questions regarding using the correct format to exclude certain domains/subdomains, e.g. to skip google search result pages. My regex looks like this: ``` ^https?:\/\/(www\.google\..+|\w+\.youtube\.com|\w+\.etools\.ch).*$ ``` I use ``` archivebox config --set URL_BLACKLIST = r'^https?:\/\/(www\.google\..+|\w+\.youtube\.com|\w+\.etools\.ch).*$' ``` which gives me this in the `archivebox.conf`: ``` URL_BLACKLIST = r^http(s)?:\/\/(www\.google\..+|\w+\.youtube\.com|\w+\.etools\.ch).*$ ``` The config did not work as expected. I also tried manual input such as ``` [GENERAL_CONFIG] TIMEOUT = 20 URL_BLACKLIST = r'^https?:\/\/(www\.google\..+|\w+\.youtube\.com|\w+\.etools\.ch).*$' ``` or ``` [GENERAL_CONFIG] TIMEOUT = 20 URL_BLACKLIST = '^https?:\/\/(www\.google\..+|\w+\.youtube\.com|\w+\.etools\.ch).*$' ``` or ``` [GENERAL_CONFIG] TIMEOUT = 20 URL_BLACKLIST= ^https?:\/\/(www\.google\..+|\w+\.youtube\.com|\w+\.etools\.ch).*$ ``` None of the above worked. Can anyone share a workable `archivebox.conf` to help me set the correct format to input the regex? I also found the documentation of [URL_BLACKLIST](https://github.com/ArchiveBox/ArchiveBox/wiki/Configuration#url_blacklist) seems to have a regex with unclosed single quote. Please help to verify. I'm using ArchiveBox v0.6.2

kerem closed this issue

2026-03-01 14:44:14 +03:00

kerem commented

2026-03-01 14:44:14 +03:00

Author

Owner

@levitabris commented on GitHub (Aug 11, 2021):

You can play with the regex here

@levitabris commented on GitHub (Aug 11, 2021): You can play with the regex [here](https://regex101.com/r/fdFOi8/1)

kerem commented

2026-03-01 14:44:15 +03:00

Author

Owner

@pirate commented on GitHub (Aug 11, 2021):

Thanks for the detailed question and regex playground link!

The r'...' format is within python when using raw string literals only, don't add the r in front if you're setting it via the config file or CLI.

The correct format to set via the CLI or config file is like so:

archivebox config --set URL_BLACKLIST='https?:\/\/(www\.google\..+|\w+\.youtube\.com|\w+\.etools\.ch).*'
# the quotes here are bash quotes to make sure any $ or other symbols in the regex are not parsed by bash

ArchiveBox.conf

[GENERAL_CONFIG]
URL_BLACKLIST = https?:\/\/(www\.google\..+|\w+\.youtube\.com|\w+\.etools\.ch).*
# no quotes or extra escaping needed here

(note it ignores both the www.google.com/example/... and www.youtube.com/example/... urls added) ^

If any troubles, please test with the latest ArchiveBox version on dev which also has some general regex improvements: https://github.com/ArchiveBox/ArchiveBox#install-and-run-a-specific-github-branch (don't worry this dev version is safe to install, it can be upgraded seamlessly back to the next full release when it becomes available)

@pirate commented on GitHub (Aug 11, 2021): Thanks for the detailed question and regex playground link! The `r'...'` format is within python when using raw string literals only, don't add the `r` in front if you're setting it via the config file or CLI. The correct format to set via the CLI or config file is like so: ```bash archivebox config --set URL_BLACKLIST='https?:\/\/(www\.google\..+|\w+\.youtube\.com|\w+\.etools\.ch).*' # the quotes here are bash quotes to make sure any $ or other symbols in the regex are not parsed by bash ``` `ArchiveBox.conf` ```ini [GENERAL_CONFIG] URL_BLACKLIST = https?:\/\/(www\.google\..+|\w+\.youtube\.com|\w+\.etools\.ch).* # no quotes or extra escaping needed here ``` ![image](https://user-images.githubusercontent.com/511499/129054130-146be8fe-8176-43a1-b1f8-5fac4951bf3b.png) (note it ignores both the `www.google.com/example/...` and `www.youtube.com/example/...` urls added) ^ If any troubles, please test with the latest ArchiveBox version on `dev` which also has some general regex improvements: https://github.com/ArchiveBox/ArchiveBox#install-and-run-a-specific-github-branch (don't worry this dev version is safe to install, it can be upgraded seamlessly back to the next full release when it becomes available)

kerem commented

2026-03-01 14:44:15 +03:00

Author

Owner

@levitabris commented on GitHub (Aug 12, 2021):

Problem solved!

I 1) installed the v0.6.3 dev version, 2) restart the archiving process (i.e. archivebox add < path/to/browser_history.json instead of archivebox update --resume=XXXXXX), 3) use the regex string without quotes as @pirate 's answer. Now it works perfectly.

I hope it helps to ones with the same issue.

Thank you @pirate !

@levitabris commented on GitHub (Aug 12, 2021): Problem solved! I 1) installed the v0.6.3 dev version, 2) restart the archiving process (i.e. `archivebox add < path/to/browser_history.json` instead of `archivebox update --resume=XXXXXX`), 3) use the regex string without quotes as @pirate 's answer. Now it works perfectly. I hope it helps to ones with the same issue. Thank you @pirate !

kerem referenced this issue

2026-03-01 14:48:51 +03:00

[PR #520] [CLOSED] Split Snapshot into Link & Snapshot + migrate #1206

kerem referenced this issue

2026-03-01 17:54:11 +03:00

[GH-ISSUE #513] Move ArchiveResults into the SQL db to avoid checking the filesystem for every Snapshot output status #1843

kerem referenced this issue

2026-03-01 18:00:32 +03:00

[PR #520] [CLOSED] Split Snapshot into Link & Snapshot + migrate #2714

kerem referenced this issue

2026-03-14 22:19:56 +03:00

[GH-ISSUE #513] Move ArchiveResults into the SQL db to avoid checking the filesystem for every Snapshot output status #3353