mirror of
https://github.com/ArchiveBox/ArchiveBox.git
synced 2026-04-25 17:16:00 +03:00
[GH-ISSUE #824] Question: The correct format for URL_BLACKLIST property? #513
Labels
No labels
expected: maybe someday
expected: next release
expected: release after next
expected: unlikely unless contributed
good first ticket
help wanted
pull-request
scope: all users
scope: windows users
size: easy
size: hard
size: medium
size: medium
status: backlog
status: blocked
status: done
status: idea-phase
status: needs followup
status: wip
status: wontfix
touches: API/CLI/Spec
touches: configuration
touches: data/schema/architecture
touches: dependencies/packaging
touches: docs
touches: js
touches: views/replayers/html/css
why: correctness
why: functionality
why: performance
why: security
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/ArchiveBox#513
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @levitabris on GitHub (Aug 11, 2021).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/824
Hi All,
I have questions regarding using the correct format to exclude certain domains/subdomains, e.g. to skip google search result pages.
My regex looks like this:
I use
which gives me this in the
archivebox.conf:The config did not work as expected. I also tried manual input such as
or
or
None of the above worked. Can anyone share a workable
archivebox.confto help me set the correct format to input the regex?I also found the documentation of URL_BLACKLIST seems to have a regex with unclosed single quote. Please help to verify.
I'm using ArchiveBox v0.6.2
@levitabris commented on GitHub (Aug 11, 2021):
You can play with the regex here
@pirate commented on GitHub (Aug 11, 2021):
Thanks for the detailed question and regex playground link!
The
r'...'format is within python when using raw string literals only, don't add therin front if you're setting it via the config file or CLI.The correct format to set via the CLI or config file is like so:
ArchiveBox.conf(note it ignores both the
www.google.com/example/...andwww.youtube.com/example/...urls added) ^If any troubles, please test with the latest ArchiveBox version on
devwhich also has some general regex improvements: https://github.com/ArchiveBox/ArchiveBox#install-and-run-a-specific-github-branch (don't worry this dev version is safe to install, it can be upgraded seamlessly back to the next full release when it becomes available)@levitabris commented on GitHub (Aug 12, 2021):
Problem solved!
I 1) installed the v0.6.3 dev version, 2) restart the archiving process (i.e.
archivebox add < path/to/browser_history.jsoninstead ofarchivebox update --resume=XXXXXX), 3) use the regex string without quotes as @pirate 's answer. Now it works perfectly.I hope it helps to ones with the same issue.
Thank you @pirate !