mirror of
https://github.com/ArchiveBox/ArchiveBox.git
synced 2026-04-25 09:06:02 +03:00
[GH-ISSUE #211] Architecture: Block ads and trackers during archiving #144
Labels
No labels
expected: maybe someday
expected: next release
expected: release after next
expected: unlikely unless contributed
good first ticket
help wanted
pull-request
scope: all users
scope: windows users
size: easy
size: hard
size: medium
size: medium
status: backlog
status: blocked
status: done
status: idea-phase
status: needs followup
status: wip
status: wontfix
touches: API/CLI/Spec
touches: configuration
touches: data/schema/architecture
touches: dependencies/packaging
touches: docs
touches: js
touches: views/replayers/html/css
why: correctness
why: functionality
why: performance
why: security
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/ArchiveBox#144
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @brandonocasey on GitHub (Apr 5, 2019).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/211
Type
What is the problem that your feature request solves
Archive pages without ads or trackers
Describe the ideal specific solution you'd want, and whether it fits into any broader scope of changes
A flag that defaults to off for archiving without ads/trackers.
What hacks or alternative solutions have you tried to solve the problem?
None
How badly do you want this new feature?
@wintdkyo commented on GitHub (Apr 11, 2019):
One potential solution outside of ArchieveBox would be to have some type of a network-wide DNS ad blocking, such as a Pi-hole
@knowncolor commented on GitHub (Apr 30, 2019):
I haven't tried this, but you might be able to achieve similar by installing certain extensions uBlock Origin, Decentraleyes, etc.
@pirate commented on GitHub (Apr 30, 2019):
Currently, adblocking during PDF/Screenshot/DOM archiving is already possible by using a
CHROME_USER_DATA_DIRwith a profile that has an adblocker installed (and settingCHROME_HEADLESS=False). In the future it will be doable in a few more ways though:At the network level:
chromium --proxy-server=http://localhost:8080 --ignore-certificate-errors --disable-web-security https://some.site.withads.comAt the application level:
As we get closer to integrating pyppeteer (#177), pywb (#130), and custom user scripts (#51) I'll check back and update this issue with progress.
@onemenzel commented on GitHub (Aug 3, 2021):
Has anyone tried installing extensions into the chrome profile? I'm running on docker compose and during my research journey on how to install extensions into a headless chrome from the shell, I found this chromium thread saying that there is no extension support in headless: https://bugs.chromium.org/p/chromium/issues/detail?id=706008
@pirate commented on GitHub (Aug 3, 2021):
You're correct that extensions only work in non-headless mode, but at least playwright makes it easy https://playwright.dev/docs/chrome-extensions. One way around that might be to use a fake display and attach chrome to it. Then we could even make an http endpoint where you can watch that display to see a realtime view of what the worker is seeing while archiving.
You can also load extension sources as ordinary JS scripts in context of the page in playwright, instead of using the real extensions system. You just need the original source of the extension, which could be ripped out of the published extensions in minified form, or copied from github if they're open-source.
@onemenzel commented on GitHub (Aug 4, 2021):
I think that'd be interesting features for future versions if some configuration method is exposed to the users. For now, I resigned to using dnsforge.de as the DNS server inside the container. They do not log DNS queries and block ad domains. Not quite as good as uBlock origin but it'll do it for now.
For everyone who's searching for a method to do this in docker-compose: it's as easy as adding the line
dns: your.dns.servers.ipto the service in thedocker-compose.ymlfile.Thanks for your time and that nice piece of software!
@akhilleusuggo commented on GitHub (Feb 17, 2022):
I'm using pihole as my main DNS ( on rpi ), and works perfectly. No ads or trackers during archiving.
Could also be installed as docker, on the same network bridged and point to it.
@pirate commented on GitHub (Mar 16, 2022):
Actually pihole is a great idea, we should add a commented out example container in
docker-compose.ymlthat shows how to setup pihole with the archivebox container's DNS pointed to the pihole container. That would be a very practical solution that doesn't involve a lot of dev work to add scripting during archiving.I've just added it to the
docker-compose.ymlfile:github.com/ArchiveBox/ArchiveBox@5cd2b328c0@pirate commented on GitHub (Mar 17, 2022):
Port 80 is not used by the principal UI by default.
@akhilleusuggo commented on GitHub (Mar 19, 2022):
Not sure how can a docker-compose break a WIFI... I'm not gonna ask why.
But what you have to do, is add an entry on the compose ( on the archivebox section ), with DNS pointing at pihole.
In my case I'm running them separately, but I'll test them and leave an example.
@pirate commented on GitHub (Mar 23, 2022):
I've already added an example
piholedocker-compose.ymlconfig here you can use: https://github.com/ArchiveBox/ArchiveBox/blob/dev/docker-compose.yml#L49@akhilleusuggo commented on GitHub (Mar 24, 2022):
@pirate Yes I've seen that, but maybe I'm missunderstanding something.
How does archivebox knows that PiHole Local IP is a DNS server? What I'm trying to say is that, requires extra environments variables to point to Pihole IP as DNS server. Under section of
archivebox, something like this should be added;And Pihole should be configured with a static IP; the same one under the
dnssection.Edit;
Pihole as a bridge should be the way to go, specially with archivebox_schedules and when running other containers on the same machine that requires Pihole.
@slikas commented on GitHub (May 22, 2022):
Linking chrome profile with extension ublock-origin doesn't work. (I need to remove cookie consent panels and HTML elements, AdGuardHome doesn't block cookie panels and leaves big blank ad spaces)
I copied the profile dir into dockerized archivebox /data
Then set in
ArchiveBox.conf:CHROME_USER_DATA_DIR=/data/chromeCHROME_HEADLESS=FalseThen
docker runit, no issues reportedBut as said, no adblock applied to saved contents
@diego898 commented on GitHub (Feb 7, 2023):
Just pinging this issue again!
@pirate commented on GitHub (Jun 13, 2023):
Quick update: The recommended solution is still to use pihole in front of ArchiveBox for now, I cant promise browser-extension-based ArchiveBox adblocking will be implemented anytime soon.
@zero77 commented on GitHub (Sep 24, 2024):
@pirate
Is there a way to use a custom dns, as this may be a quicker and easier option to running pihole or adguard home.
These are some examples of dns servers that block adds and trackers:
https://avoidthehack.com/best-dns-privacy
@pirate commented on GitHub (Sep 25, 2024):
Using external dns is very easy @zero77, no need for pihole, you can set the lines in the docker-compose.yml to point to any server you want:
@turian commented on GitHub (Oct 23, 2024):
Just curious the progress on this. What is the current best practice and will something more easy to set up / native be added to archivebox
@pirate commented on GitHub (Oct 23, 2024):
Current best practice that I run myself is:
docker-compose.yml(this gets the last 1%)URL_DENYLISTto prevent archiving of certain URLs/patterns (this covers other common minor nits/annoyance cases where you might want to avoid archiving some unecessary content)This is as "native" as it gets right now. I have plans to make the chrome profile setup process easier (e.g. ability to specify extensions to install automatically), and I think adblocking is just one of the things that will become easier as a result.