[GH-ISSUE #211] Architecture: Block ads and trackers during archiving #144

Open
opened 2026-03-01 14:41:00 +03:00 by kerem · 19 comments
Owner

Originally created by @brandonocasey on GitHub (Apr 5, 2019).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/211

Type

  • General Question or Disussion
  • Propose a brand new feature
  • Request modification of existing behavior or design

What is the problem that your feature request solves

Archive pages without ads or trackers

Describe the ideal specific solution you'd want, and whether it fits into any broader scope of changes

A flag that defaults to off for archiving without ads/trackers.

What hacks or alternative solutions have you tried to solve the problem?

None

How badly do you want this new feature?

  • It's an urgent deal-breaker, I cant live without it
  • It's important to add it in the near-mid term future
  • It would be nice to have eventually

  • I'm willing to contribute to development / fixing this issue
  • I like ArchiveBox so far / would recommend it to a friend
Originally created by @brandonocasey on GitHub (Apr 5, 2019). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/211 ## Type - [ ] General Question or Disussion - [x] Propose a brand new feature - [ ] Request modification of existing behavior or design ## What is the problem that your feature request solves Archive pages without ads or trackers ## Describe the ideal specific solution you'd want, and whether it fits into any broader scope of changes A flag that defaults to off for archiving without ads/trackers. ## What hacks or alternative solutions have you tried to solve the problem? None ## How badly do you want this new feature? - [ ] It's an urgent deal-breaker, I cant live without it - [x] It's important to add it in the near-mid term future - [ ] It would be nice to have eventually --- - [x] I'm willing to contribute to development / fixing this issue - [x] I like ArchiveBox so far / would recommend it to a friend
Author
Owner

@wintdkyo commented on GitHub (Apr 11, 2019):

One potential solution outside of ArchieveBox would be to have some type of a network-wide DNS ad blocking, such as a Pi-hole

<!-- gh-comment-id:482247067 --> @wintdkyo commented on GitHub (Apr 11, 2019): One potential solution outside of ArchieveBox would be to have some type of a network-wide DNS ad blocking, such as a Pi-hole
Author
Owner

@knowncolor commented on GitHub (Apr 30, 2019):

I haven't tried this, but you might be able to achieve similar by installing certain extensions uBlock Origin, Decentraleyes, etc.

<!-- gh-comment-id:487868881 --> @knowncolor commented on GitHub (Apr 30, 2019): I haven't tried this, but you might be able to achieve similar by installing certain extensions uBlock Origin, Decentraleyes, etc.
Author
Owner

@pirate commented on GitHub (Apr 30, 2019):

Currently, adblocking during PDF/Screenshot/DOM archiving is already possible by using a CHROME_USER_DATA_DIR with a profile that has an adblocker installed (and setting CHROME_HEADLESS=False). In the future it will be doable in a few more ways though:

At the network level:

  • Using a proxy for all archiving that removes ads e.g. chromium --proxy-server=http://localhost:8080 --ignore-certificate-errors --disable-web-security https://some.site.withads.com
  • Using a DNS resolver that blocks ads (e.g. Pi-hole)

At the application level:

  • Using a chrome extension like Ublock Origin and/or Ghostery (this is already possible)
  • Using user scrips loaded during archiving to manually remove ads from specific sites

As we get closer to integrating pyppeteer (#177), pywb (#130), and custom user scripts (#51) I'll check back and update this issue with progress.

<!-- gh-comment-id:488030613 --> @pirate commented on GitHub (Apr 30, 2019): Currently, adblocking during PDF/Screenshot/DOM archiving is already possible by using a `CHROME_USER_DATA_DIR` with a profile that has an adblocker installed (and setting `CHROME_HEADLESS=False`). In the future it will be doable in a few more ways though: **At the network level:** - [ ] Using a proxy for all archiving that removes ads e.g. `chromium --proxy-server=http://localhost:8080 --ignore-certificate-errors --disable-web-security https://some.site.withads.com` - [ ] Using a DNS resolver that blocks ads (e.g. Pi-hole) **At the application level:** - [X] Using a chrome extension like Ublock Origin and/or Ghostery (this is already possible) - [ ] Using user scrips loaded during archiving to manually remove ads from specific sites As we get closer to integrating pyppeteer (#177), pywb (#130), and custom user scripts (#51) I'll check back and update this issue with progress.
Author
Owner

@onemenzel commented on GitHub (Aug 3, 2021):

Has anyone tried installing extensions into the chrome profile? I'm running on docker compose and during my research journey on how to install extensions into a headless chrome from the shell, I found this chromium thread saying that there is no extension support in headless: https://bugs.chromium.org/p/chromium/issues/detail?id=706008

<!-- gh-comment-id:891660619 --> @onemenzel commented on GitHub (Aug 3, 2021): Has anyone tried installing extensions into the chrome profile? I'm running on docker compose and during my research journey on how to install extensions into a headless chrome from the shell, I found this chromium thread saying that there is no extension support in headless: https://bugs.chromium.org/p/chromium/issues/detail?id=706008
Author
Owner

@pirate commented on GitHub (Aug 3, 2021):

You're correct that extensions only work in non-headless mode, but at least playwright makes it easy https://playwright.dev/docs/chrome-extensions. One way around that might be to use a fake display and attach chrome to it. Then we could even make an http endpoint where you can watch that display to see a realtime view of what the worker is seeing while archiving.

You can also load extension sources as ordinary JS scripts in context of the page in playwright, instead of using the real extensions system. You just need the original source of the extension, which could be ripped out of the published extensions in minified form, or copied from github if they're open-source.

<!-- gh-comment-id:891708415 --> @pirate commented on GitHub (Aug 3, 2021): You're correct that extensions only work in non-headless mode, but at least playwright makes it easy https://playwright.dev/docs/chrome-extensions. One way around that might be to use a fake display and attach chrome to it. Then we could even make an http endpoint where you can watch that display to see a realtime view of what the worker is seeing while archiving. You can also load extension sources as ordinary JS scripts in context of the page in playwright, instead of using the real extensions system. You just need the original source of the extension, which could be ripped out of the published extensions in minified form, or copied from github if they're open-source.
Author
Owner

@onemenzel commented on GitHub (Aug 4, 2021):

You're correct that extensions only work in non-headless mode, but at least playwright makes it easy https://playwright.dev/docs/chrome-extensions. One way around that might be to use a fake display and attach chrome to it. Then we could even make an http endpoint where you can watch that display to see a realtime view of what the worker is seeing while archiving.

You can also load extension sources as ordinary JS scripts in context of the page in playwright, instead of using the real extensions system. You just need the original source of the extension, which could be ripped out of the published extensions in minified form, or copied from github if they're open-source.

I think that'd be interesting features for future versions if some configuration method is exposed to the users. For now, I resigned to using dnsforge.de as the DNS server inside the container. They do not log DNS queries and block ad domains. Not quite as good as uBlock origin but it'll do it for now.

For everyone who's searching for a method to do this in docker-compose: it's as easy as adding the line dns: your.dns.servers.ip to the service in the docker-compose.yml file.

Thanks for your time and that nice piece of software!

<!-- gh-comment-id:892750879 --> @onemenzel commented on GitHub (Aug 4, 2021): > You're correct that extensions only work in non-headless mode, but at least playwright makes it easy https://playwright.dev/docs/chrome-extensions. One way around that might be to use a fake display and attach chrome to it. Then we could even make an http endpoint where you can watch that display to see a realtime view of what the worker is seeing while archiving. > > You can also load extension sources as ordinary JS scripts in context of the page in playwright, instead of using the real extensions system. You just need the original source of the extension, which could be ripped out of the published extensions in minified form, or copied from github if they're open-source. I think that'd be interesting features for future versions if some configuration method is exposed to the users. For now, I resigned to using dnsforge.de as the DNS server inside the container. They do not log DNS queries and block ad domains. Not quite as good as uBlock origin but it'll do it for now. For everyone who's searching for a method to do this in docker-compose: it's as easy as adding the line `dns: your.dns.servers.ip` to the service in the `docker-compose.yml` file. Thanks for your time and that nice piece of software!
Author
Owner

@akhilleusuggo commented on GitHub (Feb 17, 2022):

I'm using pihole as my main DNS ( on rpi ), and works perfectly. No ads or trackers during archiving.
Could also be installed as docker, on the same network bridged and point to it.

<!-- gh-comment-id:1042911098 --> @akhilleusuggo commented on GitHub (Feb 17, 2022): I'm using pihole as my main DNS ( on rpi ), and works perfectly. No ads or trackers during archiving. Could also be installed as docker, on the same network bridged and point to it.
Author
Owner

@pirate commented on GitHub (Mar 16, 2022):

Actually pihole is a great idea, we should add a commented out example container in docker-compose.yml that shows how to setup pihole with the archivebox container's DNS pointed to the pihole container. That would be a very practical solution that doesn't involve a lot of dev work to add scripting during archiving.

I've just added it to the docker-compose.yml file: github.com/ArchiveBox/ArchiveBox@5cd2b328c0

<!-- gh-comment-id:1069611517 --> @pirate commented on GitHub (Mar 16, 2022): Actually pihole is a great idea, we should add a commented out example container in `docker-compose.yml` that shows how to setup pihole with the archivebox container's DNS pointed to the pihole container. That would be a very practical solution that doesn't involve a lot of dev work to add scripting during archiving. I've just added it to the `docker-compose.yml` file: https://github.com/ArchiveBox/ArchiveBox/commit/5cd2b328c006db9e39b919a8d7ab01aaecee2fe9
Author
Owner

@pirate commented on GitHub (Mar 17, 2022):

Port 80 is not used by the principal UI by default.

<!-- gh-comment-id:1071296390 --> @pirate commented on GitHub (Mar 17, 2022): Port 80 is not used by the principal UI by default.
Author
Owner

@akhilleusuggo commented on GitHub (Mar 19, 2022):

Not sure how can a docker-compose break a WIFI... I'm not gonna ask why.

But what you have to do, is add an entry on the compose ( on the archivebox section ), with DNS pointing at pihole.

In my case I'm running them separately, but I'll test them and leave an example.

<!-- gh-comment-id:1073123397 --> @akhilleusuggo commented on GitHub (Mar 19, 2022): Not sure how can a docker-compose break a WIFI... I'm not gonna ask why. But what you have to do, is add an entry on the compose ( on the archivebox section ), with DNS pointing at pihole. In my case I'm running them separately, but I'll test them and leave an example.
Author
Owner

@pirate commented on GitHub (Mar 23, 2022):

I've already added an example pihole docker-compose.yml config here you can use: https://github.com/ArchiveBox/ArchiveBox/blob/dev/docker-compose.yml#L49

<!-- gh-comment-id:1075796257 --> @pirate commented on GitHub (Mar 23, 2022): I've already added an example `pihole` `docker-compose.yml` config here you can use: https://github.com/ArchiveBox/ArchiveBox/blob/dev/docker-compose.yml#L49
Author
Owner

@akhilleusuggo commented on GitHub (Mar 24, 2022):

@pirate Yes I've seen that, but maybe I'm missunderstanding something.

How does archivebox knows that PiHole Local IP is a DNS server? What I'm trying to say is that, requires extra environments variables to point to Pihole IP as DNS server. Under section of archivebox, something like this should be added;

dns:
      - 172.21.0.2

And Pihole should be configured with a static IP; the same one under the dns section.

Edit;

Pihole as a bridge should be the way to go, specially with archivebox_schedules and when running other containers on the same machine that requires Pihole.

<!-- gh-comment-id:1077622194 --> @akhilleusuggo commented on GitHub (Mar 24, 2022): @pirate Yes I've seen that, but maybe I'm missunderstanding something. How does archivebox knows that PiHole Local IP is a DNS server? What I'm trying to say is that, requires extra environments variables to point to Pihole IP as DNS server. Under section of `archivebox`, something like this should be added; ``` dns: - 172.21.0.2 ``` And Pihole should be configured with a static IP; the same one under the `dns` section. Edit; Pihole as a bridge should be the way to go, specially with archivebox_schedules and when running other containers on the same machine that requires Pihole.
Author
Owner

@slikas commented on GitHub (May 22, 2022):

Linking chrome profile with extension ublock-origin doesn't work. (I need to remove cookie consent panels and HTML elements, AdGuardHome doesn't block cookie panels and leaves big blank ad spaces)

I copied the profile dir into dockerized archivebox /data
Then set in ArchiveBox.conf:
CHROME_USER_DATA_DIR=/data/chrome
CHROME_HEADLESS=False
Then docker run it, no issues reported

But as said, no adblock applied to saved contents

<!-- gh-comment-id:1133966312 --> @slikas commented on GitHub (May 22, 2022): Linking chrome profile with extension ublock-origin doesn't work. (I need to remove cookie consent panels and HTML elements, AdGuardHome doesn't block cookie panels and leaves big blank ad spaces) I copied the profile dir into dockerized archivebox /data Then set in `ArchiveBox.conf`: `CHROME_USER_DATA_DIR=/data/chrome` `CHROME_HEADLESS=False` Then `docker run` it, no issues reported But as said, no adblock applied to saved contents
Author
Owner

@diego898 commented on GitHub (Feb 7, 2023):

Just pinging this issue again!

<!-- gh-comment-id:1421362768 --> @diego898 commented on GitHub (Feb 7, 2023): Just pinging this issue again!
Author
Owner

@pirate commented on GitHub (Jun 13, 2023):

Quick update: The recommended solution is still to use pihole in front of ArchiveBox for now, I cant promise browser-extension-based ArchiveBox adblocking will be implemented anytime soon.

<!-- gh-comment-id:1589364975 --> @pirate commented on GitHub (Jun 13, 2023): Quick update: The recommended solution is still to use pihole in front of ArchiveBox for now, I cant promise browser-extension-based ArchiveBox adblocking will be implemented anytime soon.
Author
Owner

@zero77 commented on GitHub (Sep 24, 2024):

@pirate
Is there a way to use a custom dns, as this may be a quicker and easier option to running pihole or adguard home.
These are some examples of dns servers that block adds and trackers:
https://avoidthehack.com/best-dns-privacy

<!-- gh-comment-id:2370783027 --> @zero77 commented on GitHub (Sep 24, 2024): @pirate Is there a way to use a custom dns, as this may be a quicker and easier option to running pihole or adguard home. These are some examples of dns servers that block adds and trackers: https://avoidthehack.com/best-dns-privacy
Author
Owner

@pirate commented on GitHub (Sep 25, 2024):

Using external dns is very easy @zero77, no need for pihole, you can set the lines in the docker-compose.yml to point to any server you want:

services:
  archivebox:
    ...
    dns:
      - 1.1.1.1
      - ... etc any ip can go here ...
<!-- gh-comment-id:2375260031 --> @pirate commented on GitHub (Sep 25, 2024): Using external dns is very easy @zero77, no need for pihole, you can set the lines in the docker-compose.yml to point to any server you want: ```yml services: archivebox: ... dns: - 1.1.1.1 - ... etc any ip can go here ... ```
Author
Owner

@turian commented on GitHub (Oct 23, 2024):

Just curious the progress on this. What is the current best practice and will something more easy to set up / native be added to archivebox

<!-- gh-comment-id:2431134577 --> @turian commented on GitHub (Oct 23, 2024): Just curious the progress on this. What is the current best practice and will something more easy to set up / native be added to archivebox
Author
Owner

@pirate commented on GitHub (Oct 23, 2024):

Current best practice that I run myself is:

  • install uBlock origin in a chrome profile + set up archivebox to use that profile (this prevents 99% of ads)
  • set up pihole as indicated in docker-compose.yml (this gets the last 1%)
  • configure URL_DENYLIST to prevent archiving of certain URLs/patterns (this covers other common minor nits/annoyance cases where you might want to avoid archiving some unecessary content)

This is as "native" as it gets right now. I have plans to make the chrome profile setup process easier (e.g. ability to specify extensions to install automatically), and I think adblocking is just one of the things that will become easier as a result.

<!-- gh-comment-id:2433580630 --> @pirate commented on GitHub (Oct 23, 2024): Current best practice that I run myself is: - install uBlock origin in a chrome profile + [set up archivebox to use that profile](https://github.com/ArchiveBox/ArchiveBox/wiki/Chromium-Install#setting-up-a-chromium-user-profile) (this prevents 99% of ads) - set up pihole as indicated in `docker-compose.yml` (this gets the last 1%) - configure `URL_DENYLIST` to prevent archiving of certain URLs/patterns (this covers other common minor nits/annoyance cases where you might want to avoid archiving some unecessary content) This is as "native" as it gets right now. I have plans to make the chrome profile setup process easier (e.g. ability to specify extensions to install automatically), and I think adblocking is just one of the things that will become easier as a result.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#144
No description provided.