[GH-ISSUE #249] Architecture: Support running all archive methods through a SOCKS/HTTP proxy #1683

Open
opened 2026-03-01 17:52:50 +03:00 by kerem · 16 comments
Owner

Originally created by @ghost on GitHub (Jul 1, 2019).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/249

Would be nice to have a PROXY=socks5h://1.2.3.4:1080 style option to route all traffic through a proxy while archiving.

Originally created by @ghost on GitHub (Jul 1, 2019). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/249 Would be nice to have a `PROXY=socks5h://1.2.3.4:1080` style option to route all traffic through a proxy while archiving.
Author
Owner

@pirate commented on GitHub (Jul 5, 2019):

I agree this would be nice eventually, but it's tricky to implement consistently across all the archive methods. For now I recommend running it inside docker and doing the proxying for the entire container.

These docs might help, although I cant confirm this type of setup works as I haven't tried it myself:

Expect 1yr+ before I get around to implementing this natively, for now I recommend the docker approach if this is an absolute requirement.

<!-- gh-comment-id:508623986 --> @pirate commented on GitHub (Jul 5, 2019): I agree this would be nice eventually, but it's tricky to implement consistently across all the archive methods. For now I recommend running it inside docker and doing the proxying for the entire container. These docs might help, although I cant confirm this type of setup works as I haven't tried it myself: - https://docs.docker.com/network/proxy/#configure-the-docker-client - https://medium.com/datadriveninvestor/how-to-transparently-use-a-proxy-with-any-application-docker-using-iptables-and-redsocks-b8301ddc4e1e Expect 1yr+ before I get around to implementing this natively, for now I recommend the docker approach if this is an absolute requirement.
Author
Owner

@ghost commented on GitHub (Jul 5, 2019):

Ok, thank you! Appreciate you taking a look.

I will probably look into this if I give ArchiveBox another go.

<!-- gh-comment-id:508627728 --> @ghost commented on GitHub (Jul 5, 2019): Ok, thank you! Appreciate you taking a look. I will probably look into this if I give ArchiveBox another go.
Author
Owner

@issenn commented on GitHub (Apr 22, 2020):

doesn't work with environment inside docker

"Unable to detect page title"

Failed:TimeoutExpired Command 'curl' timed out after 60 seconds

Failed:TimeoutExpired Command 'wget' timed out after 60 seconds

Failed:TimeoutExpired Command 'google-chrome-unstable' timed out after 60 seconds

Failed:TimeoutExpired Command 'google-chrome-unstable' timed out after 60 seconds

Failed: Failed to find "content-location" URL header in Archive.org response.
<!-- gh-comment-id:617756763 --> @issenn commented on GitHub (Apr 22, 2020): doesn't work with environment inside docker ``` "Unable to detect page title" Failed:TimeoutExpired Command 'curl' timed out after 60 seconds Failed:TimeoutExpired Command 'wget' timed out after 60 seconds Failed:TimeoutExpired Command 'google-chrome-unstable' timed out after 60 seconds Failed:TimeoutExpired Command 'google-chrome-unstable' timed out after 60 seconds Failed: Failed to find "content-location" URL header in Archive.org response. ```
Author
Owner

@pirate commented on GitHub (Apr 29, 2020):

Can you post the docker-compose.yml file you used to test this @issenn? I have gotten docker working through wireguard tunnels with no issues in the pasts, so I'm sure there is a way to do this.

<!-- gh-comment-id:621466039 --> @pirate commented on GitHub (Apr 29, 2020): Can you post the docker-compose.yml file you used to test this @issenn? I have gotten docker working through wireguard tunnels with no issues in the pasts, so I'm sure there is a way to do this.
Author
Owner

@pirate commented on GitHub (Jul 28, 2020):

FYI all you can now run ArchiveBox through a VPN (wireguard) without too much difficulty: https://github.com/pirate/ArchiveBox/blob/master/docker-compose.yml

I'm still planning on adding HTTP proxy support with an HTTP_PROXY config var so that we can pipe archiving through pywb's wayback --proxy proxy WARC recorder, but that wont be released until a future version.

<!-- gh-comment-id:665255625 --> @pirate commented on GitHub (Jul 28, 2020): FYI all you can now run ArchiveBox through a VPN (wireguard) without too much difficulty: https://github.com/pirate/ArchiveBox/blob/master/docker-compose.yml I'm still planning on adding HTTP proxy support with an `HTTP_PROXY` config var so that we can pipe archiving through `pywb`'s `wayback --proxy` proxy WARC recorder, but that wont be released until a future version.
Author
Owner

@kai11 commented on GitHub (Oct 25, 2020):

I had this problem in my setup and write notes on SOCKS5 proxy.
https://gist.github.com/kai11/e91c6fad990c6490b2a4fe8c4defebfe

<!-- gh-comment-id:716131372 --> @kai11 commented on GitHub (Oct 25, 2020): I had this problem in my setup and write notes on SOCKS5 proxy. https://gist.github.com/kai11/e91c6fad990c6490b2a4fe8c4defebfe
Author
Owner

@marcohald commented on GitHub (May 25, 2022):

@pirate Are you still planing the HTTP_PROXY Implementation ?
It would be very useful in a enterprise environment with a HTTP Proxy which requires authentication.

<!-- gh-comment-id:1136955163 --> @marcohald commented on GitHub (May 25, 2022): @pirate Are you still planing the HTTP_PROXY Implementation ? It would be very useful in a enterprise environment with a HTTP Proxy which requires authentication.
Author
Owner

@allen7u commented on GitHub (Oct 10, 2023):

I agree this would be nice eventually, but it's tricky to implement consistently across all the archive methods. For now I recommend running it inside docker and doing the proxying for the entire container.

These docs might help, although I cant confirm this type of setup works as I haven't tried it myself:

Expect 1yr+ before I get around to implementing this natively, for now I recommend the docker approach if this is an absolute requirement.

Any news at this moment?

<!-- gh-comment-id:1755769662 --> @allen7u commented on GitHub (Oct 10, 2023): > I agree this would be nice eventually, but it's tricky to implement consistently across all the archive methods. For now I recommend running it inside docker and doing the proxying for the entire container. > > These docs might help, although I cant confirm this type of setup works as I haven't tried it myself: > > * https://docs.docker.com/network/proxy/#configure-the-docker-client > * https://medium.com/datadriveninvestor/how-to-transparently-use-a-proxy-with-any-application-docker-using-iptables-and-redsocks-b8301ddc4e1e > > Expect 1yr+ before I get around to implementing this natively, for now I recommend the docker approach if this is an absolute requirement. Any news at this moment?
Author
Owner

@pirate commented on GitHub (Oct 11, 2023):

No changes planned to add SOCKS support into ArchiveBox anytime soon because the Docker solutions work well enough for now.

My favorite way to do this is using tailscale, where you can route all Docker traffic through any desired exit node with one line: https://tailscale.com/kb/1103/exit-nodes/#step-4-use-the-exit-node

You can use a docker-compose sidecar container that shares the networking stack with the ArchiveBox container similar to this wireguard example:
https://github.com/pirate/wireguard-docs?tab=readme-ov-file#example-client-container-setup

services:
    archivebox:
        ...
        network_mode: 'service:tailscale'
        depends_on:
          - tailscale
        
    tailscale:
        image: 'tailscale/tailscale:latest'
        command: tailscaled
        environment:
          - TS_AUTHKEY="tskey-auth-xxxxxxxxx"
          - TS_EXTRA_ARGS="--exit-node=<exit-node-ip>"

For more info see: https://tailscale.com/kb/1282/docker/

<!-- gh-comment-id:1757085321 --> @pirate commented on GitHub (Oct 11, 2023): No changes planned to add SOCKS support into ArchiveBox anytime soon because the Docker solutions work well enough for now. My favorite way to do this is using tailscale, where you can route all Docker traffic through any desired exit node with one line: https://tailscale.com/kb/1103/exit-nodes/#step-4-use-the-exit-node You can use a docker-compose sidecar container that shares the networking stack with the ArchiveBox container similar to this wireguard example: https://github.com/pirate/wireguard-docs?tab=readme-ov-file#example-client-container-setup ```yaml services: archivebox: ... network_mode: 'service:tailscale' depends_on: - tailscale tailscale: image: 'tailscale/tailscale:latest' command: tailscaled environment: - TS_AUTHKEY="tskey-auth-xxxxxxxxx" - TS_EXTRA_ARGS="--exit-node=<exit-node-ip>" ``` For more info see: https://tailscale.com/kb/1282/docker/
Author
Owner

@huyz commented on GitHub (Jul 30, 2024):

@pirate How did you get this working with Tailscale?

I'm running into this Redditor's issue which is that if you assign the network_mode to the tailscale container, then all traffic goes through Tailscale. Consequently, my reverse proxy (Caddy) can no longer access the archivebox container and I can't bring up the ArchiveBox web UI

<!-- gh-comment-id:2258696567 --> @huyz commented on GitHub (Jul 30, 2024): @pirate How did you get this working with Tailscale? I'm running into this [Redditor](https://www.reddit.com/r/Tailscale/comments/sy6tbj/comment/hy629lx/)'s issue which is that if you assign the `network_mode` to the tailscale container, then all traffic goes through Tailscale. Consequently, my reverse proxy (Caddy) can no longer access the `archivebox` container and I can't bring up the ArchiveBox web UI
Author
Owner

@pirate commented on GitHub (Jul 30, 2024):

If you share your docker-compose with caddy I can show you, you just add network_mode onto the caddy container too so they all share one network stack. Incoming connections can still be handled by caddy even when all outbound traffic goes through Tailscale.

You can also do it with iptables manually if you are running caddy outside docker.

<!-- gh-comment-id:2259043708 --> @pirate commented on GitHub (Jul 30, 2024): If you share your docker-compose with caddy I can show you, you just add network_mode onto the caddy container too so they all share one network stack. Incoming connections can still be handled by caddy even when all outbound traffic goes through Tailscale. You can also do it with iptables manually if you are running caddy outside docker.
Author
Owner

@huyz commented on GitHub (Aug 2, 2024):

@pirate I'm not running Caddy outside of Docker. But I am running Caddy as part of a separate docker-compose project because it needs to reverse-proxy many services besides ArchiveBox. So it sounds like I need to use iptables.
Do you happen to have sample iptables rules I can use?

Thanks so much for your help.

<!-- gh-comment-id:2266054300 --> @huyz commented on GitHub (Aug 2, 2024): @pirate I'm not running Caddy outside of Docker. But I am running Caddy as part of a separate docker-compose project because it needs to reverse-proxy many services besides ArchiveBox. So it sounds like I need to use iptables. Do you happen to have sample iptables rules I can use? Thanks so much for your help.
Author
Owner

@pirate commented on GitHub (Aug 2, 2024):

In that case I recommend a named bridge network that they both attach to, it will be simpler than iptables. Run docker network create archivebox to create a named network, then attach both the archivebox container and the caddy container to that network. ChatGPT should be able to help generate the yaml for that to attach the containers, if that doesn't work I can maybe help write it for you.

<!-- gh-comment-id:2266102105 --> @pirate commented on GitHub (Aug 2, 2024): In that case I recommend a named bridge network that they both attach to, it will be simpler than iptables. Run `docker network create archivebox` to create a named network, then attach both the archivebox container and the caddy container to that network. ChatGPT should be able to help generate the yaml for that to attach the containers, if that doesn't work I can maybe help write it for you.
Author
Owner

@huyz commented on GitHub (Aug 2, 2024):

That's actually what I have: a named bridge network.
But it seems that tailscale's iptables rules just mess everything up and I'm not too familiar with how to allow traffic for that bridged network:

Chain INPUT (policy ACCEPT)
target     prot opt source               destination
ts-input   all  --  anywhere             anywhere

Chain FORWARD (policy ACCEPT)
target     prot opt source               destination
ts-forward  all  --  anywhere             anywhere

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination

Chain ts-forward (1 references)
target     prot opt source               destination
MARK       all  --  anywhere             anywhere             MARK xset 0x40000/0xff0000
ACCEPT     all  --  anywhere             anywhere             mark match 0x40000/0xff0000
DROP       all  --  100.64.0.0/10        anywhere
ACCEPT     all  --  anywhere             anywhere

Chain ts-input (1 references)
target     prot opt source               destination
ACCEPT     all  --  100.100.1.108        anywhere
RETURN     all  --  100.115.92.0/23      anywhere
DROP       all  --  100.64.0.0/10        anywhere
ACCEPT     all  --  anywhere             anywhere
ACCEPT     udp  --  anywhere             anywhere             udp dpt:34106
<!-- gh-comment-id:2266107291 --> @huyz commented on GitHub (Aug 2, 2024): That's actually what I have: a named bridge network. But it seems that tailscale's iptables rules just mess everything up and I'm not too familiar with how to allow traffic for that bridged network: ``` Chain INPUT (policy ACCEPT) target prot opt source destination ts-input all -- anywhere anywhere Chain FORWARD (policy ACCEPT) target prot opt source destination ts-forward all -- anywhere anywhere Chain OUTPUT (policy ACCEPT) target prot opt source destination Chain ts-forward (1 references) target prot opt source destination MARK all -- anywhere anywhere MARK xset 0x40000/0xff0000 ACCEPT all -- anywhere anywhere mark match 0x40000/0xff0000 DROP all -- 100.64.0.0/10 anywhere ACCEPT all -- anywhere anywhere Chain ts-input (1 references) target prot opt source destination ACCEPT all -- 100.100.1.108 anywhere RETURN all -- 100.115.92.0/23 anywhere DROP all -- 100.64.0.0/10 anywhere ACCEPT all -- anywhere anywhere ACCEPT udp -- anywhere anywhere udp dpt:34106 ```
Author
Owner

@huyz commented on GitHub (Aug 4, 2024):

@pirate I figured it out. Given that I was using Tailscale in order to pass traffic through a Tailscale exit node, what I was missing was --exit-node-allow-lan-access

            - TS_EXTRA_ARGS=--exit-node=TS_EXIT_IP --exit-node-allow-lan-access

For everyone, here is my full service definition (where I use Jinja2 as templater; hence the {{ … }}):

    tailscale:
        image: 'tailscale/tailscale:latest'
        container_name: archivebox-tailscale
        # Should set hostname so that the reverse-proxy (e.g. Caddy) can find the archivebox container
        hostname: archivebox
        restart: unless-stopped
        cap_add:
            - net_admin
            - sys_module
        networks:
            # For ad-blocking during archiving, uncomment this section and pihole service section below
            - dns
            - caddy-archivebox
        dns:
            - 192.168.53.53
        volumes:
            - /srv/archivebox/tailscale/state:/var/lib/tailscale
        environment:
            - TS_STATE_DIR=/var/lib/tailscale
            - TS_AUTHKEY={{ TS_AUTH_KEY_ARCHIVEBOX }}
            - TS_USERSPACE=false
            - TS_HOSTNAME=docker-archivebox
            - TS_EXTRA_ARGS=--exit-node={{ TS_EXIT_IP }} --exit-node-allow-lan-access
        #command: tailscaled

For TS_AUTH_KEY_ARCHIVEBOX, generate a reusable, ephemeral Auth Key (ephemeral, so machine is automatically cleaned up after logout or inactivity; reusable so that we don't have to get a new key it every time the container comes up).

And of course, don't forget sudo tailscale set --advertise-exit-node on your exist node.

<!-- gh-comment-id:2267640815 --> @huyz commented on GitHub (Aug 4, 2024): @pirate I figured it out. Given that I was using Tailscale in order to pass traffic through a Tailscale exit node, what I was missing was `--exit-node-allow-lan-access` ```yaml - TS_EXTRA_ARGS=--exit-node=TS_EXIT_IP --exit-node-allow-lan-access ``` For everyone, here is my full service definition (where I use Jinja2 as templater; hence the `{{ … }}`): ```yaml tailscale: image: 'tailscale/tailscale:latest' container_name: archivebox-tailscale # Should set hostname so that the reverse-proxy (e.g. Caddy) can find the archivebox container hostname: archivebox restart: unless-stopped cap_add: - net_admin - sys_module networks: # For ad-blocking during archiving, uncomment this section and pihole service section below - dns - caddy-archivebox dns: - 192.168.53.53 volumes: - /srv/archivebox/tailscale/state:/var/lib/tailscale environment: - TS_STATE_DIR=/var/lib/tailscale - TS_AUTHKEY={{ TS_AUTH_KEY_ARCHIVEBOX }} - TS_USERSPACE=false - TS_HOSTNAME=docker-archivebox - TS_EXTRA_ARGS=--exit-node={{ TS_EXIT_IP }} --exit-node-allow-lan-access #command: tailscaled ``` For `TS_AUTH_KEY_ARCHIVEBOX`, generate a reusable, [ephemeral Auth Key](https://tailscale.com/kb/1111/ephemeral-nodes) (ephemeral, so machine is automatically cleaned up after logout or inactivity; reusable so that we don't have to get a new key it every time the container comes up). And of course, don't forget `sudo tailscale set --advertise-exit-node` on your exist node.
Author
Owner

@ShockedCoder commented on GitHub (May 27, 2025):

As an expansion to this, it would be great if you could have a per-site proxy.
Such as a regex or similar, which would match .org and use the designated proxy.

Since if a proxy doesn't allow connections beyond the defined network, then you couldn't use the same ArchiveBox instance to archive normal webpages simultaneously.

<!-- gh-comment-id:2911033164 --> @ShockedCoder commented on GitHub (May 27, 2025): As an expansion to this, it would be great if you could have a per-site proxy. Such as a regex or similar, which would match `.org` and use the designated proxy. Since if a proxy doesn't allow connections beyond the defined network, then you couldn't use the same ArchiveBox instance to archive normal webpages simultaneously.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#1683
No description provided.