[GH-ISSUE #1581] Question: ArchiveBox scheduler needs same bind mounts set up as main container #3961

Closed
opened 2026-03-15 01:07:28 +03:00 by kerem · 3 comments
Owner

Originally created by @Paulie420 on GitHub (Nov 2, 2024).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/1581

OK; I'm using ArchiveBox but wanting to store all the archived website data on my NAS/NFS share, so I setup an 'archivebox-archive' volume and mounted ArchiveBox's /data/archive to it. I left its /data locally, as per the ArchiveBox documentation... but something isn't right.

When I docker compose up -d ArchiveBox, all runs fine... for a while; however, all memory and swap get eaten up and the lxc container it runs in grinds to a halt. Futhermore, I'm noticing that archived website data is showing up LOCALLY in ~/archivebox/data/archive - a folder that shouldn't ever exist because I'm specifically mounting that archivebox-archive volume to ArchiveBox's /data/archive....

More info is I can jump into the ArchiveBox docker container and I SEE that the NFS share/mount IS there - so something must be leaking or... some other part of my config is wonky?

Last bit of pertinent info - I set the PGID/PUID to 3000:3000 because thats what my NFS/NAS likes - and I chowned ~/archivebox/data to 3000:3000 too....

Can anyone look over my docker-compose.yml and possibly help me out? Much thanks - I'm SURE I'm doing something wrong. Last scrape of info I thought might be useful, but I can't fathom it would make a difference - I'm also mounting my NFS share LOCALLY on the lxc container to ~/archivebox-archive, thru its fstab, just so I can see the data - but I wouldn't think this would cause issue...


docker-compose.yml - with passwords, etc, taken out:

services:
    archivebox:
        image: archivebox/archivebox:latest
        ports:
            - 3000:8000
        volumes:
            - ./data:/data                      # Local configuration data
            - archivebox-archive:/data/archive  # NFS share for archives
            # ./data/personas/Default/chrome_profile/Default:/data/personas/Default/chrome_profile/Default
        environment:
            - PUID=3000  # If I use 3000, NFS share is OK - but ~/archivebox/data
            - PGID=3000  # gets messed up. Opposite if I use 1000 - HOW NFS archivebox!?
            - ADMIN_USERNAME=admin              # create an admin user on first run with the given user/pass combo
            - ADMIN_PASSWORD=poop
            - CSRF_TRUSTED_ORIGINS=https://archivebox.example.com  # REQUIRED for auth, REST API, etc. to work
            - ALLOWED_HOSTS=*                   # set this to the hostname(s) from your CSRF_TRUSTED_ORIGINS
            - PUBLIC_INDEX=True                 # set to False to prevent anonymous users from viewing snapshot list
            - PUBLIC_SNAPSHOTS=True             # set to False to prevent anonymous users from viewing snapshot content
            - PUBLIC_ADD_VIEW=False             # set to True to allow anonymous users to submit new URLs to archive
            - SEARCH_BACKEND_ENGINE=sonic       # tells ArchiveBox to use sonic container below for fast full-text search
            - SEARCH_BACKEND_HOST_NAME=sonic
            - SEARCH_BACKEND_PASSWORD=poops
            # - PUID=911                        # set to your host user's UID & GID if you encounter permissions issues
            # - PGID=911                        # UID/GIDs <500 may clash with existing users and are not recommended
            - MEDIA_MAX_SIZE=999m               # increase this filesize limit to allow archiving larger audio/video files
            - TIMEOUT=120                       # increase this number to 120+ seconds if you see many slow downloads timing out
            # - CHECK_SSL_VALIDITY=True         # set to False to disable strict SSL checking (allows saving URLs w/ broken certs)
            - SAVE_ARCHIVE_DOT_ORG=True       # set to False to disable submitting all URLs to Archive.org when archiving
            # - USER_AGENT="..."                # set a custom USER_AGENT to avoid being blocked as a bot
        restart: always

        # For ad-blocking during archiving, uncomment this section and pihole service section below
        # networks:
        #   - dns
        # dns:
        #   - 172.20.0.53


    ######## Optional Addons: tweak examples below as needed for your specific use case ########

    ### This optional container runs any scheduled tasks in the background, add new tasks like so:
    #   $ docker compose run archivebox schedule --add --every=day --depth=1 'https://example.com/some/rss/feed.xml'
    # then restart the scheduler container to apply any changes to the scheduled task list:
    #   $ docker compose restart archivebox_scheduler
    # https://github.com/ArchiveBox/ArchiveBox/wiki/Scheduled-Archiving

    archivebox_scheduler:

        image: archivebox/archivebox:latest
        command: schedule --foreground --update --every=day
        environment:
            - TIMEOUT=120                       # use a higher timeout than the main container to give slow tasks more time when retrying
            - PUID=3000                        # set to your host user's UID & GID if you encounter permissions issues
            - PGID=3000
        volumes:
            - ./data:/data                  
        # cpus: 2                               # uncomment / edit these values to limit scheduler container resource consumption
        # mem_limit: 2048m
        restart: always


    ### This runs the optional Sonic full-text search backend (much faster than default rg backend).
    # If Sonic is ever started after not running for a while, update its full-text index by running:
    #   $ docker-compose run archivebox update --index-only
    # https://github.com/ArchiveBox/ArchiveBox/wiki/Setting-up-Search

    sonic:
        image: valeriansaliou/sonic:latest
        build:
            # custom build just auto-downloads archivebox's default sonic.cfg as a convenience
            # not needed after first run / if you have already have ./etc/sonic.cfg present
            dockerfile_inline: |
                FROM quay.io/curl/curl:latest AS config_downloader
                RUN curl -fsSL 'https://raw.githubusercontent.com/ArchiveBox/ArchiveBox/stable/etc/sonic.cfg' > /tmp/sonic.cfg
                FROM valeriansaliou/sonic:latest
                COPY --from=config_downloader /tmp/sonic.cfg /etc/sonic.cfg
        expose:
            - 1491
        environment:
            - SEARCH_BACKEND_PASSWORD=big_poops
        volumes:
            - ./sonic.cfg:/etc/sonic.cfg:ro    # use this if you prefer to download the config on the host and mount it manually
            - ./data/sonic:/var/lib/sonic/store


    ### This container runs xvfb+noVNC so you can watch the ArchiveBox browser as it archives things,
    # or remote control it to set up a chrome profile w/ login credentials for sites you want to archive.
    # https://github.com/ArchiveBox/ArchiveBox/wiki/Chromium-Install#setting-up-a-chromium-user-profile
    # https://github.com/ArchiveBox/ArchiveBox/wiki/Chromium-Install#docker-vnc-setup

    novnc:
        image: theasp/novnc:latest
        environment:
            - DISPLAY_WIDTH=1920
            - DISPLAY_HEIGHT=1080
            - RUN_XTERM=no
        ports:
            # to view/control ArchiveBox's browser, visit: http://127.0.0.1:8080/vnc.html
            # restricted to access from localhost by default because it has no authentication
            - 127.0.0.1:8080:8080


    ### Example: Put Nginx in front of the ArchiveBox server for SSL termination and static file serving.
    # You can also any other ingress provider for SSL like Apache, Caddy, Traefik, Cloudflare Tunnels, etc.

    # nginx:
    #     image: nginx:alpine
    #     ports:
    #         - 443:443
    #         - 80:80
    #     volumes:
    #         - ./etc/nginx.conf:/etc/nginx/nginx.conf
    #         - ./data:/var/www


    ### Example: To run pihole in order to block ad/tracker requests during archiving,
    # uncomment this block and set up pihole using its admin interface

    # pihole:
    #   image: pihole/pihole:latest
    #   ports:
    #     # access the admin HTTP interface on http://localhost:8090
    #     - 127.0.0.1:8090:80
    #   environment:
    #     - WEBPASSWORD=SET_THIS_TO_SOME_SECRET_PASSWORD_FOR_ADMIN_DASHBOARD
    #     - DNSMASQ_LISTENING=all
    #   dns:
    #     - 127.0.0.1
    #     - 1.1.1.1
    #   networks:
    #     dns:
    #       ipv4_address: 172.20.0.53
    #   volumes:
    #     - ./etc/pihole:/etc/pihole
    #     - ./etc/dnsmasq:/etc/dnsmasq.d


    ### Example: run all your ArchiveBox traffic through a WireGuard VPN tunnel to avoid IP blocks.
    # You can also use any other VPN that works at the docker IP level, e.g. Tailscale, OpenVPN, etc.

    # wireguard:
    #   image: linuxserver/wireguard:latest
    #   network_mode: 'service:archivebox'
    #   cap_add:
    #     - NET_ADMIN
    #     - SYS_MODULE
    #   sysctls:
    #     - net.ipv4.conf.all.rp_filter=2
    #     - net.ipv4.conf.all.src_valid_mark=1
    #   volumes:
    #     - /lib/modules:/lib/modules
    #     - ./wireguard.conf:/config/wg0.conf:ro

    ### Example: Run ChangeDetection.io to watch for changes to websites, then trigger ArchiveBox to archive them
    # Documentation: https://github.com/dgtlmoon/changedetection.io
    # More info: https://github.com/dgtlmoon/changedetection.io/blob/master/docker-compose.yml

    # changedetection:
    #     image: ghcr.io/dgtlmoon/changedetection.io
    #     volumes:
    #         - ./data-changedetection:/datastore


    ### Example: Run PYWB in parallel and auto-import WARCs from ArchiveBox

    # pywb:
    #     image: webrecorder/pywb:latest
    #     entrypoint: /bin/sh -c '(wb-manager init default || test $$? -eq 2) && wb-manager add default /archivebox/archive/*/warc/*.warc.gz; wayback;'
    #     environment:
    #         - INIT_COLLECTION=archivebox
    #     ports:
    #         - 8686:8080
    #     volumes:
    #         - ./data:/archivebox
    #         - ./data/wayback:/webarchive


networks:
    # network just used for pihole container to offer :53 dns resolving on fixed ip for archivebox container
    dns:
        ipam:
            driver: default
            config:
                - subnet: 172.20.0.0/24


# HOW TO: Set up cloud storage for your ./data/archive (e.g. Amazon S3, Backblaze B2, Google Drive, OneDrive, SFTP, etc.)
#   https://github.com/ArchiveBox/ArchiveBox/wiki/Setting-Up-Storage
#
#   Follow the steps here to set up the Docker RClone Plugin https://rclone.org/docker/
#     $ docker plugin install rclone/docker-volume-rclone:amd64 --grant-all-permissions --alias rclone
#     $ nano /var/lib/docker-plugins/rclone/config/rclone.conf
#     [examplegdrive]
#     type = drive
#     scope = drive
#     drive_id = 1234567...
#     root_folder_id = 0Abcd...
#     token = {"access_token":...}

volumes:
    archivebox-archive:
        driver_opts:
            type: "nfs"
            o: "addr=x.x.x.x,nolock,soft,rw,nfsvers=4"
            device: ":/some/path/to/ArchiveBox"
Originally created by @Paulie420 on GitHub (Nov 2, 2024). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/1581 OK; I'm using ArchiveBox but wanting to store all the archived website data on my NAS/NFS share, so I setup an 'archivebox-archive' volume and mounted ArchiveBox's /data/archive to it. I left its /data locally, as per the ArchiveBox documentation... but something isn't right. When I docker compose up -d ArchiveBox, all runs fine... for a while; however, all memory and swap get eaten up and the lxc container it runs in grinds to a halt. Futhermore, I'm noticing that archived website data is showing up LOCALLY in ~/archivebox/data/archive - a folder that shouldn't ever exist because I'm specifically mounting that archivebox-archive volume to ArchiveBox's /data/archive.... More info is I can jump into the ArchiveBox docker container and I SEE that the NFS share/mount IS there - so something must be leaking or... some other part of my config is wonky? Last bit of pertinent info - I set the PGID/PUID to 3000:3000 because thats what my NFS/NAS likes - and I chowned ~/archivebox/data to 3000:3000 too.... Can anyone look over my docker-compose.yml and possibly help me out? Much thanks - I'm SURE I'm doing something wrong. Last scrape of info I thought might be useful, but I can't fathom it would make a difference - I'm also mounting my NFS share LOCALLY on the lxc container to ~/archivebox-archive, thru its fstab, just so I can see the data - but I wouldn't think this would cause issue... ----- docker-compose.yml - with passwords, etc, taken out: ----- ```yaml services: archivebox: image: archivebox/archivebox:latest ports: - 3000:8000 volumes: - ./data:/data # Local configuration data - archivebox-archive:/data/archive # NFS share for archives # ./data/personas/Default/chrome_profile/Default:/data/personas/Default/chrome_profile/Default environment: - PUID=3000 # If I use 3000, NFS share is OK - but ~/archivebox/data - PGID=3000 # gets messed up. Opposite if I use 1000 - HOW NFS archivebox!? - ADMIN_USERNAME=admin # create an admin user on first run with the given user/pass combo - ADMIN_PASSWORD=poop - CSRF_TRUSTED_ORIGINS=https://archivebox.example.com # REQUIRED for auth, REST API, etc. to work - ALLOWED_HOSTS=* # set this to the hostname(s) from your CSRF_TRUSTED_ORIGINS - PUBLIC_INDEX=True # set to False to prevent anonymous users from viewing snapshot list - PUBLIC_SNAPSHOTS=True # set to False to prevent anonymous users from viewing snapshot content - PUBLIC_ADD_VIEW=False # set to True to allow anonymous users to submit new URLs to archive - SEARCH_BACKEND_ENGINE=sonic # tells ArchiveBox to use sonic container below for fast full-text search - SEARCH_BACKEND_HOST_NAME=sonic - SEARCH_BACKEND_PASSWORD=poops # - PUID=911 # set to your host user's UID & GID if you encounter permissions issues # - PGID=911 # UID/GIDs <500 may clash with existing users and are not recommended - MEDIA_MAX_SIZE=999m # increase this filesize limit to allow archiving larger audio/video files - TIMEOUT=120 # increase this number to 120+ seconds if you see many slow downloads timing out # - CHECK_SSL_VALIDITY=True # set to False to disable strict SSL checking (allows saving URLs w/ broken certs) - SAVE_ARCHIVE_DOT_ORG=True # set to False to disable submitting all URLs to Archive.org when archiving # - USER_AGENT="..." # set a custom USER_AGENT to avoid being blocked as a bot restart: always # For ad-blocking during archiving, uncomment this section and pihole service section below # networks: # - dns # dns: # - 172.20.0.53 ######## Optional Addons: tweak examples below as needed for your specific use case ######## ### This optional container runs any scheduled tasks in the background, add new tasks like so: # $ docker compose run archivebox schedule --add --every=day --depth=1 'https://example.com/some/rss/feed.xml' # then restart the scheduler container to apply any changes to the scheduled task list: # $ docker compose restart archivebox_scheduler # https://github.com/ArchiveBox/ArchiveBox/wiki/Scheduled-Archiving archivebox_scheduler: image: archivebox/archivebox:latest command: schedule --foreground --update --every=day environment: - TIMEOUT=120 # use a higher timeout than the main container to give slow tasks more time when retrying - PUID=3000 # set to your host user's UID & GID if you encounter permissions issues - PGID=3000 volumes: - ./data:/data # cpus: 2 # uncomment / edit these values to limit scheduler container resource consumption # mem_limit: 2048m restart: always ### This runs the optional Sonic full-text search backend (much faster than default rg backend). # If Sonic is ever started after not running for a while, update its full-text index by running: # $ docker-compose run archivebox update --index-only # https://github.com/ArchiveBox/ArchiveBox/wiki/Setting-up-Search sonic: image: valeriansaliou/sonic:latest build: # custom build just auto-downloads archivebox's default sonic.cfg as a convenience # not needed after first run / if you have already have ./etc/sonic.cfg present dockerfile_inline: | FROM quay.io/curl/curl:latest AS config_downloader RUN curl -fsSL 'https://raw.githubusercontent.com/ArchiveBox/ArchiveBox/stable/etc/sonic.cfg' > /tmp/sonic.cfg FROM valeriansaliou/sonic:latest COPY --from=config_downloader /tmp/sonic.cfg /etc/sonic.cfg expose: - 1491 environment: - SEARCH_BACKEND_PASSWORD=big_poops volumes: - ./sonic.cfg:/etc/sonic.cfg:ro # use this if you prefer to download the config on the host and mount it manually - ./data/sonic:/var/lib/sonic/store ### This container runs xvfb+noVNC so you can watch the ArchiveBox browser as it archives things, # or remote control it to set up a chrome profile w/ login credentials for sites you want to archive. # https://github.com/ArchiveBox/ArchiveBox/wiki/Chromium-Install#setting-up-a-chromium-user-profile # https://github.com/ArchiveBox/ArchiveBox/wiki/Chromium-Install#docker-vnc-setup novnc: image: theasp/novnc:latest environment: - DISPLAY_WIDTH=1920 - DISPLAY_HEIGHT=1080 - RUN_XTERM=no ports: # to view/control ArchiveBox's browser, visit: http://127.0.0.1:8080/vnc.html # restricted to access from localhost by default because it has no authentication - 127.0.0.1:8080:8080 ### Example: Put Nginx in front of the ArchiveBox server for SSL termination and static file serving. # You can also any other ingress provider for SSL like Apache, Caddy, Traefik, Cloudflare Tunnels, etc. # nginx: # image: nginx:alpine # ports: # - 443:443 # - 80:80 # volumes: # - ./etc/nginx.conf:/etc/nginx/nginx.conf # - ./data:/var/www ### Example: To run pihole in order to block ad/tracker requests during archiving, # uncomment this block and set up pihole using its admin interface # pihole: # image: pihole/pihole:latest # ports: # # access the admin HTTP interface on http://localhost:8090 # - 127.0.0.1:8090:80 # environment: # - WEBPASSWORD=SET_THIS_TO_SOME_SECRET_PASSWORD_FOR_ADMIN_DASHBOARD # - DNSMASQ_LISTENING=all # dns: # - 127.0.0.1 # - 1.1.1.1 # networks: # dns: # ipv4_address: 172.20.0.53 # volumes: # - ./etc/pihole:/etc/pihole # - ./etc/dnsmasq:/etc/dnsmasq.d ### Example: run all your ArchiveBox traffic through a WireGuard VPN tunnel to avoid IP blocks. # You can also use any other VPN that works at the docker IP level, e.g. Tailscale, OpenVPN, etc. # wireguard: # image: linuxserver/wireguard:latest # network_mode: 'service:archivebox' # cap_add: # - NET_ADMIN # - SYS_MODULE # sysctls: # - net.ipv4.conf.all.rp_filter=2 # - net.ipv4.conf.all.src_valid_mark=1 # volumes: # - /lib/modules:/lib/modules # - ./wireguard.conf:/config/wg0.conf:ro ### Example: Run ChangeDetection.io to watch for changes to websites, then trigger ArchiveBox to archive them # Documentation: https://github.com/dgtlmoon/changedetection.io # More info: https://github.com/dgtlmoon/changedetection.io/blob/master/docker-compose.yml # changedetection: # image: ghcr.io/dgtlmoon/changedetection.io # volumes: # - ./data-changedetection:/datastore ### Example: Run PYWB in parallel and auto-import WARCs from ArchiveBox # pywb: # image: webrecorder/pywb:latest # entrypoint: /bin/sh -c '(wb-manager init default || test $$? -eq 2) && wb-manager add default /archivebox/archive/*/warc/*.warc.gz; wayback;' # environment: # - INIT_COLLECTION=archivebox # ports: # - 8686:8080 # volumes: # - ./data:/archivebox # - ./data/wayback:/webarchive networks: # network just used for pihole container to offer :53 dns resolving on fixed ip for archivebox container dns: ipam: driver: default config: - subnet: 172.20.0.0/24 # HOW TO: Set up cloud storage for your ./data/archive (e.g. Amazon S3, Backblaze B2, Google Drive, OneDrive, SFTP, etc.) # https://github.com/ArchiveBox/ArchiveBox/wiki/Setting-Up-Storage # # Follow the steps here to set up the Docker RClone Plugin https://rclone.org/docker/ # $ docker plugin install rclone/docker-volume-rclone:amd64 --grant-all-permissions --alias rclone # $ nano /var/lib/docker-plugins/rclone/config/rclone.conf # [examplegdrive] # type = drive # scope = drive # drive_id = 1234567... # root_folder_id = 0Abcd... # token = {"access_token":...} volumes: archivebox-archive: driver_opts: type: "nfs" o: "addr=x.x.x.x,nolock,soft,rw,nfsvers=4" device: ":/some/path/to/ArchiveBox" ```
kerem closed this issue 2026-03-15 01:07:33 +03:00
Author
Owner

@pirate commented on GitHub (Nov 2, 2024):

Your archivebox scheduler container is missing the bind mount, it needs it too:

archivebox_scheduler:
    ...
    volumes:
        - ./data:/data
        - archivebox-archive:/data/archive
    environment:
        - SEARCH_BACKEND_ENGINE=sonic
        - SEARCH_BACKEND_HOST_NAME=sonic
        - SEARCH_BACKEND_PASSWORD=poops
        - MEDIA_MAX_SIZE=999m
        - SAVE_ARCHIVE_DOT_ORG=True
        # you need to add any other config vars you want to apply to both containers too
        # OR better: put it in data/ArchiveBox.conf instead so you don't need duplicate it

ArchiveBox scheduler runs archivebox just like the main container, so it needs all the same config and bind mounts.

In the upcoming v0.9.0 release the scheduler will run inside main container automatically, so two separate containers wont be needed anymore, but if you are running v0.7.2 then two containers are still needed.

<!-- gh-comment-id:2453092068 --> @pirate commented on GitHub (Nov 2, 2024): Your archivebox scheduler container is missing the bind mount, it needs it too: ```yaml archivebox_scheduler: ... volumes: - ./data:/data - archivebox-archive:/data/archive environment: - SEARCH_BACKEND_ENGINE=sonic - SEARCH_BACKEND_HOST_NAME=sonic - SEARCH_BACKEND_PASSWORD=poops - MEDIA_MAX_SIZE=999m - SAVE_ARCHIVE_DOT_ORG=True # you need to add any other config vars you want to apply to both containers too # OR better: put it in data/ArchiveBox.conf instead so you don't need duplicate it ``` ArchiveBox scheduler runs archivebox just like the main container, so it needs all the same config and bind mounts. In the upcoming v0.9.0 release the scheduler will run inside main container automatically, so two separate containers wont be needed anymore, but if you are running v0.7.2 then two containers are still needed.
Author
Owner

@Paulie420 commented on GitHub (Nov 2, 2024):

Thanks for the quick reply - that makes sense, appreicate the extra eyeballs!

It always creates a ~/archivebox/data/archive folder, don't understand why... but I updated the archivebox_scheduler. Thanks!

When the v0.9.0 comes out I'll prolly have to modify docker-compose.yml again...

<!-- gh-comment-id:2453124642 --> @Paulie420 commented on GitHub (Nov 2, 2024): Thanks for the quick reply - that makes sense, appreicate the extra eyeballs! It always creates a ~/archivebox/data/archive folder, don't understand why... but I updated the archivebox_scheduler. Thanks! When the v0.9.0 comes out I'll prolly have to modify docker-compose.yml again...
Author
Owner

@pirate commented on GitHub (Nov 2, 2024):

It's possible the bind mount lags for a split second on startup giving archivebox enough time to create ./data/archive, as long as the data is appearing in your NFS mount properly I wouldn't worry about it too much.

<!-- gh-comment-id:2453220403 --> @pirate commented on GitHub (Nov 2, 2024): It's possible the bind mount lags for a split second on startup giving archivebox enough time to create `./data/archive`, as long as the data is appearing in your NFS mount properly I wouldn't worry about it too much.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#3961
No description provided.