[GH-ISSUE #1513] Bug: Scheduled archiving ignores environment variables for archiving methods in docker-compose.yml #3914

Closed
opened 2026-03-15 00:57:53 +03:00 by kerem · 3 comments
Owner

Originally created by @simmeringdeacon on GitHub (Sep 8, 2024).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/1513

Describe the bug

  • Description: The scheduled archiving process is not respecting the environment variables set in the docker-compose.yml file for specific archiving methods. As a result, it's re-archiving existing URLs using all available methods, including those explicitly set to False in the configuration.
  • Expected behavior: The scheduler should respect the environment variables set in docker-compose.yml and only use the enabled archiving methods during scheduled updates.
  • My observations: This issue appears to be specific to the scheduler process, as manual archiving respects these settings. The problem may be related to how environment variables are passed or interpreted by the scheduler container.

Steps to reproduce

  1. Configure ArchiveBox with specific archiving methods disabled in docker-compose.yml:
services:
    archivebox:
        environment:
            - SAVE_ARCHIVE_DOT_ORG=False
            - SAVE_WGET=False
            - SAVE_WARC=False
            - SAVE_PDF=False
            - SAVE_DOM=False
            # ... (other methods set accordingly)

    archivebox_scheduler:
        image: archivebox/archivebox:latest
        command: schedule --foreground --update --every=day
        environment:
            - TIMEOUT=120
        volumes:
            - ./data:/data
  1. Run the ArchiveBox container with this configuration using docker compose up.
  2. Observe in the terminal output that a cron job is set up:
  3. Observe that during the daily scheduled archiving, existing archives are partially re-archived using methods that were set to False in the configuration.
  • The scheduler adds new archive formats (such as wget, warc, pdf) to existing entries, while preserving previously archived versions (like singlefile HTML and readability).
  • This behavior ignores the False settings for certain archiving methods and results in a more comprehensive archive than intended, without overwriting existing archived content.

Screenshots or log output

Excerpt from schedule.log showing disabled methods being used:

[√] [2024-09-08 04:45:01] "${redctedTitle}"
    ${redactedUrl}
    √ ./archive/1725650052.00000
      > pdf
      > dom
      > wget
      > mercury
      > media
      > archive_org
        25 files (10.5 MB) in 0:00:39s 

Excerpt from the terminal output showing a cron job:

archivebox_scheduler-1  | 
archivebox_scheduler-1  | [!] With the current cron config, ArchiveBox is estimated to run >366 times per year.
archivebox_scheduler-1  |     Congrats on being an enthusiastic internet archiver! 👌
archivebox_scheduler-1  | 
archivebox_scheduler-1  |     Make sure you have enough storage space available to hold all the data.
archivebox_scheduler-1  |     Using a compressed/deduped filesystem like ZFS is recommended if you plan on archiving a lot.
archivebox_scheduler-1  | 
archivebox_scheduler-1  | 
archivebox_scheduler-1  | [√] Scheduled new ArchiveBox cron job for user: archivebox (1 jobs are active).
archivebox_scheduler-1  |   > @daily cd /data && /usr/local/bin/archivebox update >> /data/logs/schedule.log 2>&1 # archivebox_schedule
archivebox_scheduler-1  | [*] Running 1 ArchiveBox jobs in foreground task scheduler...
archivebox_scheduler-1  |   > update

ArchiveBox version

0.7.2
ArchiveBox v0.7.2 COMMIT_HASH=315c9f3 BUILD_TIME=2024-04-24 22:56:58 1713999418
IN_DOCKER=True IN_QEMU=False ARCH=aarch64 OS=Linux PLATFORM=Linux-6.10.0-linuxkit-aarch64-with-glibc2.36 PYTHON=Cpython
FS_ATOMIC=True FS_REMOTE=True FS_USER=911:0 FS_PERMS=644
DEBUG=False IS_TTY=True TZ=UTC SEARCH_BACKEND=sonic LDAP=False

[i] Dependency versions:
 √  PYTHON_BINARY         v3.11.9         valid     /usr/local/bin/python3.11                                                   
 √  SQLITE_BINARY         v2.6.0          valid     /usr/local/lib/python3.11/sqlite3/dbapi2.py                                 
 √  DJANGO_BINARY         v3.1.14         valid     /usr/local/lib/python3.11/site-packages/django/__init__.py                  
 √  ARCHIVEBOX_BINARY     v0.7.2          valid     /usr/local/bin/archivebox                                                   

 √  CURL_BINARY           v8.5.0          valid     /usr/bin/curl                                                               
 -  WGET_BINARY           -               disabled  /usr/bin/wget                                                               
 √  NODE_BINARY           v20.12.2        valid     /usr/bin/node                                                               
 √  SINGLEFILE_BINARY     v1.1.46         valid     /app/node_modules/single-file-cli/single-file                               
 √  READABILITY_BINARY    v0.0.11         valid     /app/node_modules/readability-extractor/readability-extractor               
 -  MERCURY_BINARY        -               disabled  /app/node_modules/@postlight/parser/cli.js                                  
 -  GIT_BINARY            -               disabled  /usr/bin/git                                                                
 -  YOUTUBEDL_BINARY      -               disabled  /usr/local/bin/yt-dlp                                                       
 √  CHROME_BINARY         v124.0.6367.29  valid     /usr/bin/chromium-browser                                                   
 √  RIPGREP_BINARY        v13.0.0         valid     /usr/bin/rg                                                                 

[i] Source-code locations:
 √  PACKAGE_DIR           23 files        valid     /app/archivebox                                                             
 √  TEMPLATES_DIR         3 files         valid     /app/archivebox/templates                                                   
 -  CUSTOM_TEMPLATES_DIR  -               disabled  None                                                                        

[i] Secrets locations:
 -  CHROME_USER_DATA_DIR  -               disabled  None                                                                        
 -  COOKIES_FILE          -               disabled  None                                                                        

[i] Data locations:
 √  OUTPUT_DIR            9 files @       valid     /data                                                                       
 √  SOURCES_DIR           49 files        valid     ./sources                                                                   
 √  LOGS_DIR              2 files         valid     ./logs                                                                      
 √  ARCHIVE_DIR           49 files        valid     ./archive                                                                   
 √  CONFIG_FILE           81.0 Bytes      valid     ./ArchiveBox.conf                                                           
 √  SQL_INDEX             684.0 KB        valid     ./index.sqlite3  
Originally created by @simmeringdeacon on GitHub (Sep 8, 2024). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/1513 #### Describe the bug - Description: The scheduled archiving process is not respecting the environment variables set in the docker-compose.yml file for specific archiving methods. As a result, it's re-archiving existing URLs using all available methods, including those explicitly set to False in the configuration. - Expected behavior: The scheduler should respect the environment variables set in docker-compose.yml and only use the enabled archiving methods during scheduled updates. - My observations: This issue appears to be specific to the scheduler process, as manual archiving respects these settings. The problem may be related to how environment variables are passed or interpreted by the scheduler container. #### Steps to reproduce 1. Configure ArchiveBox with specific archiving methods disabled in docker-compose.yml: ```yaml services: archivebox: environment: - SAVE_ARCHIVE_DOT_ORG=False - SAVE_WGET=False - SAVE_WARC=False - SAVE_PDF=False - SAVE_DOM=False # ... (other methods set accordingly) archivebox_scheduler: image: archivebox/archivebox:latest command: schedule --foreground --update --every=day environment: - TIMEOUT=120 volumes: - ./data:/data ``` 2. Run the ArchiveBox container with this configuration using `docker compose up`. 3. Observe in the terminal output that a cron job is set up: 4. Observe that during the daily scheduled archiving, existing archives are partially re-archived using methods that were set to False in the configuration. - The scheduler adds new archive formats (such as wget, warc, pdf) to existing entries, while preserving previously archived versions (like singlefile HTML and readability). - This behavior ignores the False settings for certain archiving methods and results in a more comprehensive archive than intended, without overwriting existing archived content. #### Screenshots or log output Excerpt from schedule.log showing disabled methods being used: ```schedule.log [√] [2024-09-08 04:45:01] "${redctedTitle}" ${redactedUrl} √ ./archive/1725650052.00000 > pdf > dom > wget > mercury > media > archive_org 25 files (10.5 MB) in 0:00:39s ``` Excerpt from the terminal output showing a cron job: ``` archivebox_scheduler-1 | archivebox_scheduler-1 | [!] With the current cron config, ArchiveBox is estimated to run >366 times per year. archivebox_scheduler-1 | Congrats on being an enthusiastic internet archiver! 👌 archivebox_scheduler-1 | archivebox_scheduler-1 | Make sure you have enough storage space available to hold all the data. archivebox_scheduler-1 | Using a compressed/deduped filesystem like ZFS is recommended if you plan on archiving a lot. archivebox_scheduler-1 | archivebox_scheduler-1 | archivebox_scheduler-1 | [√] Scheduled new ArchiveBox cron job for user: archivebox (1 jobs are active). archivebox_scheduler-1 | > @daily cd /data && /usr/local/bin/archivebox update >> /data/logs/schedule.log 2>&1 # archivebox_schedule archivebox_scheduler-1 | [*] Running 1 ArchiveBox jobs in foreground task scheduler... archivebox_scheduler-1 | > update ``` #### ArchiveBox version ``` 0.7.2 ArchiveBox v0.7.2 COMMIT_HASH=315c9f3 BUILD_TIME=2024-04-24 22:56:58 1713999418 IN_DOCKER=True IN_QEMU=False ARCH=aarch64 OS=Linux PLATFORM=Linux-6.10.0-linuxkit-aarch64-with-glibc2.36 PYTHON=Cpython FS_ATOMIC=True FS_REMOTE=True FS_USER=911:0 FS_PERMS=644 DEBUG=False IS_TTY=True TZ=UTC SEARCH_BACKEND=sonic LDAP=False [i] Dependency versions: √ PYTHON_BINARY v3.11.9 valid /usr/local/bin/python3.11 √ SQLITE_BINARY v2.6.0 valid /usr/local/lib/python3.11/sqlite3/dbapi2.py √ DJANGO_BINARY v3.1.14 valid /usr/local/lib/python3.11/site-packages/django/__init__.py √ ARCHIVEBOX_BINARY v0.7.2 valid /usr/local/bin/archivebox √ CURL_BINARY v8.5.0 valid /usr/bin/curl - WGET_BINARY - disabled /usr/bin/wget √ NODE_BINARY v20.12.2 valid /usr/bin/node √ SINGLEFILE_BINARY v1.1.46 valid /app/node_modules/single-file-cli/single-file √ READABILITY_BINARY v0.0.11 valid /app/node_modules/readability-extractor/readability-extractor - MERCURY_BINARY - disabled /app/node_modules/@postlight/parser/cli.js - GIT_BINARY - disabled /usr/bin/git - YOUTUBEDL_BINARY - disabled /usr/local/bin/yt-dlp √ CHROME_BINARY v124.0.6367.29 valid /usr/bin/chromium-browser √ RIPGREP_BINARY v13.0.0 valid /usr/bin/rg [i] Source-code locations: √ PACKAGE_DIR 23 files valid /app/archivebox √ TEMPLATES_DIR 3 files valid /app/archivebox/templates - CUSTOM_TEMPLATES_DIR - disabled None [i] Secrets locations: - CHROME_USER_DATA_DIR - disabled None - COOKIES_FILE - disabled None [i] Data locations: √ OUTPUT_DIR 9 files @ valid /data √ SOURCES_DIR 49 files valid ./sources √ LOGS_DIR 2 files valid ./logs √ ARCHIVE_DIR 49 files valid ./archive √ CONFIG_FILE 81.0 Bytes valid ./ArchiveBox.conf √ SQL_INDEX 684.0 KB valid ./index.sqlite3 ```
kerem closed this issue 2026-03-15 00:57:58 +03:00
Author
Owner

@pirate commented on GitHub (Sep 8, 2024):

This is expected behavior as you don't have any environment variables set on the scheduling container. You only set them in the primary container (they only apply to the container they are set on).

If you want to share config between containers you should put those in ArchiveBox.conf instead of environment variables, or you could copy paste your environment section onto both.

In general environment varibles usually only apply the processes that have them in their scope, I wouldn't want processes to copy the env they're getting and try to force-apply it to their sibling processes that have a different env.

<!-- gh-comment-id:2336758327 --> @pirate commented on GitHub (Sep 8, 2024): This is expected behavior as you don't have any environment variables set on the scheduling container. You only set them in the primary container (they only apply to the container they are set on). If you want to share config between containers you should put those in `ArchiveBox.conf` instead of environment variables, or you could copy paste your environment section onto both. In general environment varibles usually only apply the processes that have them in their scope, I wouldn't want processes to copy the env they're getting and try to force-apply it to their sibling processes that have a different env.
Author
Owner

@simmeringdeacon commented on GitHub (Sep 8, 2024):

Would it be possible to make this more obvious in https://github.com/ArchiveBox/ArchiveBox/wiki/Docker#configuration and/or "Caveats - Archiving Private Content" section of https://archivebox.io/? This was not immediately obvious to me, even after thoroughly reading the documentation, and it was only after testing for a day that I realized save_archive_dot_org was still being triggered accidentally.

I ended up tweaking the codebase myself to prevent this from happening, but I believe it would be beneficial to a wider user base if this caveat were made clearer in the official documentation.

<!-- gh-comment-id:2336762674 --> @simmeringdeacon commented on GitHub (Sep 8, 2024): Would it be possible to make this more obvious in https://github.com/ArchiveBox/ArchiveBox/wiki/Docker#configuration and/or "Caveats - Archiving Private Content" section of https://archivebox.io/? This was not immediately obvious to me, even after thoroughly reading the documentation, and it was only after testing for a day that I realized save_archive_dot_org was still being triggered accidentally. I ended up tweaking the codebase myself to prevent this from happening, but I believe it would be beneficial to a wider user base if this caveat were made clearer in the official documentation.
Author
Owner

@pirate commented on GitHub (Sep 9, 2024):

I could add more, but it's explained in a few places already: https://github.com/ArchiveBox/ArchiveBox/wiki/Docker#configuration

image image

I understand it can be confusing if you're new to Docker but I don't want to necessarily make my docs a thorough re-introduction to all of Docker's ideosyncrasies. I do expect people to learn Docker to some degree if they are using the Docker install method 🤷

<!-- gh-comment-id:2336901331 --> @pirate commented on GitHub (Sep 9, 2024): I could add more, but it's explained in a few places already: https://github.com/ArchiveBox/ArchiveBox/wiki/Docker#configuration <img width="1407" alt="image" src="https://github.com/user-attachments/assets/8194aad9-b190-428f-977a-6b6b5fbbfb83"> <img width="1373" alt="image" src="https://github.com/user-attachments/assets/5a36ea3f-98ea-48e5-a734-a7c83b17bc24"> I understand it can be confusing if you're new to Docker but I don't want to necessarily make my docs a thorough re-introduction to all of Docker's ideosyncrasies. I do expect people to learn Docker to some degree if they are using the Docker install method 🤷
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#3914
No description provided.