[GH-ISSUE #993] Running archivebox as an unprivilleged container #2130

Closed
opened 2026-03-01 17:56:43 +03:00 by kerem · 3 comments
Owner

Originally created by @tofran on GitHub (Jun 21, 2022).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/993

Bug description

ArchiveBox Docker images have always been quite opinionated in terms of permissions, but as of v0.6.2 it became impossible to run ArchiveBox with an arbitrary non privileged user.

/bin/docker_entrypoint.sh is the main reason for this, but I understand as it makes using these docker images more noob poof. I would really like if there was a way to configure this entrypoint via env vars and avoid chown/gosu.
Until now what I have been doing is completely bypassing the entrypoint, and it worked great until v0.6.2, because the code was agnostic to the user and did not require root permissions. Now it enforces the user id from /etc/passwd (with not apparent reason?).

Previous working behaviour (v0.6.0):

docker run \
    --user 1000 \
    --entrypoint sh \
    -e USER=archivebox -e USERNAME=archivebox -e PUID=1000 -e PGID=1000 \
    archivebox/archivebox:0.6.0 \
   -c -- "archivebox version"

Steps to reproduce(v0.6.2):

docker run \
    --user 1000 \
    --entrypoint sh  \
    -e USER=archivebox -e USERNAME=archivebox -e PUID=1000 -e PGID=1000 \
    archivebox/archivebox:0.6.2 \
    -c -- "archivebox version"

Output

Traceback (most recent call last):
  File "/usr/local/bin/archivebox", line 33, in <module>
    sys.exit(load_entry_point('archivebox', 'console_scripts', 'archivebox')())
  File "/usr/local/bin/archivebox", line 25, in importlib_load_entry_point
    return next(matches).load()
  File "/usr/local/lib/python3.10/importlib/metadata/__init__.py", line 171, in load
    module = import_module(match.group('module'))
  File "/usr/local/lib/python3.10/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1050, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 883, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/app/archivebox/cli/__init__.py", line 11, in <module>
    from ..config import OUTPUT_DIR, check_data_folder, check_migrations
  File "/app/archivebox/config.py", line 55, in <module>
    SYSTEM_USER = pwd.getpwuid(os.geteuid()).pw_name or SYSTEM_USER
KeyError: 'getpwuid(): uid not found: 1000'

Affecting code

As we can see we are no longer fighting the entrypoint but the code itself:

github.com/ArchiveBox/ArchiveBox@03eb7e5875/archivebox/config.py (L59)

Affecting commit: 79e19ecd47

Why isn't getpass.getuser() enough?

github.com/ArchiveBox/ArchiveBox@03eb7e5875/archivebox/config.py (L55)

getpass.getuser() already returns archivebox, thus why would we need to enforce /etc/passwd with pwd.getpwuid?

Even if this is really required for some edge case why enforce it? If needed we could catch it or find a way to bypass it.

ArchiveBox version

ArchiveBox v0.6.3
Cpython Linux Linux-5.10.104-linuxkit-aarch64-with-glibc2.31 aarch64
IN_DOCKER=True DEBUG=False IS_TTY=False TZ=UTC SEARCH_BACKEND_ENGINE=ripgrep

[i] Dependency versions:
 √  ARCHIVEBOX_BINARY     v0.6.3          valid     /usr/local/bin/archivebox
 √  PYTHON_BINARY         v3.10.4         valid     /usr/local/bin/python3.10
 √  DJANGO_BINARY         v3.1.14         valid     /usr/local/lib/python3.10/site-packages/django/bin/django-admin.py
 √  CURL_BINARY           v7.74.0         valid     /usr/bin/curl
 √  WGET_BINARY           v1.21           valid     /usr/bin/wget
 √  NODE_BINARY           v17.9.0         valid     /usr/bin/node
 √  SINGLEFILE_BINARY     v0.3.16         valid     /node/node_modules/single-file/cli/single-file
 √  READABILITY_BINARY    v0.0.2          valid     /node/node_modules/readability-extractor/readability-extractor
 √  MERCURY_BINARY        v1.0.0          valid     /node/node_modules/@postlight/mercury-parser/cli.js
 √  GIT_BINARY            v2.30.2         valid     /usr/bin/git
 √  YOUTUBEDL_BINARY      v2022.04.08     valid     /usr/local/bin/yt-dlp
 √  CHROME_BINARY         v101.0.4951.41  valid     /usr/bin/chromium
 √  RIPGREP_BINARY        v12.1.1         valid     /usr/bin/rg

[i] Source-code locations:
 √  PACKAGE_DIR           24 files        valid     /app/archivebox
 √  TEMPLATES_DIR         4 files         valid     /app/archivebox/templates
 -  CUSTOM_TEMPLATES_DIR  -               disabled

[i] Secrets locations:
 -  CHROME_USER_DATA_DIR  -               disabled
 -  COOKIES_FILE          -               disabled


[i] Data locations:

Note: since archivebox version also loads the config that enforces /etc/passwd it is the best way to demosntrate the problem. Although in a production environment the command would be archivebox server 0.0.0.0:8000 --quick-init.

Why all this hassle?

I run my containers unprivileged, and this is the direction the industry has been moving to. That's why I face this problem.
Note that running the container as root and then doing setuid inside it, although a good start, is not the same thing as fully running without any capabilities.
I think ArchiveBox should have a way to run it this way, or at least allow it (like prior to v0.6.2).

I'm willing to get my hands dirty, but first I want the opinion of this project contributors for the direction this should take.
Thank you.

Originally created by @tofran on GitHub (Jun 21, 2022). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/993 **Bug description** ArchiveBox Docker images have always been quite opinionated in terms of permissions, but as of `v0.6.2` it became impossible to run ArchiveBox with an arbitrary non privileged user. `/bin/docker_entrypoint.sh` is the main reason for this, but I understand as it makes using these docker images more noob poof. I would really like if there was a way to configure this entrypoint via env vars and avoid chown/gosu. Until now what I have been doing is completely bypassing the entrypoint, and it worked great until `v0.6.2`, because the code was agnostic to the user and did not require root permissions. Now it enforces the user id from `/etc/passwd` (with not apparent reason?). **Previous working behaviour (v0.6.0):** ```sh docker run \ --user 1000 \ --entrypoint sh \ -e USER=archivebox -e USERNAME=archivebox -e PUID=1000 -e PGID=1000 \ archivebox/archivebox:0.6.0 \ -c -- "archivebox version" ``` **Steps to reproduce(v0.6.2):** ```sh docker run \ --user 1000 \ --entrypoint sh \ -e USER=archivebox -e USERNAME=archivebox -e PUID=1000 -e PGID=1000 \ archivebox/archivebox:0.6.2 \ -c -- "archivebox version" ``` **Output** ``` Traceback (most recent call last): File "/usr/local/bin/archivebox", line 33, in <module> sys.exit(load_entry_point('archivebox', 'console_scripts', 'archivebox')()) File "/usr/local/bin/archivebox", line 25, in importlib_load_entry_point return next(matches).load() File "/usr/local/lib/python3.10/importlib/metadata/__init__.py", line 171, in load module = import_module(match.group('module')) File "/usr/local/lib/python3.10/importlib/__init__.py", line 126, in import_module return _bootstrap._gcd_import(name[level:], package, level) File "<frozen importlib._bootstrap>", line 1050, in _gcd_import File "<frozen importlib._bootstrap>", line 1027, in _find_and_load File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked File "<frozen importlib._bootstrap>", line 688, in _load_unlocked File "<frozen importlib._bootstrap_external>", line 883, in exec_module File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed File "/app/archivebox/cli/__init__.py", line 11, in <module> from ..config import OUTPUT_DIR, check_data_folder, check_migrations File "/app/archivebox/config.py", line 55, in <module> SYSTEM_USER = pwd.getpwuid(os.geteuid()).pw_name or SYSTEM_USER KeyError: 'getpwuid(): uid not found: 1000' ``` **Affecting code** As we can see we are no longer fighting the entrypoint but the code itself: https://github.com/ArchiveBox/ArchiveBox/blob/03eb7e58758d8dcb85ed781e713fc083f8292264/archivebox/config.py#L59 Affecting commit: 79e19ecd47905e754d7407bedb1cb52bbe6cb5a3 Why isn't `getpass.getuser()` enough? https://github.com/ArchiveBox/ArchiveBox/blob/03eb7e58758d8dcb85ed781e713fc083f8292264/archivebox/config.py#L55 `getpass.getuser()` already returns `archivebox`, thus why would we need to enforce `/etc/passwd` with `pwd.getpwuid`? Even if this is really required for some edge case why enforce it? If needed we could catch it or find a way to bypass it. #### ArchiveBox version ```log ArchiveBox v0.6.3 Cpython Linux Linux-5.10.104-linuxkit-aarch64-with-glibc2.31 aarch64 IN_DOCKER=True DEBUG=False IS_TTY=False TZ=UTC SEARCH_BACKEND_ENGINE=ripgrep [i] Dependency versions: √ ARCHIVEBOX_BINARY v0.6.3 valid /usr/local/bin/archivebox √ PYTHON_BINARY v3.10.4 valid /usr/local/bin/python3.10 √ DJANGO_BINARY v3.1.14 valid /usr/local/lib/python3.10/site-packages/django/bin/django-admin.py √ CURL_BINARY v7.74.0 valid /usr/bin/curl √ WGET_BINARY v1.21 valid /usr/bin/wget √ NODE_BINARY v17.9.0 valid /usr/bin/node √ SINGLEFILE_BINARY v0.3.16 valid /node/node_modules/single-file/cli/single-file √ READABILITY_BINARY v0.0.2 valid /node/node_modules/readability-extractor/readability-extractor √ MERCURY_BINARY v1.0.0 valid /node/node_modules/@postlight/mercury-parser/cli.js √ GIT_BINARY v2.30.2 valid /usr/bin/git √ YOUTUBEDL_BINARY v2022.04.08 valid /usr/local/bin/yt-dlp √ CHROME_BINARY v101.0.4951.41 valid /usr/bin/chromium √ RIPGREP_BINARY v12.1.1 valid /usr/bin/rg [i] Source-code locations: √ PACKAGE_DIR 24 files valid /app/archivebox √ TEMPLATES_DIR 4 files valid /app/archivebox/templates - CUSTOM_TEMPLATES_DIR - disabled [i] Secrets locations: - CHROME_USER_DATA_DIR - disabled - COOKIES_FILE - disabled [i] Data locations: ``` Note: since `archivebox version` also loads the config that enforces `/etc/passwd` it is the best way to demosntrate the problem. Although in a production environment the command would be `archivebox server 0.0.0.0:8000 --quick-init`. **Why all this hassle?** I run my containers unprivileged, and this is the direction the industry has been moving to. That's why I face this problem. Note that running the container as root and then doing `setuid` inside it, although a good start, is not the same thing as fully running without any capabilities. I think ArchiveBox should have a way to run it this way, or at least allow it (like prior to `v0.6.2`). I'm willing to get my hands dirty, but first I want the opinion of this project contributors for the direction this should take. Thank you.
kerem 2026-03-01 17:56:43 +03:00
Author
Owner

@pirate commented on GitHub (Jun 21, 2022):

Oh this is just a bug in getpwuid not being able to resolve a numbered user to a name across different OSs. It's not intended to enforce any kind of privilege or /etc/passwd behavior.

I added this to show whether users are running with root, www-data, or a personal account in the archivebox version output for easier support troubleshooting.

If it's causing bugs we can just remove it and all references to it though.

<!-- gh-comment-id:1161569005 --> @pirate commented on GitHub (Jun 21, 2022): Oh this is just a bug in getpwuid not being able to resolve a numbered user to a name across different OSs. It's not intended to enforce any kind of privilege or /etc/passwd behavior. I added this to show whether users are running with root, www-data, or a personal account in the `archivebox version` output for easier support troubleshooting. If it's causing bugs we can just remove it and all references to it though.
Author
Owner

@Goorzhel commented on GitHub (Sep 13, 2023):

Untested, but:

 try:
     import pwd
     SYSTEM_USER = pwd.getpwuid(os.geteuid()).pw_name or SYSTEM_USER
+except KeyError:
+    # Process' UID might not map to a user in cases such as running the Docker image
+    # (where `archivebox` is 999) as a different UID.
+    pass
 except ModuleNotFoundError:
     # pwd is only needed for some linux systems, doesn't exist on windows
     pass

This way SYSTEM_USER falls back to whatever was previously set:
github.com/ArchiveBox/ArchiveBox@03eb7e5875/archivebox/config.py (L55)

<!-- gh-comment-id:1716915897 --> @Goorzhel commented on GitHub (Sep 13, 2023): Untested, but: ```diff try: import pwd SYSTEM_USER = pwd.getpwuid(os.geteuid()).pw_name or SYSTEM_USER +except KeyError: + # Process' UID might not map to a user in cases such as running the Docker image + # (where `archivebox` is 999) as a different UID. + pass except ModuleNotFoundError: # pwd is only needed for some linux systems, doesn't exist on windows pass ``` This way `SYSTEM_USER` falls back to whatever was previously set: https://github.com/ArchiveBox/ArchiveBox/blob/03eb7e58758d8dcb85ed781e713fc083f8292264/archivebox/config.py#L55
Author
Owner

@pirate commented on GitHub (Sep 14, 2023):

fixed, thanks 5c1a14e4f2

<!-- gh-comment-id:1719207300 --> @pirate commented on GitHub (Sep 14, 2023): fixed, thanks 5c1a14e4f2bbd085954d480fbcc9c2f6c3a6a64e
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#2130
No description provided.