[GH-ISSUE #952] Bug: CHROME_USER_DATA_DIR not working for login #591

Closed
opened 2026-03-01 14:44:49 +03:00 by kerem · 11 comments
Owner

Originally created by @ga-it on GitHub (Mar 21, 2022).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/952

Describe the bug

Hi

I wish to use ArchiveBox to save content from subscriptions behind paywalls.

I have exported cookies in Netscape format (for WGET I understand - COOKIES_FILE) and shared Chromium paths with Archivebox (CHROME_USER_DATA_DIR).

Neither have resulted in logins (evident in screenshots and denied access).

The latter Chromium user data folder process has been particularly problematic - rejecting the path provided (no "Default" profile found where it was there) before suddenly accepting it.

To ensure a usable cookie, I created a VNC session on the server, browsed via Chromium to the site, logged in and then tried using archivebox and providing the path to the User Data folder.

Great project - but the login feature is critical for me to archive content behind paywalls.

Regards

Marc

Steps to reproduce

[SERVER_CONFIG]
SECRET_KEY = XXXXXXXXXXXXX

[ARCHIVE_METHOD_OPTIONS]
COOKIES_FILE = /home/XX/cookies.txt
CHROME_USER_DATA_DIR = /home/XX/.config/chromium

[ARCHIVE_METHOD_TOGGLES]
SAVE_ARCHIVE_DOT_ORG = False

Screenshots or log output

Could not find profile "Default" in CHROME_USER_DATA_DIR.
archivebox_1  |     /home/XX/.config/chromium
archivebox_1  |     Make sure you set it to a Chrome user data directory containing a Default profile folder.

ArchiveBox version

 ArchiveBox v0.6.3: archivebox server

ArchiveBox Dev docker image running on Debian Testing

Originally created by @ga-it on GitHub (Mar 21, 2022). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/952 #### Describe the bug Hi I wish to use ArchiveBox to save content from subscriptions behind paywalls. I have exported cookies in Netscape format (for WGET I understand - COOKIES_FILE) and shared Chromium paths with Archivebox (CHROME_USER_DATA_DIR). Neither have resulted in logins (evident in screenshots and denied access). The latter Chromium user data folder process has been particularly problematic - rejecting the path provided (no "Default" profile found where it was there) before suddenly accepting it. To ensure a usable cookie, I created a VNC session on the server, browsed via Chromium to the site, logged in and then tried using archivebox and providing the path to the User Data folder. Great project - but the login feature is critical for me to archive content behind paywalls. Regards Marc #### Steps to reproduce ``` [SERVER_CONFIG] SECRET_KEY = XXXXXXXXXXXXX [ARCHIVE_METHOD_OPTIONS] COOKIES_FILE = /home/XX/cookies.txt CHROME_USER_DATA_DIR = /home/XX/.config/chromium [ARCHIVE_METHOD_TOGGLES] SAVE_ARCHIVE_DOT_ORG = False ``` #### Screenshots or log output ``` Could not find profile "Default" in CHROME_USER_DATA_DIR. archivebox_1 | /home/XX/.config/chromium archivebox_1 | Make sure you set it to a Chrome user data directory containing a Default profile folder. ``` #### ArchiveBox version ``` ArchiveBox v0.6.3: archivebox server ``` ArchiveBox Dev docker image running on Debian Testing
kerem closed this issue 2026-03-01 14:44:49 +03:00
Author
Owner

@pirate commented on GitHub (Mar 22, 2022):

Are you mounting the chrome profile inside of docker? Keep in mind the Chrome version inside docker and outside is different, you must create the profile with the exact same browser binary. e.g. you cant use a Chrome profile generated outside Docker if you're using Chromium inside Docker for ArchiveBox CHROME_BINARY.

Also please post the full output of archivebox version and docker-compose.yml, don't redact it or I cant help. The ticket instructions are there for a reason.

Try setting CHROME_HEADLESS=True and making sure the browser GUI that shows is loading with the correct profile when running archivebox add.

What path are you using for the user data dir and can you post a screenshot of that dir so we can make sure it's the right one.

https://github.com/ArchiveBox/ArchiveBox/wiki/Configuration#chrome_binary
https://github.com/ArchiveBox/ArchiveBox/wiki/Configuration#chrome_headless

<!-- gh-comment-id:1075746659 --> @pirate commented on GitHub (Mar 22, 2022): Are you mounting the chrome profile inside of docker? Keep in mind the Chrome version inside docker and outside is different, you must create the profile with the exact same browser binary. e.g. you cant use a Chrome profile generated outside Docker if you're using Chromium inside Docker for ArchiveBox `CHROME_BINARY`. Also please post the full output of `archivebox version` and `docker-compose.yml`, don't redact it or I cant help. The ticket instructions are there for a reason. Try setting `CHROME_HEADLESS=True` and making sure the browser GUI that shows is loading with the correct profile when running `archivebox add`. What path are you using for the user data dir and can you post a screenshot of that dir so we can make sure it's the right one. https://github.com/ArchiveBox/ArchiveBox/wiki/Configuration#chrome_binary https://github.com/ArchiveBox/ArchiveBox/wiki/Configuration#chrome_headless
Author
Owner

@pirate commented on GitHub (Mar 22, 2022):

Are you mounting the chrome profile inside of docker? Keep in mind the Chrome version inside docker and outside is different, you must create the profile with the exact same browser binary.

<!-- gh-comment-id:1075746666 --> @pirate commented on GitHub (Mar 22, 2022): Are you mounting the chrome profile inside of docker? Keep in mind the Chrome version inside docker and outside is different, you must create the profile with the exact same browser binary.
Author
Owner

@terxw commented on GitHub (Mar 24, 2022):

I had the same problem, probably its upstream problem with headless chromium or as pirate says incompatible chrome versions.
I got it working after replicating my setup outside docker with google-chrome for CHROME_BINARY, afterwards the profile with cookies and logged in session works

<!-- gh-comment-id:1078225796 --> @terxw commented on GitHub (Mar 24, 2022): I had the same problem, probably its upstream problem with headless chromium or as pirate says incompatible chrome versions. I got it working after replicating my setup outside docker with google-chrome for CHROME_BINARY, afterwards the profile with cookies and logged in session works
Author
Owner

@mwnoo commented on GitHub (Mar 26, 2022):

I'm using docker-compose to run ArchiveBox (v0.6.2) with chromium 90.0.4430.93 running inside the container. Outside docker I'm running chromium 99.0.4844.82 on Ubuntu 20.04.

How can I use the same chromium binary to create the profile?

On Ubuntu 20.04 I cannot install chromium version 90.0.4430.93:
sudo apt install chromium-browser=90.0.4430.93
E: Version '90.0.4430.93' for 'chromium-browser' was not found

Do I need to mount the folder containing chromium 99.0.4844.82 on Ubuntu 20.04 (/snap/bin/chromium) inside the docker container? Or can I update chromium inside the docker container?

If I add /snap/bin:/snap/bin as a volume in the docker-compose.yml and set CHROME_BINARY=/snap/bin/chromium and run docker-compose run archivebox --version I get an error saying ! CHROME_BINARY: /snap/bin/chromium (unable to detect version)

<!-- gh-comment-id:1079668515 --> @mwnoo commented on GitHub (Mar 26, 2022): I'm using docker-compose to run ArchiveBox (v0.6.2) with chromium 90.0.4430.93 running inside the container. Outside docker I'm running chromium 99.0.4844.82 on Ubuntu 20.04. How can I use the same chromium binary to create the profile? On Ubuntu 20.04 I cannot install chromium version 90.0.4430.93: `sudo apt install chromium-browser=90.0.4430.93` `E: Version '90.0.4430.93' for 'chromium-browser' was not found` Do I need to mount the folder containing chromium 99.0.4844.82 on Ubuntu 20.04 (`/snap/bin/chromium`) inside the docker container? Or can I update chromium inside the docker container? If I add `/snap/bin:/snap/bin` as a volume in the docker-compose.yml and set `CHROME_BINARY=/snap/bin/chromium` and run `docker-compose run archivebox --version` I get an error saying `! CHROME_BINARY: /snap/bin/chromium (unable to detect version)`
Author
Owner

@mwnoo commented on GitHub (Mar 27, 2022):

UPDATE
I was able to install chromium 90.0.4430.93 on MX Linux (same version as in the ArchiveBox docker image)

$ chromium --version 
Chromium 90.0.4430.93 built on Debian bullseye/sid, running on Debian 11.1

On the host I visited a few sites and accepted the cookies.
Then I copied the profile folder to a folder that is mounted in the docker image:
cp -r ~/.config/chromium/ chromium

When I check the configuration, the CHROME_USER_DATA_DIR seems valid:

$ sudo docker-compose run archivebox --version

ArchiveBox v0.6.2
Cpython Linux Linux-5.10.0-9-amd64-x86_64-with-glibc2.28 x86_64
IN_DOCKER=True DEBUG=False IS_TTY=True TZ=UTC SEARCH_BACKEND_ENGINE=sonic

[i] Dependency versions:
 √  ARCHIVEBOX_BINARY     v0.6.2          valid     /usr/local/bin/archivebox                                                   
 √  PYTHON_BINARY         v3.9.5          valid     /usr/local/bin/python3.9                                                    
 √  DJANGO_BINARY         v3.1.10         valid     /usr/local/lib/python3.9/site-packages/django/bin/django-admin.py           
 √  CURL_BINARY           v7.64.0         valid     /usr/bin/curl                                                               
 √  WGET_BINARY           v1.20.1         valid     /usr/bin/wget                                                               
 √  NODE_BINARY           v15.14.0        valid     /usr/bin/node                                                               
 √  SINGLEFILE_BINARY     v0.3.16         valid     /node/node_modules/single-file/cli/single-file                              
 √  READABILITY_BINARY    v0.0.2          valid     /node/node_modules/readability-extractor/readability-extractor              
 √  MERCURY_BINARY        v1.0.0          valid     /node/node_modules/@postlight/mercury-parser/cli.js                         
 √  GIT_BINARY            v2.20.1         valid     /usr/bin/git                                                                
 -  YOUTUBEDL_BINARY      -               disabled  /usr/local/bin/youtube-dl                                                   
 √  CHROME_BINARY         v90.0.4430.93   valid     /usr/bin/chromium                                                           
 √  RIPGREP_BINARY        v0.10.0         valid     /usr/bin/rg                                                                 

[i] Source-code locations:
 √  PACKAGE_DIR           22 files        valid     /app/archivebox                                                             
 √  TEMPLATES_DIR         3 files         valid     /app/archivebox/templates                                                   
 -  CUSTOM_TEMPLATES_DIR  -               disabled                                                                              

[i] Secrets locations:
 √  CHROME_USER_DATA_DIR  28 files        valid     /chromium                                                                   
 -  COOKIES_FILE          -               disabled                                                                              

[i] Data locations:
 √  OUTPUT_DIR            10 files        valid     /data                                                                       
 √  SOURCES_DIR           140 files       valid     ./sources                                                                   
 √  LOGS_DIR              1 files         valid     ./logs                                                                      
 √  ARCHIVE_DIR           826 files       valid     ./archive                                                                   
 √  CONFIG_FILE           349.0 Bytes     valid     ./ArchiveBox.conf                                                           
 √  SQL_INDEX             7.0 MB          valid     ./index.sqlite3                                                             

Using sqlitebrowser I can see that the Cookie database contains the data.
Also the hashes are the same:

$ shasum chromium/Default/Cookies
ec465fa8c3e280de02d1e0a86dd0a01dcab7f090  chromium/Default/Cookies
$ shasum ~/.config/chromium/Default/Cookies
ec465fa8c3e280de02d1e0a86dd0a01dcab7f090  /home/test/.config/chromium/Default/Cookies

However, when I archive the same sites using Archivebox the cookie banners are still shown in the output (pdf, screenshot, etc.).
Any ideas what is going wrong here?

docker-compose.yml
I mount the profile folder as read-only (ro) otherwise the contents of the Cookie database are cleared by archivebox

# Usage:
#     docker-compose run archivebox init --setup
#     docker-compose up
#     echo "https://example.com" | docker-compose run archivebox archivebox add
#     docker-compose run archivebox add --depth=1 https://example.com/some/feed.rss
#     docker-compose run archivebox config --set PUBLIC_INDEX=True
#     docker-compose run archivebox help
# Documentation:
#     https://github.com/ArchiveBox/ArchiveBox/wiki/Docker#docker-compose

version: '2.4'

services:
    archivebox:
        # build: .                              # for developers working on archivebox
        image: ${DOCKER_IMAGE:-archivebox/archivebox:master}
        command: server --quick-init 0.0.0.0:8000
        ports:
            - 8000:8000
        environment:
            - PUID=1000
            - PGID=1000
            - ALLOWED_HOSTS=*                   # add any config options you want as env vars
            - MEDIA_MAX_SIZE=750m
            - SEARCH_BACKEND_ENGINE=sonic     # uncomment these if you enable sonic below
            - SEARCH_BACKEND_HOST_NAME=sonic
            - SEARCH_BACKEND_PASSWORD=SecretPassword
        volumes:
            - ./data:/data
            - ./chromium:/chromium:ro
            # - ./archivebox:/app/archivebox    # for developers working on archivebox

    # To run the Sonic full-text search backend, first download the config file to sonic.cfg
    # curl -O https://raw.githubusercontent.com/ArchiveBox/ArchiveBox/master/etc/sonic.cfg
    # after starting, backfill any existing Snapshots into the index: docker-compose run archivebox update --index-only
    sonic:
        image: valeriansaliou/sonic:v1.3.0
        expose:
            - 1491
        environment:
            - SEARCH_BACKEND_PASSWORD=SecretPassword
        volumes:
            - ./sonic.cfg:/etc/sonic.cfg:ro
            - ./data/sonic:/var/lib/sonic/store


    ### Optional Addons: tweak these examples as needed for your specific use case

    # Example: Run scheduled imports in a docker instead of using cron on the
    # host machine, add tasks and see more info with archivebox schedule --help
    # scheduler:
    #    image: archivebox/archivebox:latest
    #    command: schedule --foreground --every=day --depth=1 'https://getpocket.com/users/USERNAME/feed/all'
    #    environment:
    #        - USE_COLOR=True
    #        - SHOW_PROGRESS=False
    #    volumes:
    #        - ./data:/data

    # Example: Put Nginx in front of the ArchiveBox server for SSL termination
    # nginx:
    #     image: nginx:alpine
    #     ports:
    #         - 443:443
    #         - 80:80
    #     volumes:
    #         - ./etc/nginx/nginx.conf:/etc/nginx/nginx.conf
    #         - ./data:/var/www

    # Example: run all your ArchiveBox traffic through a WireGuard VPN tunnel
    # wireguard:
    #   image: linuxserver/wireguard
    #   network_mode: 'service:archivebox'
    #   cap_add:
    #     - NET_ADMIN
    #     - SYS_MODULE
    #   sysctls:
    #     - net.ipv4.conf.all.rp_filter=2
    #     - net.ipv4.conf.all.src_valid_mark=1
    #   volumes:
    #     - /lib/modules:/lib/modules
    #     - ./wireguard.conf:/config/wg0.conf:ro

    # Example: Run PYWB in parallel and auto-import WARCs from ArchiveBox
    # pywb:
    #     image: webrecorder/pywb:latest
    #     entrypoint: /bin/sh 'wb-manager add default /archivebox/archive/*/warc/*.warc.gz; wayback --proxy;'
    #     environment:
    #         - INIT_COLLECTION=archivebox
    #     ports:
    #         - 8080:8080
    #     volumes:
    #         ./data:/archivebox
    #         ./data/wayback:/webarchive

ArchiveBox.conf

[SERVER_CONFIG]
SECRET_KEY = XXXXXXXXXXXXXXXXXXXXXXXXX
SNAPSHOTS_PER_PAGE = 100

[ARCHIVE_METHOD_TOGGLES]
SAVE_ARCHIVE_DOT_ORG = False
SAVE_MEDIA = False
SAVE_WGET = False
SAVE_READABILITY = True
SAVE_MERCURY = True
SAVE_DOM = True

[ARCHIVE_METHOD_OPTIONS]
CHROME_USER_DATA_DIR = /chromium

[GENERAL_CONFIG]
TIMEOUT = 180
<!-- gh-comment-id:1080012024 --> @mwnoo commented on GitHub (Mar 27, 2022): UPDATE I was able to install chromium 90.0.4430.93 on MX Linux (same version as in the ArchiveBox docker image) ``` $ chromium --version Chromium 90.0.4430.93 built on Debian bullseye/sid, running on Debian 11.1 ``` On the host I visited a few sites and accepted the cookies. Then I copied the profile folder to a folder that is mounted in the docker image: `cp -r ~/.config/chromium/ chromium` When I check the configuration, the CHROME_USER_DATA_DIR seems valid: `$ sudo docker-compose run archivebox --version` ```Creating archivebox_archivebox_run ... done ArchiveBox v0.6.2 Cpython Linux Linux-5.10.0-9-amd64-x86_64-with-glibc2.28 x86_64 IN_DOCKER=True DEBUG=False IS_TTY=True TZ=UTC SEARCH_BACKEND_ENGINE=sonic [i] Dependency versions: √ ARCHIVEBOX_BINARY v0.6.2 valid /usr/local/bin/archivebox √ PYTHON_BINARY v3.9.5 valid /usr/local/bin/python3.9 √ DJANGO_BINARY v3.1.10 valid /usr/local/lib/python3.9/site-packages/django/bin/django-admin.py √ CURL_BINARY v7.64.0 valid /usr/bin/curl √ WGET_BINARY v1.20.1 valid /usr/bin/wget √ NODE_BINARY v15.14.0 valid /usr/bin/node √ SINGLEFILE_BINARY v0.3.16 valid /node/node_modules/single-file/cli/single-file √ READABILITY_BINARY v0.0.2 valid /node/node_modules/readability-extractor/readability-extractor √ MERCURY_BINARY v1.0.0 valid /node/node_modules/@postlight/mercury-parser/cli.js √ GIT_BINARY v2.20.1 valid /usr/bin/git - YOUTUBEDL_BINARY - disabled /usr/local/bin/youtube-dl √ CHROME_BINARY v90.0.4430.93 valid /usr/bin/chromium √ RIPGREP_BINARY v0.10.0 valid /usr/bin/rg [i] Source-code locations: √ PACKAGE_DIR 22 files valid /app/archivebox √ TEMPLATES_DIR 3 files valid /app/archivebox/templates - CUSTOM_TEMPLATES_DIR - disabled [i] Secrets locations: √ CHROME_USER_DATA_DIR 28 files valid /chromium - COOKIES_FILE - disabled [i] Data locations: √ OUTPUT_DIR 10 files valid /data √ SOURCES_DIR 140 files valid ./sources √ LOGS_DIR 1 files valid ./logs √ ARCHIVE_DIR 826 files valid ./archive √ CONFIG_FILE 349.0 Bytes valid ./ArchiveBox.conf √ SQL_INDEX 7.0 MB valid ./index.sqlite3 ``` Using `sqlitebrowser` I can see that the Cookie database contains the data. Also the hashes are the same: ``` $ shasum chromium/Default/Cookies ec465fa8c3e280de02d1e0a86dd0a01dcab7f090 chromium/Default/Cookies $ shasum ~/.config/chromium/Default/Cookies ec465fa8c3e280de02d1e0a86dd0a01dcab7f090 /home/test/.config/chromium/Default/Cookies ``` However, when I archive the same sites using Archivebox the cookie banners are still shown in the output (pdf, screenshot, etc.). **Any ideas what is going wrong here?** **docker-compose.yml** I mount the profile folder as read-only (ro) otherwise the contents of the Cookie database are cleared by archivebox ``` # Usage: # docker-compose run archivebox init --setup # docker-compose up # echo "https://example.com" | docker-compose run archivebox archivebox add # docker-compose run archivebox add --depth=1 https://example.com/some/feed.rss # docker-compose run archivebox config --set PUBLIC_INDEX=True # docker-compose run archivebox help # Documentation: # https://github.com/ArchiveBox/ArchiveBox/wiki/Docker#docker-compose version: '2.4' services: archivebox: # build: . # for developers working on archivebox image: ${DOCKER_IMAGE:-archivebox/archivebox:master} command: server --quick-init 0.0.0.0:8000 ports: - 8000:8000 environment: - PUID=1000 - PGID=1000 - ALLOWED_HOSTS=* # add any config options you want as env vars - MEDIA_MAX_SIZE=750m - SEARCH_BACKEND_ENGINE=sonic # uncomment these if you enable sonic below - SEARCH_BACKEND_HOST_NAME=sonic - SEARCH_BACKEND_PASSWORD=SecretPassword volumes: - ./data:/data - ./chromium:/chromium:ro # - ./archivebox:/app/archivebox # for developers working on archivebox # To run the Sonic full-text search backend, first download the config file to sonic.cfg # curl -O https://raw.githubusercontent.com/ArchiveBox/ArchiveBox/master/etc/sonic.cfg # after starting, backfill any existing Snapshots into the index: docker-compose run archivebox update --index-only sonic: image: valeriansaliou/sonic:v1.3.0 expose: - 1491 environment: - SEARCH_BACKEND_PASSWORD=SecretPassword volumes: - ./sonic.cfg:/etc/sonic.cfg:ro - ./data/sonic:/var/lib/sonic/store ### Optional Addons: tweak these examples as needed for your specific use case # Example: Run scheduled imports in a docker instead of using cron on the # host machine, add tasks and see more info with archivebox schedule --help # scheduler: # image: archivebox/archivebox:latest # command: schedule --foreground --every=day --depth=1 'https://getpocket.com/users/USERNAME/feed/all' # environment: # - USE_COLOR=True # - SHOW_PROGRESS=False # volumes: # - ./data:/data # Example: Put Nginx in front of the ArchiveBox server for SSL termination # nginx: # image: nginx:alpine # ports: # - 443:443 # - 80:80 # volumes: # - ./etc/nginx/nginx.conf:/etc/nginx/nginx.conf # - ./data:/var/www # Example: run all your ArchiveBox traffic through a WireGuard VPN tunnel # wireguard: # image: linuxserver/wireguard # network_mode: 'service:archivebox' # cap_add: # - NET_ADMIN # - SYS_MODULE # sysctls: # - net.ipv4.conf.all.rp_filter=2 # - net.ipv4.conf.all.src_valid_mark=1 # volumes: # - /lib/modules:/lib/modules # - ./wireguard.conf:/config/wg0.conf:ro # Example: Run PYWB in parallel and auto-import WARCs from ArchiveBox # pywb: # image: webrecorder/pywb:latest # entrypoint: /bin/sh 'wb-manager add default /archivebox/archive/*/warc/*.warc.gz; wayback --proxy;' # environment: # - INIT_COLLECTION=archivebox # ports: # - 8080:8080 # volumes: # ./data:/archivebox # ./data/wayback:/webarchive ``` **ArchiveBox.conf** ``` [SERVER_CONFIG] SECRET_KEY = XXXXXXXXXXXXXXXXXXXXXXXXX SNAPSHOTS_PER_PAGE = 100 [ARCHIVE_METHOD_TOGGLES] SAVE_ARCHIVE_DOT_ORG = False SAVE_MEDIA = False SAVE_WGET = False SAVE_READABILITY = True SAVE_MERCURY = True SAVE_DOM = True [ARCHIVE_METHOD_OPTIONS] CHROME_USER_DATA_DIR = /chromium [GENERAL_CONFIG] TIMEOUT = 180 ```
Author
Owner

@szenrom commented on GitHub (Apr 6, 2022):

Hi!

I'm not sure if I should create a new issue as my problem is somewhat similar to OP's (ArchiveBox doesn't seem to properly use Google Chrome user directory as seen in OP's update) but I don't use Docker, rather installed it with pip.

Please let me know if I should repost it as a new issue.

My setup (details below):

  1. MacOS 12.3.1
  2. ArchiveBox installed within pyenv virtualnenv (Python 3.10.4)
  3. Node.js provided with nvm (node 17.8.0)
  4. Tools like wget installed with MacPorts

My issues:

  1. ArchiveBox seems to properly locate all dependencies and yet when using simple archivebox add it doesn't seem to use browser user profile (websites are viewed in not logged in version and all cookie banners are up).
  2. After adding CHROME_USER_DATA_DIR and CHROME_BINARY problem became worse as after making sure I'm logged in on the website, closing Chrome and running ArchiveBox I get same output as before but I'm also logged out of the website when I restart Chrome.
    • Looks like loosing cookies for website that I tried to archive.
  3. Running ArchiveBox with CHROME_HEADLESS=False setting opens proper user profile (I verified it by seeing add-ons) but with already lost cookies.
    • Also, it seems that headless argument isn't passed to single-file as it doesn't show Chrome.
    • Other extractors always time out - Chrome is opened, nothing happens and after timeout it closes.
  4. I tried using dev branch version but it had same issues with a few more (there was an issue with checking version of installed Node modules).

Details of the setup:

Output of command to check versions after manual installation

> python3 --version | head -n 1 &&
git --version | head -n 1 &&
wget --version | head -n 1 &&
curl --version | head -n 1 &&
youtube-dl --version | head -n 1 &&
echo "[√] All dependencies installed."
Python 3.10.4
git version 2.32.0 (Apple Git-132)
GNU Wget 1.21.3 built on darwin20.6.0.
curl 7.82.0 (x86_64-apple-darwin20.6.0) libcurl/7.82.0 OpenSSL/3.0.2 zlib/1.2.12 zstd/1.5.2 libidn2/2.3.2 libpsl/0.21.1 (+libidn2/2.3.2)
2021.12.17
[√] All dependencies installed.

Output of archivebox --version

> archivebox --version
ArchiveBox v0.6.2
Cpython Darwin macOS-12.3.1-x86_64-i386-64bit x86_64
IN_DOCKER=False DEBUG=False IS_TTY=True TZ=UTC SEARCH_BACKEND_ENGINE=ripgrep

[i] Dependency versions:
 √  ARCHIVEBOX_BINARY     v0.6.2          valid     /Users/szenrom/.pyenv/versions/3.10.4/envs/ArchiveBox-pip/bin/archivebox
 √  PYTHON_BINARY         v3.10.4         valid     /Users/szenrom/.pyenv/versions/3.10.4/bin/python3.10
 √  DJANGO_BINARY         v3.1.14         valid     /Users/szenrom/.pyenv/versions/3.10.4/envs/ArchiveBox-pip/lib/python3.10/site-packages/django/bin/django-admin.py
 √  CURL_BINARY           v7.82.0         valid     /opt/local/bin/curl
 √  WGET_BINARY           v1.21.3         valid     /opt/local/bin/wget
 √  NODE_BINARY           v17.8.0         valid     /Users/szenrom/.nvm/versions/node/v17.8.0/bin/node
 √  SINGLEFILE_BINARY     v0.3.32         valid     ./node_modules/single-file/cli/single-file
 √  READABILITY_BINARY    v0.0.3          valid     ./node_modules/readability-extractor/readability-extractor
 √  MERCURY_BINARY        v1.0.0          valid     ./node_modules/@postlight/mercury-parser/cli.js
 √  GIT_BINARY            v2.32.0         valid     /usr/bin/git
 √  YOUTUBEDL_BINARY      v2021.12.17     valid     /Users/szenrom/.pyenv/versions/ArchiveBox-pip/bin/youtube-dl
 √  CHROME_BINARY         v100.0.4896.75  valid     "/Applications/Google Chrome.app/Contents/MacOS/Google Chrome"
 √  RIPGREP_BINARY        v13.0.0         valid     /opt/local/bin/rg

[i] Source-code locations:
 √  PACKAGE_DIR           23 files        valid     /Users/szenrom/.pyenv/versions/3.10.4/envs/ArchiveBox-pip/lib/python3.10/site-packages/archivebox
 √  TEMPLATES_DIR         3 files         valid     /Users/szenrom/.pyenv/versions/3.10.4/envs/ArchiveBox-pip/lib/python3.10/site-packages/archivebox/templates
 -  CUSTOM_TEMPLATES_DIR  -               disabled

[i] Secrets locations:
 √  CHROME_USER_DATA_DIR  40 files        valid     "/Users/szenrom/Library/Application Support/Google/Chrome"
 -  COOKIES_FILE          -               disabled

[i] Data locations:
 √  OUTPUT_DIR            9 files         valid     /Users/szenrom/Documents/ArchiveBox-pip
 √  SOURCES_DIR           20 files        valid     ./sources
 √  LOGS_DIR              1 files         valid     ./logs
 √  ARCHIVE_DIR           9 files         valid     ./archive
 √  CONFIG_FILE           310.0 Bytes     valid     ./ArchiveBox.conf
 √  SQL_INDEX             244.0 KB        valid     ./index.sqlite3

Contents of ArchiveBox.conf

> cat ArchiveBox.conf
[SERVER_CONFIG]
SECRET_KEY = xxx

[ARCHIVE_METHOD_OPTIONS]
CHROME_HEADLESS = False
CHROME_USER_DATA_DIR = /Users/szenrom/Library/Application Support/Google/Chrome/

[DEPENDENCY_CONFIG]
CHROME_BINARY = /Applications/Google Chrome.app/Contents/MacOS/Google Chrome
<!-- gh-comment-id:1090263169 --> @szenrom commented on GitHub (Apr 6, 2022): Hi! I'm not sure if I should create a new issue as my problem is somewhat similar to OP's (ArchiveBox doesn't seem to properly use Google Chrome user directory as seen in OP's [update](https://github.com/ArchiveBox/ArchiveBox/issues/952#issuecomment-1080012024)) but I don't use Docker, rather installed it with `pip`. Please let me know if I should repost it as a new issue. # My setup (details below): 1. MacOS 12.3.1 2. ArchiveBox installed within `pyenv` `virtualnenv` (Python 3.10.4) 3. Node.js provided with `nvm` (node 17.8.0) 4. Tools like `wget` installed with MacPorts # My issues: 1. ArchiveBox seems to properly locate all dependencies and yet when using simple `archivebox add` it doesn't seem to use browser user profile (websites are viewed in not logged in version and all cookie banners are up). 2. After adding `CHROME_USER_DATA_DIR` and `CHROME_BINARY` problem became worse as after making sure I'm logged in on the website, closing Chrome and running ArchiveBox I get same output as before but I'm also logged out of the website when I restart Chrome. - Looks like loosing cookies for website that I tried to archive. 3. Running ArchiveBox with `CHROME_HEADLESS=False` setting opens proper user profile (I verified it by seeing add-ons) but with already lost cookies. - Also, it seems that headless argument isn't passed to `single-file` as it doesn't show Chrome. - Other extractors always time out - Chrome is opened, nothing happens and after timeout it closes. 4. I tried using `dev` branch version but it had same issues with a few more (there was an issue with checking version of installed Node modules). # Details of the setup: Output of [command to check versions after manual installation](https://github.com/ArchiveBox/ArchiveBox/wiki/Install#check-that-everything-worked-and-the-versions-are-high-enough) ``` > python3 --version | head -n 1 && git --version | head -n 1 && wget --version | head -n 1 && curl --version | head -n 1 && youtube-dl --version | head -n 1 && echo "[√] All dependencies installed." Python 3.10.4 git version 2.32.0 (Apple Git-132) GNU Wget 1.21.3 built on darwin20.6.0. curl 7.82.0 (x86_64-apple-darwin20.6.0) libcurl/7.82.0 OpenSSL/3.0.2 zlib/1.2.12 zstd/1.5.2 libidn2/2.3.2 libpsl/0.21.1 (+libidn2/2.3.2) 2021.12.17 [√] All dependencies installed. ``` Output of `archivebox --version` ``` > archivebox --version ArchiveBox v0.6.2 Cpython Darwin macOS-12.3.1-x86_64-i386-64bit x86_64 IN_DOCKER=False DEBUG=False IS_TTY=True TZ=UTC SEARCH_BACKEND_ENGINE=ripgrep [i] Dependency versions: √ ARCHIVEBOX_BINARY v0.6.2 valid /Users/szenrom/.pyenv/versions/3.10.4/envs/ArchiveBox-pip/bin/archivebox √ PYTHON_BINARY v3.10.4 valid /Users/szenrom/.pyenv/versions/3.10.4/bin/python3.10 √ DJANGO_BINARY v3.1.14 valid /Users/szenrom/.pyenv/versions/3.10.4/envs/ArchiveBox-pip/lib/python3.10/site-packages/django/bin/django-admin.py √ CURL_BINARY v7.82.0 valid /opt/local/bin/curl √ WGET_BINARY v1.21.3 valid /opt/local/bin/wget √ NODE_BINARY v17.8.0 valid /Users/szenrom/.nvm/versions/node/v17.8.0/bin/node √ SINGLEFILE_BINARY v0.3.32 valid ./node_modules/single-file/cli/single-file √ READABILITY_BINARY v0.0.3 valid ./node_modules/readability-extractor/readability-extractor √ MERCURY_BINARY v1.0.0 valid ./node_modules/@postlight/mercury-parser/cli.js √ GIT_BINARY v2.32.0 valid /usr/bin/git √ YOUTUBEDL_BINARY v2021.12.17 valid /Users/szenrom/.pyenv/versions/ArchiveBox-pip/bin/youtube-dl √ CHROME_BINARY v100.0.4896.75 valid "/Applications/Google Chrome.app/Contents/MacOS/Google Chrome" √ RIPGREP_BINARY v13.0.0 valid /opt/local/bin/rg [i] Source-code locations: √ PACKAGE_DIR 23 files valid /Users/szenrom/.pyenv/versions/3.10.4/envs/ArchiveBox-pip/lib/python3.10/site-packages/archivebox √ TEMPLATES_DIR 3 files valid /Users/szenrom/.pyenv/versions/3.10.4/envs/ArchiveBox-pip/lib/python3.10/site-packages/archivebox/templates - CUSTOM_TEMPLATES_DIR - disabled [i] Secrets locations: √ CHROME_USER_DATA_DIR 40 files valid "/Users/szenrom/Library/Application Support/Google/Chrome" - COOKIES_FILE - disabled [i] Data locations: √ OUTPUT_DIR 9 files valid /Users/szenrom/Documents/ArchiveBox-pip √ SOURCES_DIR 20 files valid ./sources √ LOGS_DIR 1 files valid ./logs √ ARCHIVE_DIR 9 files valid ./archive √ CONFIG_FILE 310.0 Bytes valid ./ArchiveBox.conf √ SQL_INDEX 244.0 KB valid ./index.sqlite3 ``` Contents of `ArchiveBox.conf` ``` > cat ArchiveBox.conf [SERVER_CONFIG] SECRET_KEY = xxx [ARCHIVE_METHOD_OPTIONS] CHROME_HEADLESS = False CHROME_USER_DATA_DIR = /Users/szenrom/Library/Application Support/Google/Chrome/ [DEPENDENCY_CONFIG] CHROME_BINARY = /Applications/Google Chrome.app/Contents/MacOS/Google Chrome ```
Author
Owner

@ga-it commented on GitHub (Apr 9, 2022):

Thanks @pirate

I have now successfully got Archivebox to use my Chromium profile.

Resolution:

  1. Correct mount in Docker and reference in config
  2. Exactly the same version of Chromium outside the Docker image and within the Docker image
  3. Chromium outside Docker closed when running Archivebox
  4. Read/write permissions on Chromium profile

ArchiveBox.conf

[SERVER_CONFIG]
SECRET_KEY = XX

[ARCHIVE_METHOD_OPTIONS]
CHROME_USER_DATA_DIR = /data/chromium

[ARCHIVE_METHOD_TOGGLES]
SAVE_ARCHIVE_DOT_ORG = False

docker-compose.yml

services:
archivebox:
XX
ports:
XX
environment:
XX
volumes:
XX
- /XX/.config/chromium:/data/chromium

<!-- gh-comment-id:1094076168 --> @ga-it commented on GitHub (Apr 9, 2022): Thanks @pirate I have now successfully got Archivebox to use my Chromium profile. Resolution: 1) Correct mount in Docker and reference in config 2) Exactly the same version of Chromium outside the Docker image and within the Docker image 3) Chromium outside Docker closed when running Archivebox 4) Read/write permissions on Chromium profile **ArchiveBox.conf** [SERVER_CONFIG] SECRET_KEY = XX [ARCHIVE_METHOD_OPTIONS] CHROME_USER_DATA_DIR = /data/chromium [ARCHIVE_METHOD_TOGGLES] SAVE_ARCHIVE_DOT_ORG = False **docker-compose.yml** services: archivebox: XX ports: XX environment: XX volumes: XX - /XX/.config/chromium:/data/chromium
Author
Owner

@ga-it commented on GitHub (Apr 10, 2022):

I have since found synchronizing the chromium version in the docker file and on the host to be a nightmare.

If they are not exactly synced, the profiles become incompatible.

The chromium version in the current dev image is 90.0.4430.212. When downloading this via https://chromium.cypress.io/ it actually downloads 90.0.4430.0 resulting in an inability to reaccess the profile after use by the docker image.

To prevent this, I have followed the following workaround:

  1. install desired chromium version in directory shared as a Docker volume (used "data" for ease)
  2. run vncserver as archivebox user and run chromium in vnc session to generate cookies
  3. Close chromium in vncserver session
  4. chmod -R ugo+rwx /opt/archivebox/.config/chromium
  5. mount /opt/archivebox/.config/chromium as docker volume /data/chromium
  6. set CHROME_USER_DATA_DIR = /data/chromium
  7. set CHROME_BINARY = /data/chrome-linux/chrome (installed version of chrome now common between host VNC session and docker container)
  8. chown -R archivebox:archivebox /opt/archivebox/

I found each step to be crucial especially permissions.

Now profile is generated and used by same instance of chrome on docker host and container.

<!-- gh-comment-id:1094274529 --> @ga-it commented on GitHub (Apr 10, 2022): I have since found synchronizing the chromium version in the docker file and on the host to be a nightmare. If they are not exactly synced, the profiles become incompatible. The chromium version in the current dev image is 90.0.4430.212. When downloading this via https://chromium.cypress.io/ it actually downloads 90.0.4430.0 resulting in an inability to reaccess the profile after use by the docker image. To prevent this, I have followed the following workaround: 1) install desired chromium version in directory shared as a Docker volume (used "data" for ease) 2) run vncserver as archivebox user and run chromium in vnc session to generate cookies 3) Close chromium in vncserver session 4) chmod -R ugo+rwx /opt/archivebox/.config/chromium 5) mount /opt/archivebox/.config/chromium as docker volume /data/chromium 6) set CHROME_USER_DATA_DIR = /data/chromium 7) set CHROME_BINARY = /data/chrome-linux/chrome (installed version of chrome now common between host VNC session and docker container) 8) chown -R archivebox:archivebox /opt/archivebox/ I found each step to be crucial especially permissions. Now profile is generated and used by same instance of chrome on docker host and container.
Author
Owner

@pirate commented on GitHub (Apr 11, 2022):

Yeah those steps sound right, unfortunately that is the status quo right now. There's not an easy way around it to make profiles compatible across versions. The next release will update chrome to the latest version which may make things slightly easier.

<!-- gh-comment-id:1095517586 --> @pirate commented on GitHub (Apr 11, 2022): Yeah those steps sound right, unfortunately that is the status quo right now. There's not an easy way around it to make profiles compatible across versions. The next release will update chrome to the latest version which may make things slightly easier.
Author
Owner

@pirate commented on GitHub (Apr 12, 2022):

I've added the instructions from your steps @ga-it to the wiki for future reference: https://github.com/ArchiveBox/ArchiveBox/wiki/Chromium-Install#setting-up-a-chromium-user-profile

If possible, can you provide the steps you used to install/setup the vncserver connected to Chrome? Thanks @ga-it!

<!-- gh-comment-id:1097100206 --> @pirate commented on GitHub (Apr 12, 2022): I've added the instructions from your steps @ga-it to the wiki for future reference: https://github.com/ArchiveBox/ArchiveBox/wiki/Chromium-Install#setting-up-a-chromium-user-profile If possible, can you provide the steps you used to install/setup the vncserver connected to Chrome? Thanks @ga-it!
Author
Owner

@OlegShevtsov1 commented on GitHub (Jun 17, 2023):

@pirate

I'm reproducing the https://github.com/ArchiveBox/ArchiveBox/wiki/Chromium-Install#setting-up-a-chromium-user-profile
√ CHROME_BINARY v90.0.4430.93 valid /usr/bin/chromium

mkdir -p  ~/projects/archivebox/chromium/.config
cp -R ~/.config/chromium ~/projects/archivebox/chromium/.config

Having noticed difference the one step should be added as root:

sudo cp -R ~/projects/archivebox/chromium ~/projects/archivebox/data
sudo chown -R systemd-coredump:systemd-coredump ~/projects/archivebox/data/chromium/

Due to data directory has permissions of user systemd-coredump after init
docker-compose run archivebox init --setup

And then your steps

docker-compose run --rm archivebox /bin/bash
chown -R archivebox:archivebox /data/chromium/
chmod -R ugo+rwx /data/chromium

But it does not have access to protected page unfortunately anyway

OS Ubuntu 22.04.

<!-- gh-comment-id:1595795195 --> @OlegShevtsov1 commented on GitHub (Jun 17, 2023): @pirate I'm reproducing the https://github.com/ArchiveBox/ArchiveBox/wiki/Chromium-Install#setting-up-a-chromium-user-profile ` √ CHROME_BINARY v90.0.4430.93 valid /usr/bin/chromium ` ``` mkdir -p ~/projects/archivebox/chromium/.config cp -R ~/.config/chromium ~/projects/archivebox/chromium/.config ``` Having noticed difference the one step should be added as root: ``` sudo cp -R ~/projects/archivebox/chromium ~/projects/archivebox/data sudo chown -R systemd-coredump:systemd-coredump ~/projects/archivebox/data/chromium/ ``` Due to data directory has permissions of user `systemd-coredump` after init `docker-compose run archivebox init --setup` And then your steps ``` docker-compose run --rm archivebox /bin/bash chown -R archivebox:archivebox /data/chromium/ chmod -R ugo+rwx /data/chromium ``` **But it does not have access to protected page unfortunately anyway** OS Ubuntu 22.04.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#591
No description provided.