[GH-ISSUE #254] FR: Use custom cookies for PDF and screenshot generation #1688

Closed
opened 2026-03-01 17:52:53 +03:00 by kerem · 11 comments
Owner

Originally created by @adan89lion on GitHub (Aug 10, 2019).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/254

Type

  • General Question or Disussion
  • Propose a brand new feature
  • Request modification of existing behavior or design

What is the problem that your feature request solves

I am archiving some websites with 18+ confirmation (mildly adult contents) and it has a entering confirmation to verify your age. I have exported the cookies.txt file and linked to the configuration and the Local Archive has passed the confirmation successfully; however, I noticed that other types of archive (e.g. HTML, PDF and screenshot) are not applied to the cookie file in my configuration file so everything it captured was just an 18+ confirmation screen.

Describe the ideal specific solution you'd want, and whether it fits into any broader scope of changes

FYI this is the web page I tried to archive (and other posts under this sub forum). Hope that the archiving process includes the custom cookie file I provided in the configuration and print out correctly for the PDF, HTML and screenshot archive.

What hacks or alternative solutions have you tried to solve the problem?

I've looked into each index.json file for each archive and found out that it would be nice to include cookies flag for the headless chrome/chromium command. Only wget has been assigned with my cookie file.

How badly do you want this new feature?

  • It's an urgent deal-breaker, I cant live without it
  • It's important to add it in the near-mid term future
  • It would be nice to have eventually

  • I'm willing to contribute to development / fixing this issue
  • I like ArchiveBox so far / would recommend it to a friend

P.S. I don't have code experience and excuse me for the lacking knowledge of IT knowledge.

Originally created by @adan89lion on GitHub (Aug 10, 2019). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/254 ## Type - [ ] General Question or Disussion - [ ] Propose a brand new feature - [X] Request modification of existing behavior or design ## What is the problem that your feature request solves I am archiving some websites with 18+ confirmation (mildly adult contents) and it has a entering confirmation to verify your age. I have exported the `cookies.txt` file and linked to the configuration and the Local Archive has passed the confirmation successfully; however, I noticed that other types of archive (e.g. HTML, PDF and screenshot) are not applied to the cookie file in my configuration file so everything it captured was just an 18+ confirmation screen. ## Describe the ideal specific solution you'd want, and whether it fits into any broader scope of changes FYI [this](https://www.ptt.cc/bbs/Gossiping/M.1553393933.A.A92.html) is the web page I tried to archive (and other posts under this sub forum). Hope that the archiving process includes the custom cookie file I provided in the configuration and print out correctly for the PDF, HTML and screenshot archive. ## What hacks or alternative solutions have you tried to solve the problem? I've looked into each `index.json` file for each archive and found out that it would be nice to include cookies flag for the headless chrome/chromium command. Only wget has been assigned with my cookie file. ## How badly do you want this new feature? - [ ] It's an urgent deal-breaker, I cant live without it - [X] It's important to add it in the near-mid term future - [ ] It would be nice to have eventually --- - [ ] I'm willing to contribute to development / fixing this issue - [X] I like ArchiveBox so far / would recommend it to a friend P.S. I don't have code experience and excuse me for the lacking knowledge of IT knowledge.
kerem closed this issue 2026-03-01 17:52:53 +03:00
Author
Owner

@pirate commented on GitHub (Aug 11, 2019):

The trick is your Chrome data dir used for archiving needs to be from a Chrome instance that's logged into the site. Try opening chromium-browser or whatever binary is the same chrome instance you're using for archiving, and logging into the site, then running archivebox. If you're doing it on a remote server you'll need to rsync your chrome data dir to the server.

You can find it at one of these paths depending on what OS you're on and what Chrome version you're using:

            # if using chromium
            '~/.config/chromium',                      # linux
            '~/Library/Application Support/Chromium',   # mac
            '~/AppData/Local/Chromium/User Data',   # windows

            # if using normal Google Chrome
            '~/.config/chrome',
            '~/.config/google-chrome',
            '~/Library/Application Support/Google/Chrome',
            '~/AppData/Local/Google/Chrome/User Data',
            '~/.config/google-chrome-stable',

           # If using beta/canary chrome
            '~/.config/google-chrome-beta',
            '~/Library/Application Support/Google/Chrome Canary',
            '~/AppData/Local/Google/Chrome SxS/User Data',
            '~/.config/google-chrome-unstable',
            '~/.config/google-chrome-dev',
<!-- gh-comment-id:520255691 --> @pirate commented on GitHub (Aug 11, 2019): The trick is your Chrome data dir used for archiving needs to be from a Chrome instance that's logged into the site. Try opening `chromium-browser` or whatever binary is the same chrome instance you're using for archiving, and logging into the site, then running archivebox. If you're doing it on a remote server you'll need to rsync your chrome data dir to the server. You can find it at one of these paths depending on what OS you're on and what Chrome version you're using: ``` # if using chromium '~/.config/chromium', # linux '~/Library/Application Support/Chromium', # mac '~/AppData/Local/Chromium/User Data', # windows # if using normal Google Chrome '~/.config/chrome', '~/.config/google-chrome', '~/Library/Application Support/Google/Chrome', '~/AppData/Local/Google/Chrome/User Data', '~/.config/google-chrome-stable', # If using beta/canary chrome '~/.config/google-chrome-beta', '~/Library/Application Support/Google/Chrome Canary', '~/AppData/Local/Google/Chrome SxS/User Data', '~/.config/google-chrome-unstable', '~/.config/google-chrome-dev', ```
Author
Owner

@adan89lion commented on GitHub (Aug 29, 2019):

Hi @pirate

It seems like the headless Chrome does not respect my cookies file placed in ~/.config/google-chrome-unstable/Default (I'm using the official docker image).

I've tried syncing my cookies file and even the entire user config folder (not just Cookies file), but headless Chrome doesn't load the cookies file I manually added. Did I miss something here?

<!-- gh-comment-id:526130542 --> @adan89lion commented on GitHub (Aug 29, 2019): Hi @pirate It seems like the headless Chrome does not respect my cookies file placed in `~/.config/google-chrome-unstable/Default` (I'm using the official docker image). I've tried syncing my cookies file and even the entire user config folder (not just `Cookies` file), but headless Chrome doesn't load the cookies file I manually added. Did I miss something here?
Author
Owner

@pirate commented on GitHub (Sep 6, 2019):

You'll need the entire user data dir, not just the Cookies file or any other individual config folders.

Can you share exactly what path you got your data dir from on the host system? And exactly where you copied the folder inside the container (e.g. the full command used like cp ... ... or rsync ... ...). You'll also have to make sure the permissions are correct for the container user to read the data.

<!-- gh-comment-id:529005393 --> @pirate commented on GitHub (Sep 6, 2019): You'll need the entire user data dir, not just the Cookies file or any other individual config folders. Can you share exactly what path you got your data dir from on the host system? And exactly where you copied the folder inside the container (e.g. the full command used like `cp ... ...` or `rsync ... ...`). You'll also have to make sure the permissions are correct for the container user to read the data.
Author
Owner

@adan89lion commented on GitHub (Sep 7, 2019):

I copied the user data from %localappdata%/Google/Chrome/User Data/Default to ~/.config/google-chrome-unstable/Default even after syncing (mirroring) the full directory didn't work.

Perhaps the Chrome version is affecting the issue? (My Windows PC is on Stable 76) I don't have GUI access to the container so I couldn't confirm whether the user data were properly copied.

Update (2019-09-16):
I found that I missed the ArchiveBox config thing. The docker image loads Chrome config from ~.config/google-chrome/Default but the image pre-installed google-chrome-unstable instead of google-chrome, so it will not recognise the synced browser configuration.

However, after creating ArchiveBox.conf for both the docker image and the container's environments, ArchiveBox still cannot load the synced configurations properly. I've also tried a clean install of Lubuntu in VirtualBox and run the software, but it still misses my browser configuration.

The problem is that the cookie for the website I want to archive is Session Only which makes that needed to use extensions like EditThisCookie to remove the Session Only tag and sync them with ArchiveBox (since there is no GUI for the docker image, the cookie files must be modified outside the container, but I also tried the configuation in Lubuntu and failed either).

<!-- gh-comment-id:529094839 --> @adan89lion commented on GitHub (Sep 7, 2019): I copied the user data from `%localappdata%/Google/Chrome/User Data/Default` to `~/.config/google-chrome-unstable/Default` even after syncing (mirroring) the full directory didn't work. Perhaps the Chrome version is affecting the issue? (My Windows PC is on Stable 76) I don't have GUI access to the container so I couldn't confirm whether the user data were properly copied. Update (2019-09-16): I found that I missed the ArchiveBox config thing. The docker image loads Chrome config from `~.config/google-chrome/Default` but the image pre-installed `google-chrome-unstable` instead of `google-chrome`, so it will not recognise the synced browser configuration. However, after creating `ArchiveBox.conf` for both the docker image and the container's environments, ArchiveBox still cannot load the synced configurations properly. I've also tried a clean install of Lubuntu in VirtualBox and run the software, but it still misses my browser configuration. The problem is that the cookie for the website I want to archive is `Session Only` which makes that needed to use extensions like `EditThisCookie` to remove the `Session Only` tag and sync them with ArchiveBox (since there is no GUI for the docker image, the cookie files must be modified outside the container, but I also tried the configuation in Lubuntu and failed either).
Author
Owner

@pirate commented on GitHub (Sep 19, 2019):

Can you try putting the Chrome config dir in a docker volume, then symlinking your system Config dir to the path of the volume config dir on the host so that your host version of Chrome uses that data dir? If it works then the only thing I can think of would be permissions issues, in which case you can try chmod -R 777 ./Default as a last resort.

<!-- gh-comment-id:532914284 --> @pirate commented on GitHub (Sep 19, 2019): Can you try putting the Chrome config dir in a docker volume, then symlinking your system Config dir to the path of the volume config dir on the host so that your host version of Chrome uses that data dir? If it works then the only thing I can think of would be permissions issues, in which case you can try `chmod -R 777 ./Default` as a last resort.
Author
Owner

@adan89lion commented on GitHub (Sep 29, 2019):

I bind the ~/.config/chromium folder to the docker container /home/pptruser/.config/chromium and set the environment CHROME_USER_DATA_DIR to it. Still, nothing happens. I will follow up on future release (v.0.4) and try it again.

<!-- gh-comment-id:536301689 --> @adan89lion commented on GitHub (Sep 29, 2019): I bind the `~/.config/chromium` folder to the docker container `/home/pptruser/.config/chromium` and set the environment `CHROME_USER_DATA_DIR` to it. Still, nothing happens. I will follow up on future release (v.0.4) and try it again.
Author
Owner

@jdkang commented on GitHub (May 21, 2020):

Was this ever resolved? I am interested in using the archivebox container on sites which require logins.

<!-- gh-comment-id:632286655 --> @jdkang commented on GitHub (May 21, 2020): Was this ever resolved? I am interested in using the archivebox container on sites which require logins.
Author
Owner

@pirate commented on GitHub (May 22, 2020):

It works for me with v0.4.3 right now, I was never able to reproduce this person's specific issue unfortunately but I've left it open in case more info surfaces. if you are encountering the same issue please post screenshots, logs, commands run, etc, anything is helpful!

<!-- gh-comment-id:632419411 --> @pirate commented on GitHub (May 22, 2020): It works for me with v0.4.3 right now, I was never able to reproduce this person's specific issue unfortunately but I've left it open in case more info surfaces. if you are encountering the same issue please post screenshots, logs, commands run, etc, anything is helpful!
Author
Owner

@pirate commented on GitHub (Jul 15, 2020):

Going to close this for now, and recommend that the Chrome version outside the container must match the one inside if you're using docker. The session cookies issue described is also a dealbreaker, the cookies you're relying on have to be long-lived cookies that don't expire when the browser is closed. We may figure out a way around that issue in the future, but it's not a priority right now.

<!-- gh-comment-id:658988730 --> @pirate commented on GitHub (Jul 15, 2020): Going to close this for now, and recommend that the Chrome version outside the container must match the one inside if you're using docker. The session cookies issue described is also a dealbreaker, the cookies you're relying on have to be long-lived cookies that don't expire when the browser is closed. We may figure out a way around that issue in the future, but it's not a priority right now.
Author
Owner

@terxw commented on GitHub (Mar 21, 2022):

This issue is still ongoing, see https://github.com/gladiatortoise/node-apiless-youtube-upload/issues/21 and https://github.com/puppeteer/puppeteer/issues/921

I am also experiencing this issue, chromium is logged in to my sites even after restart but headless chromium in docker doesnt respect session in the same folder

e.g.

chromium --enable-logging=stderr --v=1 --headless --no-sandbox --disable-gpu --disable-dev-shm-usage --disable-software-rasterizer --run-all-compositor-stages-before-draw --hide-scrollbars --disable-web-security --ignore-certificate-errors --user-agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.75 Safari/537.36" --window-size=1920,1080 --timeout=180000 --user-data-dir=/data/chrome_user_dir https://examplesite.com/members/ --print-to-pdf=test.pdf

and

chromium -user-data-dir=/data/chrome_user_dir

both pointing to the same directory.

Could be usefull to try chrome-browser instead of chromium...

<!-- gh-comment-id:1074081993 --> @terxw commented on GitHub (Mar 21, 2022): This issue is still ongoing, see https://github.com/gladiatortoise/node-apiless-youtube-upload/issues/21 and https://github.com/puppeteer/puppeteer/issues/921 I am also experiencing this issue, chromium is logged in to my sites even after restart but headless chromium in docker doesnt respect session in the same folder e.g. ``` chromium --enable-logging=stderr --v=1 --headless --no-sandbox --disable-gpu --disable-dev-shm-usage --disable-software-rasterizer --run-all-compositor-stages-before-draw --hide-scrollbars --disable-web-security --ignore-certificate-errors --user-agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.75 Safari/537.36" --window-size=1920,1080 --timeout=180000 --user-data-dir=/data/chrome_user_dir https://examplesite.com/members/ --print-to-pdf=test.pdf ``` and ``` chromium -user-data-dir=/data/chrome_user_dir ``` both pointing to the same directory. Could be usefull to try chrome-browser instead of chromium...
Author
Owner

@pirate commented on GitHub (Mar 23, 2022):

Are you using the same Chromium version inside and outside Docker to generate that profile? It must be exactly the same version, architecture, release type, etc. for it to work. @terxw
You can try setting CHROME_HEADLESS=False and checking the GUI that pops up to make sure it's using it correctly.

<!-- gh-comment-id:1075774277 --> @pirate commented on GitHub (Mar 23, 2022): Are you using the same Chromium version inside and outside Docker to generate that profile? It must be exactly the same version, architecture, release type, etc. for it to work. @terxw You can try setting `CHROME_HEADLESS=False` and checking the GUI that pops up to make sure it's using it correctly.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#1688
No description provided.