[GH-ISSUE #204] Archive Method: CHROME_USER_DATA_DIR is being ignored, authenticated site archiving fails #139

Closed
opened 2026-03-01 14:40:58 +03:00 by kerem · 3 comments
Owner

Originally created by @jamelait on GitHub (Apr 1, 2019).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/204

Hi,
There is a website that requires me to be logged in that I want to archive.

I have logged in using chromium-browser (launched from bash) and I made sure to set the env (with export command) for:
CHROME_BINARY is set to value returned by the command "which chromium-browser"
CHROME_USER_DATA_DIR is set to value that I saw in "chrome://version"

I then do

echo 'https://members.website.com/my-account' | ./archive

The command runs fine but it looks like my credentials are not used to access the website (a message saying that i should be logged in is shown in the archived html).

What could be the issue?

chromium-browser --version : Chromium 73.0.3683.75 Built on Ubuntu , running on Ubuntu 18.04

Originally created by @jamelait on GitHub (Apr 1, 2019). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/204 Hi, There is a website that requires me to be logged in that I want to archive. I have logged in using chromium-browser (launched from bash) and I made sure to set the env (with export command) for: CHROME_BINARY is set to value returned by the command "which chromium-browser" CHROME_USER_DATA_DIR is set to value that I saw in "chrome://version" I then do > echo 'https://members.website.com/my-account' | ./archive The command runs fine but it looks like my credentials are not used to access the website (a message saying that i should be logged in is shown in the archived html). What could be the issue? chromium-browser --version : Chromium 73.0.3683.75 Built on Ubuntu , running on Ubuntu 18.04
Author
Owner

@pirate commented on GitHub (Apr 2, 2019):

Huh, strange. Try setting CHROME_HEADLESS=False so you can watch the browser UI as it archives, it may reveal some problem.

You can also try running the chrome command manually like this (try it both with and without --headless, also replace that user-data-dir path with your correct one if different):

chromium-browser --headless --timeout=60000 --user-data-dir=~/.config/chromium --window-size=1440,2000 --screenshot http://example.com
<!-- gh-comment-id:479132898 --> @pirate commented on GitHub (Apr 2, 2019): Huh, strange. Try setting `CHROME_HEADLESS=False` so you can watch the browser UI as it archives, it may reveal some problem. You can also try running the chrome command manually like this (try it both with and without `--headless`, also replace that `user-data-dir` path with your correct one if different): ```bash chromium-browser --headless --timeout=60000 --user-data-dir=~/.config/chromium --window-size=1440,2000 --screenshot http://example.com ```
Author
Owner

@jamelait commented on GitHub (Apr 3, 2019):

Setting CHROME_HEADLESS=False did reveal the problem: it was using the wrong user data directory.
I was setting the env variable like this: CHROME_USER_DATA_DIR=/home/jamel/.config/chromium/Default
But it seems that in that configuration, chromium creates a new profile directory so the new user-data-dir becomes /home/jamel/.config/chromium/Default/Default.
So of course I wasn't logged in.

Problem solved!

I did notice a weird thing: it seems that /ArchiveBox/output/archive/.../members.website.com/my-account/index.html was not accessed with a logged in profile but /ArchiveBox/output/archive/.../output.html was.

<!-- gh-comment-id:479477762 --> @jamelait commented on GitHub (Apr 3, 2019): Setting `CHROME_HEADLESS=False` did reveal the problem: it was using the wrong user data directory. I was setting the env variable like this: `CHROME_USER_DATA_DIR=/home/jamel/.config/chromium/Default` But it seems that in that configuration, chromium creates a new profile directory so the new `user-data-dir` becomes `/home/jamel/.config/chromium/Default/Default`. So of course I wasn't logged in. Problem solved! I did notice a weird thing: it seems that `/ArchiveBox/output/archive/.../members.website.com/my-account/index.html` was not accessed with a logged in profile but `/ArchiveBox/output/archive/.../output.html` was.
Author
Owner

@pirate commented on GitHub (Apr 3, 2019):

Ah yes, this problem is common, I’ve made the same mistake of using the default directory before too.

The Index.html output is generated using Wget, not chrome, to make that one be logged in you have to pass a COOKIES_FILE=... parameter.

<!-- gh-comment-id:479510374 --> @pirate commented on GitHub (Apr 3, 2019): Ah yes, this problem is common, I’ve made the same mistake of using the default directory before too. The Index.html output is generated using Wget, not chrome, to make that one be logged in you have to pass a `COOKIES_FILE=...` parameter.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#139
No description provided.