starred/ArchiveBox

Fork 0

mirror of https://github.com/ArchiveBox/ArchiveBox.git synced 2026-04-26 01:26:00 +03:00

[GH-ISSUE #204] Archive Method: CHROME_USER_DATA_DIR is being ignored, authenticated site archiving fails #139

New issue

Closed

opened 2026-03-01 14:40:58 +03:00 by kerem · 3 comments

kerem commented

2026-03-01 14:40:58 +03:00

Owner

Originally created by @jamelait on GitHub (Apr 1, 2019).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/204

Hi,
There is a website that requires me to be logged in that I want to archive.

I have logged in using chromium-browser (launched from bash) and I made sure to set the env (with export command) for:
CHROME_BINARY is set to value returned by the command "which chromium-browser"
CHROME_USER_DATA_DIR is set to value that I saw in "chrome://version"

I then do

echo 'https://members.website.com/my-account' | ./archive

The command runs fine but it looks like my credentials are not used to access the website (a message saying that i should be logged in is shown in the archived html).

What could be the issue?

chromium-browser --version : Chromium 73.0.3683.75 Built on Ubuntu , running on Ubuntu 18.04

Originally created by @jamelait on GitHub (Apr 1, 2019). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/204 Hi, There is a website that requires me to be logged in that I want to archive. I have logged in using chromium-browser (launched from bash) and I made sure to set the env (with export command) for: CHROME_BINARY is set to value returned by the command "which chromium-browser" CHROME_USER_DATA_DIR is set to value that I saw in "chrome://version" I then do > echo 'https://members.website.com/my-account' | ./archive The command runs fine but it looks like my credentials are not used to access the website (a message saying that i should be logged in is shown in the archived html). What could be the issue? chromium-browser --version : Chromium 73.0.3683.75 Built on Ubuntu , running on Ubuntu 18.04

kerem

2026-03-01 14:40:58 +03:00

closed this issue
added the
status: needs followup

size: easy
labels

kerem commented

2026-03-01 14:41:00 +03:00

Author

Owner

@pirate commented on GitHub (Apr 2, 2019):

Huh, strange. Try setting CHROME_HEADLESS=False so you can watch the browser UI as it archives, it may reveal some problem.

You can also try running the chrome command manually like this (try it both with and without --headless, also replace that user-data-dir path with your correct one if different):

chromium-browser --headless --timeout=60000 --user-data-dir=~/.config/chromium --window-size=1440,2000 --screenshot http://example.com

@pirate commented on GitHub (Apr 2, 2019): Huh, strange. Try setting `CHROME_HEADLESS=False` so you can watch the browser UI as it archives, it may reveal some problem. You can also try running the chrome command manually like this (try it both with and without `--headless`, also replace that `user-data-dir` path with your correct one if different): ```bash chromium-browser --headless --timeout=60000 --user-data-dir=~/.config/chromium --window-size=1440,2000 --screenshot http://example.com ```

kerem commented

2026-03-01 14:41:00 +03:00

Author

Owner

@jamelait commented on GitHub (Apr 3, 2019):

Setting CHROME_HEADLESS=False did reveal the problem: it was using the wrong user data directory.
I was setting the env variable like this: CHROME_USER_DATA_DIR=/home/jamel/.config/chromium/Default
But it seems that in that configuration, chromium creates a new profile directory so the new user-data-dir becomes /home/jamel/.config/chromium/Default/Default.
So of course I wasn't logged in.

Problem solved!

I did notice a weird thing: it seems that /ArchiveBox/output/archive/.../members.website.com/my-account/index.html was not accessed with a logged in profile but /ArchiveBox/output/archive/.../output.html was.

@jamelait commented on GitHub (Apr 3, 2019): Setting `CHROME_HEADLESS=False` did reveal the problem: it was using the wrong user data directory. I was setting the env variable like this: `CHROME_USER_DATA_DIR=/home/jamel/.config/chromium/Default` But it seems that in that configuration, chromium creates a new profile directory so the new `user-data-dir` becomes `/home/jamel/.config/chromium/Default/Default`. So of course I wasn't logged in. Problem solved! I did notice a weird thing: it seems that `/ArchiveBox/output/archive/.../members.website.com/my-account/index.html` was not accessed with a logged in profile but `/ArchiveBox/output/archive/.../output.html` was.

kerem commented

2026-03-01 14:41:00 +03:00

Author

Owner

@pirate commented on GitHub (Apr 3, 2019):

Ah yes, this problem is common, I’ve made the same mistake of using the default directory before too.

The Index.html output is generated using Wget, not chrome, to make that one be logged in you have to pass a COOKIES_FILE=... parameter.

@pirate commented on GitHub (Apr 3, 2019): Ah yes, this problem is common, I’ve made the same mistake of using the default directory before too. The Index.html output is generated using Wget, not chrome, to make that one be logged in you have to pass a `COOKIES_FILE=...` parameter.

kerem referenced this issue

2026-03-01 17:52:09 +03:00

[GH-ISSUE #139] AttributeError: 'NoneType' object has no attribute 'replace' #1604

kerem referenced this issue

2026-03-14 21:08:39 +03:00

[GH-ISSUE #139] AttributeError: 'NoneType' object has no attribute 'replace' #3115

No milestone

No project

No assignees

1 participant

Notifications

Due date

The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference

starred/ArchiveBox#139

No description provided.

Rows
Columns