mirror of
https://github.com/ArchiveBox/ArchiveBox.git
synced 2026-04-25 09:06:02 +03:00
[GH-ISSUE #254] FR: Use custom cookies for PDF and screenshot generation #1688
Labels
No labels
expected: maybe someday
expected: next release
expected: release after next
expected: unlikely unless contributed
good first ticket
help wanted
pull-request
scope: all users
scope: windows users
size: easy
size: hard
size: medium
size: medium
status: backlog
status: blocked
status: done
status: idea-phase
status: needs followup
status: wip
status: wontfix
touches: API/CLI/Spec
touches: configuration
touches: data/schema/architecture
touches: dependencies/packaging
touches: docs
touches: js
touches: views/replayers/html/css
why: correctness
why: functionality
why: performance
why: security
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/ArchiveBox#1688
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @adan89lion on GitHub (Aug 10, 2019).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/254
Type
What is the problem that your feature request solves
I am archiving some websites with 18+ confirmation (mildly adult contents) and it has a entering confirmation to verify your age. I have exported the
cookies.txtfile and linked to the configuration and the Local Archive has passed the confirmation successfully; however, I noticed that other types of archive (e.g. HTML, PDF and screenshot) are not applied to the cookie file in my configuration file so everything it captured was just an 18+ confirmation screen.Describe the ideal specific solution you'd want, and whether it fits into any broader scope of changes
FYI this is the web page I tried to archive (and other posts under this sub forum). Hope that the archiving process includes the custom cookie file I provided in the configuration and print out correctly for the PDF, HTML and screenshot archive.
What hacks or alternative solutions have you tried to solve the problem?
I've looked into each
index.jsonfile for each archive and found out that it would be nice to include cookies flag for the headless chrome/chromium command. Only wget has been assigned with my cookie file.How badly do you want this new feature?
P.S. I don't have code experience and excuse me for the lacking knowledge of IT knowledge.
@pirate commented on GitHub (Aug 11, 2019):
The trick is your Chrome data dir used for archiving needs to be from a Chrome instance that's logged into the site. Try opening
chromium-browseror whatever binary is the same chrome instance you're using for archiving, and logging into the site, then running archivebox. If you're doing it on a remote server you'll need to rsync your chrome data dir to the server.You can find it at one of these paths depending on what OS you're on and what Chrome version you're using:
@adan89lion commented on GitHub (Aug 29, 2019):
Hi @pirate
It seems like the headless Chrome does not respect my cookies file placed in
~/.config/google-chrome-unstable/Default(I'm using the official docker image).I've tried syncing my cookies file and even the entire user config folder (not just
Cookiesfile), but headless Chrome doesn't load the cookies file I manually added. Did I miss something here?@pirate commented on GitHub (Sep 6, 2019):
You'll need the entire user data dir, not just the Cookies file or any other individual config folders.
Can you share exactly what path you got your data dir from on the host system? And exactly where you copied the folder inside the container (e.g. the full command used like
cp ... ...orrsync ... ...). You'll also have to make sure the permissions are correct for the container user to read the data.@adan89lion commented on GitHub (Sep 7, 2019):
I copied the user data from
%localappdata%/Google/Chrome/User Data/Defaultto~/.config/google-chrome-unstable/Defaulteven after syncing (mirroring) the full directory didn't work.Perhaps the Chrome version is affecting the issue? (My Windows PC is on Stable 76) I don't have GUI access to the container so I couldn't confirm whether the user data were properly copied.
Update (2019-09-16):
I found that I missed the ArchiveBox config thing. The docker image loads Chrome config from
~.config/google-chrome/Defaultbut the image pre-installedgoogle-chrome-unstableinstead ofgoogle-chrome, so it will not recognise the synced browser configuration.However, after creating
ArchiveBox.conffor both the docker image and the container's environments, ArchiveBox still cannot load the synced configurations properly. I've also tried a clean install of Lubuntu in VirtualBox and run the software, but it still misses my browser configuration.The problem is that the cookie for the website I want to archive is
Session Onlywhich makes that needed to use extensions likeEditThisCookieto remove theSession Onlytag and sync them with ArchiveBox (since there is no GUI for the docker image, the cookie files must be modified outside the container, but I also tried the configuation in Lubuntu and failed either).@pirate commented on GitHub (Sep 19, 2019):
Can you try putting the Chrome config dir in a docker volume, then symlinking your system Config dir to the path of the volume config dir on the host so that your host version of Chrome uses that data dir? If it works then the only thing I can think of would be permissions issues, in which case you can try
chmod -R 777 ./Defaultas a last resort.@adan89lion commented on GitHub (Sep 29, 2019):
I bind the
~/.config/chromiumfolder to the docker container/home/pptruser/.config/chromiumand set the environmentCHROME_USER_DATA_DIRto it. Still, nothing happens. I will follow up on future release (v.0.4) and try it again.@jdkang commented on GitHub (May 21, 2020):
Was this ever resolved? I am interested in using the archivebox container on sites which require logins.
@pirate commented on GitHub (May 22, 2020):
It works for me with v0.4.3 right now, I was never able to reproduce this person's specific issue unfortunately but I've left it open in case more info surfaces. if you are encountering the same issue please post screenshots, logs, commands run, etc, anything is helpful!
@pirate commented on GitHub (Jul 15, 2020):
Going to close this for now, and recommend that the Chrome version outside the container must match the one inside if you're using docker. The session cookies issue described is also a dealbreaker, the cookies you're relying on have to be long-lived cookies that don't expire when the browser is closed. We may figure out a way around that issue in the future, but it's not a priority right now.
@terxw commented on GitHub (Mar 21, 2022):
This issue is still ongoing, see https://github.com/gladiatortoise/node-apiless-youtube-upload/issues/21 and https://github.com/puppeteer/puppeteer/issues/921
I am also experiencing this issue, chromium is logged in to my sites even after restart but headless chromium in docker doesnt respect session in the same folder
e.g.
and
both pointing to the same directory.
Could be usefull to try chrome-browser instead of chromium...
@pirate commented on GitHub (Mar 23, 2022):
Are you using the same Chromium version inside and outside Docker to generate that profile? It must be exactly the same version, architecture, release type, etc. for it to work. @terxw
You can try setting
CHROME_HEADLESS=Falseand checking the GUI that pops up to make sure it's using it correctly.