mirror of
https://github.com/ArchiveBox/ArchiveBox.git
synced 2026-04-25 09:06:02 +03:00
[GH-ISSUE #1125] Archivebox stopped saving DOM, screenshot and PDF with v111 update #706
Labels
No labels
expected: maybe someday
expected: next release
expected: release after next
expected: unlikely unless contributed
good first ticket
help wanted
pull-request
scope: all users
scope: windows users
size: easy
size: hard
size: medium
size: medium
status: backlog
status: blocked
status: done
status: idea-phase
status: needs followup
status: wip
status: wontfix
touches: API/CLI/Spec
touches: configuration
touches: data/schema/architecture
touches: dependencies/packaging
touches: docs
touches: js
touches: views/replayers/html/css
why: correctness
why: functionality
why: performance
why: security
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/ArchiveBox#706
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @Giger22 on GitHub (Mar 21, 2023).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/1125
Archivebox fails to save DOM, screenshot and PDF.
Steps to reproduce
For example:
I get the same error in normal version.
Command '['chromium', '--headless', '--user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.61 Safari/537.36 ArchiveBox/{VERSION} (+https://github.com/ArchiveBox/ArchiveBox/)', '--window-size=1440,2000', '--timeout=180000', '--screenshot', 'https://www.dailymail.co.uk/sport/formulaone/article-11883195/Mercedes-chief-Toto-Wolff-bemoans-Red-Bulls-early-season-dominance-not-great-show.html']' timed out after 180 seconds
And just these error for dev version.
Failed to save DOM
Failed to save screenshot
Failed to save PDF
Screenshots or log output
ArchiveBox version
ArchiveBoxDev
@pirate commented on GitHub (Mar 21, 2023):
Chrome unfortunately changed the behavior of a bunch of their CLI options between versions with little warning, so we're scrambling to cover all the edge cases still. Thanks for your patience 😅
Can you try runnning
chromium --headless=new --screenshot 'https://example.com'in terminal and posting the output?@Giger22 commented on GitHub (Mar 22, 2023):
@adamcecc commented on GitHub (Apr 7, 2023):
I had the same issue. I removed the dead symlinks for /home/username/.config/google-chrome/SingletonCookie and /home/username/.config/google-chrome/SingletonLock and it works again.
@Dontkickmi22 commented on GitHub (Apr 11, 2023):
I did that, it worked. But mine got same error again today. Have to do it again to clear. I wonder.
@mwnoo commented on GitHub (Apr 11, 2023):
Using dev version 0.6.3 with the
headless=newcommand successfully archives pdf, screenshots and DOM. The only thing is that with chromium v111 and v112--user-data-dirinformation (accepted cookies, etc.) is not used anymore. New screenshots and pdf's still have the cookie banners inside although I accepted them in the browser and copied the profile to my archivebox folders. This worked fine with previous versions of chromium (e.g. v101).Also from the command line Chromium does not use the cookie information in
--user-data-dirAny ideas to get the new chromium headless to use the profile data (user-data-dir) and make screenshots without the cookie banners?
@pirate commented on GitHub (Apr 14, 2023):
Argh that's frustrating, two major breaking changes to chromium's CLI without much care toward backwards compatibility on their part 😓
I'll take a look but probably going to focus on the ongoing playwright/browsertrix-crawler integration before trying to fix this.
@turian commented on GitHub (May 3, 2023):
How do I run that command if I'm running archivebox from within docker?
@pirate commented on GitHub (May 3, 2023):
docker-compose run archivebox chromium --headless=new --screenshot 'https://example.com/'@turian commented on GitHub (May 3, 2023):
Ah thanks. I also found:
The docker compose command doesn't work for me because I'm on a system where I just run from dockerhub (I assume dev is the most stable recent tag?). I'm back to using archivebox again, and am trying to find a docker image that has recent fixes but is relatively stable.
Dropping into bash using that command, I get:
BTW, this is my ArchiveBox.conf, are there current best defaults that I should update?
@pirate commented on GitHub (May 4, 2023):
Sorry, try this:
Then open
./data/screenshot.pngto make sure it succeeded.@turian commented on GitHub (May 4, 2023):
docker-compose doesn't work for me because I'm just using the latest tag on docker. But if you think I should switch, I will.
screenshot.png doesn't come through as per the following:
@turian commented on GitHub (May 4, 2023):
What I'm looking for is a "relatively stable, relatively recent" docker tag so I can start crawling again.
@pirate commented on GitHub (May 4, 2023):
archivebox/archivebox:devis definitely the most recent/stable tag.I'm unable to replicate this
[0504/103129.626416:ERROR:chrome_main.cc(164)] Multiple targets are not supported in headless mode.message on my side, so it's tricky to debug :/docker run -v $PWD/archivebox:/data archivebox/archivebox:dev bash@mrled commented on GitHub (May 7, 2023):
I think "multiple targets are not supported in headless mode" is because Chromium thinks that you're trying to navigate to two URLs:
https://example.com/andscreenshot.png. You probably want/usr/bin/chromium --headless=new --screenshot --no-sandbox --no-first-run --disable-sync --disable-gpu 'https://example.com/', withoutscreenshot.pngat the end. (Should work the same whether the container is started from plain Docker or from docker-compose.)@rpcope1 commented on GitHub (Jun 8, 2023):
I don't know if this helps or not, but I ran into the "Multiple targets are not supported in headless mode." and narrowed it down to Chromium interpreting the "--user-agent" argument as a URL instead of the way you would expect it should. This was with Chromium 114 on Debian 11.7. It probably has to do with Chromium no longer taking a user-agent argument. I ended up patching around it with this hack:
But the archiver still times out (maybe because this same LXC container doesn't have a GPU exposed). It sounds like the thing to do for now maybe is to find a version of Chromium before 111 and use that instead.
@dcalano commented on GitHub (Jul 1, 2023):
Looking at my error logs, the dom/screenshot calls failing for me seem to fail to interpolate the {VERSION} env variable. Interestingly the wget capture was resolving fine to 0.6.2 and completed successfully, but in the cmd string logs this is remaining as {VERSION} for dom/sreenshot when the crawl command is being called.
@sclu1034 commented on GitHub (Jul 6, 2023):
I was seeing the same issues, including the unresolved
{VERSION}fields in the error logs.Removing the broken symlinks mentioned by @adamcecc fixed it for now.
@sclu1034 commented on GitHub (Jul 11, 2023):
That didn't actually help all that much. It breaks again as soon as there are multiple processes adding URLs running at the same time, and then remains broken until those files are removed again.
@pirate commented on GitHub (Aug 16, 2023):
afaik there is no way to run multiple chrome instances with a single profile, Chrome does not support it. the best we could do is clone the profile directory and create a few temporary copies and use those, or migrate to an event sourcing model with a single playwright-based chrome worker that handles all the jobs as separate tabs in a single chrome instance.
@pirate commented on GitHub (Oct 20, 2023):
I made major changes to the Dockerfile last night and bumped all the dependency versions so it should be on the latest Chrome v119 now. Not all the cross-platform builds are on Docker Hub yet, but you can try it by pulling and running
docker build . -t archivebox-dev; docker run -it -v $PWD:/data archivebox-dev ....@pirate commented on GitHub (Nov 9, 2023):
I think this should be fixed now on dev. Please pull the latest image/pip package and try again,
docker pull archivebox/archivebox:dev. https://github.com/ArchiveBox/ArchiveBox#install-and-run-a-specific-github-branchComment back here if any of you are still encountering issues and I'll reopen the ticket.
@unlostify commented on GitHub (Feb 9, 2024):
@pirate I'm having this same problem with 0.7.0-0.7.3.
For Dom, PDF and Screenshot the logs show: Extractor timed out after 900s.
There also appears to be a problem with SingleFile – the logs show: SingleFile was not able to archive the page
As you can see, I set the timeout to 900 seconds, but that didn't help.
For context, ArchiveBox is running in Docker Compose. This instance is several years old and includes ~8000 saved pages. I've run
archivebox initto ensure the necessary migrations have run.Through v0.6.3 I never had any problems, but it seems like this started after I upgraded to 0.7.0 at some point late last year, and persisted when I later upgraded to 0.7.3. This has been an issue ever since then, but it is only now that I'm getting around to troubleshooting this.
Today I rolled back to a backup from Oct 2023, running on 0.7.0. It had this issue. I ran the necessary migrations via
archivebox init, updated to 0.7.1, ran the migrations again, and then did the same for 0.7.2 and 0.7.3.This is the result of
archivebox versionnow that I'm back on 0.7.3Thank you kindly for the work you do on this wonderful project!
@pirate commented on GitHub (Feb 10, 2024):
Thanks for the info and the version output. I've experienced this intermittently with chrome sometimes but it usually went away on its own. I think it's caused by chrome not exiting correctly after a job finishes, it just hangs indefinitely (singlefile uses chrome too)
I'll take a deeper look next week!
@unlostify commented on GitHub (Feb 10, 2024):
Thanks @pirate ! Let me know if there's anything else I can provide (logs etc), or do, that would help. Happy to do anything I can to help diagnose the issue =)
@pirate commented on GitHub (Feb 29, 2024):
@unlostify you may want to subscribe to this issue as well, I have a more in-depth comment trying to figure out the underlying cause here: https://github.com/cypress-io/cypress/issues/27264#issuecomment-1972167140
@unlostify commented on GitHub (Mar 1, 2024):
Thanks so much! Will do =)
In the other issue you mentioned its hard to reproduce, but the 'good' thing is that this bug happens 100% of the time in my instance. I'm running in Docker, and the issue always appears, even if I regenerate the container. Since the /data directory is all that persists, its presumably something in there that's the problem. So perhaps it would be helpful if I provided you a copy of my /data directory?
It's currently ~8GB. However, if you think it'd be useful for troubleshooting, I can prune it down as much as possible by duplicating it and removing all of the sites I've archived.
If you'd like me to do that, just let me know and I'll put it on my todo list.
@pirate commented on GitHub (Mar 1, 2024):
It's probably not something in your
/datadirectory actually. I think it's more likely to correlated with the chrome version in the docker container combined with your CPU architecture, host kernel, core count/threading support, docker storage driver, underlying host filesystem, network conditions, etc. (which is why I've gradually added all these things to thearchivebox versionoutput)x86in Docker appears to hit this issue much more thanarm64for example. (I personally runarm64on macOS, where it almost never happens, which is partly why it's been hard for me to debug without running test cloud servers all the time)@unlostify commented on GitHub (Mar 1, 2024):
Whoops! That makes way more sense.
For what its worth, I am indeed running on x86 (also on macOS).
@pirate commented on GitHub (Mar 1, 2024):
Hah of course not 30 seconds after posting this I tried again just for fun and managed to reproduce this on arm64 on macOS!
I didn't even add any of our normal ArchiveBox args, it hung immediately on the first try with only
--headless=newand--screenshot!This dispelled the last of my doubts, this is 100% an upstream Chromium bug and has nothing to do with ArchiveBox. I just opened an upstream bug report on the Chromium bug tracker: https://issues.chromium.org/issues/327583144
@unlostify commented on GitHub (Mar 1, 2024):
You're the best @pirate! Thanks for looking into this, and for all of your hard work on this fantastic project =)