mirror of
https://github.com/ArchiveBox/ArchiveBox.git
synced 2026-04-25 09:06:02 +03:00
[GH-ISSUE #997] Chromium Headless is Detected by Cloudflare (Error 1010) even with Custom User Agent #3642
Labels
No labels
expected: maybe someday
expected: next release
expected: release after next
expected: unlikely unless contributed
good first ticket
help wanted
pull-request
scope: all users
scope: windows users
size: easy
size: hard
size: medium
size: medium
status: backlog
status: blocked
status: done
status: idea-phase
status: needs followup
status: wip
status: wontfix
touches: API/CLI/Spec
touches: configuration
touches: data/schema/architecture
touches: dependencies/packaging
touches: docs
touches: js
touches: views/replayers/html/css
why: correctness
why: functionality
why: performance
why: security
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/ArchiveBox#3642
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @Morpheus0x on GitHub (Jul 12, 2022).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/997
Describe the bug
Unable to make snapshot of a website using Cloudflare.
Error 1010: The owner of this website has banned your access based on your browser's
signature.
This occurs even with a custom user agent: "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36
(KHTML, like Gecko) Chrome/102.0.5005.115 Safari/537.36"
Also, I have set the following chromium option:
--disable-blink-features=AutomationControlledbut the browser is still detected by Cloudflare. I even set CHROME_HEADLESS to False, but it still doesn't work.Searching for a way to make Chrome Headless not be detected, I found several promising sites:
https://blog.m157q.tw/posts/2020/09/11/bypass-cloudflare-detection-while-using-selenium-with-chromedriver/
https://intoli.com/blog/making-chrome-headless-undetectable/
https://github.com/ultrafunkamsterdam/undetected-chromedriver
https://stackoverflow.com/questions/65760004/making-chrome-headless-undetectable-in-python
https://stackoverflow.com/questions/65994908/selenium-cant-open-a-second-page/65998533#65998533
Most of these require webdriver, selenium or something similar. As far as I can tell, ArchiveBox only uses the chromium executable without any scraper wrapper. For that, the best option I found would be to inject JavaScript into every scraped website, like explained here. This uses mitmproxy which isn't an ideal option.
I would propose to switch to Selenium in order to take advantage of selenium-stealth.
Steps to reproduce
Archive any Quora or Discord link.
Look at the resulting snapshot, there should only be the above Cloudflare Error 1010 message.
Screenshots or log output
ArchiveBox version
ArchiveBox v0.6.3
Cpython Linux Linux-5.10.0-11-amd64-x86_64-with-glibc2.31 x86_64
IN_DOCKER=True DEBUG=False IS_TTY=False TZ=UTC SEARCH_BACKEND_ENGINE=sonic
[i] Dependency versions:
√ ARCHIVEBOX_BINARY v0.6.3 valid /usr/local/bin/archivebox
√ PYTHON_BINARY v3.10.4 valid /usr/local/bin/python3.10
√ DJANGO_BINARY v3.1.14 valid /usr/local/lib/python3.10/site-packages/django/bin/django-admin.py
√ CURL_BINARY v7.74.0 valid /usr/bin/curl
√ WGET_BINARY v1.21 valid /usr/bin/wget
√ NODE_BINARY v17.9.0 valid /usr/bin/node
√ SINGLEFILE_BINARY v0.3.16 valid /node/node_modules/single-file/cli/single-file
√ READABILITY_BINARY v0.0.2 valid /node/node_modules/readability-extractor/readability-extractor
√ MERCURY_BINARY v1.0.0 valid /node/node_modules/@postlight/mercury-parser/cli.js
√ GIT_BINARY v2.30.2 valid /usr/bin/git
√ YOUTUBEDL_BINARY v2022.04.08 valid /usr/local/bin/yt-dlp
√ CHROME_BINARY v101.0.4951.41 valid /usr/bin/chromium
√ RIPGREP_BINARY v12.1.1 valid /usr/bin/rg
[i] Source-code locations:
√ PACKAGE_DIR 24 files valid /app/archivebox
√ TEMPLATES_DIR 4 files valid /app/archivebox/templates
[i] Secrets locations:
[i] Data locations:
√ OUTPUT_DIR 5 files valid /data
√ SOURCES_DIR 20 files valid ./sources
√ LOGS_DIR 1 files valid ./logs
√ ARCHIVE_DIR 49 files valid ./archive
√ CONFIG_FILE 81.0 Bytes valid ./ArchiveBox.conf
√ SQL_INDEX 660.0 KB valid ./index.sqlite3
@pirate commented on GitHub (Jul 12, 2022):
Definitely not going to switch to Selenium, we're partway done with a refactor to Pypeteer. This is just generally a hard problem and is forever going to be cat and mouse with providers like Cloudflare trying to block bots.
@derRichter commented on GitHub (Nov 8, 2023):
same problem here.
if headless-mode is off and no user agent is set is working.
(i got a normal user agent like
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36
with headless-off and the same real user agent like above is set in chrome driver-options, Cloudflare blocks!
Whats happened, what's is different to the user agent if i set it the same manual or i leave the option and i have the same user agent?
What is the different?
greats
@pirate commented on GitHub (Jan 19, 2024):
I figured out one potential reason why Cloudflare blocks ppl beyond USER_AGENT detection! They do a sneaky thing where they set a cookie on every request, and if later requests dont have that cookie set they assume it's a headless browser without persistent state, and block it.
Going to be very tricky for us to solve this since we intentionally use ephemeral contexts for every archive method. But I'm working on integrating browsertrix which may help https://github.com/ArchiveBox/ArchiveBox/pull/1327
In the meantime if you set up a chrome profile and browse with that profile normally for 20min maybe this will go away, as you'll collect enough of these tracer cookies to satisfy their human-detection algorithm.
https://github.com/ArchiveBox/ArchiveBox/wiki/Chromium-Install#setting-up-a-chromium-user-profile
@pirate commented on GitHub (Jan 20, 2024):
worth looking into: https://github.com/FlareSolverr/FlareSolverr
@pirate commented on GitHub (Mar 29, 2024):
I've verified FlareSolverr works great, got it working doing puppeteer-powered archiving. It's for a paying client but I'm just noting it here because I plan to come back later and add it to ArchiveBox.
Also found these alternatives to keep an eye on in the future: