[GH-ISSUE #997] Chromium Headless is Detected by Cloudflare (Error 1010) even with Custom User Agent #2133

Open
opened 2026-03-01 17:56:45 +03:00 by kerem · 5 comments
Owner

Originally created by @Morpheus0x on GitHub (Jul 12, 2022).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/997

Describe the bug

Unable to make snapshot of a website using Cloudflare.
Error 1010: The owner of this website has banned your access based on your browser's
signature.
This occurs even with a custom user agent: "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36
(KHTML, like Gecko) Chrome/102.0.5005.115 Safari/537.36"
Also, I have set the following chromium option: --disable-blink-features=AutomationControlled but the browser is still detected by Cloudflare. I even set CHROME_HEADLESS to False, but it still doesn't work.

Searching for a way to make Chrome Headless not be detected, I found several promising sites:
https://blog.m157q.tw/posts/2020/09/11/bypass-cloudflare-detection-while-using-selenium-with-chromedriver/
https://intoli.com/blog/making-chrome-headless-undetectable/
https://github.com/ultrafunkamsterdam/undetected-chromedriver
https://stackoverflow.com/questions/65760004/making-chrome-headless-undetectable-in-python
https://stackoverflow.com/questions/65994908/selenium-cant-open-a-second-page/65998533#65998533

Most of these require webdriver, selenium or something similar. As far as I can tell, ArchiveBox only uses the chromium executable without any scraper wrapper. For that, the best option I found would be to inject JavaScript into every scraped website, like explained here. This uses mitmproxy which isn't an ideal option.

I would propose to switch to Selenium in order to take advantage of selenium-stealth.

Steps to reproduce

Archive any Quora or Discord link.
Look at the resulting snapshot, there should only be the above Cloudflare Error 1010 message.

Screenshots or log output

screenshot

ArchiveBox version

ArchiveBox v0.6.3
Cpython Linux Linux-5.10.0-11-amd64-x86_64-with-glibc2.31 x86_64
IN_DOCKER=True DEBUG=False IS_TTY=False TZ=UTC SEARCH_BACKEND_ENGINE=sonic

[i] Dependency versions:
√ ARCHIVEBOX_BINARY v0.6.3 valid /usr/local/bin/archivebox
√ PYTHON_BINARY v3.10.4 valid /usr/local/bin/python3.10
√ DJANGO_BINARY v3.1.14 valid /usr/local/lib/python3.10/site-packages/django/bin/django-admin.py
√ CURL_BINARY v7.74.0 valid /usr/bin/curl
√ WGET_BINARY v1.21 valid /usr/bin/wget
√ NODE_BINARY v17.9.0 valid /usr/bin/node
√ SINGLEFILE_BINARY v0.3.16 valid /node/node_modules/single-file/cli/single-file
√ READABILITY_BINARY v0.0.2 valid /node/node_modules/readability-extractor/readability-extractor
√ MERCURY_BINARY v1.0.0 valid /node/node_modules/@postlight/mercury-parser/cli.js
√ GIT_BINARY v2.30.2 valid /usr/bin/git
√ YOUTUBEDL_BINARY v2022.04.08 valid /usr/local/bin/yt-dlp
√ CHROME_BINARY v101.0.4951.41 valid /usr/bin/chromium
√ RIPGREP_BINARY v12.1.1 valid /usr/bin/rg

[i] Source-code locations:
√ PACKAGE_DIR 24 files valid /app/archivebox
√ TEMPLATES_DIR 4 files valid /app/archivebox/templates

  • CUSTOM_TEMPLATES_DIR - disabled

[i] Secrets locations:

  • CHROME_USER_DATA_DIR - disabled
  • COOKIES_FILE - disabled

[i] Data locations:
√ OUTPUT_DIR 5 files valid /data
√ SOURCES_DIR 20 files valid ./sources
√ LOGS_DIR 1 files valid ./logs
√ ARCHIVE_DIR 49 files valid ./archive
√ CONFIG_FILE 81.0 Bytes valid ./ArchiveBox.conf
√ SQL_INDEX 660.0 KB valid ./index.sqlite3

Originally created by @Morpheus0x on GitHub (Jul 12, 2022). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/997 <!-- Please fill out the following information, feel free to delete sections if they're not applicable or if long issue templates annoy you. (the only required section is the version information) --> # Describe the bug Unable to make snapshot of a website using Cloudflare. Error 1010: The owner of this website has banned your access based on your browser's signature. This occurs even with a custom user agent: "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.5005.115 Safari/537.36" Also, I have set the following chromium option: `--disable-blink-features=AutomationControlled` but the browser is still detected by Cloudflare. I even set CHROME_HEADLESS to False, but it still doesn't work. Searching for a way to make Chrome Headless not be detected, I found several promising sites: https://blog.m157q.tw/posts/2020/09/11/bypass-cloudflare-detection-while-using-selenium-with-chromedriver/ https://intoli.com/blog/making-chrome-headless-undetectable/ https://github.com/ultrafunkamsterdam/undetected-chromedriver https://stackoverflow.com/questions/65760004/making-chrome-headless-undetectable-in-python https://stackoverflow.com/questions/65994908/selenium-cant-open-a-second-page/65998533#65998533 Most of these require webdriver, selenium or something similar. As far as I can tell, ArchiveBox only uses the chromium executable without any scraper wrapper. For that, the best option I found would be to inject JavaScript into every scraped website, like explained [here](https://intoli.com/blog/making-chrome-headless-undetectable/). This uses [mitmproxy](https://mitmproxy.org/) which isn't an ideal option. I would propose to switch to Selenium in order to take advantage of [selenium-stealth](https://github.com/diprajpatra/selenium-stealth). # Steps to reproduce Archive any Quora or Discord link. Look at the resulting snapshot, there should only be the above Cloudflare Error 1010 message. #### Screenshots or log output ![screenshot](https://user-images.githubusercontent.com/22265595/178444736-dda29159-49f4-4cac-9d76-c42a700739a0.png) # ArchiveBox version ArchiveBox v0.6.3 Cpython Linux Linux-5.10.0-11-amd64-x86_64-with-glibc2.31 x86_64 IN_DOCKER=True DEBUG=False IS_TTY=False TZ=UTC SEARCH_BACKEND_ENGINE=sonic [i] Dependency versions: √ ARCHIVEBOX_BINARY v0.6.3 valid /usr/local/bin/archivebox √ PYTHON_BINARY v3.10.4 valid /usr/local/bin/python3.10 √ DJANGO_BINARY v3.1.14 valid /usr/local/lib/python3.10/site-packages/django/bin/django-admin.py √ CURL_BINARY v7.74.0 valid /usr/bin/curl √ WGET_BINARY v1.21 valid /usr/bin/wget √ NODE_BINARY v17.9.0 valid /usr/bin/node √ SINGLEFILE_BINARY v0.3.16 valid /node/node_modules/single-file/cli/single-file √ READABILITY_BINARY v0.0.2 valid /node/node_modules/readability-extractor/readability-extractor √ MERCURY_BINARY v1.0.0 valid /node/node_modules/@postlight/mercury-parser/cli.js √ GIT_BINARY v2.30.2 valid /usr/bin/git √ YOUTUBEDL_BINARY v2022.04.08 valid /usr/local/bin/yt-dlp √ CHROME_BINARY v101.0.4951.41 valid /usr/bin/chromium √ RIPGREP_BINARY v12.1.1 valid /usr/bin/rg [i] Source-code locations: √ PACKAGE_DIR 24 files valid /app/archivebox √ TEMPLATES_DIR 4 files valid /app/archivebox/templates - CUSTOM_TEMPLATES_DIR - disabled [i] Secrets locations: - CHROME_USER_DATA_DIR - disabled - COOKIES_FILE - disabled [i] Data locations: √ OUTPUT_DIR 5 files valid /data √ SOURCES_DIR 20 files valid ./sources √ LOGS_DIR 1 files valid ./logs √ ARCHIVE_DIR 49 files valid ./archive √ CONFIG_FILE 81.0 Bytes valid ./ArchiveBox.conf √ SQL_INDEX 660.0 KB valid ./index.sqlite3
Author
Owner

@pirate commented on GitHub (Jul 12, 2022):

Definitely not going to switch to Selenium, we're partway done with a refactor to Pypeteer. This is just generally a hard problem and is forever going to be cat and mouse with providers like Cloudflare trying to block bots.

<!-- gh-comment-id:1182365334 --> @pirate commented on GitHub (Jul 12, 2022): Definitely not going to switch to Selenium, we're partway done with a refactor to Pypeteer. This is just generally a hard problem and is forever going to be cat and mouse with providers like Cloudflare trying to block bots.
Author
Owner

@derRichter commented on GitHub (Nov 8, 2023):

same problem here.
if headless-mode is off and no user agent is set is working.
(i got a normal user agent like
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36

with headless-off and the same real user agent like above is set in chrome driver-options, Cloudflare blocks!

Whats happened, what's is different to the user agent if i set it the same manual or i leave the option and i have the same user agent?
What is the different?

greats

<!-- gh-comment-id:1802304535 --> @derRichter commented on GitHub (Nov 8, 2023): same problem here. if headless-mode is off and no user agent is set is working. (i got a normal user agent like Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36 with headless-off and the same real user agent like above is set in chrome driver-options, Cloudflare blocks! Whats happened, what's is different to the user agent if i set it the same manual or i leave the option and i have the same user agent? What is the different? greats
Author
Owner

@pirate commented on GitHub (Jan 19, 2024):

I figured out one potential reason why Cloudflare blocks ppl beyond USER_AGENT detection! They do a sneaky thing where they set a cookie on every request, and if later requests dont have that cookie set they assume it's a headless browser without persistent state, and block it.

Going to be very tricky for us to solve this since we intentionally use ephemeral contexts for every archive method. But I'm working on integrating browsertrix which may help https://github.com/ArchiveBox/ArchiveBox/pull/1327

In the meantime if you set up a chrome profile and browse with that profile normally for 20min maybe this will go away, as you'll collect enough of these tracer cookies to satisfy their human-detection algorithm.
https://github.com/ArchiveBox/ArchiveBox/wiki/Chromium-Install#setting-up-a-chromium-user-profile

<!-- gh-comment-id:1899768023 --> @pirate commented on GitHub (Jan 19, 2024): I figured out one potential reason why Cloudflare blocks ppl beyond USER_AGENT detection! They do a sneaky thing where they set a cookie on every request, and if later requests dont have that cookie set they assume it's a headless browser without persistent state, and block it. Going to be very tricky for us to solve this since we intentionally use ephemeral contexts for every archive method. But I'm working on integrating browsertrix which may help https://github.com/ArchiveBox/ArchiveBox/pull/1327 In the meantime if you set up a chrome profile and browse with that profile normally for 20min maybe this will go away, as you'll collect enough of these tracer cookies to satisfy their human-detection algorithm. https://github.com/ArchiveBox/ArchiveBox/wiki/Chromium-Install#setting-up-a-chromium-user-profile
Author
Owner

@pirate commented on GitHub (Jan 20, 2024):

worth looking into: https://github.com/FlareSolverr/FlareSolverr

<!-- gh-comment-id:1901711072 --> @pirate commented on GitHub (Jan 20, 2024): worth looking into: https://github.com/FlareSolverr/FlareSolverr
Author
Owner

@pirate commented on GitHub (Mar 29, 2024):

I've verified FlareSolverr works great, got it working doing puppeteer-powered archiving. It's for a paying client but I'm just noting it here because I plan to come back later and add it to ArchiveBox.

Also found these alternatives to keep an eye on in the future:

<!-- gh-comment-id:2026771291 --> @pirate commented on GitHub (Mar 29, 2024): I've verified FlareSolverr works great, got it working doing puppeteer-powered archiving. It's for a paying client but I'm just noting it here because I plan to come back later and add it to ArchiveBox. Also found these alternatives to keep an eye on in the future: - https://github.com/omkarcloud/botasaurus - https://github.com/ultrafunkamsterdam/nodriver - https://github.com/Akmal-CloudFreed/CloudFreed-CloudFlare-bypass - https://github.com/VeNoMouS/cloudscraper
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#2133
No description provided.