[GH-ISSUE #180] Specifying User-Agent to Chromium #124

Closed
opened 2026-03-01 14:40:48 +03:00 by kerem · 2 comments
Owner

Originally created by @n0ncetonic on GitHub (Mar 19, 2019).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/180

Type:

  • General Question or Disussion
  • Propose a brand new feature
  • Request modification of existing behavior or design

I'm currently archiving https://developer.apple.com/library/archive/navigation/ by writing my own custom crawlers and processors but wanted to supplement my primarily PDF-based archive with the multi-output archive provided by Archivebox. I fed a list of urls into Archivebox but due to detections by Apple that I noticed while writing my own headless chrome crawler, Apple is serving constant 40X errors because the chromium instance by default has "Headless" in its User-Agent string which many websites check against in order to block spiders/scrapers.

Ideally I'd like to be able to set a User-Agent value per importing session in order to address this issue. I'm not sure that it's possible to set via the Chromium commandline options but it may be possible via the interface being used to control chromium

I haven't yet looked at the code base and was posting this first as I hadn't seen an issue pertaining to this and wasn't sure if I was just missing something somewhere.

Great project, I'm excited to begin using it 👍

How badly do you want this new feature?

  • It's an urgent deal-breaker, I cant live without it
  • [] It's important to add it in the near-mid term future
  • It would be nice to have eventually
  • I'm willing to contribute to development
Originally created by @n0ncetonic on GitHub (Mar 19, 2019). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/180 Type: - [x] General Question or Disussion - [ ] Propose a brand new feature - [x] Request modification of existing behavior or design I'm currently archiving https://developer.apple.com/library/archive/navigation/ by writing my own custom crawlers and processors but wanted to supplement my primarily PDF-based archive with the multi-output archive provided by Archivebox. I fed a list of urls into Archivebox but due to detections by Apple that I noticed while writing my own headless chrome crawler, Apple is serving constant 40X errors because the chromium instance by default has "Headless" in its User-Agent string which many websites check against in order to block spiders/scrapers. Ideally I'd like to be able to set a User-Agent value per importing session in order to address this issue. ~~I'm not sure that it's possible to set via the Chromium commandline options but it may be possible via the interface being used to control chromium~~ I haven't yet looked at the code base and was posting this first as I hadn't seen an issue pertaining to this and wasn't sure if I was just missing something somewhere. Great project, I'm excited to begin using it 👍 **How badly do you want this new feature?** - [ ] It's an urgent deal-breaker, I cant live without it - [] It's important to add it in the near-mid term future - [x] It would be nice to have eventually - [x] I'm willing to contribute to development
kerem closed this issue 2026-03-01 14:40:49 +03:00
Author
Owner

@n0ncetonic commented on GitHub (Mar 19, 2019):

Minor update, I actually looked at the code and was able to remedy this by adding a new HEADLESS_USER_AGENT configuration key and adding logic to archive_methods.py to add the specified User-Agent if it is present.

This seems to have been enough to keep from getting detected by sites checking for "Headless" in the User-Agent string and refusing to serve content. Is this something you'd like me to submit a PR for or am I more of a one off running into this issue.

<!-- gh-comment-id:474335828 --> @n0ncetonic commented on GitHub (Mar 19, 2019): Minor update, I actually looked at the code and was able to remedy this by adding a new `HEADLESS_USER_AGENT` configuration key and adding logic to `archive_methods.py` to add the specified User-Agent if it is present. This seems to have been enough to keep from getting detected by sites checking for "Headless" in the User-Agent string and refusing to serve content. Is this something you'd like me to submit a PR for or am I more of a one off running into this issue.
Author
Owner

@n0ncetonic commented on GitHub (Mar 19, 2019):

Closing issue as PR #181 addresses this issue

<!-- gh-comment-id:474417390 --> @n0ncetonic commented on GitHub (Mar 19, 2019): Closing issue as PR #181 addresses this issue
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#124
No description provided.