mirror of
https://github.com/ArchiveBox/ArchiveBox.git
synced 2026-04-25 17:16:00 +03:00
[GH-ISSUE #180] Specifying User-Agent to Chromium #3146
Labels
No labels
expected: maybe someday
expected: next release
expected: release after next
expected: unlikely unless contributed
good first ticket
help wanted
pull-request
scope: all users
scope: windows users
size: easy
size: hard
size: medium
size: medium
status: backlog
status: blocked
status: done
status: idea-phase
status: needs followup
status: wip
status: wontfix
touches: API/CLI/Spec
touches: configuration
touches: data/schema/architecture
touches: dependencies/packaging
touches: docs
touches: js
touches: views/replayers/html/css
why: correctness
why: functionality
why: performance
why: security
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/ArchiveBox#3146
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @n0ncetonic on GitHub (Mar 19, 2019).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/180
Type:
I'm currently archiving https://developer.apple.com/library/archive/navigation/ by writing my own custom crawlers and processors but wanted to supplement my primarily PDF-based archive with the multi-output archive provided by Archivebox. I fed a list of urls into Archivebox but due to detections by Apple that I noticed while writing my own headless chrome crawler, Apple is serving constant 40X errors because the chromium instance by default has "Headless" in its User-Agent string which many websites check against in order to block spiders/scrapers.
Ideally I'd like to be able to set a User-Agent value per importing session in order to address this issue.
I'm not sure that it's possible to set via the Chromium commandline options but it may be possible via the interface being used to control chromiumI haven't yet looked at the code base and was posting this first as I hadn't seen an issue pertaining to this and wasn't sure if I was just missing something somewhere.
Great project, I'm excited to begin using it 👍
How badly do you want this new feature?
@n0ncetonic commented on GitHub (Mar 19, 2019):
Minor update, I actually looked at the code and was able to remedy this by adding a new
HEADLESS_USER_AGENTconfiguration key and adding logic toarchive_methods.pyto add the specified User-Agent if it is present.This seems to have been enough to keep from getting detected by sites checking for "Headless" in the User-Agent string and refusing to serve content. Is this something you'd like me to submit a PR for or am I more of a one off running into this issue.
@n0ncetonic commented on GitHub (Mar 19, 2019):
Closing issue as PR #181 addresses this issue