mirror of
https://github.com/ArchiveBox/ArchiveBox.git
synced 2026-04-25 17:16:00 +03:00
[GH-ISSUE #35] Set User-Agent of Wget to Chrome or Custom Value #22
Labels
No labels
expected: maybe someday
expected: next release
expected: release after next
expected: unlikely unless contributed
good first ticket
help wanted
pull-request
scope: all users
scope: windows users
size: easy
size: hard
size: medium
size: medium
status: backlog
status: blocked
status: done
status: idea-phase
status: needs followup
status: wip
status: wontfix
touches: API/CLI/Spec
touches: configuration
touches: data/schema/architecture
touches: dependencies/packaging
touches: docs
touches: js
touches: views/replayers/html/css
why: correctness
why: functionality
why: performance
why: security
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/ArchiveBox#22
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @tscs37 on GitHub (Jul 5, 2017).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/35
Some websites do not properly permit wget to crawl the entire site, they block with a 403 or empty webpages.
For this it would be useful to set the
--user-agentparameter of wget to another value.A good default would be the chrome user-agent, which would make websites show as-in-browser instead of how they treat wget's default.
Alternatively the User-Agent
Lynxcan be used, websites that are aware of Lynx will reduce JS and CSS contents, making it easier to archive the page.@pirate commented on GitHub (Jul 5, 2017):
I thought about this actually, and I'd rather respect site's desires to not be crawled unless explicitly overridden.
I'm happy adding a user agent change, but I want to stick it behind a configuration variable like
WGET_USER_AGENTor something, and leave the default aswget. If people want to change it they can.@pirate commented on GitHub (Jul 6, 2017):
Done:
acf59faDocs:7196486