[GH-ISSUE #35] Set User-Agent of Wget to Chrome or Custom Value #22

Closed
opened 2026-03-01 14:39:55 +03:00 by kerem · 2 comments
Owner

Originally created by @tscs37 on GitHub (Jul 5, 2017).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/35

Some websites do not properly permit wget to crawl the entire site, they block with a 403 or empty webpages.

For this it would be useful to set the --user-agent parameter of wget to another value.

A good default would be the chrome user-agent, which would make websites show as-in-browser instead of how they treat wget's default.

Alternatively the User-Agent Lynx can be used, websites that are aware of Lynx will reduce JS and CSS contents, making it easier to archive the page.

Originally created by @tscs37 on GitHub (Jul 5, 2017). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/35 Some websites do not properly permit wget to crawl the entire site, they block with a 403 or empty webpages. For this it would be useful to set the `--user-agent` parameter of wget to another value. A good default would be the chrome user-agent, which would make websites show as-in-browser instead of how they treat wget's default. Alternatively the User-Agent `Lynx` can be used, websites that are aware of Lynx will reduce JS and CSS contents, making it easier to archive the page.
kerem closed this issue 2026-03-01 14:39:55 +03:00
Author
Owner

@pirate commented on GitHub (Jul 5, 2017):

I thought about this actually, and I'd rather respect site's desires to not be crawled unless explicitly overridden.

I'm happy adding a user agent change, but I want to stick it behind a configuration variable like WGET_USER_AGENT or something, and leave the default as wget. If people want to change it they can.

<!-- gh-comment-id:313166890 --> @pirate commented on GitHub (Jul 5, 2017): I thought about this actually, and I'd rather respect site's desires to not be crawled unless explicitly overridden. I'm happy adding a user agent change, but I want to stick it behind a configuration variable like `WGET_USER_AGENT` or something, and leave the default as `wget`. If people want to change it they can.
Author
Owner

@pirate commented on GitHub (Jul 6, 2017):

Done: acf59fa Docs: 7196486

<!-- gh-comment-id:313529700 --> @pirate commented on GitHub (Jul 6, 2017): Done: acf59fa Docs: 7196486
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#22
No description provided.