[GH-ISSUE #469] Feature/Question: is there a way to get Chromium to spoof its browser agent? #1817

Closed
opened 2026-03-01 17:53:56 +03:00 by kerem · 2 comments
Owner

Originally created by @dannguyen on GitHub (Sep 7, 2020).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/469

Type

  • [X ] General question or discussion
  • Propose a brand new feature
  • Request modification of existing behavior or design

When trying to capture a tweet, via a standard call of archivebox, i.e.

  $ archivebox add https://twitter.com/jack/status/20

The task seems to complete successfully, other than an irrelevant youtube-dl error (see output attached to bottom), but Twitter seems to be blocking Chromium (or whatever it is that does the WARC) based on browser agent, showing an error popup with "This browser is no longer supported. Please switch to a supported browser or disable the extension which masks your browser to continue using twitter.com"

image

The PDF and Screenshot and HTML versions are also broken. However, the SingleFile snapshot works mostly as expected:

image

I know it's not in the scope of ArchiveBox to navigate the many kinds of blockers that individual sites put up. I was just wondering if there was a way to spoof the browser agent through Chromium (though obviously, whatever general configuration option may not fool the method Twitter is using to detect browser)

Output of my archivebox invocation` (note, I ran this on top of an existing project, hence the 179 links being updated, etc:

$ archivebox add https://twitter.com/jack/status/20
[i] [2020-09-07 23:17:49] ArchiveBox v0.4.21: archivebox add https://twitter.com/jack/status/20
    > /private/tmp/archivx

[+] [2020-09-07 23:17:51] Adding 1 links to index (crawl depth=0)...
    > Saved verbatim input to sources/1599520671-import.txt
    > Parsed 1 URLs from input (Plain Text)
    > Found 1 new URLs not already in index

[*] [2020-09-07 23:17:51] Writing 179 links to main index...
    √ /private/tmp/archivx/index.sqlite3
    √ /private/tmp/archivx/index.json
    √ /private/tmp/archivx/index.html

[▶] [2020-09-07 23:17:51] Collecting content for 1 Snapshots in archive...

[+] [2020-09-07 23:17:51] "twitter.com/jack/status/20"
    https://twitter.com/jack/status/20
    > ./archive/1599520671.2744
      > title
        Extractor failed:
             Unable to detect page title
        Run to see full output:
            cd /private/tmp/archivx/archive/1599520671.2744;
            curl --silent --max-time 60 --location --compressed --user-agent "ArchiveBox/0.4.21 (+https://github.com/pirate/ArchiveBox/) curl/curl 7.64.1 (x86_64-apple-darwin19.0)" https://twitter.com/jack/status/20

      > favicon
      > wget
      > singlefile
      > pdf
      > screenshot
      > dom
      > readability
      > media
        Extractor failed:
             Failed to save media
            Got youtube-dl response code: 1.
            ERROR: There's no video in this tweet.; please report this issue on https://yt-dl.org/bug . Make sure you are using the latest version; see  https://yt-dl.org/update  on how to update. Be sure to call youtube-dl with the --verbose flag and include its complete output.
        Run to see full output:
            cd /private/tmp/archivx/archive/1599520671.2744;
            youtube-dl --write-description --write-info-json --write-annotations --write-thumbnail --no-call-home --no-check-certificate --user-agent --all-subs --extract-audio --keep-video --ignore-errors --geo-bypass --audio-format mp3 --audio-quality 320K --embed-thumbnail --add-metadata --yes-playlist https://twitter.com/jack/status/20

      > archive_org

[√] [2020-09-07 23:18:38] Update of 1 pages complete (46.42 sec)
    - 0 links skipped
    - 1 links updated
    - 1 links had errors

    Hint: To view your archive index, open:
        /private/tmp/archivx/index.html
    Or run the built-in webserver:
        archivebox server

[*] [2020-09-07 23:18:38] Writing 179 links to main index...
    √ /private/tmp/archivx/index.sqlite3
    √ /private/tmp/archivx/index.json
    √ /private/tmp/archivx/index.html
Originally created by @dannguyen on GitHub (Sep 7, 2020). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/469 ## Type - [X ] General question or discussion - [ ] Propose a brand new feature - [ ] Request modification of existing behavior or design When trying to capture a tweet, via a standard call of `archivebox`, i.e. $ archivebox add https://twitter.com/jack/status/20 The task seems to complete successfully, other than an irrelevant youtube-dl error (see [output attached to bottom](#mark-box-output)), but Twitter seems to be blocking Chromium (or whatever it is that does the WARC) based on browser agent, showing an error popup with *"This browser is no longer supported. Please switch to a supported browser or disable the extension which masks your browser to continue using twitter.com"* ![image](https://user-images.githubusercontent.com/121520/92420096-08b81b00-f137-11ea-9e15-38a139d5b5f2.png) The PDF and Screenshot and HTML versions are also broken. However, the SingleFile snapshot works mostly as expected: ![image](https://user-images.githubusercontent.com/121520/92420151-6187b380-f137-11ea-9f63-b54929aed862.png) I know it's [not in the scope of ArchiveBox](https://github.com/pirate/ArchiveBox/issues/331) to navigate the many kinds of blockers that individual sites put up. I was just wondering if there was a way to spoof the browser agent through Chromium (though obviously, whatever general configuration option may not fool the method Twitter is using to detect browser) <a name="mark-output" id="mark-output"></a> Output of my `archivebox` invocation` (note, I ran this on top of an existing project, hence the 179 links being updated, etc: ``` $ archivebox add https://twitter.com/jack/status/20 [i] [2020-09-07 23:17:49] ArchiveBox v0.4.21: archivebox add https://twitter.com/jack/status/20 > /private/tmp/archivx [+] [2020-09-07 23:17:51] Adding 1 links to index (crawl depth=0)... > Saved verbatim input to sources/1599520671-import.txt > Parsed 1 URLs from input (Plain Text) > Found 1 new URLs not already in index [*] [2020-09-07 23:17:51] Writing 179 links to main index... √ /private/tmp/archivx/index.sqlite3 √ /private/tmp/archivx/index.json √ /private/tmp/archivx/index.html [▶] [2020-09-07 23:17:51] Collecting content for 1 Snapshots in archive... [+] [2020-09-07 23:17:51] "twitter.com/jack/status/20" https://twitter.com/jack/status/20 > ./archive/1599520671.2744 > title Extractor failed: Unable to detect page title Run to see full output: cd /private/tmp/archivx/archive/1599520671.2744; curl --silent --max-time 60 --location --compressed --user-agent "ArchiveBox/0.4.21 (+https://github.com/pirate/ArchiveBox/) curl/curl 7.64.1 (x86_64-apple-darwin19.0)" https://twitter.com/jack/status/20 > favicon > wget > singlefile > pdf > screenshot > dom > readability > media Extractor failed: Failed to save media Got youtube-dl response code: 1. ERROR: There's no video in this tweet.; please report this issue on https://yt-dl.org/bug . Make sure you are using the latest version; see https://yt-dl.org/update on how to update. Be sure to call youtube-dl with the --verbose flag and include its complete output. Run to see full output: cd /private/tmp/archivx/archive/1599520671.2744; youtube-dl --write-description --write-info-json --write-annotations --write-thumbnail --no-call-home --no-check-certificate --user-agent --all-subs --extract-audio --keep-video --ignore-errors --geo-bypass --audio-format mp3 --audio-quality 320K --embed-thumbnail --add-metadata --yes-playlist https://twitter.com/jack/status/20 > archive_org [√] [2020-09-07 23:18:38] Update of 1 pages complete (46.42 sec) - 0 links skipped - 1 links updated - 1 links had errors Hint: To view your archive index, open: /private/tmp/archivx/index.html Or run the built-in webserver: archivebox server [*] [2020-09-07 23:18:38] Writing 179 links to main index... √ /private/tmp/archivx/index.sqlite3 √ /private/tmp/archivx/index.json √ /private/tmp/archivx/index.html ```
kerem closed this issue 2026-03-01 17:53:56 +03:00
Author
Owner
<!-- gh-comment-id:688967257 --> @pirate commented on GitHub (Sep 8, 2020): https://github.com/pirate/ArchiveBox/wiki/Configuration#chrome_user_agent
Author
Owner

@dannguyen commented on GitHub (Sep 8, 2020):

Danke!

<!-- gh-comment-id:689023971 --> @dannguyen commented on GitHub (Sep 8, 2020): Danke!
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#1817
No description provided.