[GH-ISSUE #239] Architecture: Archived JS executes in a context shared with all other archived content (and the admin UI!) #1677

Open
opened 2026-03-01 17:52:47 +03:00 by kerem · 8 comments
Owner

Originally created by @s7x on GitHub (May 14, 2019).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/239

Describe the bug

Hi there!
There's an XSS vulnerability when you open your index.html if you saved a page with a title containing an XSS vector.

Steps to reproduce

  1. Save this page for example: [Twitter of @garethheyes] ](https://twitter.com/garethheyes/status/1126526480614416395)
  2. Open your index.html
  3. Get XSS'd by sir @garethheyes

Source code:

<a href="archive/1557816881/twitter.com/garethheyes/status/1126526480614416395.html" title="\u2028\u2029 op Twitter: "Another way to use throw without a semi-colon:
<script>{onerror=alert}throw 1</script>"">

Software versions

  • OS: ArchLinux
  • ArchiveBox version: 903.59da482-1
  • Python version: python3.7
  • Chrome version: Chromium 74.0.3729.131 Arch Linux
Originally created by @s7x on GitHub (May 14, 2019). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/239 #### Describe the bug Hi there! There's an XSS vulnerability when you open your index.html if you saved a page with a title containing an XSS vector. #### Steps to reproduce 1. Save this page for example: [Twitter of @garethheyes] ](https://twitter.com/garethheyes/status/1126526480614416395) 2. Open your index.html 3. Get XSS'd by sir @garethheyes Source code: ``` <a href="archive/1557816881/twitter.com/garethheyes/status/1126526480614416395.html" title="\u2028\u2029 op Twitter: "Another way to use throw without a semi-colon: <script>{onerror=alert}throw 1</script>""> ``` #### Software versions - OS: ArchLinux - ArchiveBox version: 903.59da482-1 - Python version: python3.7 - Chrome version: Chromium 74.0.3729.131 Arch Linux
Author
Owner

@pirate commented on GitHub (May 18, 2019):

I'm aware of this already, the reason I haven't immediately locked it down is because archived pages can already run arbitrary Javascript in the context of your archive domain, so there's not much we can do to protect against that attack vector without breaking interactivity across all archived pages. If I add bleach-style XSS stripping to titles it'll make the title page less likely to break from a UX perspective, but it doesn't make it any more secure because archived pages can just request the index page at any time directly using JavaScript.

v0.4 is going to add some security headers that will make it more difficult for pages to use JS to access other archived pages, but it's never going to be perfect unless we have each archived page stored on its own domain.

I'm having long conversations with several people this week about the security model of archivebox, it's a difficult problem but I think we'll have to end up disabling all Javascript in the static HTML archives and only allowing proxy replaying of WARCs if people want interactivity preserved. I'm going to move all the filesystem stuff in to hash-bucketed folders to discourage people from opening the saved html files directly and only accessing them via nginx or the django webserver, as allowing archived JS to have filesystem access is disastrously bad security.

<!-- gh-comment-id:493632886 --> @pirate commented on GitHub (May 18, 2019): I'm aware of this already, the reason I haven't immediately locked it down is because archived pages can already run arbitrary Javascript in the context of your archive domain, so there's not much we can do to protect against that attack vector without breaking interactivity across all archived pages. If I add bleach-style XSS stripping to titles it'll make the title page less likely to break from a UX perspective, but it doesn't make it any more secure because archived pages can just request the index page at any time directly using JavaScript. v0.4 is going to add some security headers that will make it more difficult for pages to use JS to access other archived pages, but it's never going to be perfect unless we have each archived page stored on its own domain. I'm having long conversations with several people this week about the security model of archivebox, it's a difficult problem but I think we'll have to end up disabling **all** Javascript in the static HTML archives and only allowing proxy replaying of WARCs if people want interactivity preserved. I'm going to move all the filesystem stuff in to hash-bucketed folders to discourage people from opening the saved html files directly and only accessing them via nginx or the django webserver, as allowing archived JS to have filesystem access is disastrously bad security.
Author
Owner

@pirate commented on GitHub (May 20, 2019):

Because this is fairly serious I've temporarily striked-out the instructions for running archivebox with private data: https://github.com/pirate/ArchiveBox/wiki/Security-Overview#important-dont-use-archivebox-for-private-archived-content-right-now-as-were-in-the-middle-of-resolving-some-security-issues-with-how-js-is-executed-in-archived-content

Unfortunately my day job is getting super busy right now so I don't know how soon I can change the design (fixing this is a big architectural change), but I think I might add a notice to the README as well to warn people that running archived pages can leak the index and content in the current state. The primary use case is archiving public pages and feeds so it's not as bad as if it were doing private session archiving by default, but I don't want to give users a false sense of security so we should definitely be transparent about the risks.

<!-- gh-comment-id:494105803 --> @pirate commented on GitHub (May 20, 2019): Because this is fairly serious I've temporarily striked-out the instructions for running archivebox with private data: https://github.com/pirate/ArchiveBox/wiki/Security-Overview#important-dont-use-archivebox-for-private-archived-content-right-now-as-were-in-the-middle-of-resolving-some-security-issues-with-how-js-is-executed-in-archived-content Unfortunately my day job is getting super busy right now so I don't know how soon I can change the design (fixing this is a big architectural change), but I think I might add a notice to the README as well to warn people that running archived pages can leak the index and content in the current state. The primary use case is archiving public pages and feeds so it's not as bad as if it were doing private session archiving by default, but I don't want to give users a false sense of security so we should definitely be transparent about the risks.
Author
Owner

@s7x commented on GitHub (May 24, 2019):

Hi @pirate! I know the issue is not that critical when you're using archivebox only locally (like I do) cause you're aware of what you are doing (supposed at least) when you save pages and stuff like this but still I think that some people would be happy to know there's no random JS popping up in their hoarding box :)

Thanks for your time & consideration. And for sure, thanks for this awesome tool.

Cheers!

<!-- gh-comment-id:495619615 --> @s7x commented on GitHub (May 24, 2019): Hi @pirate! I know the issue is not that critical when you're using archivebox only locally (like I do) cause you're aware of what you are doing (supposed at least) when you save pages and stuff like this but still I think that some people would be happy to know there's no random JS popping up in their hoarding box :) Thanks for your time & consideration. And for sure, thanks for this awesome tool. Cheers!
Author
Owner

@andrewzigerelli commented on GitHub (Jul 17, 2019):

Why does this only affect title? Is it possible that this XSS opportunity exists elsewhere?

<!-- gh-comment-id:512551384 --> @andrewzigerelli commented on GitHub (Jul 17, 2019): Why does this only affect title? Is it possible that this XSS opportunity exists elsewhere?
Author
Owner

@pirate commented on GitHub (Sep 25, 2019):

@andrewzigerelli see my comment above. A primary goal of ArchiveBox is to preserve JS and interactivity in archived pages, but that means pages necessarily have to be able to execute their own arbitrary JS.

XSS-stripping titles or any of the other little metadata fields is like putting up a small picket fence to try and stop a tsunami. Why would an attacker bother going to the trouble to stuff some XSS payload in page titles when they can just put JS on the page directly knowing it will be executed by ArchiveBox users on a domain shared with the index and all the other pages? (the whole traditional browser security model breaks without CORS protections. the invisible wall that stops xxxhacker.com from accessing your data on facebook.com is the fact that it's on a different domain, but all archived pages are served from the same domain)

archived pages can already run arbitrary Javascript in the context of your archive domain, so there's not much we can do to protect against that attack vector without breaking interactivity across all archived pages

archived pages can just request the index page at any time directly using JavaScript

<!-- gh-comment-id:534874745 --> @pirate commented on GitHub (Sep 25, 2019): @andrewzigerelli see my comment above. A primary goal of ArchiveBox is to preserve JS and interactivity in archived pages, but that means pages necessarily have to be able to execute their own arbitrary JS. XSS-stripping titles or any of the other little metadata fields is like putting up a small picket fence to try and stop a tsunami. Why would an attacker bother going to the trouble to stuff some XSS payload in page titles when they can just put JS on the page directly knowing it will be executed by ArchiveBox users on a domain shared with the index and all the other pages? (the whole traditional browser security model breaks without CORS protections. the invisible wall that stops xxxhacker.com from accessing your data on facebook.com is the fact that it's on a different domain, but all archived pages are served *from the same domain*) > archived pages can already run arbitrary Javascript in the context of your archive domain, so there's not much we can do to protect against that attack vector without breaking interactivity across all archived pages > archived pages can just request the index page at any time directly using JavaScript
Author
Owner

@pirate commented on GitHub (May 12, 2021):

Idea h/t for encouragement from @FiloSottile, and similar to how Wikimedia and many other services do it:

  • serve all "dirty" archived content from one port, e.g. 9595. including static archive/<timestamp>/index.html indexes, archived content with live JS, etc. that could be dangerous
  • serve the django admin interface from 9594, with the login screen, ability to add new snapshots, remove URLs, etc. shoudl not be on the same origin as the risky archived content

These can be mapped to separate domains/ports (subdomains are dangerous?maybe, full domains likely required) by the user, but will require adding some new config options to tune what port/domain the admin and dirty content are listening on: e.g.
HTTP_DIRTY_LISTEN=https://demousercontent.archivebox.io
HTTP_ADMIN_LISTEN=https://demo.archivebox.io

This would close a pretty crucial security hole where archived content can mess with the execution of extractors (and potentially run abitrary shell scripts if they chain together a series of injection attacks).

Semi-Related, using sandbox iframes for replay: https://developer.mozilla.org/en-US/docs/Web/HTML/Element/iframe
https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Sec-Fetch-Mode

Extractor methods that replay JS:

  • wget
  • dom

Proposed behavior:

  • if dirty content is loaded fromw within iframe (with sandbox protections): allow JS, because iframe sandboxes protect us (verify this first)
  • if dirty content is loaded outside an iframe (e.g. if someone visits the URL directly): serve strict CSP/CORS headers to prevent JS execution entirely
  • prevent right clicking the iframe to get the unsafe url and open it in a new tab directly ? or detect server side if dirty url is visited outside an iframe and prevent it?

config option to enable bypassing sandboxing:

  • DANGER_ALLOW_BYPASSING_SANDOX=True/False
  • once enabled ^ checkbox appears on a per-snapshot basis that allows disabling iframe/csp sandbox protections when replaying that snapshot
<!-- gh-comment-id:840134446 --> @pirate commented on GitHub (May 12, 2021): Idea h/t for encouragement from @FiloSottile, and similar to how Wikimedia and many other services do it: - serve all "dirty" archived content from one port, e.g. 9595. including static `archive/<timestamp>/index.html` indexes, archived content with live JS, etc. that could be dangerous - serve the django admin interface from 9594, with the login screen, ability to add new snapshots, remove URLs, etc. shoudl not be on the same origin as the risky archived content These can be mapped to separate domains/ports (subdomains are dangerous?*maybe*, full domains likely required) by the user, but will require adding some new config options to tune what port/domain the admin and dirty content are listening on: e.g. `HTTP_DIRTY_LISTEN=https://demousercontent.archivebox.io` `HTTP_ADMIN_LISTEN=https://demo.archivebox.io` This would close a pretty crucial security hole where archived content can mess with the execution of extractors (and potentially run abitrary shell scripts if they chain together a series of injection attacks). Semi-Related, using sandbox iframes for replay: https://developer.mozilla.org/en-US/docs/Web/HTML/Element/iframe https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Sec-Fetch-Mode Extractor methods that replay JS: - [x] wget - [x] dom Proposed behavior: - if dirty content is loaded fromw within iframe (with sandbox protections): allow JS, because iframe sandboxes protect us (verify this first) - if dirty content is loaded outside an iframe (e.g. if someone visits the URL directly): serve strict CSP/CORS headers to prevent JS execution *entirely* - prevent right clicking the iframe to get the unsafe url and open it in a new tab directly ? or detect server side if dirty url is visited outside an iframe and prevent it? config option to enable bypassing sandboxing: - `DANGER_ALLOW_BYPASSING_SANDOX=True/False` - once enabled ^ checkbox appears on a per-snapshot basis that allows disabling iframe/csp sandbox protections when replaying that snapshot
Author
Owner

@FiloSottile commented on GitHub (Jun 21, 2021):

I talked about the ArchiveBox scenario with a couple experts, and we came up with a better option than <iframe sandbox>: Content-Security-Policy: sandbox, which instructs the browser to treat the load as its own unique origin.

This is much more robust and convenient than detecting iframe loads.

We also went through the list of security headers to pick the ones that would protect ArchiveBox pages from Spectre, too. They should involve no maintenance.

Content-Security-Policy: sandbox
Cross-Origin-Opener-Policy: same-origin
Cross-Origin-Embedder-Policy: require-corp
Cross-Origin-Resource-Policy: same-origin [FOR HTML] / cross-origin [FOR NOT HTML]
Vary: Sec-Fetch-Site
X-Content-Type-Options: nosniff

On top of that, it would still be a good idea to have the admin API on a different origin (a different subdomain is enough), and make its cookie SameSite=Strict.

This should stop any cross-contamination between archived pages, but it won't stop them from detecting other archived pages. That might be possible, but it will require more complex server logic.

<!-- gh-comment-id:864709561 --> @FiloSottile commented on GitHub (Jun 21, 2021): I talked about the ArchiveBox scenario with a couple experts, and we came up with a better option than `<iframe sandbox>`: `Content-Security-Policy: sandbox`, which instructs the browser to treat the load as its own unique origin. This is much more robust and convenient than detecting iframe loads. We also went through the list of security headers to pick the ones that would protect ArchiveBox pages from Spectre, too. They should involve no maintenance. ``` Content-Security-Policy: sandbox Cross-Origin-Opener-Policy: same-origin Cross-Origin-Embedder-Policy: require-corp Cross-Origin-Resource-Policy: same-origin [FOR HTML] / cross-origin [FOR NOT HTML] Vary: Sec-Fetch-Site X-Content-Type-Options: nosniff ``` On top of that, it would still be a good idea to have the admin API on a different origin (a different subdomain is enough), and make its cookie `SameSite=Strict`. This should stop any cross-contamination between archived pages, but it won't stop them from detecting other archived pages. That might be possible, but it will require more complex server logic.
Author
Owner

@agnosticlines commented on GitHub (Jul 24, 2022):

Hi! Sorry to post on such an old issue, just wondering if this is going to be implemented? Would love to be able to use WARC instead of SingleFile

<!-- gh-comment-id:1193289319 --> @agnosticlines commented on GitHub (Jul 24, 2022): Hi! Sorry to post on such an old issue, just wondering if this is going to be implemented? Would love to be able to use WARC instead of SingleFile
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#1677
No description provided.