mirror of
https://github.com/ArchiveBox/ArchiveBox.git
synced 2026-04-25 09:06:02 +03:00
[GH-ISSUE #239] Architecture: Archived JS executes in a context shared with all other archived content (and the admin UI!) #1677
Labels
No labels
expected: maybe someday
expected: next release
expected: release after next
expected: unlikely unless contributed
good first ticket
help wanted
pull-request
scope: all users
scope: windows users
size: easy
size: hard
size: medium
size: medium
status: backlog
status: blocked
status: done
status: idea-phase
status: needs followup
status: wip
status: wontfix
touches: API/CLI/Spec
touches: configuration
touches: data/schema/architecture
touches: dependencies/packaging
touches: docs
touches: js
touches: views/replayers/html/css
why: correctness
why: functionality
why: performance
why: security
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/ArchiveBox#1677
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @s7x on GitHub (May 14, 2019).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/239
Describe the bug
Hi there!
There's an XSS vulnerability when you open your index.html if you saved a page with a title containing an XSS vector.
Steps to reproduce
Source code:
Software versions
@pirate commented on GitHub (May 18, 2019):
I'm aware of this already, the reason I haven't immediately locked it down is because archived pages can already run arbitrary Javascript in the context of your archive domain, so there's not much we can do to protect against that attack vector without breaking interactivity across all archived pages. If I add bleach-style XSS stripping to titles it'll make the title page less likely to break from a UX perspective, but it doesn't make it any more secure because archived pages can just request the index page at any time directly using JavaScript.
v0.4 is going to add some security headers that will make it more difficult for pages to use JS to access other archived pages, but it's never going to be perfect unless we have each archived page stored on its own domain.
I'm having long conversations with several people this week about the security model of archivebox, it's a difficult problem but I think we'll have to end up disabling all Javascript in the static HTML archives and only allowing proxy replaying of WARCs if people want interactivity preserved. I'm going to move all the filesystem stuff in to hash-bucketed folders to discourage people from opening the saved html files directly and only accessing them via nginx or the django webserver, as allowing archived JS to have filesystem access is disastrously bad security.
@pirate commented on GitHub (May 20, 2019):
Because this is fairly serious I've temporarily striked-out the instructions for running archivebox with private data: https://github.com/pirate/ArchiveBox/wiki/Security-Overview#important-dont-use-archivebox-for-private-archived-content-right-now-as-were-in-the-middle-of-resolving-some-security-issues-with-how-js-is-executed-in-archived-content
Unfortunately my day job is getting super busy right now so I don't know how soon I can change the design (fixing this is a big architectural change), but I think I might add a notice to the README as well to warn people that running archived pages can leak the index and content in the current state. The primary use case is archiving public pages and feeds so it's not as bad as if it were doing private session archiving by default, but I don't want to give users a false sense of security so we should definitely be transparent about the risks.
@s7x commented on GitHub (May 24, 2019):
Hi @pirate! I know the issue is not that critical when you're using archivebox only locally (like I do) cause you're aware of what you are doing (supposed at least) when you save pages and stuff like this but still I think that some people would be happy to know there's no random JS popping up in their hoarding box :)
Thanks for your time & consideration. And for sure, thanks for this awesome tool.
Cheers!
@andrewzigerelli commented on GitHub (Jul 17, 2019):
Why does this only affect title? Is it possible that this XSS opportunity exists elsewhere?
@pirate commented on GitHub (Sep 25, 2019):
@andrewzigerelli see my comment above. A primary goal of ArchiveBox is to preserve JS and interactivity in archived pages, but that means pages necessarily have to be able to execute their own arbitrary JS.
XSS-stripping titles or any of the other little metadata fields is like putting up a small picket fence to try and stop a tsunami. Why would an attacker bother going to the trouble to stuff some XSS payload in page titles when they can just put JS on the page directly knowing it will be executed by ArchiveBox users on a domain shared with the index and all the other pages? (the whole traditional browser security model breaks without CORS protections. the invisible wall that stops xxxhacker.com from accessing your data on facebook.com is the fact that it's on a different domain, but all archived pages are served from the same domain)
@pirate commented on GitHub (May 12, 2021):
Idea h/t for encouragement from @FiloSottile, and similar to how Wikimedia and many other services do it:
archive/<timestamp>/index.htmlindexes, archived content with live JS, etc. that could be dangerousThese can be mapped to separate domains/ports (subdomains are dangerous?maybe, full domains likely required) by the user, but will require adding some new config options to tune what port/domain the admin and dirty content are listening on: e.g.
HTTP_DIRTY_LISTEN=https://demousercontent.archivebox.ioHTTP_ADMIN_LISTEN=https://demo.archivebox.ioThis would close a pretty crucial security hole where archived content can mess with the execution of extractors (and potentially run abitrary shell scripts if they chain together a series of injection attacks).
Semi-Related, using sandbox iframes for replay: https://developer.mozilla.org/en-US/docs/Web/HTML/Element/iframe
https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Sec-Fetch-Mode
Extractor methods that replay JS:
Proposed behavior:
config option to enable bypassing sandboxing:
DANGER_ALLOW_BYPASSING_SANDOX=True/False@FiloSottile commented on GitHub (Jun 21, 2021):
I talked about the ArchiveBox scenario with a couple experts, and we came up with a better option than
<iframe sandbox>:Content-Security-Policy: sandbox, which instructs the browser to treat the load as its own unique origin.This is much more robust and convenient than detecting iframe loads.
We also went through the list of security headers to pick the ones that would protect ArchiveBox pages from Spectre, too. They should involve no maintenance.
On top of that, it would still be a good idea to have the admin API on a different origin (a different subdomain is enough), and make its cookie
SameSite=Strict.This should stop any cross-contamination between archived pages, but it won't stop them from detecting other archived pages. That might be possible, but it will require more complex server logic.
@agnosticlines commented on GitHub (Jul 24, 2022):
Hi! Sorry to post on such an old issue, just wondering if this is going to be implemented? Would love to be able to use WARC instead of SingleFile