[GH-ISSUE #51] Add ability to run JS scripts during archiving with Playwright/Puppeteer

kerem commented

2026-03-01 17:51:36 +03:00

Owner

Originally created by @pirate on GitHub (Nov 2, 2017).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/51

https://github.com/GoogleChrome/puppeteer is fantastic for scripting actions on pages before making a screenshot or PDF.

I could add support for custom puppeteer scripts for certain urls that need a user action to be performed before archiving (e.g. logging in or closing a welcome message popup).

Puppeteer code looks like this:

        const browser = await puppeteer.launch({headless: false})
        const page = await browser.newPage()

        await page.goto('https://carbon.now.sh')

        const code_input = 'div.ReactCodeMirror div.CodeMirror-code > pre:nth-child(11)'
        await page.click(code_input)
        await page.keyboard.down('Meta')
        await page.keyboard.down('a')
        await page.keyboard.up('a')
        await page.keyboard.up('Meta')
        await page.keyboard.press('Backspace')

Originally created by @pirate on GitHub (Nov 2, 2017). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/51 https://github.com/GoogleChrome/puppeteer is fantastic for scripting actions on pages before making a screenshot or PDF. I could add support for custom puppeteer scripts for certain urls that need a user action to be performed before archiving (e.g. logging in or closing a welcome message popup). Puppeteer code looks like this: ```javascript const browser = await puppeteer.launch({headless: false}) const page = await browser.newPage() await page.goto('https://carbon.now.sh') const code_input = 'div.ReactCodeMirror div.CodeMirror-code > pre:nth-child(11)' await page.click(code_input) await page.keyboard.down('Meta') await page.keyboard.down('a') await page.keyboard.up('a') await page.keyboard.up('Meta') await page.keyboard.press('Backspace') ```

kerem

2026-03-01 17:51:36 +03:00

closed this issue
added the
size: hard

why: functionality

touches: data/schema/architecture

status: done

expected: next release
labels

kerem commented

2026-03-01 17:51:37 +03:00

Author

Owner

@pirate commented on GitHub (Sep 12, 2018):

Would go well with: https://github.com/checkly/puppeteer-recorder

@pirate commented on GitHub (Sep 12, 2018): Would go well with: https://github.com/checkly/puppeteer-recorder

kerem commented

2026-03-01 17:51:37 +03:00

Author

Owner

@FiloSottile commented on GitHub (Dec 1, 2018):

archive.is has a nice set of scripts that do things like expanding all Reddit threads or scrolling through Twitter timelines before taking a snapshot. This is the kind of thing I've seen develop a nice community around with youtube-dl.

@FiloSottile commented on GitHub (Dec 1, 2018): archive.is has a nice set of scripts that do things like expanding all Reddit threads or scrolling through Twitter timelines before taking a snapshot. This is the kind of thing I've seen develop a nice community around with youtube-dl.

kerem commented

2026-03-01 17:51:37 +03:00

Author

Owner

@pirate commented on GitHub (Mar 15, 2019):

The beginnings of this will start to be implemented with our move from chromium-browser to ~~pyppeteer~~ playwright: #177, then these will be possible:

support for scripted user flows (this ticket)
dismissing gdpr / cookie / subscription / donation popups automatically: #175
autoscroll before archiving with full-page dynamic height screenshots: #80
dynamic/interactive requests saving into the WARC with pypetteer running through pywb: #130

@pirate commented on GitHub (Mar 15, 2019): The beginnings of this will start to be implemented with our move from `chromium-browser` to ~~`pyppeteer`~~ playwright: #177, then these will be possible: - support for scripted user flows (this ticket) - dismissing gdpr / cookie / subscription / donation popups automatically: #175 - autoscroll before archiving with full-page dynamic height screenshots: #80 - dynamic/interactive requests saving into the WARC with pypetteer running through pywb: #130

kerem commented

2026-03-01 17:51:37 +03:00

Author

Owner

@n0ncetonic commented on GitHub (Mar 23, 2019):

I have experience with coding Puppeteer scripts and I'm willing to start either implementing fixes for #175 #80 #130 as independent code samples in preparation for pyppeteer or to start a branch that just replicates current functionality but with pyppeteer depending on whether or not you've started a private branch or prefer to implement it yourself

@n0ncetonic commented on GitHub (Mar 23, 2019): I have experience with coding Puppeteer scripts and I'm willing to start either implementing fixes for #175 #80 #130 as independent code samples in preparation for `pyppeteer` or to start a branch that just replicates current functionality but with `pyppeteer` depending on whether or not you've started a private branch or prefer to implement it yourself

kerem commented

2026-03-01 17:51:37 +03:00

Author

Owner

@pirate commented on GitHub (Mar 23, 2019):

Sweet, the super-rough planned design is for ArchiveBox to run user-provided scripts like this:

archive_scripts = {
    'dismiss_modals: '() => {document.querySelectorAll(".modal").delete()}',
    ...
}


browser = await launch()
page = await browser.newPage()

for link in links:
    await page.goto(link['url'])

    for script_name, script_js in archive_scripts:
        link['history'][script_name].append(await page.evaluate(script_js))

    link['history']['screenshot'].append(await page.screenshot({'path': 'screenshot.png'}))
    link['history']['pdf'].append(await page.print_pdf({'path': 'output.pdf'}))

await browser.close()

The final implementation will be more fully-featured than this of course. S context and any output returned gets saved as an ArchiveResult entry like any other extractor.

@pirate commented on GitHub (Mar 23, 2019): Sweet, the super-rough planned design is for ArchiveBox to run user-provided scripts like this: ```python archive_scripts = { 'dismiss_modals: '() => {document.querySelectorAll(".modal").delete()}', ... } browser = await launch() page = await browser.newPage() for link in links: await page.goto(link['url']) for script_name, script_js in archive_scripts: link['history'][script_name].append(await page.evaluate(script_js)) link['history']['screenshot'].append(await page.screenshot({'path': 'screenshot.png'})) link['history']['pdf'].append(await page.print_pdf({'path': 'output.pdf'})) await browser.close() ``` The final implementation will be more fully-featured than this of course. S context and any output returned gets saved as an ArchiveResult entry like any other extractor.

kerem commented

2026-03-01 17:51:37 +03:00

Author

Owner

@n0ncetonic commented on GitHub (Mar 23, 2019):

Alright cool, I will start working on getting that implemented on my fork.

Planning to do this in 3 phases across two milestones which I think align well with the current roadmap.

Phase I. import pyppeteer and replace all current chromium-browser calls with pyppeteer equivalents.

Milestone I. ArchiveBox migrated to pyppeteer

Phase II. Implement minimalist scripting support allowing users to extend browser-based modules using javascript.

Milestone II. Codebase aligned with Roadmap's Long Term Change to allow user-defined scripting of the browser.

Phase III. Bootstrap collection of browser scripts by creating and including

autoscroll_screenshot.js - screenshot capturing of the entire page by autoscrolling #80
anti_detection.js - bypasses detection/blocking of headless browser via selective overwriting of page-wide getter properties this is something I have working for a personal project that leveraged Puppeteer
cookie_accept.js - generic enumeration and dismissal of GDPR/subscription/cookie popups #175

Note: As my primary aim will be to make progress on the Roadmap #130 will not be a requisite for Phase III completion. Once Phase III is complete and merged into master a separate Pull Request will address extending WARC generation.

We'll go to next steps (like mimicking archive_methods.py loading of scripts) after Phase III provides a working, basic scripting subsystem

@n0ncetonic commented on GitHub (Mar 23, 2019): Alright cool, I will start working on getting that implemented on my fork. Planning to do this in 3 phases across two milestones which I think align well with the current roadmap. Phase I. import `pyppeteer` and replace all current `chromium-browser` calls with `pyppeteer` equivalents. Milestone I. ArchiveBox migrated to `pyppeteer` Phase II. Implement minimalist scripting support allowing users to extend browser-based modules using javascript. Milestone II. Codebase aligned with Roadmap's Long Term Change to allow user-defined scripting of the browser. Phase III. Bootstrap collection of browser scripts by creating and including - `autoscroll_screenshot.js` - screenshot capturing of the entire page by autoscrolling #80 - `anti_detection.js` - bypasses detection/blocking of headless browser via selective overwriting of page-wide getter properties *this is something I have working for a personal project that leveraged Puppeteer* - `cookie_accept.js` - generic enumeration and dismissal of GDPR/subscription/cookie popups #175 **Note:** As my primary aim will be to make progress on the Roadmap #130 will not be a requisite for Phase III completion. Once Phase III is complete and merged into master a separate Pull Request will address extending WARC generation. We'll go to next steps (like mimicking archive_methods.py loading of scripts) after Phase III provides a working, basic scripting subsystem

kerem commented

2026-03-01 17:51:37 +03:00

Author

Owner

@pirate commented on GitHub (Mar 23, 2019):

If possible, work on the Phase III scripts first. Those would be most helpful to me, as I've already started work on the phase I and II steps you outlined above over the last few months.

You can test your scripts using the pyppeteer demo code from their README, and I'll make sure the ArchiveBox API is compatible to work with them.

@pirate commented on GitHub (Mar 23, 2019): If possible, work on the Phase III scripts first. Those would be most helpful to me, as I've already started work on the phase I and II steps you outlined above over the last few months. You can test your scripts using the pyppeteer demo code from their README, and I'll make sure the ArchiveBox API is compatible to work with them.

kerem commented

2026-03-01 17:51:37 +03:00

Author

Owner

@pirate commented on GitHub (Apr 17, 2019):

I found some huge repositories of Seleneium/Puppeteer scripts for dismissing modals and logging in to lots of sites. These are going to be super useful:

@pirate commented on GitHub (Apr 17, 2019): I found some huge repositories of Seleneium/Puppeteer scripts for dismissing modals and logging in to lots of sites. These are going to be super useful: - https://github.com/CriseLYJ/awesome-python-login-model - https://github.com/facert/awesome-spider - https://github.com/duyetdev/awesome-web-scraper - https://github.com/BruceDone/awesome-crawler

kerem commented

2026-03-01 17:51:37 +03:00

Author

Owner

@pirate commented on GitHub (Jan 22, 2021):

Whoops closed/reopened by accident. A quick update for those following this issue, we have a number of blocking tasks before we're going to get around to this:

Finish refactoring extractors into independent plugin-style modules that define their own config and dependencies
Refactor ArchiveBox to use a message-passing/event-sourcing architecture so that all tasks are handled by workers listening on queues
Create a playwright/puppeteer message queue worker to handle all headless-browser related tasks in a single browser instance (to avoid launching and closing a browser for each URL/extractor run)
Define a spec for user-contributed playwright scripts that are callable by the playwright worker during archiving

Lots of work has been done so far to get us to step 1, but we're still at the foothills of what will be required before this feature is ready for prime-time. It's still high up on our list of desired features but don't expect it anytime soon.

@pirate commented on GitHub (Jan 22, 2021): Whoops closed/reopened by accident. A quick update for those following this issue, we have a number of blocking tasks before we're going to get around to this: 1. Finish refactoring extractors into independent plugin-style modules that define their own config and dependencies 2. Refactor ArchiveBox to use a message-passing/event-sourcing architecture so that all tasks are handled by workers listening on queues 3. Create a playwright/puppeteer message queue worker to handle all headless-browser related tasks in a single browser instance (to avoid launching and closing a browser for each URL/extractor run) 4. Define a spec for user-contributed playwright scripts that are callable by the playwright worker during archiving Lots of work has been done so far to get us to step 1, but we're still at the foothills of what will be required before this feature is ready for prime-time. It's still high up on our list of desired features but don't expect it anytime soon.

kerem commented

2026-03-01 17:51:37 +03:00

Author

Owner

@UmutAlihan commented on GitHub (Jun 7, 2021):

Looking very forward for these feautures to be implemented. Since Cloudflare now effectively blocks content crawling for many many adopting sites (such as Medium), functionality of Archivebox has a risk to be limited only to a sites which are not yet utilizing Cloudflare's bot detection systems.

(p.s.: Many sites are transitioning to Cloudflare's or similar services. There it is currently very likely to crawl an empty/failing archive like below example.)

@UmutAlihan commented on GitHub (Jun 7, 2021): Looking very forward for these feautures to be implemented. Since Cloudflare now effectively blocks content crawling for many many adopting sites (such as Medium), functionality of Archivebox has a risk to be limited only to a sites which are not yet utilizing Cloudflare's bot detection systems. (p.s.: Many sites are transitioning to Cloudflare's or similar services. There it is currently very likely to crawl an empty/failing archive like below example.)

kerem commented

2026-03-01 17:51:37 +03:00

Author

Owner

@GlassedSilver commented on GitHub (Jun 7, 2021):

Looking very forward for these feautures to be implemented. Since Cloudflare now effectively blocks content crawling for many many adopting sites (such as Medium), functionality of Archivebox has a risk to be limited only to a sites which are not yet utilizing Cloudflare's bot detection systems.

(p.s.: Many sites are transitioning to Cloudflare's or similar services. There it is currently very likely to crawl an empty/failing archive like below example.)

[snipped image]

+1 to that.

Really, REALLY annoying to see Cloudflare being so overly aggressive.

Like sure, you can be hammering down on me if I try to pull pages from sites protected by your network in the hundreds per minute, fine. I get that.

But to blatantly block me only because JS execution doesn't verify my humanness? Sketchy at best!

Edit: btw, I'd strip that IP address from that screenshot of yours if I were you, unless it's dynamic and you're due for a new one soon. ;)

At the very least however it allows someone to possibly (roughly) geolocate you.

@GlassedSilver commented on GitHub (Jun 7, 2021): > Looking very forward for these feautures to be implemented. Since Cloudflare now effectively blocks content crawling for many many adopting sites (such as Medium), functionality of Archivebox has a risk to be limited only to a sites which are not yet utilizing Cloudflare's bot detection systems. > > (p.s.: Many sites are transitioning to Cloudflare's or similar services. There it is currently very likely to crawl an empty/failing archive like below example.) > > [snipped image] +1 to that. Really, REALLY annoying to see Cloudflare being so overly aggressive. Like sure, you can be hammering down on me if I try to pull pages from sites protected by your network in the hundreds per minute, fine. I get that. But to blatantly block me only because JS execution doesn't verify my humanness? Sketchy at best! **Edit:** btw, I'd strip that IP address from that screenshot of yours if I were you, unless it's dynamic and you're due for a new one soon. ;) At the very least however it allows someone to possibly (roughly) geolocate you.

kerem commented

2026-03-01 17:51:37 +03:00

Author

Owner

@pirate commented on GitHub (Dec 8, 2021):

I've started mocking up what a playwright-based pluginized refactor would look like for ArchiveBox, and I think it's pretty elegant so far! This is still a ways away, but I'm starting to crystalize what I want the plugin-style interface to be between the browser and the extractors.

Please note almost all the classes are stateless namespaces, but I still need to figure out a more elegant composition solution than all this inheritance madness.

from playwright.sync_api import sync_playwright

CRUCIAL_STEPS = (
    BrowserSetupStep,
    PageLoadStep,
)

MINIMAL_STEPS = (
    BrowserSetupStep,
    PageLoadStep,
    TitleRecorderStep,
    HTMLRecorderStep,
    ScreenshotRecordeStep,
)

ALL_STEPS = (
    BrowserSetupStep,
    ExtensionSetupStep,
    SecuritySetupStep,
    ProxySetupStep,

    DialogInterceptorStep,
    TrafficInterceptorStep,
    
    DownloadRecorderStep,
    ConsoleLogRecorderStep,
    WebsocketRecorderStep,
    TrafficRecorderStep,
    TimingRecorderStep,

    PageLoadStep,

    ScriptRunnerStep,
    TitleRecorderStep,
    HTMLRecorderStep,
    PDFRecorderStep,
    TextRecorderStep,
    StorageRecorderStep,
    ScreenshotRecorderStep,
    VideoRecorderStep,
)

r = CompleteRunner()
r.run(url='https://example.com')


class EmptyRunner(BrowserRunner):
    steps = CRUCIAL_STEPS

class MinimalRunner(BrowserRunner):
    steps = MINIMAL_STEPS

class CompleteRunner(BrowserRunner):
    steps = ALL_STEPS


class BrowserRunner:
    steps = ()

    # runtime mutable state
    url = None
    browser = None
    context_args = None
    context = None
    page = None
    config = None

    def run(self, url, config):
        self.url = url
        self.config = config

        self.setup_browser()
        self.setup_context()
        self.setup_page()
        self.run_pre_load()
        self.run_load()
        self.run_post_load()

    def setup_browser(self):
        for step in self.steps:
            step.setup_browser(runner=self)

        return self.browser
        
    def setup_context(self):
        for step in self.steps:
            step.setup_context(runner=self)

    def setup_page(self):
        for step in self.steps:
            step.setup_page(runner=self)

    def pre_load(self):
        for step in self.steps:
            step.pre_load(runner=self)

    def load(self):
        for step in self.steps:
            step.load(runner=self)

    def post_load(self):
        for step in self.steps:
            step.post_load(runner=self)


class BrowserRunnerStep:
    @staticmethod
    def setup_browser(runner):
        pass

    @staticmethod
    def setup_context(runner):
        return {}

    @staticmethod
    def setup_page(runner):
        pass

    @staticmethod
    def run_pre_load(runner):
        pass

    @staticmethod
    def run_load(runner):
        pass

    @staticmethod
    def run_post_load(runner):
        pass


class BrowserSetupStep(BrowserRunnerStep):
    @staticmethod
    def setup_browser(runner):
        runner.browser = sync_playwright.chromium

    @staticmethod
    def setup_page(runner):
        runner.context = runner.browser.launch_persistent_context(**runner.context_args)
        runner.page = runner.context.new_page()

    @staticmethod
    def setup_context(runner):
        runner.context_args = (runner.context_args or {}).update({
            executable_path: "path-to-chromium",
            timeout: 30_000,
        })

class PageLoadStep(BrowserRunnerStep):
    @staticmethod
    def run_load(runner):
        runner.page.goto(url)


class ExtensionSetupStep(BrowserRunnerStep):
    @staticmethod
    def setup_context(runner):
        runner.context_args = (runner.context_args or {}).update({
            args: ["--load-extension: ./my-extension"],
        })


class EmulationSetupStep(BrowserRunnerStep):
    @staticmethod
    def setup_context(runner):
        runner.context_args = (runner.context_args or {}).update({
            headless: True,
            user_agent: runner.config['BROWSER_USER_AGENT'],
            viewport: { 'width': 1280, 'height': 1024 },
            has_touch: False,
            is_mobile: False,
            device_scale_factor: 2,
            locale: 'de-DE',
            timezone_id: 'Europe/Berlin',
            permissions: ['geolocation', 'notifications'],
            geolocation: {"longitude": 48.858455, "latitude": 2.294474},
            color_scheme: 'light',
            **sync_playwright.devices['Pixel 2'],
        })

class SecuritySetupStep(BrowserRunnerStep):
    @staticmethod
    def setup_context(runner):
        runner.context_args = (runner.context_args or {}).update({
            user_agent: 'My user agent',
            java_script_enabled: True,
            chromium_sandbox: True,
            permissions: ['geolocation', 'notifications'],
            extra_http_headers: '...',
            bypass_csp: True,
            ignore_https_errors: True,
        })

class ProxySetupStep(BrowserRunnerStep):
    @staticmethod
    def setup_context(runner):
        runner.context_args = (runner.context_args or {}).update({
            proxy: {
              "server": "http://myproxy.com:3128",
              "username": "usr",
              "password": "pwd",
              "bypass": "github.com,apple.com"
            },
        })


class DialogInterceptorStep(BrowserRunnerStep):
    @staticmethod
    def run_pre_load(runner):
        # handle any dialog boxes
        runner.page.on("dialog", lambda dialog: dialog.accept())

class TrafficInterceptorStep(BrowserRunnerStep):
    @staticmethod
    def run_pre_load(runner):
        # intercept certain requests
        runner.page.route("**/xhr_endpoint", lambda route: route.fulfill(path="mock_data.json"))

class DownloadRecorderStep(BrowserRunnerStep):
    @staticmethod
    def get_context(runner):
        runner.context_args = (runner.context_args or {}).update({
            accept_downloads: True,
            downloads_path: '.',
        })

    @staticmethod
    def run_pre_load(runner):
        # handle any download events
        runner.page.on("download", lambda download: print(download.path()))

class ConsoleLogRecorderStep(BrowserRunnerStep):
    @staticmethod
    def run_pre_load(runner):
        # save console.log to file
        runner.page.on("console", lambda msg: print(msg.text))

class WebsocketRecorderStep(BrowserRunnerStep):
    @staticmethod
    def run_pre_load(runner):
        # handle any websockets opening/closing
        runner.page.on("websocket", lambda websocket: print(
            websocket.url,
            # web_socket.on("close", lambda event: print(event))
            # web_socket.on("framereceived", lambda event: print(event))
            # web_socket.on("framesent", lambda event: print(event))
            # web_socket.on("socketerror", lambda event: print(event))
        ))

class TrafficRecorderStep(BrowserRunnerStep):
    @staticmethod
    def run_pre_load(runner):
        # save requests and responses to file
        runner.page.on("request", lambda request: print(">>", request.method, request.url, request.all_headers()))
        runner.page.on("response", lambda response: print(
            ">>",
            response.request.method,
            response.request.url,
            response.request.headers,
            response.status,
            response.status_text,
            response.url,
            response.headers,
        ))

class TimingRecorderStep(BrowserRunnerStep):
    @staticmethod
    def run_pre_load(runner):
        self.start_time = time.now()

        # measure timing
        runner.page.once("load", lambda: print("page loaded!", self.start_time, time.now()))

class ScriptRunnerStep(BrowserRunnerStep):
    @staticmethod
    def run_post_load(runner):
        # run any scripts in the page
        someresult = runner.page.evaluate('object => object.foo', { 'foo': 'bar' })

        # get page dimensions
        dimensions = runner.page.evaluate('''() => {
          return {
            width: document.documentElement.clientWidth,
            height: document.documentElement.clientHeight,
            deviceScaleFactor: window.devicePixelRatio
          }
        }''')

class TitleRecorderStep(BrowserRunnerStep):
    @staticmethod
    def run_post_load(runner):
        # get title
        return runner.page.title()

class HTMLRecorderStep(BrowserRunnerStep):
    @staticmethod
    def run_post_load(runner):
        # get full page html
        html = runner.page.context()

class TextRecorderStep(BrowserRunnerStep):
    @staticmethod
    def run_post_load(runner):
        # get page innerText
        text = page.inner_text("body")

class StorageRecorderStep(BrowserRunnerStep):
    @staticmethod
    def setup_context(runner):
        runner.context_args = (runner.context_args or {}).update({
            user_data_dir: "/tmp/test-user-data-dir",
            storage_state: "./state.json",
        })

    @staticmethod
    def run_post_load(runner):
        # Save storage state into the file.
        runner.context.storage_state(path="state.json")

class ScreenshotRecorderStep(BrowserRunnerStep):
    @staticmethod
    def run_post_load(runner):
        runner.page.screenshot(path='screenshot.png', full_page=full_page)

class PDFRecorderStep(BrowserRunnerStep):
    @staticmethod
    def run_post_load(runner):
        # generates a pdf with "screen" media type.
        runner.page.emulate_media(media="screen")
        runner.page.pdf(path="page.pdf")

class HARRecorderStep(BrowserRunnerStep):
    @staticmethod
    def setup_context(runner):
        runner.context_args = (runner.context_args or {}).update({
            record_har_omit_content: True,
            record_har_path: './har',
        })

    @staticmethod
    def run_post_load(runner):
        # TODO: save the HAR file path to output dir
        pass


class VideoRecorderStep(BrowserRunnerStep):
    @staticmethod
    def setup_context(runner):
        runner.context_args = (runner.context_args or {}).update({
            record_video_dir: './video',
            slow_mo: 0,
        })

    @staticmethod
    def run_post_load(runner):
        # save the video path
        Path(runner.page.video.path()).move_to('./screenrecording')

@pirate commented on GitHub (Dec 8, 2021): I've started mocking up what a playwright-based pluginized refactor would look like for ArchiveBox, and I think it's pretty elegant so far! This is still a ways away, but I'm starting to crystalize what I want the plugin-style interface to be between the browser and the extractors. Please note almost all the classes are stateless namespaces, but I still need to figure out a more elegant composition solution than all this inheritance madness. ```python3 from playwright.sync_api import sync_playwright CRUCIAL_STEPS = ( BrowserSetupStep, PageLoadStep, ) MINIMAL_STEPS = ( BrowserSetupStep, PageLoadStep, TitleRecorderStep, HTMLRecorderStep, ScreenshotRecordeStep, ) ALL_STEPS = ( BrowserSetupStep, ExtensionSetupStep, SecuritySetupStep, ProxySetupStep, DialogInterceptorStep, TrafficInterceptorStep, DownloadRecorderStep, ConsoleLogRecorderStep, WebsocketRecorderStep, TrafficRecorderStep, TimingRecorderStep, PageLoadStep, ScriptRunnerStep, TitleRecorderStep, HTMLRecorderStep, PDFRecorderStep, TextRecorderStep, StorageRecorderStep, ScreenshotRecorderStep, VideoRecorderStep, ) r = CompleteRunner() r.run(url='https://example.com') class EmptyRunner(BrowserRunner): steps = CRUCIAL_STEPS class MinimalRunner(BrowserRunner): steps = MINIMAL_STEPS class CompleteRunner(BrowserRunner): steps = ALL_STEPS class BrowserRunner: steps = () # runtime mutable state url = None browser = None context_args = None context = None page = None config = None def run(self, url, config): self.url = url self.config = config self.setup_browser() self.setup_context() self.setup_page() self.run_pre_load() self.run_load() self.run_post_load() def setup_browser(self): for step in self.steps: step.setup_browser(runner=self) return self.browser def setup_context(self): for step in self.steps: step.setup_context(runner=self) def setup_page(self): for step in self.steps: step.setup_page(runner=self) def pre_load(self): for step in self.steps: step.pre_load(runner=self) def load(self): for step in self.steps: step.load(runner=self) def post_load(self): for step in self.steps: step.post_load(runner=self) class BrowserRunnerStep: @staticmethod def setup_browser(runner): pass @staticmethod def setup_context(runner): return {} @staticmethod def setup_page(runner): pass @staticmethod def run_pre_load(runner): pass @staticmethod def run_load(runner): pass @staticmethod def run_post_load(runner): pass class BrowserSetupStep(BrowserRunnerStep): @staticmethod def setup_browser(runner): runner.browser = sync_playwright.chromium @staticmethod def setup_page(runner): runner.context = runner.browser.launch_persistent_context(**runner.context_args) runner.page = runner.context.new_page() @staticmethod def setup_context(runner): runner.context_args = (runner.context_args or {}).update({ executable_path: "path-to-chromium", timeout: 30_000, }) class PageLoadStep(BrowserRunnerStep): @staticmethod def run_load(runner): runner.page.goto(url) class ExtensionSetupStep(BrowserRunnerStep): @staticmethod def setup_context(runner): runner.context_args = (runner.context_args or {}).update({ args: ["--load-extension: ./my-extension"], }) class EmulationSetupStep(BrowserRunnerStep): @staticmethod def setup_context(runner): runner.context_args = (runner.context_args or {}).update({ headless: True, user_agent: runner.config['BROWSER_USER_AGENT'], viewport: { 'width': 1280, 'height': 1024 }, has_touch: False, is_mobile: False, device_scale_factor: 2, locale: 'de-DE', timezone_id: 'Europe/Berlin', permissions: ['geolocation', 'notifications'], geolocation: {"longitude": 48.858455, "latitude": 2.294474}, color_scheme: 'light', **sync_playwright.devices['Pixel 2'], }) class SecuritySetupStep(BrowserRunnerStep): @staticmethod def setup_context(runner): runner.context_args = (runner.context_args or {}).update({ user_agent: 'My user agent', java_script_enabled: True, chromium_sandbox: True, permissions: ['geolocation', 'notifications'], extra_http_headers: '...', bypass_csp: True, ignore_https_errors: True, }) class ProxySetupStep(BrowserRunnerStep): @staticmethod def setup_context(runner): runner.context_args = (runner.context_args or {}).update({ proxy: { "server": "http://myproxy.com:3128", "username": "usr", "password": "pwd", "bypass": "github.com,apple.com" }, }) class DialogInterceptorStep(BrowserRunnerStep): @staticmethod def run_pre_load(runner): # handle any dialog boxes runner.page.on("dialog", lambda dialog: dialog.accept()) class TrafficInterceptorStep(BrowserRunnerStep): @staticmethod def run_pre_load(runner): # intercept certain requests runner.page.route("**/xhr_endpoint", lambda route: route.fulfill(path="mock_data.json")) class DownloadRecorderStep(BrowserRunnerStep): @staticmethod def get_context(runner): runner.context_args = (runner.context_args or {}).update({ accept_downloads: True, downloads_path: '.', }) @staticmethod def run_pre_load(runner): # handle any download events runner.page.on("download", lambda download: print(download.path())) class ConsoleLogRecorderStep(BrowserRunnerStep): @staticmethod def run_pre_load(runner): # save console.log to file runner.page.on("console", lambda msg: print(msg.text)) class WebsocketRecorderStep(BrowserRunnerStep): @staticmethod def run_pre_load(runner): # handle any websockets opening/closing runner.page.on("websocket", lambda websocket: print( websocket.url, # web_socket.on("close", lambda event: print(event)) # web_socket.on("framereceived", lambda event: print(event)) # web_socket.on("framesent", lambda event: print(event)) # web_socket.on("socketerror", lambda event: print(event)) )) class TrafficRecorderStep(BrowserRunnerStep): @staticmethod def run_pre_load(runner): # save requests and responses to file runner.page.on("request", lambda request: print(">>", request.method, request.url, request.all_headers())) runner.page.on("response", lambda response: print( ">>", response.request.method, response.request.url, response.request.headers, response.status, response.status_text, response.url, response.headers, )) class TimingRecorderStep(BrowserRunnerStep): @staticmethod def run_pre_load(runner): self.start_time = time.now() # measure timing runner.page.once("load", lambda: print("page loaded!", self.start_time, time.now())) class ScriptRunnerStep(BrowserRunnerStep): @staticmethod def run_post_load(runner): # run any scripts in the page someresult = runner.page.evaluate('object => object.foo', { 'foo': 'bar' }) # get page dimensions dimensions = runner.page.evaluate('''() => { return { width: document.documentElement.clientWidth, height: document.documentElement.clientHeight, deviceScaleFactor: window.devicePixelRatio } }''') class TitleRecorderStep(BrowserRunnerStep): @staticmethod def run_post_load(runner): # get title return runner.page.title() class HTMLRecorderStep(BrowserRunnerStep): @staticmethod def run_post_load(runner): # get full page html html = runner.page.context() class TextRecorderStep(BrowserRunnerStep): @staticmethod def run_post_load(runner): # get page innerText text = page.inner_text("body") class StorageRecorderStep(BrowserRunnerStep): @staticmethod def setup_context(runner): runner.context_args = (runner.context_args or {}).update({ user_data_dir: "/tmp/test-user-data-dir", storage_state: "./state.json", }) @staticmethod def run_post_load(runner): # Save storage state into the file. runner.context.storage_state(path="state.json") class ScreenshotRecorderStep(BrowserRunnerStep): @staticmethod def run_post_load(runner): runner.page.screenshot(path='screenshot.png', full_page=full_page) class PDFRecorderStep(BrowserRunnerStep): @staticmethod def run_post_load(runner): # generates a pdf with "screen" media type. runner.page.emulate_media(media="screen") runner.page.pdf(path="page.pdf") class HARRecorderStep(BrowserRunnerStep): @staticmethod def setup_context(runner): runner.context_args = (runner.context_args or {}).update({ record_har_omit_content: True, record_har_path: './har', }) @staticmethod def run_post_load(runner): # TODO: save the HAR file path to output dir pass class VideoRecorderStep(BrowserRunnerStep): @staticmethod def setup_context(runner): runner.context_args = (runner.context_args or {}).update({ record_video_dir: './video', slow_mo: 0, }) @staticmethod def run_post_load(runner): # save the video path Path(runner.page.video.path()).move_to('./screenrecording') ```

kerem commented

2026-03-01 17:51:38 +03:00

Author

Owner

@pellaeon commented on GitHub (Jan 8, 2022):

@pirate Do you have the playwright-based refactor in a public branch? I'd love to contribute if possible. :-)

@pellaeon commented on GitHub (Jan 8, 2022): @pirate Do you have the playwright-based refactor in a public branch? I'd love to contribute if possible. :-)

kerem commented

2026-03-01 17:51:38 +03:00

Author

Owner

@pirate commented on GitHub (Jan 8, 2022):

Not yet but soon! It's just in a gist right now. Will publish it once I've moved >50% of the old codebase into the new structure. I'm traveling in Mexico right now with limited work time but will keep everyone posted as it progresses!

The new design is quite exciting, I'm able to add new features as plugins with <10min of boilerplate work per feature.

https://gist.github.com/pirate/7193ab54557b051aa1e3a83191b69793

@pirate commented on GitHub (Jan 8, 2022): Not yet but soon! It's just in a gist right now. Will publish it once I've moved >50% of the old codebase into the new structure. I'm traveling in Mexico right now with limited work time but will keep everyone posted as it progresses! The new design is quite exciting, I'm able to add new features as plugins with <10min of boilerplate work per feature. https://gist.github.com/pirate/7193ab54557b051aa1e3a83191b69793

kerem commented

2026-03-01 17:51:38 +03:00

Author

Owner

@pirate commented on GitHub (Nov 18, 2022):

Useful scripts for testing and evading archivebox-bot detection blocking with playwright/puppeteer in the future:

I'm also leaning towards implementing this using Conifer/Rhizome's well-defined spec for scripted archiving behaviors here: https://github.com/webrecorder/browsertrix-behaviors/blob/main/docs/TUTORIAL.md
Their behaviors framework is also open-source, so if we are compatible with their behaviors our communities can share behavior JS scripts and help grow the capabilities of both tools simultaneously 🎉

@pirate commented on GitHub (Nov 18, 2022): Useful scripts for testing and evading archivebox-bot detection blocking with playwright/puppeteer in the future: - https://github.com/niespodd/browser-fingerprinting#technical-insights-into-bypassing-bot-detection - https://github.com/matomo-org/device-detector/blob/master/README.md I'm also leaning towards implementing this using Conifer/Rhizome's well-defined spec for scripted archiving behaviors here: https://github.com/webrecorder/browsertrix-behaviors/blob/main/docs/TUTORIAL.md Their behaviors framework is also open-source, so if we are compatible with their behaviors our communities can share behavior JS scripts and help grow the capabilities of both tools simultaneously 🎉

kerem commented

2026-03-01 17:51:38 +03:00

Author

Owner

@pirate commented on GitHub (Apr 12, 2023):

Chrome now supports a new framework-agnostic JSON user flow export from the DevTools recording pane. I'd like to use this format if possible instead of playwright/puppeteer directly.

Waiting for a response to see if playwright will implement replay support for it. If so browsertrix-crawler will get support soon after, and I'm likely to just build on top of browsertrix crawler.

Otherwise we could also add a custom recorder format for ArchiveBox to the archivebox-extension so that export scripts can be generated directly in our own format (but I prefer the JSON approach above instead^):

https://developer.chrome.com/blog/extend-recorder/

Part of why this feature is taking so long is that I think all of the solutions for automating JS browser scripting right now are extremely high-maintenance/brittle, and I've been waiting for the ecosystem to mature so as not to overload limited time to work on ArchiveBox by adding a new brittle stack of things I have to maintain.

@pirate commented on GitHub (Apr 12, 2023): Chrome now supports a new framework-agnostic JSON user flow export from the DevTools recording pane. I'd like to use this format if possible instead of playwright/puppeteer directly. Waiting for a response to see if playwright will implement replay support for it. If so browsertrix-crawler will get support soon after, and I'm likely to just build on top of browsertrix crawler. Related: - https://github.com/microsoft/playwright/issues/22345 - https://github.com/webrecorder/browsertrix-crawler/issues/283 - https://developer.chrome.com/blog/new-in-devtools-101/#recorder - https://developer.chrome.com/blog/new-in-devtools-92/#puppeteer-recorder - https://developer.chrome.com/docs/devtools/recorder/reference/ Otherwise we could also add a custom recorder format for ArchiveBox to the archivebox-extension so that export scripts can be generated directly in our own format (but I prefer the JSON approach above instead^): - https://developer.chrome.com/blog/extend-recorder/ Part of why this feature is taking so long is that I think all of the solutions for automating JS browser scripting right now are extremely high-maintenance/brittle, and I've been waiting for the ecosystem to mature so as not to overload limited time to work on ArchiveBox by adding a new brittle stack of things I have to maintain.

kerem commented

2026-03-01 17:51:38 +03:00

Author

Owner

@pirate commented on GitHub (Mar 21, 2024):

I've been doing some work for paying ArchiveBox consulting clients to implement advanced puppeteer-based archiving. Here's an overview of what I'm running for them (all of these are implemented and working well right now), with more otw:

My clients (all still non-profits) pay for my time needed to learn about and implement these features, so getting it working for their immediate needs is my priority, but the plan is to integrate these new features back into the main open-source ArchiveBox codebase so everyone can benefit!

@pirate commented on GitHub (Mar 21, 2024): I've been doing some work for paying [ArchiveBox consulting](https://zulip.archivebox.io/#narrow/stream/167-enterprise/topic/welcome/near/1191102) clients to implement advanced puppeteer-based archiving. Here's an overview of what I'm running for them (all of these are implemented and working well right now), with more otw: <img width="905" alt="image" src="https://github.com/ArchiveBox/ArchiveBox/assets/511499/718c1c7f-4f17-43a9-93d5-31109fbfa76d"> My clients (all still non-profits) pay for my time needed to learn about and implement these features, so getting it working for their immediate needs is my priority, but the plan is to integrate these new features back into the main open-source ArchiveBox codebase so everyone can benefit!

kerem commented

2026-03-01 17:51:38 +03:00

Author

Owner

@a10kiloham commented on GitHub (May 15, 2024):

For what it's worth this functionality is quite similar to how ChangeDetection works.
https://github.com/dgtlmoon/changedetection.io/tree/master/changedetectionio/content_fetchers
It's very useful to just plug in an additional service into the docker-compose for browserless.io and then pass the websocket URL to the app and get a full browser experience totally headlessly on Ubuntu.

@a10kiloham commented on GitHub (May 15, 2024): For what it's worth this functionality is quite similar to how ChangeDetection works. https://github.com/dgtlmoon/changedetection.io/tree/master/changedetectionio/content_fetchers It's very useful to just plug in an additional service into the docker-compose for browserless.io and then pass the websocket URL to the app and get a full browser experience totally headlessly on Ubuntu.

kerem commented

2026-03-01 17:51:38 +03:00

Author

Owner

@pirate commented on GitHub (Dec 29, 2025):

this is now implemented on dev, the archivebox/plugins/chrome plugin sets up a persistent chrome tab for the snapshot and the other plugins use it for their archiving, it's easy to add JS scripts using the new hook system in archivebox/hooks.py. for an example see archivebox/plugins/infiniscroll

@pirate commented on GitHub (Dec 29, 2025): this is now implemented on `dev`, the `archivebox/plugins/chrome` plugin sets up a persistent chrome tab for the snapshot and the other plugins use it for their archiving, it's easy to add JS scripts using the new hook system in `archivebox/hooks.py`. for an example see `archivebox/plugins/infiniscroll`

kerem referenced this issue

2026-03-01 17:58:56 +03:00

[GH-ISSUE #1544] Bug: failing to start supervisord in v0.8.5rc44 #2425

kerem referenced this issue

2026-03-15 01:01:43 +03:00

[GH-ISSUE #1544] Bug: failing to start supervisord in v0.8.5rc44 #3935

Rows
Columns

[GH-ISSUE #51] Add ability to run JS scripts during archiving with Playwright/Puppeteer #1544