[GH-ISSUE #80] Autoscroll before before archiving and take full-height screenshots #3076

Open
opened 2026-03-14 20:55:04 +03:00 by kerem · 8 comments
Owner

Originally created by @pirate on GitHub (Jun 19, 2018).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/80

I've sumbitted a Chromium bug tracker feature request for adding a --full-page flag: https://bugs.chromium.org/p/chromium/issues/detail?id=854013

Hopefully it's merged, allowing us to screenshot the full height of pages, instead of limiting them to the config settings defined by DIMENSIONS.

Originally created by @pirate on GitHub (Jun 19, 2018). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/80 I've sumbitted a Chromium bug tracker feature request for adding a `--full-page` flag: https://bugs.chromium.org/p/chromium/issues/detail?id=854013 Hopefully it's merged, allowing us to screenshot the full height of pages, instead of limiting them to the config settings defined by `DIMENSIONS`.
Author
Owner

@pirate commented on GitHub (Mar 15, 2019):

This will be easy with user scripts the moment pyppeteer is merged in #177. Or if we switch to playwright it's also easy using playwright's --full-page flag. https://github.com/ArchiveBox/ArchiveBox/issues/51

<!-- gh-comment-id:473369324 --> @pirate commented on GitHub (Mar 15, 2019): This will be easy with user scripts the moment pyppeteer is merged in #177. Or if we switch to playwright it's also easy using playwright's `--full-page` flag. https://github.com/ArchiveBox/ArchiveBox/issues/51
Author
Owner

@mtvu commented on GitHub (Jun 10, 2021):

The code provided in this playwright issue solves the full-page screenshot problem for me
https://github.com/microsoft/playwright/issues/620

Here is the code I use to take a full page screenshot with playwright

const { chromium } = require('playwright');

(async () => {

  const browser = await chromium.launch({
    channel: 'chrome' // or 'msedge', 'chrome-beta', 'msedge-beta', 'msedge-dev', etc.
  });
  const context = await browser.newContext();
  const page = await context.newPage();
  
  await page.goto('https://apple.com/');
  await scrollFullPage(page);
  
  await page.screenshot({ 
    path: 'apple.png',
    fullPage : true
  });
  
  await browser.close();
})();

async function scrollFullPage(page) {
  await page.evaluate(async () => {
    await new Promise(resolve => {
      let totalHeight = 0;
      const distance = 100;
      const timer = setInterval(() => {
        const scrollHeight = document.body.scrollHeight;
        window.scrollBy(0, distance);
        totalHeight += distance;
        
        if (totalHeight >= scrollHeight){
          clearInterval(timer);
          resolve();
        }
      }, 100);
    });
  });
}`
<!-- gh-comment-id:858316400 --> @mtvu commented on GitHub (Jun 10, 2021): The code provided in this playwright issue solves the full-page screenshot problem for me https://github.com/microsoft/playwright/issues/620 Here is the code I use to take a full page screenshot with playwright ``` const { chromium } = require('playwright'); (async () => { const browser = await chromium.launch({ channel: 'chrome' // or 'msedge', 'chrome-beta', 'msedge-beta', 'msedge-dev', etc. }); const context = await browser.newContext(); const page = await context.newPage(); await page.goto('https://apple.com/'); await scrollFullPage(page); await page.screenshot({ path: 'apple.png', fullPage : true }); await browser.close(); })(); async function scrollFullPage(page) { await page.evaluate(async () => { await new Promise(resolve => { let totalHeight = 0; const distance = 100; const timer = setInterval(() => { const scrollHeight = document.body.scrollHeight; window.scrollBy(0, distance); totalHeight += distance; if (totalHeight >= scrollHeight){ clearInterval(timer); resolve(); } }, 100); }); }); }` ```
Author
Owner

@timdonovanuk commented on GitHub (Jun 11, 2021):

Is this feature natively available now or only via hacking in user scripts?

<!-- gh-comment-id:859604246 --> @timdonovanuk commented on GitHub (Jun 11, 2021): Is this feature natively available now or only via hacking in user scripts?
Author
Owner

@pirate commented on GitHub (Jun 11, 2021):

Not available natively yet, it's blocked on https://github.com/ArchiveBox/ArchiveBox/issues/51

<!-- gh-comment-id:859703710 --> @pirate commented on GitHub (Jun 11, 2021): Not available natively yet, it's blocked on https://github.com/ArchiveBox/ArchiveBox/issues/51
Author
Owner

@timdonovanuk commented on GitHub (Jun 11, 2021):

Ah fair enough, thanks! Seems like #51 encapsulates a whole ton of effort to make this happen, so thanks and good luck!

<!-- gh-comment-id:859709268 --> @timdonovanuk commented on GitHub (Jun 11, 2021): Ah fair enough, thanks! Seems like #51 encapsulates a whole ton of effort to make this happen, so thanks and good luck!
Author
Owner

@DeoLeung commented on GitHub (Mar 7, 2025):

will be great to have the ability to take full height screenshots! any update on this after 4 years?

<!-- gh-comment-id:2705752984 --> @DeoLeung commented on GitHub (Mar 7, 2025): will be great to have the ability to take full height screenshots! any update on this after 4 years?
Author
Owner

@pirate commented on GitHub (Mar 10, 2025):

My conclusion after a lot of work on this issue is that full-page screenshots up to ~8000px maximum height are ok, but many many pages are longer than that, and most common image formats actually don't support images that big. Even the formats that do (png) cause most image viewers to crash when you try to open them. You need to mess with Chrome's GPU memory settings to even get it to take more than 16,000px in one image, let alone the 90,000px+ that some long comment thread pages have.

Multiple screenshots are the better solution. My solution so far is one 4:3 screenshot at the top of the page, and then numbered 16:10 screenshots for like ~15 full-height scrolls down the page. Also works great for feeding it to vision and OCR models for analysis.

I built this ^ more advanced puppeteer based screenshot approach for a paying client last year, and it's still in active development. It's all in TS and ArchiveBox is all Python, so it takes time to bridge that gap, refactor, open source it, document it, package it, ship it, etc. for the public.

<!-- gh-comment-id:2709598064 --> @pirate commented on GitHub (Mar 10, 2025): My conclusion after a lot of work on this issue is that full-page screenshots up to ~8000px maximum height are ok, but many many pages are longer than that, and most common image formats actually don't support images that big. Even the formats that do (png) cause most image viewers to crash when you try to open them. You need to mess with Chrome's GPU memory settings to even get it to take more than 16,000px in one image, let alone the 90,000px+ that some long comment thread pages have. Multiple screenshots are the better solution. My solution so far is one 4:3 screenshot at the top of the page, and then numbered 16:10 screenshots for like ~15 full-height scrolls down the page. Also works great for feeding it to vision and OCR models for analysis. I built this ^ more advanced puppeteer based screenshot approach for a paying client last year, and it's still in active development. It's all in TS and ArchiveBox is all Python, so it takes time to bridge that gap, refactor, open source it, document it, package it, ship it, etc. for the public.
Author
Owner

@pirate commented on GitHub (Jan 8, 2026):

dev now has an infiniscroll plugin out-of-the-box which scrolls the page up to N times and expands comments and detail blocks in the process. it doesnt implement the multiple screenshots approach yet but it's a step in the right direction.

<!-- gh-comment-id:3722252695 --> @pirate commented on GitHub (Jan 8, 2026): `dev` now has an `infiniscroll` plugin out-of-the-box which scrolls the page up to N times and expands comments and detail blocks in the process. it doesnt implement the multiple screenshots approach yet but it's a step in the right direction.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#3076
No description provided.