mirror of https://github.com/timf34/pagesource.git synced 2026-04-27 00:05:58 +03:00

No description

Find a file

timf34 f59ed61dfc cleanup		2025-12-30 14:42:41 +00:00
assets	added graphic	2025-12-30 12:42:42 +00:00
src/pagesource	v	2025-12-30 12:16:43 +00:00
.gitignore	update	2025-12-30 12:17:42 +00:00
LICENSE	adding license	2025-12-29 18:31:49 +00:00
pyproject.toml	version num	2025-12-30 12:17:00 +00:00
README.md	added graphic	2025-12-30 12:42:42 +00:00

README.md

pagesource

A Python CLI tool that captures all resources loaded by a webpage (like browser DevTools Sources tab) and saves them with the original directory structure.

Installation

pip install pagesource

# IMPORTANT: Install Playwright browser after package installation
playwright install chromium

Usage

Basic Usage

# Capture all resources from a webpage
pagesource https://example.com

This will save all resources to ./pagesource_output/ with the directory structure preserved.

Options

# Specify custom output directory
pagesource https://example.com -o ./my-output

# Wait extra time for JavaScript content (useful for SPAs)
pagesource https://example.com --wait 5

# Include external resources (CDN assets, third-party scripts)
pagesource https://example.com --include-external

# Combine options
pagesource https://example.com -o ./output --wait 3 --include-external

CLI Reference

pagesource <url> [OPTIONS]

Arguments:
  url                     URL of the webpage to capture resources from

Options:
  -o, --output PATH       Output directory (default: ./pagesource_output)
  -w, --wait INTEGER      Additional seconds to wait after page load
  -e, --include-external  Include external resources (CDN, third-party)
  -v, --version           Show version and exit
  --help                  Show help message

Output Structure

Resources are saved preserving the URL path structure:

pagesource_output/
└── example.com/
    ├── index.html
    ├── assets/
    │   ├── css/
    │   │   └── style.css
    │   └── js/
    │       └── app.js
    └── images/
        └── logo.png

If --include-external is used, external resources are saved in their own host directories:

pagesource_output/
├── example.com/
│   └── ...
├── cdn.example.com/
│   └── libs/
│       └── library.js
└── fonts.googleapis.com/
    └── css/
        └── font.css

Features

Captures all network resources loaded by the page (HTML, CSS, JS, images, fonts, etc.)
Preserves original directory structure
Handles query strings (strips them from filenames)
Infers file extensions from Content-Type when missing
Handles duplicate filenames
Sanitizes paths for filesystem safety
Optional wait time for JavaScript-heavy pages

Requirements

Python 3.10+
Playwright (with Chromium browser)

License

MIT