No description
Find a file
2025-09-04 18:56:22 +03:30
.env.sample update Readme 2025-09-04 18:56:22 +03:30
chunker_s3.py add chunker script 2025-09-04 18:48:34 +03:30
LICENSE Initial commit 2025-09-04 18:38:18 +03:30
README.md update Readme 2025-09-04 18:56:22 +03:30
requirements.txt add chunker script 2025-09-04 18:48:34 +03:30

S3 Incremental Upload Chunker

A powerful Python utility for efficiently uploading large files to S3-compatible storage services with intelligent chunking, incremental uploads, and sparse file optimization. Perfect for backing up disk images, virtual machine files, and other large binary files.

Introduction

The S3 Incremental Upload Chunker is designed to solve the problem of efficiently backing up large files to S3-compatible storage. It provides several key features:

  • Intelligent Chunking: Splits large files into manageable chunks with VMDK block alignment
  • Incremental Uploads: Only uploads changed chunks, saving bandwidth and time
  • Sparse File Optimization: Detects and optimizes sparse (mostly empty) chunks to save storage space
  • Hash Verification: Uses SHA-256 hashing to ensure data integrity
  • S3-Compatible: Works with AWS S3, MinIO, and other S3-compatible services
  • Comprehensive Management: List, delete, and verify backups with detailed statistics

Key Features

  • Incremental Backup: Only uploads chunks that have changed since the last backup
  • Sparse File Detection: Automatically detects and optimizes sparse chunks (mostly zeros)
  • VMDK Optimization: Chunk sizes are aligned to 64KB VMDK blocks for optimal performance
  • Metadata Tracking: Stores comprehensive metadata about each backup
  • Verification Tools: Built-in file verification and integrity checking
  • Flexible Configuration: Environment-based configuration with fallback options

Installation

Prerequisites

  • Python 3.6 or higher
  • Access to an S3-compatible storage service (AWS S3, MinIO, etc.)

Install Dependencies

pip install -r requirements.txt

Or install manually:

pip install boto3>=1.26.0

Configuration

Create a .env file in the project directory with your S3 credentials:

# S3 Configuration
AWS_ACCESS_KEY_ID=your_access_key_here
AWS_SECRET_ACCESS_KEY=your_secret_key_here
S3_ENDPOINT_URL=your-s3-endpoint.com
S3_BUCKET_NAME=your-bucket-name

Note: The script will also look for a config.env file if .env is not found.

Usage

The script provides several commands for managing file chunks and backups. All commands support both long-form (--option) and short-form (-o) flags.

Basic Syntax

python chunker_s3.py <command> [options]

Commands

1. Split and Upload (split)

Splits a file into chunks and uploads them to S3 with incremental support.

python chunker_s3.py split [options]

Required Flags:

  • --input-file, --input_file, -i: Path to the input file to split

Optional Flags:

  • --bucket-name, --bucket_name, -b: S3 bucket name (uses config if not specified)
  • --number-of-chunks, --number_of_chunks: Number of chunks to split into
  • --chunk-size, --chunk_size: Size of each chunk in bytes (default: 4MB)
  • --force: Force upload all chunks (disable incremental mode)

Examples:

# Basic split with default 4MB chunks
python chunker_s3.py split --input-file large_file.vmdk --bucket-name my-backups

# Split into 100 chunks
python chunker_s3.py split --input-file disk_image.img --number-of-chunks 100

# Use custom chunk size (8MB)
python chunker_s3.py split --input-file backup.tar.gz --chunk-size 8388608

# Force full upload (ignore existing chunks)
python chunker_s3.py split --input-file file.iso --force

2. Reassemble (reassemble)

Downloads and reassembles chunks from S3 into the original file.

python chunker_s3.py reassemble [options]

Required Flags:

  • --output-file, --output_file, -o: Path for the reassembled output file
  • --prefix, -p: S3 prefix for the chunk files

Optional Flags:

  • --bucket-name, --bucket_name, -b: S3 bucket name (uses config if not specified)

Examples:

# Reassemble a backup
python chunker_s3.py reassemble --output-file restored_file.vmdk --prefix myfile

# Reassemble from specific bucket
python chunker_s3.py reassemble --output-file backup.img --prefix diskimage --bucket-name my-backups

3. List Backups (list)

Lists all available backups in the S3 bucket.

python chunker_s3.py list [options]

Optional Flags:

  • --bucket-name, --bucket_name, -b: S3 bucket name (uses config if not specified)
  • --prefix, -p: Filter backups by prefix

Examples:

# List all backups
python chunker_s3.py list

# List backups in specific bucket
python chunker_s3.py list --bucket-name my-backups

# List backups with specific prefix
python chunker_s3.py list --prefix myfile

4. Delete Backup (delete)

Deletes a backup and all its chunks from S3.

python chunker_s3.py delete [options]

Required Flags:

  • --prefix, -p: Prefix of the backup to delete

Optional Flags:

  • --bucket-name, --bucket_name, -b: S3 bucket name (uses config if not specified)
  • --force: Skip confirmation prompt

Examples:

# Delete a backup (with confirmation)
python chunker_s3.py delete --prefix old-backup

# Delete without confirmation
python chunker_s3.py delete --prefix temp-backup --force

# Delete from specific bucket
python chunker_s3.py delete --prefix backup --bucket-name my-backups

5. Verify Files (verify)

Verifies that a reassembled file matches the original file.

python chunker_s3.py verify [options]

Required Flags:

  • --original-file, --original_file: Path to the original file
  • --reassembled-file, --reassembled_file: Path to the reassembled file

Optional Flags:

  • --chunk-size, --chunk_size: Chunk size for comparison (default: 4MB)

Examples:

# Verify a reassembled file
python chunker_s3.py verify --original-file original.vmdk --reassembled-file restored.vmdk

# Verify with custom chunk size
python chunker_s3.py verify --original-file file.img --reassembled-file restored.img --chunk-size 8388608

Advanced Usage

Environment Variables

You can also set configuration via environment variables:

export AWS_ACCESS_KEY_ID=your_access_key
export AWS_SECRET_ACCESS_KEY=your_secret_key
export S3_ENDPOINT_URL=your-s3-endpoint.com
export S3_BUCKET_NAME=your-bucket-name

Sparse File Optimization

The script automatically detects sparse chunks (mostly zeros) and optimizes them:

  • Sparse chunks are stored as small placeholders (6 bytes) instead of full chunks
  • Original size is preserved in metadata for accurate reconstruction
  • Significant storage savings for disk images and virtual machine files

Incremental Uploads

  • Only uploads chunks that have changed since the last backup
  • Uses SHA-256 hashing to detect changes
  • Maintains metadata about chunk types and sizes
  • Provides detailed statistics about upload efficiency

Output and Statistics

The script provides detailed output including:

  • Upload progress with chunk type information
  • Storage efficiency statistics
  • Sparse file optimization savings
  • Upload speed and duration
  • File verification results

Example output:

[INFO] Starting incremental upload of large_file.vmdk
[DATA] myfile_chunk_0000001 (4,194,304 bytes, hash: a1b2c3d4...)
[SPARSE] myfile_chunk_0000002 (4,194,304 bytes -> 6 bytes placeholder, 99.9% saved)
[SKIP] myfile_chunk_0000003 unchanged (data, hash: e5f6g7h8...)

=== Upload Complete ===
Total chunks: 1000
  - Data chunks: 750
  - Sparse chunks: 250
Uploaded chunks: 200
Skipped chunks: 800
Incremental efficiency: 80.0% data reduction
Sparse optimization: 15.2% storage reduction
Total efficiency: 87.5% storage reduction

Contribution

We welcome contributions to improve the S3 Incremental Upload Chunker! Here's how you can help:

Development Setup

  1. Fork the repository and clone your fork
  2. Create a virtual environment:
    python -m venv venv
    source venv/bin/activate  # On Windows: venv\Scripts\activate
    
  3. Install dependencies:
    pip install -r requirements.txt
    
  4. Create a test configuration:
    cp .env.example .env
    # Edit .env with your test S3 credentials
    

Contributing Tasks

🐛 Bug Reports

  • Test the issue with the latest version
  • Provide detailed reproduction steps
  • Include error messages and logs
  • Specify your Python version and OS

Feature Requests

  • Check existing issues first
  • Describe the use case and expected behavior
  • Consider backward compatibility
  • Provide implementation ideas if possible

🔧 Code Contributions

  • Follow the existing code style
  • Add tests for new features
  • Update documentation as needed
  • Ensure all tests pass

📚 Documentation

  • Improve README clarity
  • Add more usage examples
  • Document edge cases
  • Translate to other languages

🧪 Testing

  • Test with different file types and sizes
  • Verify S3-compatible service compatibility
  • Test error handling scenarios
  • Performance testing with large files

Development Guidelines

  1. Code Style: Follow PEP 8 Python style guidelines
  2. Testing: Test your changes with various file types
  3. Documentation: Update README for new features
  4. Commits: Use clear, descriptive commit messages
  5. Pull Requests: Provide detailed description of changes

Reporting Issues

When reporting issues, please include:

  • Python version (python --version)
  • Operating system
  • S3 service being used (AWS S3, MinIO, etc.)
  • Complete error messages
  • Steps to reproduce
  • Sample files (if applicable)

Getting Help

  • Check existing issues and discussions
  • Create a new issue with detailed information
  • Provide minimal reproduction examples
  • Include relevant configuration details

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

  • Built with boto3 for AWS S3 compatibility
  • Designed for efficient backup of large files and disk images
  • Optimized for VMDK and other virtual machine file formats