mirror of https://github.com/legitYosal/s3-incremental-upload.git synced 2026-04-26 17:55:56 +03:00

No description

Find a file

Yosef S cabdc16803 update Readme		2025-09-04 18:56:22 +03:30
.env.sample	update Readme	2025-09-04 18:56:22 +03:30
chunker_s3.py	add chunker script	2025-09-04 18:48:34 +03:30
LICENSE	Initial commit	2025-09-04 18:38:18 +03:30
README.md	update Readme	2025-09-04 18:56:22 +03:30
requirements.txt	add chunker script	2025-09-04 18:48:34 +03:30

README.md

S3 Incremental Upload Chunker

A powerful Python utility for efficiently uploading large files to S3-compatible storage services with intelligent chunking, incremental uploads, and sparse file optimization. Perfect for backing up disk images, virtual machine files, and other large binary files.

Introduction

The S3 Incremental Upload Chunker is designed to solve the problem of efficiently backing up large files to S3-compatible storage. It provides several key features:

Intelligent Chunking: Splits large files into manageable chunks with VMDK block alignment
Incremental Uploads: Only uploads changed chunks, saving bandwidth and time
Sparse File Optimization: Detects and optimizes sparse (mostly empty) chunks to save storage space
Hash Verification: Uses SHA-256 hashing to ensure data integrity
S3-Compatible: Works with AWS S3, MinIO, and other S3-compatible services
Comprehensive Management: List, delete, and verify backups with detailed statistics

Key Features

Incremental Backup: Only uploads chunks that have changed since the last backup
Sparse File Detection: Automatically detects and optimizes sparse chunks (mostly zeros)
VMDK Optimization: Chunk sizes are aligned to 64KB VMDK blocks for optimal performance
Metadata Tracking: Stores comprehensive metadata about each backup
Verification Tools: Built-in file verification and integrity checking
Flexible Configuration: Environment-based configuration with fallback options

Installation

Prerequisites

Python 3.6 or higher
Access to an S3-compatible storage service (AWS S3, MinIO, etc.)

Install Dependencies

pip install -r requirements.txt

Or install manually:

pip install boto3>=1.26.0

Configuration

Create a .env file in the project directory with your S3 credentials:

# S3 Configuration
AWS_ACCESS_KEY_ID=your_access_key_here
AWS_SECRET_ACCESS_KEY=your_secret_key_here
S3_ENDPOINT_URL=your-s3-endpoint.com
S3_BUCKET_NAME=your-bucket-name

Note: The script will also look for a config.env file if .env is not found.

Usage

The script provides several commands for managing file chunks and backups. All commands support both long-form (--option) and short-form (-o) flags.

Basic Syntax

python chunker_s3.py <command> [options]

Commands

1. Split and Upload (`split`)

Splits a file into chunks and uploads them to S3 with incremental support.

python chunker_s3.py split [options]

Required Flags:

--input-file, --input_file, -i: Path to the input file to split

Optional Flags:

--bucket-name, --bucket_name, -b: S3 bucket name (uses config if not specified)
--number-of-chunks, --number_of_chunks: Number of chunks to split into
--chunk-size, --chunk_size: Size of each chunk in bytes (default: 4MB)
--force: Force upload all chunks (disable incremental mode)

Examples:

# Basic split with default 4MB chunks
python chunker_s3.py split --input-file large_file.vmdk --bucket-name my-backups

# Split into 100 chunks
python chunker_s3.py split --input-file disk_image.img --number-of-chunks 100

# Use custom chunk size (8MB)
python chunker_s3.py split --input-file backup.tar.gz --chunk-size 8388608

# Force full upload (ignore existing chunks)
python chunker_s3.py split --input-file file.iso --force

2. Reassemble (`reassemble`)

Downloads and reassembles chunks from S3 into the original file.

python chunker_s3.py reassemble [options]

Required Flags:

--output-file, --output_file, -o: Path for the reassembled output file
--prefix, -p: S3 prefix for the chunk files

Optional Flags:

--bucket-name, --bucket_name, -b: S3 bucket name (uses config if not specified)

Examples:

# Reassemble a backup
python chunker_s3.py reassemble --output-file restored_file.vmdk --prefix myfile

# Reassemble from specific bucket
python chunker_s3.py reassemble --output-file backup.img --prefix diskimage --bucket-name my-backups

3. List Backups (`list`)

Lists all available backups in the S3 bucket.

python chunker_s3.py list [options]

Optional Flags:

--bucket-name, --bucket_name, -b: S3 bucket name (uses config if not specified)
--prefix, -p: Filter backups by prefix

Examples:

# List all backups
python chunker_s3.py list

# List backups in specific bucket
python chunker_s3.py list --bucket-name my-backups

# List backups with specific prefix
python chunker_s3.py list --prefix myfile

4. Delete Backup (`delete`)

Deletes a backup and all its chunks from S3.

python chunker_s3.py delete [options]

Required Flags:

--prefix, -p: Prefix of the backup to delete

Optional Flags:

--bucket-name, --bucket_name, -b: S3 bucket name (uses config if not specified)
--force: Skip confirmation prompt

Examples:

# Delete a backup (with confirmation)
python chunker_s3.py delete --prefix old-backup

# Delete without confirmation
python chunker_s3.py delete --prefix temp-backup --force

# Delete from specific bucket
python chunker_s3.py delete --prefix backup --bucket-name my-backups

5. Verify Files (`verify`)

Verifies that a reassembled file matches the original file.

python chunker_s3.py verify [options]

Required Flags:

--original-file, --original_file: Path to the original file
--reassembled-file, --reassembled_file: Path to the reassembled file

Optional Flags:

--chunk-size, --chunk_size: Chunk size for comparison (default: 4MB)

Examples:

# Verify a reassembled file
python chunker_s3.py verify --original-file original.vmdk --reassembled-file restored.vmdk

# Verify with custom chunk size
python chunker_s3.py verify --original-file file.img --reassembled-file restored.img --chunk-size 8388608

Advanced Usage

Environment Variables

You can also set configuration via environment variables:

export AWS_ACCESS_KEY_ID=your_access_key
export AWS_SECRET_ACCESS_KEY=your_secret_key
export S3_ENDPOINT_URL=your-s3-endpoint.com
export S3_BUCKET_NAME=your-bucket-name

Sparse File Optimization

The script automatically detects sparse chunks (mostly zeros) and optimizes them:

Sparse chunks are stored as small placeholders (6 bytes) instead of full chunks
Original size is preserved in metadata for accurate reconstruction
Significant storage savings for disk images and virtual machine files

Incremental Uploads

Only uploads chunks that have changed since the last backup
Uses SHA-256 hashing to detect changes
Maintains metadata about chunk types and sizes
Provides detailed statistics about upload efficiency

Output and Statistics

The script provides detailed output including:

Upload progress with chunk type information
Storage efficiency statistics
Sparse file optimization savings
Upload speed and duration
File verification results

Example output:

[INFO] Starting incremental upload of large_file.vmdk
[DATA] myfile_chunk_0000001 (4,194,304 bytes, hash: a1b2c3d4...)
[SPARSE] myfile_chunk_0000002 (4,194,304 bytes -> 6 bytes placeholder, 99.9% saved)
[SKIP] myfile_chunk_0000003 unchanged (data, hash: e5f6g7h8...)

=== Upload Complete ===
Total chunks: 1000
  - Data chunks: 750
  - Sparse chunks: 250
Uploaded chunks: 200
Skipped chunks: 800
Incremental efficiency: 80.0% data reduction
Sparse optimization: 15.2% storage reduction
Total efficiency: 87.5% storage reduction

Contribution

We welcome contributions to improve the S3 Incremental Upload Chunker! Here's how you can help:

Development Setup

Fork the repository and clone your fork

Create a virtual environment:

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install dependencies:
```
pip install -r requirements.txt
```

Create a test configuration:

cp .env.example .env
# Edit .env with your test S3 credentials

Contributing Tasks

🐛 Bug Reports

Test the issue with the latest version
Provide detailed reproduction steps
Include error messages and logs
Specify your Python version and OS

✨ Feature Requests

Check existing issues first
Describe the use case and expected behavior
Consider backward compatibility
Provide implementation ideas if possible

🔧 Code Contributions

Follow the existing code style
Add tests for new features
Update documentation as needed
Ensure all tests pass

📚 Documentation

Improve README clarity
Add more usage examples
Document edge cases
Translate to other languages

🧪 Testing

Test with different file types and sizes
Verify S3-compatible service compatibility
Test error handling scenarios
Performance testing with large files

Development Guidelines

Code Style: Follow PEP 8 Python style guidelines
Testing: Test your changes with various file types
Documentation: Update README for new features
Commits: Use clear, descriptive commit messages
Pull Requests: Provide detailed description of changes

Reporting Issues

When reporting issues, please include:

Python version (python --version)
Operating system
S3 service being used (AWS S3, MinIO, etc.)
Complete error messages
Steps to reproduce
Sample files (if applicable)

Getting Help

Check existing issues and discussions
Create a new issue with detailed information
Provide minimal reproduction examples
Include relevant configuration details

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

Built with boto3 for AWS S3 compatibility
Designed for efficient backup of large files and disk images
Optimized for VMDK and other virtual machine file formats

README.md

S3 Incremental Upload Chunker

Introduction

Key Features

Installation

Prerequisites

Install Dependencies

Configuration

Usage

Basic Syntax

Commands

1. Split and Upload (split)

2. Reassemble (reassemble)

3. List Backups (list)

4. Delete Backup (delete)

5. Verify Files (verify)

Advanced Usage

Environment Variables

Sparse File Optimization

Incremental Uploads

Output and Statistics

Contribution

Development Setup

Contributing Tasks

🐛 Bug Reports

✨ Feature Requests

🔧 Code Contributions

📚 Documentation

🧪 Testing

Development Guidelines

Reporting Issues

Getting Help

License

Acknowledgments

1. Split and Upload (`split`)

2. Reassemble (`reassemble`)

3. List Backups (`list`)

4. Delete Backup (`delete`)

5. Verify Files (`verify`)