| .env.sample | ||
| chunker_s3.py | ||
| LICENSE | ||
| README.md | ||
| requirements.txt | ||
S3 Incremental Upload Chunker
A powerful Python utility for efficiently uploading large files to S3-compatible storage services with intelligent chunking, incremental uploads, and sparse file optimization. Perfect for backing up disk images, virtual machine files, and other large binary files.
Introduction
The S3 Incremental Upload Chunker is designed to solve the problem of efficiently backing up large files to S3-compatible storage. It provides several key features:
- Intelligent Chunking: Splits large files into manageable chunks with VMDK block alignment
- Incremental Uploads: Only uploads changed chunks, saving bandwidth and time
- Sparse File Optimization: Detects and optimizes sparse (mostly empty) chunks to save storage space
- Hash Verification: Uses SHA-256 hashing to ensure data integrity
- S3-Compatible: Works with AWS S3, MinIO, and other S3-compatible services
- Comprehensive Management: List, delete, and verify backups with detailed statistics
Key Features
- Incremental Backup: Only uploads chunks that have changed since the last backup
- Sparse File Detection: Automatically detects and optimizes sparse chunks (mostly zeros)
- VMDK Optimization: Chunk sizes are aligned to 64KB VMDK blocks for optimal performance
- Metadata Tracking: Stores comprehensive metadata about each backup
- Verification Tools: Built-in file verification and integrity checking
- Flexible Configuration: Environment-based configuration with fallback options
Installation
Prerequisites
- Python 3.6 or higher
- Access to an S3-compatible storage service (AWS S3, MinIO, etc.)
Install Dependencies
pip install -r requirements.txt
Or install manually:
pip install boto3>=1.26.0
Configuration
Create a .env file in the project directory with your S3 credentials:
# S3 Configuration
AWS_ACCESS_KEY_ID=your_access_key_here
AWS_SECRET_ACCESS_KEY=your_secret_key_here
S3_ENDPOINT_URL=your-s3-endpoint.com
S3_BUCKET_NAME=your-bucket-name
Note: The script will also look for a config.env file if .env is not found.
Usage
The script provides several commands for managing file chunks and backups. All commands support both long-form (--option) and short-form (-o) flags.
Basic Syntax
python chunker_s3.py <command> [options]
Commands
1. Split and Upload (split)
Splits a file into chunks and uploads them to S3 with incremental support.
python chunker_s3.py split [options]
Required Flags:
--input-file,--input_file,-i: Path to the input file to split
Optional Flags:
--bucket-name,--bucket_name,-b: S3 bucket name (uses config if not specified)--number-of-chunks,--number_of_chunks: Number of chunks to split into--chunk-size,--chunk_size: Size of each chunk in bytes (default: 4MB)--force: Force upload all chunks (disable incremental mode)
Examples:
# Basic split with default 4MB chunks
python chunker_s3.py split --input-file large_file.vmdk --bucket-name my-backups
# Split into 100 chunks
python chunker_s3.py split --input-file disk_image.img --number-of-chunks 100
# Use custom chunk size (8MB)
python chunker_s3.py split --input-file backup.tar.gz --chunk-size 8388608
# Force full upload (ignore existing chunks)
python chunker_s3.py split --input-file file.iso --force
2. Reassemble (reassemble)
Downloads and reassembles chunks from S3 into the original file.
python chunker_s3.py reassemble [options]
Required Flags:
--output-file,--output_file,-o: Path for the reassembled output file--prefix,-p: S3 prefix for the chunk files
Optional Flags:
--bucket-name,--bucket_name,-b: S3 bucket name (uses config if not specified)
Examples:
# Reassemble a backup
python chunker_s3.py reassemble --output-file restored_file.vmdk --prefix myfile
# Reassemble from specific bucket
python chunker_s3.py reassemble --output-file backup.img --prefix diskimage --bucket-name my-backups
3. List Backups (list)
Lists all available backups in the S3 bucket.
python chunker_s3.py list [options]
Optional Flags:
--bucket-name,--bucket_name,-b: S3 bucket name (uses config if not specified)--prefix,-p: Filter backups by prefix
Examples:
# List all backups
python chunker_s3.py list
# List backups in specific bucket
python chunker_s3.py list --bucket-name my-backups
# List backups with specific prefix
python chunker_s3.py list --prefix myfile
4. Delete Backup (delete)
Deletes a backup and all its chunks from S3.
python chunker_s3.py delete [options]
Required Flags:
--prefix,-p: Prefix of the backup to delete
Optional Flags:
--bucket-name,--bucket_name,-b: S3 bucket name (uses config if not specified)--force: Skip confirmation prompt
Examples:
# Delete a backup (with confirmation)
python chunker_s3.py delete --prefix old-backup
# Delete without confirmation
python chunker_s3.py delete --prefix temp-backup --force
# Delete from specific bucket
python chunker_s3.py delete --prefix backup --bucket-name my-backups
5. Verify Files (verify)
Verifies that a reassembled file matches the original file.
python chunker_s3.py verify [options]
Required Flags:
--original-file,--original_file: Path to the original file--reassembled-file,--reassembled_file: Path to the reassembled file
Optional Flags:
--chunk-size,--chunk_size: Chunk size for comparison (default: 4MB)
Examples:
# Verify a reassembled file
python chunker_s3.py verify --original-file original.vmdk --reassembled-file restored.vmdk
# Verify with custom chunk size
python chunker_s3.py verify --original-file file.img --reassembled-file restored.img --chunk-size 8388608
Advanced Usage
Environment Variables
You can also set configuration via environment variables:
export AWS_ACCESS_KEY_ID=your_access_key
export AWS_SECRET_ACCESS_KEY=your_secret_key
export S3_ENDPOINT_URL=your-s3-endpoint.com
export S3_BUCKET_NAME=your-bucket-name
Sparse File Optimization
The script automatically detects sparse chunks (mostly zeros) and optimizes them:
- Sparse chunks are stored as small placeholders (6 bytes) instead of full chunks
- Original size is preserved in metadata for accurate reconstruction
- Significant storage savings for disk images and virtual machine files
Incremental Uploads
- Only uploads chunks that have changed since the last backup
- Uses SHA-256 hashing to detect changes
- Maintains metadata about chunk types and sizes
- Provides detailed statistics about upload efficiency
Output and Statistics
The script provides detailed output including:
- Upload progress with chunk type information
- Storage efficiency statistics
- Sparse file optimization savings
- Upload speed and duration
- File verification results
Example output:
[INFO] Starting incremental upload of large_file.vmdk
[DATA] myfile_chunk_0000001 (4,194,304 bytes, hash: a1b2c3d4...)
[SPARSE] myfile_chunk_0000002 (4,194,304 bytes -> 6 bytes placeholder, 99.9% saved)
[SKIP] myfile_chunk_0000003 unchanged (data, hash: e5f6g7h8...)
=== Upload Complete ===
Total chunks: 1000
- Data chunks: 750
- Sparse chunks: 250
Uploaded chunks: 200
Skipped chunks: 800
Incremental efficiency: 80.0% data reduction
Sparse optimization: 15.2% storage reduction
Total efficiency: 87.5% storage reduction
Contribution
We welcome contributions to improve the S3 Incremental Upload Chunker! Here's how you can help:
Development Setup
- Fork the repository and clone your fork
- Create a virtual environment:
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate - Install dependencies:
pip install -r requirements.txt - Create a test configuration:
cp .env.example .env # Edit .env with your test S3 credentials
Contributing Tasks
🐛 Bug Reports
- Test the issue with the latest version
- Provide detailed reproduction steps
- Include error messages and logs
- Specify your Python version and OS
✨ Feature Requests
- Check existing issues first
- Describe the use case and expected behavior
- Consider backward compatibility
- Provide implementation ideas if possible
🔧 Code Contributions
- Follow the existing code style
- Add tests for new features
- Update documentation as needed
- Ensure all tests pass
📚 Documentation
- Improve README clarity
- Add more usage examples
- Document edge cases
- Translate to other languages
🧪 Testing
- Test with different file types and sizes
- Verify S3-compatible service compatibility
- Test error handling scenarios
- Performance testing with large files
Development Guidelines
- Code Style: Follow PEP 8 Python style guidelines
- Testing: Test your changes with various file types
- Documentation: Update README for new features
- Commits: Use clear, descriptive commit messages
- Pull Requests: Provide detailed description of changes
Reporting Issues
When reporting issues, please include:
- Python version (
python --version) - Operating system
- S3 service being used (AWS S3, MinIO, etc.)
- Complete error messages
- Steps to reproduce
- Sample files (if applicable)
Getting Help
- Check existing issues and discussions
- Create a new issue with detailed information
- Provide minimal reproduction examples
- Include relevant configuration details
License
This project is licensed under the MIT License - see the LICENSE file for details.
Acknowledgments
- Built with boto3 for AWS S3 compatibility
- Designed for efficient backup of large files and disk images
- Optimized for VMDK and other virtual machine file formats