[GH-ISSUE #941] automatically tune multipart sizes #532

Open
opened 2026-03-04 01:46:26 +03:00 by kerem · 2 comments
Owner

Originally created by @gaul on GitHub (Jan 30, 2019).
Original GitHub issue: https://github.com/s3fs-fuse/s3fs-fuse/issues/941

s3fs should automatically use larger multipart sizes when object sizes are large. For example, multipart_size defaults to 10 MB which means that s3fs can only write <= 100 GB objects with the maximum 10,000 part size instead of the 5 TB limit. Similarly, singlepart_copy_limit should start smaller to improve parallel uploads but increase as object size gets larger. Propose giving these -1 defaults to allow users to modify behavior but otherwise letting s3fs choose the sizes. References #940.

Originally created by @gaul on GitHub (Jan 30, 2019). Original GitHub issue: https://github.com/s3fs-fuse/s3fs-fuse/issues/941 s3fs should automatically use larger multipart sizes when object sizes are large. For example, `multipart_size` defaults to 10 MB which means that s3fs can only write <= 100 GB objects with the maximum 10,000 part size instead of the 5 TB limit. Similarly, `singlepart_copy_limit` should start smaller to improve parallel uploads but increase as object size gets larger. Propose giving these `-1` defaults to allow users to modify behavior but otherwise letting s3fs choose the sizes. References #940.
Author
Owner

@ffeldhaus commented on GitHub (Feb 1, 2019):

I would suggest to divide the filesize by the parallel_count or a multiple of parallel_count and determine the multipart_size that way. It's also helpful if the multipart_size is rounded down to the next power of 2 (e.g. 16MB or 1GB) in case someone wants to check the ETag of a downloaded file and needs to guess the part size used for the multipart upload.

<!-- gh-comment-id:459849362 --> @ffeldhaus commented on GitHub (Feb 1, 2019): I would suggest to divide the filesize by the `parallel_count` or a multiple of `parallel_count` and determine the `multipart_size` that way. It's also helpful if the `multipart_size` is rounded down to the next power of 2 (e.g. 16MB or 1GB) in case someone wants to check the ETag of a downloaded file and needs to guess the part size used for the multipart upload.
Author
Owner

@gaul commented on GitHub (Feb 2, 2019):

Using more parts than the number of parallel_count helps since network errors do not retransmit as much data. The underlying curl reuses connections so TCP window scaling only affects some connections. Note that there are many possible values for part sizes; let's scope this one to simply allow the full range of object sizes and we can follow on with optimizations.

<!-- gh-comment-id:459924481 --> @gaul commented on GitHub (Feb 2, 2019): Using more parts than the number of `parallel_count` helps since network errors do not retransmit as much data. The underlying curl reuses connections so TCP window scaling only affects some connections. Note that there are many possible values for part sizes; let's scope this one to simply allow the full range of object sizes and we can follow on with optimizations.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/s3fs-fuse#532
No description provided.