[PR #1802] [MERGED] Add the stream upload which starts uploading parts before Flush #2185

Closed
opened 2026-03-04 02:04:13 +03:00 by kerem · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/s3fs-fuse/s3fs-fuse/pull/1802
Author: @ggtakec
Created: 11/2/2021
Status: Merged
Merged: 7/17/2022
Merged by: @gaul

Base: masterHead: stream_upload


📝 Commits (5)

  • 6585356 Add the stream upload which starts uploading parts before Flush
  • f3a7fb6 Reflected the result of the review in the code
  • 3ca3cd8 Reflect the result of the review in the code again
  • fd81f63 Fixed an error which reported by cppcheck 2.8
  • 7a578b6 Merged the code corresponding to the mknod fix(f11eb7d)

📊 Changes

15 files changed (+1771 additions, -135 deletions)

View changed files

📝 src/Makefile.am (+1 -0)
📝 src/curl.cpp (+66 -0)
📝 src/curl.h (+7 -4)
📝 src/fdcache_entity.cpp (+287 -1)
📝 src/fdcache_entity.h (+5 -0)
📝 src/fdcache_fdinfo.cpp (+826 -20)
📝 src/fdcache_fdinfo.h (+56 -13)
📝 src/fdcache_untreated.cpp (+82 -86)
📝 src/fdcache_untreated.h (+8 -9)
📝 src/psemaphore.h (+17 -0)
📝 src/s3fs.cpp (+28 -0)
src/threadpoolman.cpp (+261 -0)
src/threadpoolman.h (+97 -0)
📝 src/types.h (+29 -2)
📝 test/small-integration-test.sh (+1 -0)

📄 Description

Relevant Issue (if applicable)

n/a

Overview

In multi-part upload (mix, non-mix upload), the function to upload the file part sequentially before the file is flushed has been added.

Details

The current s3fs will only start uploading a file when a flush is called for the file.
This PR code has added an option called streamupload to allow s3fs to upload the file part before the file is flushed.
The streamupload option is only effective when multipart upload (mixupload and nomixupload) is enabled.

The individual explanations are as follows:

(1) streamupload option

This is an option to enable the Stream upload function.
This option is a tentative option.
I will remove this option once this PR has been merged and fully tested.
This function should be as the default behavior of s3fs, I plan to enable this feature like multipart upload.
At that time, I will add the nostream option(pseudonym) instead, it is similar to nomultipart etc.

(2) Multipart size

When Stream upload is enabled, each part size for multipart upload is fixed(specified by the multipart_size option).
In other words, from the beginning of the file, the size indicated by the multipart_size option is used as the boundary, and each part is uploaded.

(3) Part upload conditions

When all the data for the fixed range part shown in (2) is written, the upload of that part will start.(Multipart upload will start even if it is not flushed)
If writing occurs again for the range of the part that has already been uploaded, the range will be uploaded again.

If the written area does not fill the range of the part, the part will not be uploaded until flush is called.
This range will be uploaded when flush is called.

(4) Thread pool

The code for this additional feature is implemented to have a thread pool.
This thread pool is used in each part's upload call.
The thread pool is initialized when s3fs starts, and all threads are started and put into a standby state.
Thus the max_thread_count option(provisional) has been added for specifying this thread pool count.

This option is a temporary option like streamupload.
This option will be replaced with the parallel_count option, etc., when the s3fs refurbishment(including this PR) is completed.

(5) About test

Existing tests are sufficient for uploading files.
Testing for opening files, writing to non-contiguous areas, and closing files can be done with the recently added write_multiblock test.

Testing of large files was done individually, please see (6).

(6) Performance

Performance comparisons involving large files are performed individually and summarized in the Gist below:
https://gist.github.com/ggtakec/0482aca53643681e2e410ed4032b780f

The speed of uploading 5GB files has been improved by about 40%.

NOTE

This PR is intended for performance tuning and source code cleanup.
The refurbishment will be a series of modifications including this PR.
In a series of fixes, I plan to use the thread pool mentioned above, and to fix downloads, HEAD requests, and so on.
And when the series of refurbishments is complete, the two tentative options mentioned above will also be sorted out.


🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/s3fs-fuse/s3fs-fuse/pull/1802 **Author:** [@ggtakec](https://github.com/ggtakec) **Created:** 11/2/2021 **Status:** ✅ Merged **Merged:** 7/17/2022 **Merged by:** [@gaul](https://github.com/gaul) **Base:** `master` ← **Head:** `stream_upload` --- ### 📝 Commits (5) - [`6585356`](https://github.com/s3fs-fuse/s3fs-fuse/commit/6585356819e2cd4cfbea324c79097ef50db6e1a7) Add the stream upload which starts uploading parts before Flush - [`f3a7fb6`](https://github.com/s3fs-fuse/s3fs-fuse/commit/f3a7fb6454995c293ca5a9689c2cdb66bac0d18a) Reflected the result of the review in the code - [`3ca3cd8`](https://github.com/s3fs-fuse/s3fs-fuse/commit/3ca3cd8d41e2fbcd0e308e4db647ad2109e508fe) Reflect the result of the review in the code again - [`fd81f63`](https://github.com/s3fs-fuse/s3fs-fuse/commit/fd81f6358d38be53463d4bcdef479e8e51603c63) Fixed an error which reported by cppcheck 2.8 - [`7a578b6`](https://github.com/s3fs-fuse/s3fs-fuse/commit/7a578b617c5c0b22f8c77a980cd9dbb46c3916a0) Merged the code corresponding to the mknod fix(f11eb7d) ### 📊 Changes **15 files changed** (+1771 additions, -135 deletions) <details> <summary>View changed files</summary> 📝 `src/Makefile.am` (+1 -0) 📝 `src/curl.cpp` (+66 -0) 📝 `src/curl.h` (+7 -4) 📝 `src/fdcache_entity.cpp` (+287 -1) 📝 `src/fdcache_entity.h` (+5 -0) 📝 `src/fdcache_fdinfo.cpp` (+826 -20) 📝 `src/fdcache_fdinfo.h` (+56 -13) 📝 `src/fdcache_untreated.cpp` (+82 -86) 📝 `src/fdcache_untreated.h` (+8 -9) 📝 `src/psemaphore.h` (+17 -0) 📝 `src/s3fs.cpp` (+28 -0) ➕ `src/threadpoolman.cpp` (+261 -0) ➕ `src/threadpoolman.h` (+97 -0) 📝 `src/types.h` (+29 -2) 📝 `test/small-integration-test.sh` (+1 -0) </details> ### 📄 Description ### Relevant Issue (if applicable) n/a ### Overview In multi-part upload (mix, non-mix upload), the function to upload the file part sequentially before the file is flushed has been added. ### Details The current s3fs will only start uploading a file when a flush is called for the file. This PR code has added an option called `streamupload` to allow s3fs to upload the file part before the file is flushed. The `streamupload` option is only effective when multipart upload (mixupload and nomixupload) is enabled. The individual explanations are as follows: #### (1) `streamupload` option This is an option to enable the Stream upload function. This option is a tentative option. I will remove this option once this PR has been merged and fully tested. This function should be as the default behavior of s3fs, I plan to enable this feature like multipart upload. At that time, I will add the `nostream` option(pseudonym) instead, it is similar to nomultipart etc. #### (2) Multipart size When Stream upload is enabled, each part size for multipart upload is fixed(specified by the `multipart_size` option). In other words, from the beginning of the file, the size indicated by the `multipart_size` option is used as the boundary, and each part is uploaded. #### (3) Part upload conditions When all the data for the fixed range part shown in (2) is written, the upload of that part will start.(Multipart upload will start even if it is not flushed) If writing occurs again for the range of the part that has already been uploaded, the range will be uploaded again. If the written area does not fill the range of the part, the part will not be uploaded until flush is called. This range will be uploaded when flush is called. #### (4) Thread pool The code for this additional feature is implemented to have a thread pool. This thread pool is used in each part's upload call. The thread pool is initialized when s3fs starts, and all threads are started and put into a standby state. Thus the `max_thread_count` option(provisional) has been added for specifying this thread pool count. This option is a temporary option like `streamupload`. This option will be replaced with the `parallel_count` option, etc., when the s3fs refurbishment(including this PR) is completed. #### (5) About test Existing tests are sufficient for uploading files. Testing for opening files, writing to non-contiguous areas, and closing files can be done with the recently added `write_multiblock` test. Testing of large files was done individually, please see (6). #### (6) Performance Performance comparisons involving large files are performed individually and summarized in the Gist below: https://gist.github.com/ggtakec/0482aca53643681e2e410ed4032b780f The speed of uploading 5GB files has been improved by about **40%**. ### NOTE This PR is intended for performance tuning and source code cleanup. The refurbishment will be a series of modifications including this PR. In a series of fixes, I plan to use the thread pool mentioned above, and to fix downloads, HEAD requests, and so on. And when the series of refurbishments is complete, the two tentative options mentioned above will also be sorted out. --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
kerem 2026-03-04 02:04:13 +03:00
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/s3fs-fuse#2185
No description provided.