[PR #1567] [CLOSED] Enhanced parallel processing for list_bucket and multihead request #2050

Closed
opened 2026-03-04 02:03:26 +03:00 by kerem · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/s3fs-fuse/s3fs-fuse/pull/1567
Author: @ggtakec
Created: 2/12/2021
Status: Closed

Base: masterHead: update_listbucket


📝 Commits (2)

  • a53854c Enhanced parallel processing for list_bucket and multihead request
  • 1412f1b Enhanced parallel processing for list_bucket and multihead request

📊 Changes

8 files changed (+605 additions, -247 deletions)

View changed files

📝 src/curl.cpp (+8 -9)
📝 src/curl.h (+2 -6)
📝 src/curl_multi.cpp (+372 -99)
📝 src/curl_multi.h (+37 -10)
📝 src/psemaphore.h (+2 -0)
📝 src/s3fs.cpp (+167 -123)
📝 src/s3objlist.cpp (+15 -0)
📝 src/s3objlist.h (+2 -0)

📄 Description

Relevant Issue (if applicable)

#1541

Details

Current

The process of readdir is to first create a list of files(objects) in the directory with the list_bucket function.(ListBucket Request)
The maximum value of ListBucket Request is 1000 files, and the list is created in units of 1000 files.
For directories with over 1000 files, list them all.
After making list, s3fs sends HEAD requests for all the files to get the Stats information.(readdir_multi_head)
Finally, after all the Stats information is complete, it returns the file list to FUSE, and it will also be registered in the Stats cache.

Change

I changed the above process.
First, calling the list_bucket function is the same.
ListBucket Request lists files in units of 1000 as before.

Changed to start sending HEAD requests after listing 1000 instead of waiting for all listings.
This process is done in a separate thread, and also the HEAD request itself runs in a separate thread.
This has changed to list up files and executing HEAD requests in parallel.

It also registers the Stats information in the Stats cache as soon as each HEAD request is completed.
This means updating the Stats cache while processing other HEAD requests and listing files.
If the Stats cache is updated while waiting for a HEAD request and its file information becomes hit, the HEAD request will not be sent.
(It's not clear if this situation exists, but it would be beneficial if the Stats cache was updated by another process.)

About perfomance

After all, due to the large number of individual HEAD requests, the processing time of HEAD requests is longer than the processing time of ListBucket.
Therefore, I think the performance will not change much.
(There was not much difference in my results.)
However, I think that this PR should be merged for future changes.
(Because the multi-request logic can be used for other than HEAD request)

Others

400 HTTP response code

In the HEAD request, I noticed that accessing the HEAD request to an object such as SSE returns an error 400 from S3.
In the processing of S3fsMultiCurl, when this number 400 was received, it was retrying.
Changed this retry process so that it is not performed in the case of HEAD request.
I think this issue was previously pointed out in the Issue. (I couldn't find which issue it was.)

#### About libcrypt and multi-threading
When using openssl(libcrypt) from multithreading, it may get a double free error on OSX.
To avoid this, the insertV4(2)Headers function was changed exclusively controlled by mutex.
I wrote a little more information on gist at https://gist.github.com/ggtakec/a743affecf153e78f6b5d74e2bb1fcd5.


🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/s3fs-fuse/s3fs-fuse/pull/1567 **Author:** [@ggtakec](https://github.com/ggtakec) **Created:** 2/12/2021 **Status:** ❌ Closed **Base:** `master` ← **Head:** `update_listbucket` --- ### 📝 Commits (2) - [`a53854c`](https://github.com/s3fs-fuse/s3fs-fuse/commit/a53854ce8eb026278dc677ded4d52f4e3ccd356f) Enhanced parallel processing for list_bucket and multihead request - [`1412f1b`](https://github.com/s3fs-fuse/s3fs-fuse/commit/1412f1b87d8b70528c4eb4d1b07e3e4259987d73) Enhanced parallel processing for list_bucket and multihead request ### 📊 Changes **8 files changed** (+605 additions, -247 deletions) <details> <summary>View changed files</summary> 📝 `src/curl.cpp` (+8 -9) 📝 `src/curl.h` (+2 -6) 📝 `src/curl_multi.cpp` (+372 -99) 📝 `src/curl_multi.h` (+37 -10) 📝 `src/psemaphore.h` (+2 -0) 📝 `src/s3fs.cpp` (+167 -123) 📝 `src/s3objlist.cpp` (+15 -0) 📝 `src/s3objlist.h` (+2 -0) </details> ### 📄 Description ## Relevant Issue (if applicable) #1541 ## Details ### Current The process of readdir is to first create a list of files(objects) in the directory with the list_bucket function.(ListBucket Request) The maximum value of ListBucket Request is 1000 files, and the list is created in units of 1000 files. For directories with over 1000 files, list them all. After making list, s3fs sends HEAD requests for all the files to get the Stats information.(readdir_multi_head) Finally, after all the Stats information is complete, it returns the file list to FUSE, and it will also be registered in the Stats cache. ### Change I changed the above process. First, calling the list_bucket function is the same. ListBucket Request lists files in units of 1000 as before. Changed to start sending HEAD requests after listing 1000 instead of waiting for all listings. This process is done in a separate thread, and also the HEAD request itself runs in a separate thread. This has changed to list up files and executing HEAD requests in parallel. It also registers the Stats information in the Stats cache as soon as each HEAD request is completed. This means updating the Stats cache while processing other HEAD requests and listing files. If the Stats cache is updated while waiting for a HEAD request and its file information becomes hit, the HEAD request will not be sent. (It's not clear if this situation exists, but it would be beneficial if the Stats cache was updated by another process.) ### About perfomance After all, due to the large number of individual HEAD requests, the processing time of HEAD requests is longer than the processing time of ListBucket. Therefore, I think the performance will not change much. (There was not much difference in my results.) However, I think that this PR should be merged for future changes. (Because the multi-request logic can be used for other than HEAD request) ### Others #### 400 HTTP response code In the HEAD request, I noticed that accessing the HEAD request to an object such as SSE returns an error 400 from S3. In the processing of S3fsMultiCurl, when this number 400 was received, it was retrying. Changed this retry process so that it is not performed in the case of HEAD request. I think this issue was previously pointed out in the Issue. (I couldn't find which issue it was.) ~~#### About libcrypt and multi-threading~~ ~~When using openssl(libcrypt) from multithreading, it may get a `double free` error on OSX.~~ ~~To avoid this, the insertV4(2)Headers function was changed exclusively controlled by mutex.~~ ~~I wrote a little more information on gist at https://gist.github.com/ggtakec/a743affecf153e78f6b5d74e2bb1fcd5.~~ --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
kerem 2026-03-04 02:03:26 +03:00
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/s3fs-fuse#2050
No description provided.