mirror of
https://github.com/s3fs-fuse/s3fs-fuse.git
synced 2026-04-25 13:26:00 +03:00
[PR #1567] [CLOSED] Enhanced parallel processing for list_bucket and multihead request #2050
Labels
No labels
bug
bug
dataloss
duplicate
enhancement
feature request
help wanted
invalid
need info
performance
pull-request
question
question
testing
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/s3fs-fuse#2050
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
📋 Pull Request Information
Original PR: https://github.com/s3fs-fuse/s3fs-fuse/pull/1567
Author: @ggtakec
Created: 2/12/2021
Status: ❌ Closed
Base:
master← Head:update_listbucket📝 Commits (2)
a53854cEnhanced parallel processing for list_bucket and multihead request1412f1bEnhanced parallel processing for list_bucket and multihead request📊 Changes
8 files changed (+605 additions, -247 deletions)
View changed files
📝
src/curl.cpp(+8 -9)📝
src/curl.h(+2 -6)📝
src/curl_multi.cpp(+372 -99)📝
src/curl_multi.h(+37 -10)📝
src/psemaphore.h(+2 -0)📝
src/s3fs.cpp(+167 -123)📝
src/s3objlist.cpp(+15 -0)📝
src/s3objlist.h(+2 -0)📄 Description
Relevant Issue (if applicable)
#1541
Details
Current
The process of readdir is to first create a list of files(objects) in the directory with the list_bucket function.(ListBucket Request)
The maximum value of ListBucket Request is 1000 files, and the list is created in units of 1000 files.
For directories with over 1000 files, list them all.
After making list, s3fs sends HEAD requests for all the files to get the Stats information.(readdir_multi_head)
Finally, after all the Stats information is complete, it returns the file list to FUSE, and it will also be registered in the Stats cache.
Change
I changed the above process.
First, calling the list_bucket function is the same.
ListBucket Request lists files in units of 1000 as before.
Changed to start sending HEAD requests after listing 1000 instead of waiting for all listings.
This process is done in a separate thread, and also the HEAD request itself runs in a separate thread.
This has changed to list up files and executing HEAD requests in parallel.
It also registers the Stats information in the Stats cache as soon as each HEAD request is completed.
This means updating the Stats cache while processing other HEAD requests and listing files.
If the Stats cache is updated while waiting for a HEAD request and its file information becomes hit, the HEAD request will not be sent.
(It's not clear if this situation exists, but it would be beneficial if the Stats cache was updated by another process.)
About perfomance
After all, due to the large number of individual HEAD requests, the processing time of HEAD requests is longer than the processing time of ListBucket.
Therefore, I think the performance will not change much.
(There was not much difference in my results.)
However, I think that this PR should be merged for future changes.
(Because the multi-request logic can be used for other than HEAD request)
Others
400 HTTP response code
In the HEAD request, I noticed that accessing the HEAD request to an object such as SSE returns an error 400 from S3.
In the processing of S3fsMultiCurl, when this number 400 was received, it was retrying.
Changed this retry process so that it is not performed in the case of HEAD request.
I think this issue was previously pointed out in the Issue. (I couldn't find which issue it was.)
#### About libcrypt and multi-threadingWhen using openssl(libcrypt) from multithreading, it may get adouble freeerror on OSX.To avoid this, the insertV4(2)Headers function was changed exclusively controlled by mutex.I wrote a little more information on gist at https://gist.github.com/ggtakec/a743affecf153e78f6b5d74e2bb1fcd5.🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.