[GH-ISSUE #1465] Trying to improve scan time of files in s3fs-fuse mount #772

Closed
opened 2026-03-04 01:48:39 +03:00 by kerem · 16 comments
Owner

Originally created by @matrush900 on GitHub (Oct 30, 2020).
Original GitHub issue: https://github.com/s3fs-fuse/s3fs-fuse/issues/1465

Additional Information

The following information is very important in order to help us to help you. Omission of the following details may delay your support request or receive no attention at all.
Keep in mind that the commands we provide to retrieve information are oriented to GNU/Linux Distributions, so you could need to use others if you use s3fs on macOS or BSD

Version of s3fs being used (s3fs --version)

1.86

Version of fuse being used (pkg-config --modversion fuse, rpm -qi fuse, dpkg -s fuse)

2.9.2

Kernel information (uname -r)

3.10.0-1127.13.1.el7.x86_64

GNU/Linux Distribution, if applicable (cat /etc/os-release)

NAME="Red Hat Enterprise Linux Server"
VERSION="7.8 (Maipo)"
ID="rhel"
ID_LIKE="fedora"
VARIANT="Server"
VARIANT_ID="server"
VERSION_ID="7.8"
PRETTY_NAME=RHEL
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:redhat:enterprise_linux:7.8:GA:server"
HOME_URL="https://www.redhat.com/"
BUG_REPORT_URL="https://bugzilla.redhat.com/"

REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 7"
REDHAT_BUGZILLA_PRODUCT_VERSION=7.8
REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux"
REDHAT_SUPPORT_PRODUCT_VERSION="7.8"

s3fs command line used, if applicable

/etc/fstab entry, if applicable

accumulo-cold-archive /data-cold fuse.s3fs kernel_cache,max_background=1000,max_stat_cache_size=300000,enable_noobj_cache,multipart_size=52,parallel_count=15,multireq_max=15,dbglevel=warn,_netdev,allow_other,mp_umask=0022,nonempty,use_path_request,iam_role=auto 0 0

#### s3fs syslog messages (grep s3fs /var/log/syslog, journalctl | grep s3fs, or s3fs outputs)
_if you execute s3fs with dbglevel, curldbg option, you can get detail debug messages_
### Details about issue
We are running an HDFS cluster with Accumulo.  We have 48 datanodes in the cluster, each with 5-2TB EBS plus an S3 mount using s3fs-fuse that we've recently added.    The configuration within HDFS has each node pointing to its own folder within the S3 mount e.g. /data-cold/cloud-int-data1a/dfs/dn  .  We are running into a bit of an issue when restarting a datanode, part of the startup process is cataloging/scanning each object under each of the 5-2TB drives along with the /data-cold s3fs mountpoint, each EBS volume takes ~45 seconds, where the /data-cold mount takes 14-20 minutes.  I realize that s3fs will be slower, but are there any parameter changes we should try to speed this up?
Originally created by @matrush900 on GitHub (Oct 30, 2020). Original GitHub issue: https://github.com/s3fs-fuse/s3fs-fuse/issues/1465 ### Additional Information _The following information is very important in order to help us to help you. Omission of the following details may delay your support request or receive no attention at all._ _Keep in mind that the commands we provide to retrieve information are oriented to GNU/Linux Distributions, so you could need to use others if you use s3fs on macOS or BSD_ #### Version of s3fs being used (s3fs --version) 1.86 #### Version of fuse being used (pkg-config --modversion fuse, rpm -qi fuse, dpkg -s fuse) 2.9.2 #### Kernel information (uname -r) 3.10.0-1127.13.1.el7.x86_64 #### GNU/Linux Distribution, if applicable (cat /etc/os-release) NAME="Red Hat Enterprise Linux Server" VERSION="7.8 (Maipo)" ID="rhel" ID_LIKE="fedora" VARIANT="Server" VARIANT_ID="server" VERSION_ID="7.8" PRETTY_NAME=RHEL ANSI_COLOR="0;31" CPE_NAME="cpe:/o:redhat:enterprise_linux:7.8:GA:server" HOME_URL="https://www.redhat.com/" BUG_REPORT_URL="https://bugzilla.redhat.com/" REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 7" REDHAT_BUGZILLA_PRODUCT_VERSION=7.8 REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux" REDHAT_SUPPORT_PRODUCT_VERSION="7.8" #### s3fs command line used, if applicable ``` ``` #### /etc/fstab entry, if applicable accumulo-cold-archive /data-cold fuse.s3fs kernel_cache,max_background=1000,max_stat_cache_size=300000,enable_noobj_cache,multipart_size=52,parallel_count=15,multireq_max=15,dbglevel=warn,_netdev,allow_other,mp_umask=0022,nonempty,use_path_request,iam_role=auto 0 0 ``` #### s3fs syslog messages (grep s3fs /var/log/syslog, journalctl | grep s3fs, or s3fs outputs) _if you execute s3fs with dbglevel, curldbg option, you can get detail debug messages_ ``` ``` ### Details about issue We are running an HDFS cluster with Accumulo. We have 48 datanodes in the cluster, each with 5-2TB EBS plus an S3 mount using s3fs-fuse that we've recently added. The configuration within HDFS has each node pointing to its own folder within the S3 mount e.g. /data-cold/cloud-int-data1a/dfs/dn . We are running into a bit of an issue when restarting a datanode, part of the startup process is cataloging/scanning each object under each of the 5-2TB drives along with the /data-cold s3fs mountpoint, each EBS volume takes ~45 seconds, where the /data-cold mount takes 14-20 minutes. I realize that s3fs will be slower, but are there any parameter changes we should try to speed this up?
kerem 2026-03-04 01:48:39 +03:00
Author
Owner

@matrush900 commented on GitHub (Oct 30, 2020):

I forgot to mention that each HDFS block in S3 is 128MB, and there are 150000-200000 objects in each datanode folder in the bucket.

<!-- gh-comment-id:719812247 --> @matrush900 commented on GitHub (Oct 30, 2020): I forgot to mention that each HDFS block in S3 is 128MB, and there are 150000-200000 objects in each datanode folder in the bucket.
Author
Owner

@gaul commented on GitHub (Oct 30, 2020):

Could you try increasing the value of -o multireq_max? s3fs is issuing many HEAD requests to get the stat information for readdir.

<!-- gh-comment-id:719824742 --> @gaul commented on GitHub (Oct 30, 2020): Could you try increasing the value of `-o multireq_max`? s3fs is issuing many HEAD requests to get the stat information for readdir.
Author
Owner

@tke273 commented on GitHub (Oct 30, 2020):

I work with Mat and we have that and parallel_count at 15. When we had it set to 30, we had the s3fs mount go offline on multiple nodes, then a umount/remount was needed to get going again. Now stable at 15, but the latency after mount is where we are looking for guidance. Is there a way to determine what the setting should be?

<!-- gh-comment-id:719830713 --> @tke273 commented on GitHub (Oct 30, 2020): I work with Mat and we have that and parallel_count at 15. When we had it set to 30, we had the s3fs mount go offline on multiple nodes, then a umount/remount was needed to get going again. Now stable at 15, but the latency after mount is where we are looking for guidance. Is there a way to determine what the setting should be?
Author
Owner

@gaul commented on GitHub (Oct 30, 2020):

Doubling the number should halve the scan time and so on. Please test again with 1.87; it includes fixes that might address your symptoms. We expect that s3fs should support a hundred or more concurrent requests. If it crashes we can investigate further.

<!-- gh-comment-id:719836115 --> @gaul commented on GitHub (Oct 30, 2020): Doubling the number should halve the scan time and so on. Please test again with 1.87; it includes fixes that might address your symptoms. We expect that s3fs should support a hundred or more concurrent requests. If it crashes we can investigate further.
Author
Owner

@matrush900 commented on GitHub (Nov 2, 2020):

We've tried a few different combinations of parallel_count and multireq_max, first with 1.86, then with 1.87. None of these combinations made any appreciable difference.
1.86 parallel_count=15, multireq_max=30 - 28 minutes
1.86 parallel_count=15, multireq_max=90 - 26 minutes
1.87 parallel_count=20, multireq_max=90 - 26 minutes
1.87 parallel_count=30, multireq_max=90 - 25 minutes

Are there any other parameter changes we may want to try? It doesn't appear that this s3fs-mount is crashing like before, but we may have to watch it for a day to find out.

<!-- gh-comment-id:720696375 --> @matrush900 commented on GitHub (Nov 2, 2020): We've tried a few different combinations of parallel_count and multireq_max, first with 1.86, then with 1.87. None of these combinations made any appreciable difference. 1.86 parallel_count=15, multireq_max=30 - 28 minutes 1.86 parallel_count=15, multireq_max=90 - 26 minutes 1.87 parallel_count=20, multireq_max=90 - 26 minutes 1.87 parallel_count=30, multireq_max=90 - 25 minutes Are there any other parameter changes we may want to try? It doesn't appear that this s3fs-mount is crashing like before, but we may have to watch it for a day to find out.
Author
Owner

@matrush900 commented on GitHub (Nov 3, 2020):

It looks like our GetRequests to the bucket are topping out at 10,500/minute during these times as well.

<!-- gh-comment-id:721278967 --> @matrush900 commented on GitHub (Nov 3, 2020): It looks like our GetRequests to the bucket are topping out at 10,500/minute during these times as well.
Author
Owner

@gaul commented on GitHub (Nov 4, 2020):

You might also try -o enable_noobj_cache. If this does not help, can you analyze the requests being sent via the logs in -o curldbg?

<!-- gh-comment-id:721750118 --> @gaul commented on GitHub (Nov 4, 2020): You might also try `-o enable_noobj_cache`. If this does not help, can you analyze the requests being sent via the logs in `-o curldbg`?
Author
Owner

@matrush900 commented on GitHub (Nov 11, 2020):

We already had enable_noobj_cache implemented, but I've added curldbg to several of our nodes. I'll let you know if we see anything strange from the curldbg output. We've also created a second bucket, and mounted it using s3fs-fuse. This looks promising so far, but needs some more load testing.

<!-- gh-comment-id:725644154 --> @matrush900 commented on GitHub (Nov 11, 2020): We already had enable_noobj_cache implemented, but I've added curldbg to several of our nodes. I'll let you know if we see anything strange from the curldbg output. We've also created a second bucket, and mounted it using s3fs-fuse. This looks promising so far, but needs some more load testing.
Author
Owner

@gaul commented on GitHub (Jan 1, 2021):

I tried various values of multireq_max when running ls --color=always /path | wc -l on a bucket with 3,000 objects and ~100 ms latency:

multireq_max run 1 run 2 run 3
10 74.663s 75.892s 77.274s
20 55.648s 57.224s 60.967s
40 47.310s 49.731s 50.041s
80 43.335s 45.279s 46.692s
160 42.993s 43.069s 44.234s

I used a new instance of s3fs in each run to prevent caching. While there is a positive effect from increasing multireq_max, there should have been a linear effect from increasing parallelism. This needs more investigation.

<!-- gh-comment-id:753275976 --> @gaul commented on GitHub (Jan 1, 2021): I tried various values of `multireq_max` when running `ls --color=always /path | wc -l` on a bucket with 3,000 objects and ~100 ms latency: | multireq_max | run 1 | run 2 | run 3 | | -----------: | ----: | ----: | ----: | | 10 | 74.663s | 75.892s | 77.274s | | 20 | 55.648s | 57.224s | 60.967s | | 40 | 47.310s | 49.731s | 50.041s | | 80 | 43.335s | 45.279s | 46.692s | | 160 | 42.993s | 43.069s | 44.234s | I used a new instance of s3fs in each run to prevent caching. While there is a positive effect from increasing `multireq_max`, there should have been a linear effect from increasing parallelism. This needs more investigation.
Author
Owner

@matrush900 commented on GitHub (Jan 5, 2021):

Thanks for continuing to look at this. Here's an update on our current settings. We are connecting 10 buckets to each HDFS node with kernel_cache,max_background=1000,max_stat_cache_size=300000,enable_noobj_cache,multipart_size=52,parallel_count=15,multireq_max=15,dbglevel=warn,_netdev,allow_other,mp_umask=0022,use_path_request,iam_role=auto 0 0

At this point each s3fs process is using 7-20% CPU after an HDFS restart, when its cataloging/listing the objects contained under each bucket, each HDFS datanode has eight CPUs. I'm not sure if the 20% CPU limit is due to an s3fs of HDFS limitation.

<!-- gh-comment-id:754819935 --> @matrush900 commented on GitHub (Jan 5, 2021): Thanks for continuing to look at this. Here's an update on our current settings. We are connecting 10 buckets to each HDFS node with kernel_cache,max_background=1000,max_stat_cache_size=300000,enable_noobj_cache,multipart_size=52,parallel_count=15,multireq_max=15,dbglevel=warn,_netdev,allow_other,mp_umask=0022,use_path_request,iam_role=auto 0 0 At this point each s3fs process is using 7-20% CPU after an HDFS restart, when its cataloging/listing the objects contained under each bucket, each HDFS datanode has eight CPUs. I'm not sure if the 20% CPU limit is due to an s3fs of HDFS limitation.
Author
Owner

@matrush900 commented on GitHub (Jan 7, 2021):

It looks like HDFS is creating about 1 billion list requests a day through s3fs-fuse across our 10 buckets. Would we benefit from increasing the list_object_max_keys from 1000 to 100000? Each s3fs mount is attached to a bucket with a folder containing around 85000 objects.

<!-- gh-comment-id:756442511 --> @matrush900 commented on GitHub (Jan 7, 2021): It looks like HDFS is creating about 1 billion list requests a day through s3fs-fuse across our 10 buckets. Would we benefit from increasing the list_object_max_keys from 1000 to 100000? Each s3fs mount is attached to a bucket with a folder containing around 85000 objects.
Author
Owner

@matrush900 commented on GitHub (Jan 8, 2021):

We've removed the enable_noobj_cache flag, but no luck, LIST api calls are still high, no change.

<!-- gh-comment-id:756933407 --> @matrush900 commented on GitHub (Jan 8, 2021): We've removed the enable_noobj_cache flag, but no luck, LIST api calls are still high, no change.
Author
Owner

@gaul commented on GitHub (Jan 11, 2021):

Related to #1482.

<!-- gh-comment-id:757953221 --> @gaul commented on GitHub (Jan 11, 2021): Related to #1482.
Author
Owner

@matrush900 commented on GitHub (Jan 12, 2021):

We determined that as part of HDFS's routine, it was running a du -sk every 15 minutes on each storage directory to find out how much space was available. This was creating 90-95% of the 1 billion list requests per day. We changed out du with df to dramatically cut our ListObject requests.

<!-- gh-comment-id:758831354 --> @matrush900 commented on GitHub (Jan 12, 2021): We determined that as part of HDFS's routine, it was running a du -sk every 15 minutes on each storage directory to find out how much space was available. This was creating 90-95% of the 1 billion list requests per day. We changed out du with df to dramatically cut our ListObject requests.
Author
Owner

@gaul commented on GitHub (Jan 13, 2021):

We determined that as part of HDFS's routine, it was running a du -sk every 15 minutes on each storage directory to find out how much space was available. This was creating 90-95% of the 1 billion list requests per day. We changed out du with df to dramatically cut our ListObject requests.

I am glad you could diagnose this! Note that s3fs statvfs call just returns the maximum value for available and 0 for used so you will not get an accurate count.

The periodic du issue is similar to the updatedb symptoms we see sometimes. I wonder if there is some way for users to diagnose which processes are querying s3fs so that they can more easily diagnose similar issues?

Please let us know if your performance improves without the periodic du. Although let's leave this issue open so I can follow up on https://github.com/s3fs-fuse/s3fs-fuse/issues/1465#issuecomment-753275976 at some point.

<!-- gh-comment-id:759299488 --> @gaul commented on GitHub (Jan 13, 2021): > We determined that as part of HDFS's routine, it was running a du -sk every 15 minutes on each storage directory to find out how much space was available. This was creating 90-95% of the 1 billion list requests per day. We changed out du with df to dramatically cut our ListObject requests. I am glad you could diagnose this! Note that s3fs `statvfs` call just returns the maximum value for available and 0 for used so you will not get an accurate count. The periodic `du` issue is similar to the `updatedb` symptoms we see sometimes. I wonder if there is some way for users to diagnose which processes are querying s3fs so that they can more easily diagnose similar issues? Please let us know if your performance improves without the periodic `du`. Although let's leave this issue open so I can follow up on https://github.com/s3fs-fuse/s3fs-fuse/issues/1465#issuecomment-753275976 at some point.
Author
Owner

@gaul commented on GitHub (Apr 25, 2021):

Closing since a workaround addresses the symptoms.

<!-- gh-comment-id:826249025 --> @gaul commented on GitHub (Apr 25, 2021): Closing since a workaround addresses the symptoms.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/s3fs-fuse#772
No description provided.