[GH-ISSUE #2193] s3fs is order of magnitude slower in scanning directory tree than direct s3 access #1117

Open
opened 2026-03-04 01:51:32 +03:00 by kerem · 4 comments
Owner

Originally created by @kgabor on GitHub (Jun 23, 2023).
Original GitHub issue: https://github.com/s3fs-fuse/s3fs-fuse/issues/2193

Similar problem was reported in #1465 .

We have several zarr datasets stored in S3 buckets that consists of hundred thousands of 10-40MB chunk objects arranged in an index tree like directory structure.

A public example eg.
s3://aind-open-data/exaSPIM_653431_2023-05-06_10-23-15/exaSPIM.zarr/tile_x_0000_y_0000_z_0000_ch_488.zarr/
that has 159,911 objects and a total size of 1.1 TB.

Traversing (listing) these directory structures (or stat-ing of a pre-existing list of these objects) is order of magnitude slower via s3fs than using direct S3 api communication. I could not get any notable performance improvement by the multireq_max or parallel_count parameters, setting multireq_max to high values like 1024 seems to make performance even worse. The use case is that the processing application uses s3fs and checks for existence for each chunk at opening and thus the overall input data rate remains very limited (at about max. ~1.2 GB/min), irrespectively of the no. of reader threads and s3fs mount parameters. Why?

ubuntu@ip-172-31-1-50:~$ date; find aind-scratch-data/gabor.kovacs/2023-06-16_653431_2200/ -type f > scratch_filelist.txt; date

Thu Jun 22 20:35:51 UTC 2023
Thu Jun 22 21:16:10 UTC 2023

ubuntu@ip-172-31-1-50:~$ wc scratch_filelist.txt 
  306436   306436 25620646 scratch_filelist.txt

# =====

ubuntu@ip-172-31-1-50:~$ date; rclone ls aind_scratch_data:aind-scratch-data/gabor.kovacs/2023-06-16_653431_2200 > rclone_filelist.txt; date
Thu Jun 22 14:53:39 PDT 2023
Thu Jun 22 14:54:15 PDT 2023

ubuntu@ip-172-31-1-50:~$ wc rclone_filelist.txt 
  306436   612872 12137462 rclone_filelist.txt

Additional Information

Version of s3fs being used (s3fs --version)

V1.90

Version of fuse being used (pkg-config --modversion fuse, rpm -qi fuse or dpkg -s fuse)

3.10.5-1build1

Kernel information (uname -r)

5.19.0-1027-aws

GNU/Linux Distribution, if applicable (cat /etc/os-release)

Ubuntu 22.04.2 LTS

How to run s3fs, if applicable

[] command line
[] /etc/fstab
sudo s3fs aind-scratch-data ./aind-scratch-data -o rw,allow_other,umask=0002,uid=$(id -u),gid=$(id -g),use_cache=/home/ubuntu/s3cache,ensure_diskfree=200000,parallel_count=16,nomultipart,multireq_max=32

Originally created by @kgabor on GitHub (Jun 23, 2023). Original GitHub issue: https://github.com/s3fs-fuse/s3fs-fuse/issues/2193 Similar problem was reported in #1465 . We have several zarr datasets stored in S3 buckets that consists of hundred thousands of 10-40MB chunk objects arranged in an index tree like directory structure. A public example eg. `s3://aind-open-data/exaSPIM_653431_2023-05-06_10-23-15/exaSPIM.zarr/tile_x_0000_y_0000_z_0000_ch_488.zarr/` that has 159,911 objects and a total size of 1.1 TB. Traversing (listing) these directory structures (or `stat`-ing of a pre-existing list of these objects) is order of magnitude slower via s3fs than using direct S3 api communication. I could not get any notable performance improvement by the `multireq_max` or `parallel_count` parameters, setting `multireq_max` to high values like 1024 seems to make performance even worse. The use case is that the processing application uses s3fs and checks for existence for each chunk at opening and thus the overall input data rate remains very limited (at about max. ~1.2 GB/min), irrespectively of the no. of reader threads and s3fs mount parameters. Why? ``` ubuntu@ip-172-31-1-50:~$ date; find aind-scratch-data/gabor.kovacs/2023-06-16_653431_2200/ -type f > scratch_filelist.txt; date Thu Jun 22 20:35:51 UTC 2023 Thu Jun 22 21:16:10 UTC 2023 ubuntu@ip-172-31-1-50:~$ wc scratch_filelist.txt 306436 306436 25620646 scratch_filelist.txt # ===== ubuntu@ip-172-31-1-50:~$ date; rclone ls aind_scratch_data:aind-scratch-data/gabor.kovacs/2023-06-16_653431_2200 > rclone_filelist.txt; date Thu Jun 22 14:53:39 PDT 2023 Thu Jun 22 14:54:15 PDT 2023 ubuntu@ip-172-31-1-50:~$ wc rclone_filelist.txt 306436 612872 12137462 rclone_filelist.txt ``` <!-- -------------------------------------------------------------------------- The following information is very important in order to help us to help you. Omission of the following details may delay your support request or receive no attention at all. Keep in mind that the commands we provide to retrieve information are oriented to GNU/Linux Distributions, so you could need to use others if you use s3fs on macOS or BSD. --------------------------------------------------------------------------- --> ### Additional Information #### Version of s3fs being used (`s3fs --version`) <!-- example: V1.91 (commit:b19262a) --> V1.90 #### Version of fuse being used (`pkg-config --modversion fuse`, `rpm -qi fuse` or `dpkg -s fuse`) 3.10.5-1build1 #### Kernel information (`uname -r`) 5.19.0-1027-aws #### GNU/Linux Distribution, if applicable (`cat /etc/os-release`) Ubuntu 22.04.2 LTS #### How to run s3fs, if applicable <!-- Describe the s3fs "command line" or "/etc/fstab" entry used. --> [] command line [] /etc/fstab `sudo s3fs aind-scratch-data ./aind-scratch-data -o rw,allow_other,umask=0002,uid=$(id -u),gid=$(id -g),use_cache=/home/ubuntu/s3cache,ensure_diskfree=200000,parallel_count=16,nomultipart,multireq_max=32`
Author
Owner

@gaul commented on GitHub (Jun 24, 2023):

s3fs 1.91 reduces the number of HEAD requests but something is wrong if we don't get more speedup with more parallelism. See #1482 for background on how to make readdir much faster at the cost of POSIX compatibility.

<!-- gh-comment-id:1605263440 --> @gaul commented on GitHub (Jun 24, 2023): s3fs 1.91 reduces the number of HEAD requests but something is wrong if we don't get more speedup with more parallelism. See #1482 for background on how to make `readdir` much faster at the cost of POSIX compatibility.
Author
Owner

@ggtakec commented on GitHub (Jun 25, 2023):

@kgabor
I think if the s3fs command called from the find command has recursive checks on directories, etc., it may slow down the operation.

To solve this, it may be effective to increase the size of the file stat cache with max_stat_cache_size.
This cache is a cache of stat information for files that have been read once, so in your case set it higher than 159,911.

Hopefully this will improve performance.

<!-- gh-comment-id:1606081298 --> @ggtakec commented on GitHub (Jun 25, 2023): @kgabor _I think if the s3fs command called from the `find` command has recursive checks on directories, etc., it may slow down the operation._ To solve this, it may be effective to increase the size of the file stat cache with `max_stat_cache_size`. This cache is a cache of stat information for files that have been read once, so in your case set it higher than 159,911. Hopefully this will improve performance.
Author
Owner

@kgabor commented on GitHub (Jun 27, 2023):

@gaul I'm experimenting with max_stat_cache_size=5000000,stat_cache_expire=1300000. Would the idea of pre-filling stat cache with running a find command work? Is this a memory only cache ? (I only see entries in .aind-open-data.stat cache dir for files that were actually opened)

First experiment with starting the processing job along with find in parallel did not give any performance improvement. (I expected a nice speedup once find fills up the chunk file stat cache, but nothing...)

<!-- gh-comment-id:1610134194 --> @kgabor commented on GitHub (Jun 27, 2023): @gaul I'm experimenting with `max_stat_cache_size=5000000,stat_cache_expire=1300000`. Would the idea of pre-filling stat cache with running a `find` command work? Is this a memory only cache ? (I only see entries in .aind-open-data.stat cache dir for files that were actually opened) First experiment with starting the processing job along with `find` in parallel did not give any performance improvement. (I expected a nice speedup once `find` fills up the chunk file stat cache, but nothing...)
Author
Owner

@kgabor commented on GitHub (Jun 28, 2023):

Latest caching experiment. cache dir empty, s3fs mount:

sudo s3fs aind-open-data ./aind-open-data -o rw,allow_other,umask=0002,uid=$(id -u),gid=$(id -g),use_cache=/home/ubuntu/s3cache,ensure_diskfree=200000,multireq_max=32,parallel_count=16,passwd_file=/home/ubuntu/.passwd_open_s3,max_dirty_data=256,nomultipart,max_stat_cache_size=5000000,stat_cache_interval_expire=1300000
find ~/aind-open-data/exaSPIM_653431_2023-05-06_10-23-15/exaSPIM.zarr/ -type f > filelist.txt

This proceeds with 100-200 files/s, finishes in a few hours (959,467 objects). Nothing appears in ~/s3cache, s3fs process has 3-4GB mem usage.

Now, if find is repeated, it's much faster, several thousand files/s, cache is working.

Now, if I start the data processing, that reads these very same dirs on 32 threads, re-running find in the meantime above gets very slow <100 files/s! If I stop the data processing, find gets fast again.

I suspect, there must be a locking bottleneck in cache access or something similar in concurrency handling. This is also supported by the experience that I/O throughput (in data processing) is mostly independent of the no. of data processing threads and CPU usage remains well below 100% (i.e. limited by data rate). Also, not much difference of s3fs parallel_count and multireq_max options.

<!-- gh-comment-id:1610403547 --> @kgabor commented on GitHub (Jun 28, 2023): Latest caching experiment. cache dir empty, s3fs mount: ``` sudo s3fs aind-open-data ./aind-open-data -o rw,allow_other,umask=0002,uid=$(id -u),gid=$(id -g),use_cache=/home/ubuntu/s3cache,ensure_diskfree=200000,multireq_max=32,parallel_count=16,passwd_file=/home/ubuntu/.passwd_open_s3,max_dirty_data=256,nomultipart,max_stat_cache_size=5000000,stat_cache_interval_expire=1300000 ``` ``` find ~/aind-open-data/exaSPIM_653431_2023-05-06_10-23-15/exaSPIM.zarr/ -type f > filelist.txt ``` This proceeds with 100-200 files/s, finishes in a few hours (959,467 objects). Nothing appears in ``~/s3cache``, s3fs process has 3-4GB mem usage. Now, if ``find`` is repeated, it's much faster, several thousand files/s, cache is working. Now, if I start the data processing, that reads these very same dirs on 32 threads, re-running ``find`` in the meantime above gets very slow <100 files/s! If I stop the data processing, ``find`` gets fast again. I suspect, there must be a locking bottleneck in cache access or something similar in concurrency handling. This is also supported by the experience that I/O throughput (in data processing) is mostly independent of the no. of data processing threads and CPU usage remains well below 100% (i.e. limited by data rate). Also, not much difference of s3fs ``parallel_count`` and ``multireq_max`` options.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/s3fs-fuse#1117
No description provided.