[GH-ISSUE #2193] s3fs is order of magnitude slower in scanning directory tree than direct s3 access

kerem commented

2026-03-04 01:51:32 +03:00

Owner

Originally created by @kgabor on GitHub (Jun 23, 2023).
Original GitHub issue: https://github.com/s3fs-fuse/s3fs-fuse/issues/2193

Additional Information

Version of s3fs being used (`s3fs --version`)

V1.90

Version of fuse being used (`pkg-config --modversion fuse`, `rpm -qi fuse` or `dpkg -s fuse`)

3.10.5-1build1

Kernel information (`uname -r`)

5.19.0-1027-aws

GNU/Linux Distribution, if applicable (`cat /etc/os-release`)

Ubuntu 22.04.2 LTS

How to run s3fs, if applicable

[] command line
[] /etc/fstab
sudo s3fs aind-scratch-data ./aind-scratch-data -o rw,allow_other,umask=0002,uid=$(id -u),gid=$(id -g),use_cache=/home/ubuntu/s3cache,ensure_diskfree=200000,parallel_count=16,nomultipart,multireq_max=32

Originally created by @kgabor on GitHub (Jun 23, 2023). Original GitHub issue: https://github.com/s3fs-fuse/s3fs-fuse/issues/2193 Similar problem was reported in #1465 . We have several zarr datasets stored in S3 buckets that consists of hundred thousands of 10-40MB chunk objects arranged in an index tree like directory structure. A public example eg. `s3://aind-open-data/exaSPIM_653431_2023-05-06_10-23-15/exaSPIM.zarr/tile_x_0000_y_0000_z_0000_ch_488.zarr/` that has 159,911 objects and a total size of 1.1 TB. Traversing (listing) these directory structures (or `stat`-ing of a pre-existing list of these objects) is order of magnitude slower via s3fs than using direct S3 api communication. I could not get any notable performance improvement by the `multireq_max` or `parallel_count` parameters, setting `multireq_max` to high values like 1024 seems to make performance even worse. The use case is that the processing application uses s3fs and checks for existence for each chunk at opening and thus the overall input data rate remains very limited (at about max. ~1.2 GB/min), irrespectively of the no. of reader threads and s3fs mount parameters. Why? ``` ubuntu@ip-172-31-1-50:~$ date; find aind-scratch-data/gabor.kovacs/2023-06-16_653431_2200/ -type f > scratch_filelist.txt; date Thu Jun 22 20:35:51 UTC 2023 Thu Jun 22 21:16:10 UTC 2023 ubuntu@ip-172-31-1-50:~$ wc scratch_filelist.txt 306436 306436 25620646 scratch_filelist.txt # ===== ubuntu@ip-172-31-1-50:~$ date; rclone ls aind_scratch_data:aind-scratch-data/gabor.kovacs/2023-06-16_653431_2200 > rclone_filelist.txt; date Thu Jun 22 14:53:39 PDT 2023 Thu Jun 22 14:54:15 PDT 2023 ubuntu@ip-172-31-1-50:~$ wc rclone_filelist.txt 306436 612872 12137462 rclone_filelist.txt ```  ### Additional Information #### Version of s3fs being used (`s3fs --version`)  V1.90 #### Version of fuse being used (`pkg-config --modversion fuse`, `rpm -qi fuse` or `dpkg -s fuse`) 3.10.5-1build1 #### Kernel information (`uname -r`) 5.19.0-1027-aws #### GNU/Linux Distribution, if applicable (`cat /etc/os-release`) Ubuntu 22.04.2 LTS #### How to run s3fs, if applicable  [] command line [] /etc/fstab `sudo s3fs aind-scratch-data ./aind-scratch-data -o rw,allow_other,umask=0002,uid=$(id -u),gid=$(id -g),use_cache=/home/ubuntu/s3cache,ensure_diskfree=200000,parallel_count=16,nomultipart,multireq_max=32`

kerem added the

performance

label

2026-03-04 01:51:32 +03:00

kerem commented

2026-03-04 01:51:33 +03:00

Author

Owner

@gaul commented on GitHub (Jun 24, 2023):

s3fs 1.91 reduces the number of HEAD requests but something is wrong if we don't get more speedup with more parallelism. See #1482 for background on how to make readdir much faster at the cost of POSIX compatibility.

@gaul commented on GitHub (Jun 24, 2023): s3fs 1.91 reduces the number of HEAD requests but something is wrong if we don't get more speedup with more parallelism. See #1482 for background on how to make `readdir` much faster at the cost of POSIX compatibility.

kerem commented

2026-03-04 01:51:33 +03:00

Author

Owner

@ggtakec commented on GitHub (Jun 25, 2023):

@kgabor
I think if the s3fs command called from the find command has recursive checks on directories, etc., it may slow down the operation.

To solve this, it may be effective to increase the size of the file stat cache with max_stat_cache_size.
This cache is a cache of stat information for files that have been read once, so in your case set it higher than 159,911.

Hopefully this will improve performance.

@ggtakec commented on GitHub (Jun 25, 2023): @kgabor _I think if the s3fs command called from the `find` command has recursive checks on directories, etc., it may slow down the operation._ To solve this, it may be effective to increase the size of the file stat cache with `max_stat_cache_size`. This cache is a cache of stat information for files that have been read once, so in your case set it higher than 159,911. Hopefully this will improve performance.

kerem commented

2026-03-04 01:51:33 +03:00

Author

Owner

@kgabor commented on GitHub (Jun 27, 2023):

@gaul I'm experimenting with max_stat_cache_size=5000000,stat_cache_expire=1300000. Would the idea of pre-filling stat cache with running a find command work? Is this a memory only cache ? (I only see entries in .aind-open-data.stat cache dir for files that were actually opened)

First experiment with starting the processing job along with find in parallel did not give any performance improvement. (I expected a nice speedup once find fills up the chunk file stat cache, but nothing...)

@kgabor commented on GitHub (Jun 27, 2023): @gaul I'm experimenting with `max_stat_cache_size=5000000,stat_cache_expire=1300000`. Would the idea of pre-filling stat cache with running a `find` command work? Is this a memory only cache ? (I only see entries in .aind-open-data.stat cache dir for files that were actually opened) First experiment with starting the processing job along with `find` in parallel did not give any performance improvement. (I expected a nice speedup once `find` fills up the chunk file stat cache, but nothing...)

kerem commented

2026-03-04 01:51:33 +03:00

Author

Owner

@kgabor commented on GitHub (Jun 28, 2023):

Latest caching experiment. cache dir empty, s3fs mount:

sudo s3fs aind-open-data ./aind-open-data -o rw,allow_other,umask=0002,uid=$(id -u),gid=$(id -g),use_cache=/home/ubuntu/s3cache,ensure_diskfree=200000,multireq_max=32,parallel_count=16,passwd_file=/home/ubuntu/.passwd_open_s3,max_dirty_data=256,nomultipart,max_stat_cache_size=5000000,stat_cache_interval_expire=1300000

find ~/aind-open-data/exaSPIM_653431_2023-05-06_10-23-15/exaSPIM.zarr/ -type f > filelist.txt

This proceeds with 100-200 files/s, finishes in a few hours (959,467 objects). Nothing appears in ~/s3cache, s3fs process has 3-4GB mem usage.

Now, if find is repeated, it's much faster, several thousand files/s, cache is working.

Now, if I start the data processing, that reads these very same dirs on 32 threads, re-running find in the meantime above gets very slow <100 files/s! If I stop the data processing, find gets fast again.

I suspect, there must be a locking bottleneck in cache access or something similar in concurrency handling. This is also supported by the experience that I/O throughput (in data processing) is mostly independent of the no. of data processing threads and CPU usage remains well below 100% (i.e. limited by data rate). Also, not much difference of s3fs parallel_count and multireq_max options.

@kgabor commented on GitHub (Jun 28, 2023): Latest caching experiment. cache dir empty, s3fs mount: ``` sudo s3fs aind-open-data ./aind-open-data -o rw,allow_other,umask=0002,uid=$(id -u),gid=$(id -g),use_cache=/home/ubuntu/s3cache,ensure_diskfree=200000,multireq_max=32,parallel_count=16,passwd_file=/home/ubuntu/.passwd_open_s3,max_dirty_data=256,nomultipart,max_stat_cache_size=5000000,stat_cache_interval_expire=1300000 ``` ``` find ~/aind-open-data/exaSPIM_653431_2023-05-06_10-23-15/exaSPIM.zarr/ -type f > filelist.txt ``` This proceeds with 100-200 files/s, finishes in a few hours (959,467 objects). Nothing appears in ``~/s3cache``, s3fs process has 3-4GB mem usage. Now, if ``find`` is repeated, it's much faster, several thousand files/s, cache is working. Now, if I start the data processing, that reads these very same dirs on 32 threads, re-running ``find`` in the meantime above gets very slow <100 files/s! If I stop the data processing, ``find`` gets fast again. I suspect, there must be a locking bottleneck in cache access or something similar in concurrency handling. This is also supported by the experience that I/O throughput (in data processing) is mostly independent of the no. of data processing threads and CPU usage remains well below 100% (i.e. limited by data rate). Also, not much difference of s3fs ``parallel_count`` and ``multireq_max`` options.

kerem referenced this issue

2026-03-04 02:02:21 +03:00

[PR #1117] [MERGED] Increase test startup retries on Linux #1829

Rows
Columns

[GH-ISSUE #2193] s3fs is order of magnitude slower in scanning directory tree than direct s3 access #1117

Additional Information

Version of s3fs being used (s3fs --version)

Version of fuse being used (pkg-config --modversion fuse, rpm -qi fuse or dpkg -s fuse)

Kernel information (uname -r)

GNU/Linux Distribution, if applicable (cat /etc/os-release)

How to run s3fs, if applicable

Version of s3fs being used (`s3fs --version`)

Version of fuse being used (`pkg-config --modversion fuse`, `rpm -qi fuse` or `dpkg -s fuse`)

Kernel information (`uname -r`)

GNU/Linux Distribution, if applicable (`cat /etc/os-release`)