[GH-ISSUE #705] s3fs re-downloading data rather than checking the cache

kerem commented

2026-03-04 01:45:10 +03:00

Owner

Originally created by @gkiar on GitHub (Jan 8, 2018).
Original GitHub issue: https://github.com/s3fs-fuse/s3fs-fuse/issues/705

Additional Information

Field	Value
`s3fs --version`	`1.83`
`pkg-config --modversion fuse`	`2.9.7`
`uname -r`	`16.7.0`
Distribution	Mac OSX Sierra + Ubuntu 16.04 (in Docker, for data access)

s3fs command line used

s3fs mybucket /data/mymount/ -o passwd_file=/etc/awspasswd,umask=0007,use_cache=/data/cache -d -d -f

Details about issue

I am running the above command in a terminal session, and then in another I launch a tool which processes some of the data on my S3 bucket. The first time I run my tool, the accessed data is downloaded to the cache and processing occurs as expected. However, subsequent attempts to run processes on the same data do not use the cached version, but re-download it, effectively ignoring the cache. Thanks for your help!

Originally created by @gkiar on GitHub (Jan 8, 2018). Original GitHub issue: https://github.com/s3fs-fuse/s3fs-fuse/issues/705 ### Additional Information | Field | Value | |-----:|:------| | `s3fs --version` | `1.83` | | `pkg-config --modversion fuse` | `2.9.7` | | `uname -r` | `16.7.0` | | Distribution | Mac OSX Sierra + Ubuntu 16.04 (in Docker, for data access) | #### s3fs command line used ``` s3fs mybucket /data/mymount/ -o passwd_file=/etc/awspasswd,umask=0007,use_cache=/data/cache -d -d -f ``` ### Details about issue I am running the above command in a terminal session, and then in another I launch a tool which processes some of the data on my S3 bucket. The first time I run my tool, the accessed data is downloaded to the cache and processing occurs as expected. However, subsequent attempts to run processes on the same data do not use the cached version, but re-download it, effectively ignoring the cache. Thanks for your help!

kerem added the

performance

label

2026-03-04 01:45:10 +03:00

kerem commented

2026-03-04 01:45:11 +03:00

Author

Owner

@gkiar commented on GitHub (Jan 9, 2018):

Update: when I am accessing files from the same system which I've launched the connection, the cache is correctly used (as shown here). However, when I access the data from within a Docker container with the mount and cache attached at the same location, the cache seems to be skipped and the output log is much more verbose (as shown here).

Do you have any idea why the cache is being ignored in the Docker container? Thanks!

@gkiar commented on GitHub (Jan 9, 2018): Update: when I am accessing files from the same system which I've launched the connection, the cache is correctly used (as shown [here](https://gist.github.com/gkiar/ce80df4d66ee16911d2a2e56eb7b651e)). However, when I access the data from within a Docker container with the mount and cache attached at the same location, the cache seems to be skipped and the output log is much more verbose (as shown [here](https://gist.github.com/gkiar/c5b92cb3578288d8ba235713660a5394)). Do you have any idea why the cache is being ignored in the Docker container? Thanks!

kerem commented

2026-03-04 01:45:11 +03:00

Author

Owner

@sqlbot commented on GitHub (Jan 10, 2018):

With a shared cache directory, for each process to ignore anything written by any of the others seems like it would be the expected behavior.

@sqlbot commented on GitHub (Jan 10, 2018): With a shared cache directory, for each process to ignore anything written by any of the others seems like it would be the expected behavior.

kerem commented

2026-03-04 01:45:11 +03:00

Author

Owner

@gkiar commented on GitHub (Jan 10, 2018):

@sqlbot based on your response I'm not sure you understand the issue (or I need you to elaborate in order to understand your response).... Regardless, allow me to clarify:

I have mounted the bucket via s3fs on my host system
I run a task in a Docker container, ensuring to share both the cache and mount directory
Each subsequent attempted access of a file (whether in the same Docker container or those launched after the files have been downloaded), ignores the cached copy and re-downloads the requested files, despite them appearing in the cache.

I do not think this would be expected behaviour? This has also been observed when the S3 mount was shared with the Docker container but the cache was not.

@gkiar commented on GitHub (Jan 10, 2018): @sqlbot based on your response I'm not sure you understand the issue (or I need you to elaborate in order to understand your response).... Regardless, allow me to clarify: 1. I have mounted the bucket via s3fs on my host system 2. I run a task in a Docker container, ensuring to share both the cache and mount directory 3. Each subsequent attempted access of a file (whether in the same Docker container or those launched after the files have been downloaded), ignores the cached copy and re-downloads the requested files, despite them appearing in the cache. I do not think this would be expected behaviour? This has also been observed when the S3 mount was shared with the Docker container but the cache was not.

kerem commented

2026-03-04 01:45:11 +03:00

Author

Owner

@sqlbot commented on GitHub (Jan 10, 2018):

Yes, I think I did misinterpret this part:

when I access the data from within a Docker container with the mount and cache attached at the same location

I interpreted "with the mount" to mean each container had "the mount" because each container was individually running s3fs and mounting the bucket itself, and also using the same shared cache location.

@sqlbot commented on GitHub (Jan 10, 2018): Yes, I think I did misinterpret this part: >when I access the data from within a Docker container with the mount and cache attached at the same location I interpreted *"with the mount"* to mean each container had "the mount" because each container was individually running s3fs and mounting the bucket itself, and also using the same shared cache location.

kerem commented

2026-03-04 01:45:11 +03:00

Author

Owner

@gkiar commented on GitHub (Jan 10, 2018):

Ah I understand your confusion - thanks. Now that we're on the same page, do you have any ideas as to why this may be happening/how to correct it? 😄

@gkiar commented on GitHub (Jan 10, 2018): Ah I understand your confusion - thanks. Now that we're on the same page, do you have any ideas as to why this may be happening/how to correct it? 😄

kerem commented

2026-03-04 01:45:11 +03:00

Author

Owner

@ggtakec commented on GitHub (Jan 14, 2018):

@gkiar @sqlbot I'm sorry for my late reply.

s3fs can cache the contents of objects as local files.
Whether or not to use this cache is judged by the stats cache held in the internal memory of the s3fs process.
The stats cache holds the stats information of the object (file).
This is the stats information retrieved by the HEAD request of each object.

In other words, even if s3fs processes on different containers use a common cache directory, I think that s3fs is calling a GET request instead of using the cache because there is no stats information.

Since the stats cache is in the process memory, each process decides whether to update the cache of the object.
If multiple s3fs processes have stats caches with the same expiration date, cache data updates are not expected to occur every time.
However, current version will not be able to be expected the best operation when sharing the cache directory from another process.

@ggtakec commented on GitHub (Jan 14, 2018): @gkiar @sqlbot I'm sorry for my late reply. s3fs can cache the contents of objects as local files. Whether or not to use this cache is judged by the stats cache held in the internal memory of the s3fs process. The stats cache holds the stats information of the object (file). This is the stats information retrieved by the HEAD request of each object. In other words, even if s3fs processes on different containers use a common cache directory, I think that s3fs is calling a GET request instead of using the cache because there is no stats information. Since the stats cache is in the process memory, each process decides whether to update the cache of the object. If multiple s3fs processes have stats caches with the same expiration date, cache data updates are not expected to occur every time. However, current version will not be able to be expected the best operation when sharing the cache directory from another process.

kerem commented

2026-03-04 01:45:12 +03:00

Author

Owner

@sqlbot commented on GitHub (Jan 14, 2018):

@ggtakec I think you are making the same assumption that I was about this issue. @gkiar seems to be reporting that only a single s3fs process exists, and the single mounted filesystem is accessible from multiple containers (which would mean that whether the cache directory is shared doesn't actually have any significance).

@sqlbot commented on GitHub (Jan 14, 2018): @ggtakec I think you are making the same assumption that I was about this issue. @gkiar seems to be reporting that only a single s3fs process exists, and the single mounted filesystem is accessible from multiple containers (which would mean that whether the cache directory is shared doesn't actually have any significance).

kerem commented

2026-03-04 01:45:12 +03:00

Author

Owner

@ggtakec commented on GitHub (Jan 15, 2018):

@sqlbot amd @gkiar I'm sorry, I made the same misunderstanding.
I will see the problem in more detail.
Regards,

@ggtakec commented on GitHub (Jan 15, 2018): @sqlbot amd @gkiar I'm sorry, I made the same misunderstanding. I will see the problem in more detail. Regards,

kerem commented

2026-03-04 01:45:12 +03:00

Author

Owner

@Oldsouldier commented on GitHub (Aug 29, 2018):

Also noticing this issue.
V1.84
/usr/bin/s3fs mybucket -o use_cache=/tmp -o allow_other -o iam_role=myrole -o mp_umask=022 -o multireq_max=5 -o multipart_size=20 -o ensure_diskfree=10000 /mybucket

I have an inotify watch on the directory that immediately processes the file after it is done being created, and i see that it is being read from S3 over the network rather than the cache.

Running natively on Ubuntu 16.04 (+required dependencies). No containers, single process.

@Oldsouldier commented on GitHub (Aug 29, 2018): Also noticing this issue. V1.84 `/usr/bin/s3fs mybucket -o use_cache=/tmp -o allow_other -o iam_role=myrole -o mp_umask=022 -o multireq_max=5 -o multipart_size=20 -o ensure_diskfree=10000 /mybucket` I have an inotify watch on the directory that immediately processes the file after it is done being created, and i see that it is being read from S3 over the network rather than the cache. Running natively on Ubuntu 16.04 (+required dependencies). No containers, single process.

Rows
Columns

[GH-ISSUE #705] s3fs re-downloading data rather than checking the cache #399

Additional Information

s3fs command line used

Details about issue