[GH-ISSUE #705] s3fs re-downloading data rather than checking the cache #399

Open
opened 2026-03-04 01:45:10 +03:00 by kerem · 9 comments
Owner

Originally created by @gkiar on GitHub (Jan 8, 2018).
Original GitHub issue: https://github.com/s3fs-fuse/s3fs-fuse/issues/705

Additional Information

Field Value
s3fs --version 1.83
pkg-config --modversion fuse 2.9.7
uname -r 16.7.0
Distribution Mac OSX Sierra + Ubuntu 16.04 (in Docker, for data access)

s3fs command line used

s3fs mybucket /data/mymount/ -o passwd_file=/etc/awspasswd,umask=0007,use_cache=/data/cache -d -d -f

Details about issue

I am running the above command in a terminal session, and then in another I launch a tool which processes some of the data on my S3 bucket. The first time I run my tool, the accessed data is downloaded to the cache and processing occurs as expected. However, subsequent attempts to run processes on the same data do not use the cached version, but re-download it, effectively ignoring the cache. Thanks for your help!

Originally created by @gkiar on GitHub (Jan 8, 2018). Original GitHub issue: https://github.com/s3fs-fuse/s3fs-fuse/issues/705 ### Additional Information | Field | Value | |-----:|:------| | `s3fs --version` | `1.83` | | `pkg-config --modversion fuse` | `2.9.7` | | `uname -r` | `16.7.0` | | Distribution | Mac OSX Sierra + Ubuntu 16.04 (in Docker, for data access) | #### s3fs command line used ``` s3fs mybucket /data/mymount/ -o passwd_file=/etc/awspasswd,umask=0007,use_cache=/data/cache -d -d -f ``` ### Details about issue I am running the above command in a terminal session, and then in another I launch a tool which processes some of the data on my S3 bucket. The first time I run my tool, the accessed data is downloaded to the cache and processing occurs as expected. However, subsequent attempts to run processes on the same data do not use the cached version, but re-download it, effectively ignoring the cache. Thanks for your help!
Author
Owner

@gkiar commented on GitHub (Jan 9, 2018):

Update: when I am accessing files from the same system which I've launched the connection, the cache is correctly used (as shown here). However, when I access the data from within a Docker container with the mount and cache attached at the same location, the cache seems to be skipped and the output log is much more verbose (as shown here).

Do you have any idea why the cache is being ignored in the Docker container? Thanks!

<!-- gh-comment-id:356430910 --> @gkiar commented on GitHub (Jan 9, 2018): Update: when I am accessing files from the same system which I've launched the connection, the cache is correctly used (as shown [here](https://gist.github.com/gkiar/ce80df4d66ee16911d2a2e56eb7b651e)). However, when I access the data from within a Docker container with the mount and cache attached at the same location, the cache seems to be skipped and the output log is much more verbose (as shown [here](https://gist.github.com/gkiar/c5b92cb3578288d8ba235713660a5394)). Do you have any idea why the cache is being ignored in the Docker container? Thanks!
Author
Owner

@sqlbot commented on GitHub (Jan 10, 2018):

With a shared cache directory, for each process to ignore anything written by any of the others seems like it would be the expected behavior.

<!-- gh-comment-id:356471275 --> @sqlbot commented on GitHub (Jan 10, 2018): With a shared cache directory, for each process to ignore anything written by any of the others seems like it would be the expected behavior.
Author
Owner

@gkiar commented on GitHub (Jan 10, 2018):

@sqlbot based on your response I'm not sure you understand the issue (or I need you to elaborate in order to understand your response).... Regardless, allow me to clarify:

  1. I have mounted the bucket via s3fs on my host system
  2. I run a task in a Docker container, ensuring to share both the cache and mount directory
  3. Each subsequent attempted access of a file (whether in the same Docker container or those launched after the files have been downloaded), ignores the cached copy and re-downloads the requested files, despite them appearing in the cache.

I do not think this would be expected behaviour? This has also been observed when the S3 mount was shared with the Docker container but the cache was not.

<!-- gh-comment-id:356478541 --> @gkiar commented on GitHub (Jan 10, 2018): @sqlbot based on your response I'm not sure you understand the issue (or I need you to elaborate in order to understand your response).... Regardless, allow me to clarify: 1. I have mounted the bucket via s3fs on my host system 2. I run a task in a Docker container, ensuring to share both the cache and mount directory 3. Each subsequent attempted access of a file (whether in the same Docker container or those launched after the files have been downloaded), ignores the cached copy and re-downloads the requested files, despite them appearing in the cache. I do not think this would be expected behaviour? This has also been observed when the S3 mount was shared with the Docker container but the cache was not.
Author
Owner

@sqlbot commented on GitHub (Jan 10, 2018):

Yes, I think I did misinterpret this part:

when I access the data from within a Docker container with the mount and cache attached at the same location

I interpreted "with the mount" to mean each container had "the mount" because each container was individually running s3fs and mounting the bucket itself, and also using the same shared cache location.

<!-- gh-comment-id:356479991 --> @sqlbot commented on GitHub (Jan 10, 2018): Yes, I think I did misinterpret this part: >when I access the data from within a Docker container with the mount and cache attached at the same location I interpreted *"with the mount"* to mean each container had "the mount" because each container was individually running s3fs and mounting the bucket itself, and also using the same shared cache location.
Author
Owner

@gkiar commented on GitHub (Jan 10, 2018):

Ah I understand your confusion - thanks. Now that we're on the same page, do you have any ideas as to why this may be happening/how to correct it? 😄

<!-- gh-comment-id:356480675 --> @gkiar commented on GitHub (Jan 10, 2018): Ah I understand your confusion - thanks. Now that we're on the same page, do you have any ideas as to why this may be happening/how to correct it? 😄
Author
Owner

@ggtakec commented on GitHub (Jan 14, 2018):

@gkiar @sqlbot I'm sorry for my late reply.

s3fs can cache the contents of objects as local files.
Whether or not to use this cache is judged by the stats cache held in the internal memory of the s3fs process.
The stats cache holds the stats information of the object (file).
This is the stats information retrieved by the HEAD request of each object.

In other words, even if s3fs processes on different containers use a common cache directory, I think that s3fs is calling a GET request instead of using the cache because there is no stats information.

Since the stats cache is in the process memory, each process decides whether to update the cache of the object.
If multiple s3fs processes have stats caches with the same expiration date, cache data updates are not expected to occur every time.
However, current version will not be able to be expected the best operation when sharing the cache directory from another process.

<!-- gh-comment-id:357497460 --> @ggtakec commented on GitHub (Jan 14, 2018): @gkiar @sqlbot I'm sorry for my late reply. s3fs can cache the contents of objects as local files. Whether or not to use this cache is judged by the stats cache held in the internal memory of the s3fs process. The stats cache holds the stats information of the object (file). This is the stats information retrieved by the HEAD request of each object. In other words, even if s3fs processes on different containers use a common cache directory, I think that s3fs is calling a GET request instead of using the cache because there is no stats information. Since the stats cache is in the process memory, each process decides whether to update the cache of the object. If multiple s3fs processes have stats caches with the same expiration date, cache data updates are not expected to occur every time. However, current version will not be able to be expected the best operation when sharing the cache directory from another process.
Author
Owner

@sqlbot commented on GitHub (Jan 14, 2018):

@ggtakec I think you are making the same assumption that I was about this issue. @gkiar seems to be reporting that only a single s3fs process exists, and the single mounted filesystem is accessible from multiple containers (which would mean that whether the cache directory is shared doesn't actually have any significance).

<!-- gh-comment-id:357525706 --> @sqlbot commented on GitHub (Jan 14, 2018): @ggtakec I think you are making the same assumption that I was about this issue. @gkiar seems to be reporting that only a single s3fs process exists, and the single mounted filesystem is accessible from multiple containers (which would mean that whether the cache directory is shared doesn't actually have any significance).
Author
Owner

@ggtakec commented on GitHub (Jan 15, 2018):

@sqlbot amd @gkiar I'm sorry, I made the same misunderstanding.
I will see the problem in more detail.
Regards,

<!-- gh-comment-id:357673862 --> @ggtakec commented on GitHub (Jan 15, 2018): @sqlbot amd @gkiar I'm sorry, I made the same misunderstanding. I will see the problem in more detail. Regards,
Author
Owner

@Oldsouldier commented on GitHub (Aug 29, 2018):

Also noticing this issue.
V1.84
/usr/bin/s3fs mybucket -o use_cache=/tmp -o allow_other -o iam_role=myrole -o mp_umask=022 -o multireq_max=5 -o multipart_size=20 -o ensure_diskfree=10000 /mybucket

I have an inotify watch on the directory that immediately processes the file after it is done being created, and i see that it is being read from S3 over the network rather than the cache.

Running natively on Ubuntu 16.04 (+required dependencies). No containers, single process.

<!-- gh-comment-id:417073530 --> @Oldsouldier commented on GitHub (Aug 29, 2018): Also noticing this issue. V1.84 `/usr/bin/s3fs mybucket -o use_cache=/tmp -o allow_other -o iam_role=myrole -o mp_umask=022 -o multireq_max=5 -o multipart_size=20 -o ensure_diskfree=10000 /mybucket` I have an inotify watch on the directory that immediately processes the file after it is done being created, and i see that it is being read from S3 over the network rather than the cache. Running natively on Ubuntu 16.04 (+required dependencies). No containers, single process.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/s3fs-fuse#399
No description provided.