mirror of
https://github.com/s3fs-fuse/s3fs-fuse.git
synced 2026-04-25 13:26:00 +03:00
[GH-ISSUE #705] s3fs re-downloading data rather than checking the cache #399
Labels
No labels
bug
bug
dataloss
duplicate
enhancement
feature request
help wanted
invalid
need info
performance
pull-request
question
question
testing
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/s3fs-fuse#399
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @gkiar on GitHub (Jan 8, 2018).
Original GitHub issue: https://github.com/s3fs-fuse/s3fs-fuse/issues/705
Additional Information
s3fs --version1.83pkg-config --modversion fuse2.9.7uname -r16.7.0s3fs command line used
Details about issue
I am running the above command in a terminal session, and then in another I launch a tool which processes some of the data on my S3 bucket. The first time I run my tool, the accessed data is downloaded to the cache and processing occurs as expected. However, subsequent attempts to run processes on the same data do not use the cached version, but re-download it, effectively ignoring the cache. Thanks for your help!
@gkiar commented on GitHub (Jan 9, 2018):
Update: when I am accessing files from the same system which I've launched the connection, the cache is correctly used (as shown here). However, when I access the data from within a Docker container with the mount and cache attached at the same location, the cache seems to be skipped and the output log is much more verbose (as shown here).
Do you have any idea why the cache is being ignored in the Docker container? Thanks!
@sqlbot commented on GitHub (Jan 10, 2018):
With a shared cache directory, for each process to ignore anything written by any of the others seems like it would be the expected behavior.
@gkiar commented on GitHub (Jan 10, 2018):
@sqlbot based on your response I'm not sure you understand the issue (or I need you to elaborate in order to understand your response).... Regardless, allow me to clarify:
I do not think this would be expected behaviour? This has also been observed when the S3 mount was shared with the Docker container but the cache was not.
@sqlbot commented on GitHub (Jan 10, 2018):
Yes, I think I did misinterpret this part:
I interpreted "with the mount" to mean each container had "the mount" because each container was individually running s3fs and mounting the bucket itself, and also using the same shared cache location.
@gkiar commented on GitHub (Jan 10, 2018):
Ah I understand your confusion - thanks. Now that we're on the same page, do you have any ideas as to why this may be happening/how to correct it? 😄
@ggtakec commented on GitHub (Jan 14, 2018):
@gkiar @sqlbot I'm sorry for my late reply.
s3fs can cache the contents of objects as local files.
Whether or not to use this cache is judged by the stats cache held in the internal memory of the s3fs process.
The stats cache holds the stats information of the object (file).
This is the stats information retrieved by the HEAD request of each object.
In other words, even if s3fs processes on different containers use a common cache directory, I think that s3fs is calling a GET request instead of using the cache because there is no stats information.
Since the stats cache is in the process memory, each process decides whether to update the cache of the object.
If multiple s3fs processes have stats caches with the same expiration date, cache data updates are not expected to occur every time.
However, current version will not be able to be expected the best operation when sharing the cache directory from another process.
@sqlbot commented on GitHub (Jan 14, 2018):
@ggtakec I think you are making the same assumption that I was about this issue. @gkiar seems to be reporting that only a single s3fs process exists, and the single mounted filesystem is accessible from multiple containers (which would mean that whether the cache directory is shared doesn't actually have any significance).
@ggtakec commented on GitHub (Jan 15, 2018):
@sqlbot amd @gkiar I'm sorry, I made the same misunderstanding.
I will see the problem in more detail.
Regards,
@Oldsouldier commented on GitHub (Aug 29, 2018):
Also noticing this issue.
V1.84
/usr/bin/s3fs mybucket -o use_cache=/tmp -o allow_other -o iam_role=myrole -o mp_umask=022 -o multireq_max=5 -o multipart_size=20 -o ensure_diskfree=10000 /mybucketI have an inotify watch on the directory that immediately processes the file after it is done being created, and i see that it is being read from S3 over the network rather than the cache.
Running natively on Ubuntu 16.04 (+required dependencies). No containers, single process.