mirror of
https://github.com/s3fs-fuse/s3fs-fuse.git
synced 2026-04-25 13:26:00 +03:00
[GH-ISSUE #1962] Architectural suggestion regarding s3fs #991
Labels
No labels
bug
bug
dataloss
duplicate
enhancement
feature request
help wanted
invalid
need info
performance
pull-request
question
question
testing
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/s3fs-fuse#991
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @giulianoc on GitHub (Jun 12, 2022).
Original GitHub issue: https://github.com/s3fs-fuse/s3fs-fuse/issues/1962
Hi,
I need to share an s3 storage among several virtual machines.
My requirement is that a file created in a folder mounted by s3fs, in one virtual machine, has to be accessible as soon as possible
by the other virtual machines.
I guess I can use two approaches:
Of course, in scenario 1, every virtual machine needs the s3 credentials to mount the s3 storage and
a folder to be used as s3-cache while in scenario 2, we need this just into the gateway machine.
Anyway, which will be the best approach to be used considering my above requirement?
Best regards
@vguilleaume commented on GitHub (Jun 13, 2022):
Hi Giulianoc,
These two approaches can work i believe.
Depending of what - where is your s3 storage (public cloud - object storage -...) , I see an advantage to use a caching system in between the 'gateway/nfs-server' and the s3 storage. This caching can keep during certain amount of time the data in the cache even if there is also a copy into the bucket.
I have some experience in combining NFS server with a local cache and in the backend S3FS to copy data to S3 bucket through a workflow mechanism.
The first option which is a direct access to the S3 from each VM will work as well ; but can require a little bit more configuration as it is a delocalized management.
I think both can work, afteward it is perhaps a question of volume and type of data that you have to access from each Virtual Machine and the centralize S3 storage.
@ggtakec commented on GitHub (Jun 13, 2022):
@giulianoc
If you expect the cache of s3fs to work effectively as the problem of #1961 is solved, you may be able to judge by the following conditions.
If the s3fs cache(depending on local disk space) is sufficient for requests from those NFS clients, then the s3fs host could be the gateway.
And if there are few file updates, it will improve the cache hit rate.
If you run out of disk space, you can imagine that even if you use s3fs as a gateway, the cache hit rate will drop and you will not get the expected performance.
In this case, I feel that it is better to start s3fs on each host.
In either case, the issue of #1961 seems to need to be resolved.
@giulianoc commented on GitHub (Jun 13, 2022):
The scenario 1., I guess has the issue that, if Client 1 (having s3fs) creates file_1 and, very soon, Client 2 (having s3fs too) needs this file to do some processing, It is not guaranteed that CLient 2 has file_1 because it depends on his local cache was so fast to get this file.
The above issue I guess is not present in scenario 2 because the s3-cache deployed on 'gateway/nfs-server' is "shared" among all the NFS Clients. So, if NFS Client 1 creates file_1, for sure NFS Client 2 is able to do processing on this file because both accede to the same cache through the nfs mount of s3fs.
Do you agree?
Since the scenario 2 is right now not an option because of the nfs issue (#1961), do you think I can manage the above just explained issue (file_1) in some other ways?
Best regards
@ggtakec commented on GitHub (Jun 15, 2022):
Let me supplement - About Client2 catching file updates etc. for scenario 1.
s3fs can cache file stats and its content and complete its access locally.
However, when a file is accessed, it depends on the cache expiration of the file's stat information.
If this expiration has passed for the file, s3fs will access the server(S3) to check if that file have been existed, deleted or modified.
If an update is detected, the stat cache update and file content are disabled.
(By checking the existence of the file by detecting the ETag value.)
Simply put, if the stat cache has expired(default 900sec), you can catch file updates and update file content as well.
@ggtakec commented on GitHub (Feb 12, 2023):
@giulianoc
I believe #1964 and #2016 have been merged and your proposed scenario 2 works.
Try using master's code with the
update_parent_dir_statoption.This issue is closed, but if you're still having issues, please reopen or submit a new issue.