[GH-ISSUE #1962] Architectural suggestion regarding s3fs #991

Closed
opened 2026-03-04 01:50:29 +03:00 by kerem · 5 comments
Owner

Originally created by @giulianoc on GitHub (Jun 12, 2022).
Original GitHub issue: https://github.com/s3fs-fuse/s3fs-fuse/issues/1962

Hi,
I need to share an s3 storage among several virtual machines.

My requirement is that a file created in a folder mounted by s3fs, in one virtual machine, has to be accessible as soon as possible
by the other virtual machines.

I guess I can use two approaches:

  1. every virtual machine will mount by s3fs the s3 storage. It would mean that.
  2. we could use a 'gateway/nfs-server' virtual machine where the s3 storage is mounted by s3fs. All virtual machines will mount by nfs the folder from the gateway virtual machine. In this scenario, the creadential

Of course, in scenario 1, every virtual machine needs the s3 credentials to mount the s3 storage and
a folder to be used as s3-cache while in scenario 2, we need this just into the gateway machine.

Anyway, which will be the best approach to be used considering my above requirement?

Best regards

Originally created by @giulianoc on GitHub (Jun 12, 2022). Original GitHub issue: https://github.com/s3fs-fuse/s3fs-fuse/issues/1962 Hi, I need to share an s3 storage among several virtual machines. My requirement is that a file created in a folder mounted by s3fs, in one virtual machine, has to be accessible as soon as possible by the other virtual machines. I guess I can use two approaches: 1. every virtual machine will mount by s3fs the s3 storage. It would mean that. 2. we could use a 'gateway/nfs-server' virtual machine where the s3 storage is mounted by s3fs. All virtual machines will mount by nfs the folder from the gateway virtual machine. In this scenario, the creadential Of course, in scenario 1, every virtual machine needs the s3 credentials to mount the s3 storage and a folder to be used as s3-cache while in scenario 2, we need this just into the gateway machine. Anyway, which will be the best approach to be used considering my above requirement? Best regards
kerem closed this issue 2026-03-04 01:50:29 +03:00
Author
Owner

@vguilleaume commented on GitHub (Jun 13, 2022):

Hi Giulianoc,

These two approaches can work i believe.

Depending of what - where is your s3 storage (public cloud - object storage -...) , I see an advantage to use a caching system in between the 'gateway/nfs-server' and the s3 storage. This caching can keep during certain amount of time the data in the cache even if there is also a copy into the bucket.

I have some experience in combining NFS server with a local cache and in the backend S3FS to copy data to S3 bucket through a workflow mechanism.

The first option which is a direct access to the S3 from each VM will work as well ; but can require a little bit more configuration as it is a delocalized management.

I think both can work, afteward it is perhaps a question of volume and type of data that you have to access from each Virtual Machine and the centralize S3 storage.

<!-- gh-comment-id:1153647746 --> @vguilleaume commented on GitHub (Jun 13, 2022): Hi Giulianoc, These two approaches can work i believe. Depending of what - where is your s3 storage (public cloud - object storage -...) , I see an advantage to use a caching system in between the 'gateway/nfs-server' and the s3 storage. This caching can keep during certain amount of time the data in the cache even if there is also a copy into the bucket. I have some experience in combining NFS server with a local cache and in the backend S3FS to copy data to S3 bucket through a workflow mechanism. The first option which is a direct access to the S3 from each VM will work as well ; but can require a little bit more configuration as it is a delocalized management. I think both can work, afteward it is perhaps a question of volume and type of data that you have to access from each Virtual Machine and the centralize S3 storage.
Author
Owner

@ggtakec commented on GitHub (Jun 13, 2022):

@giulianoc
If you expect the cache of s3fs to work effectively as the problem of #1961 is solved, you may be able to judge by the following conditions.

If the s3fs cache(depending on local disk space) is sufficient for requests from those NFS clients, then the s3fs host could be the gateway.
And if there are few file updates, it will improve the cache hit rate.

If you run out of disk space, you can imagine that even if you use s3fs as a gateway, the cache hit rate will drop and you will not get the expected performance.
In this case, I feel that it is better to start s3fs on each host.

In either case, the issue of #1961 seems to need to be resolved.

<!-- gh-comment-id:1153775974 --> @ggtakec commented on GitHub (Jun 13, 2022): @giulianoc If you expect the cache of s3fs to work effectively as the problem of #1961 is solved, you may be able to judge by the following conditions. If the s3fs cache(depending on local disk space) is sufficient for requests from those NFS clients, then the s3fs host could be the gateway. And if there are few file updates, it will improve the cache hit rate. If you run out of disk space, you can imagine that even if you use s3fs as a gateway, the cache hit rate will drop and you will not get the expected performance. In this case, I feel that it is better to start s3fs on each host. In either case, the issue of #1961 seems to need to be resolved.
Author
Owner

@giulianoc commented on GitHub (Jun 13, 2022):

The scenario 1., I guess has the issue that, if Client 1 (having s3fs) creates file_1 and, very soon, Client 2 (having s3fs too) needs this file to do some processing, It is not guaranteed that CLient 2 has file_1 because it depends on his local cache was so fast to get this file.

The above issue I guess is not present in scenario 2 because the s3-cache deployed on 'gateway/nfs-server' is "shared" among all the NFS Clients. So, if NFS Client 1 creates file_1, for sure NFS Client 2 is able to do processing on this file because both accede to the same cache through the nfs mount of s3fs.

Do you agree?

Since the scenario 2 is right now not an option because of the nfs issue (#1961), do you think I can manage the above just explained issue (file_1) in some other ways?

Best regards

<!-- gh-comment-id:1154182414 --> @giulianoc commented on GitHub (Jun 13, 2022): The scenario 1., I guess has the issue that, if Client 1 (having s3fs) creates file_1 and, very soon, Client 2 (having s3fs too) needs this file to do some processing, It is not guaranteed that CLient 2 has file_1 because it depends on his local cache was so fast to get this file. The above issue I guess is not present in scenario 2 because the s3-cache deployed on 'gateway/nfs-server' is "shared" among all the NFS Clients. So, if NFS Client 1 creates file_1, for sure NFS Client 2 is able to do processing on this file because both accede to the same cache through the nfs mount of s3fs. Do you agree? Since the scenario 2 is right now not an option because of the nfs issue (#1961), do you think I can manage the above just explained issue (file_1) in some other ways? Best regards
Author
Owner

@ggtakec commented on GitHub (Jun 15, 2022):

Let me supplement - About Client2 catching file updates etc. for scenario 1.

s3fs can cache file stats and its content and complete its access locally.
However, when a file is accessed, it depends on the cache expiration of the file's stat information.
If this expiration has passed for the file, s3fs will access the server(S3) to check if that file have been existed, deleted or modified.
If an update is detected, the stat cache update and file content are disabled.
(By checking the existence of the file by detecting the ETag value.)

Simply put, if the stat cache has expired(default 900sec), you can catch file updates and update file content as well.

<!-- gh-comment-id:1156482746 --> @ggtakec commented on GitHub (Jun 15, 2022): Let me supplement - About Client2 catching file updates etc. for scenario 1. s3fs can cache file stats and its content and complete its access locally. However, when a file is accessed, it depends on the cache expiration of the file's stat information. If this expiration has passed for the file, s3fs will access the server(S3) to check if that file have been existed, deleted or modified. If an update is detected, the stat cache update and file content are disabled. (By checking the existence of the file by detecting the ETag value.) Simply put, if the stat cache has expired(default 900sec), you can catch file updates and update file content as well.
Author
Owner

@ggtakec commented on GitHub (Feb 12, 2023):

@giulianoc
I believe #1964 and #2016 have been merged and your proposed scenario 2 works.
Try using master's code with the update_parent_dir_stat option.

This issue is closed, but if you're still having issues, please reopen or submit a new issue.

<!-- gh-comment-id:1426981759 --> @ggtakec commented on GitHub (Feb 12, 2023): @giulianoc I believe #1964 and #2016 have been merged and your proposed scenario 2 works. Try using master's code with the `update_parent_dir_stat` option. This issue is closed, but if you're still having issues, please reopen or submit a new issue.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/s3fs-fuse#991
No description provided.