[GH-ISSUE #2156] OOM Killer kills s3fs running in container

kerem commented

2026-03-04 01:51:22 +03:00

Owner

Originally created by @tanguofu on GitHub (May 11, 2023).
Original GitHub issue: https://github.com/s3fs-fuse/s3fs-fuse/issues/2156

when write the big file for example the file is tens of gigabytes in size. the active_file cache in memory of system because very big， which will be trigger the oom when the s3fs run in a pod which has poor mem limit.

so it is possible to add the options make s3fs write use O_DIRECT flags to reduce the system memory cache for write large file

many thanks！

Originally created by @tanguofu on GitHub (May 11, 2023). Original GitHub issue: https://github.com/s3fs-fuse/s3fs-fuse/issues/2156 when write the big file for example the file is tens of gigabytes in size. the active_file cache in memory of system because very big， which will be trigger the oom when the s3fs run in a pod which has poor mem limit. so it is possible to add the options make s3fs write use O_DIRECT flags to reduce the system memory cache for write large file many thanks！

kerem added the

need info

help wanted

labels

2026-03-04 01:51:22 +03:00

kerem commented

2026-03-04 01:51:23 +03:00

Author

Owner

@ggtakec commented on GitHub (May 13, 2023):

@tanguofu
Could you tell us what version of s3fs you are using and the contents of the command line(or fstab entry) when starting s3fs?
I'm wondering why the s3fs process gets tens of gigabytes in a single object download.
(If this is occurred, I believe it can be avoided with an option, so please tell me how to start it.)
Thanks in advance for your assistance.

@ggtakec commented on GitHub (May 13, 2023): @tanguofu Could you tell us what version of s3fs you are using and the contents of the command line(or fstab entry) when starting s3fs? I'm wondering why the s3fs process gets tens of gigabytes in a single object download. (If this is occurred, I believe it can be avoided with an option, so please tell me how to start it.) Thanks in advance for your assistance.

kerem commented

2026-03-04 01:51:23 +03:00

Author

Owner

@gaul commented on GitHub (May 14, 2023):

If s3fs has unbounded memory then this is something we should fix. This has nothing to do with O_DIRECT which limits the kernel page cache.

@gaul commented on GitHub (May 14, 2023): If s3fs has unbounded memory then this is something we should fix. This has nothing to do with `O_DIRECT` which limits the kernel page cache.

kerem commented

2026-03-04 01:51:23 +03:00

Author

Owner

@tanguofu commented on GitHub (May 16, 2023):

the memory is used by page cache of system memory， so i add the fdatasync to flush cache fix this.

@tanguofu commented on GitHub (May 16, 2023): the memory is used by page cache of system memory， so i add the fdatasync to flush cache fix this.

kerem commented

2026-03-04 01:51:23 +03:00

Author

Owner

@ggtakec commented on GitHub (May 27, 2023):

@tanguofu
(Please let us continue the discussion of comment #2157 in this issue.)

The problem is that when running s3fs in a container(kubernetes/docker), downloading files of tens of gigabytes will increase the ative_file cache(page cache), which will hit the OOM threshold, isn't it?

And since it is in a Container, the active_file cache increases within the free memory area of host/node, and it becomes over the OOM limit.

Ant you know that the solution is to either call sync/fsync/fdatasync(#2157), use the O_DIRECT flag, or set drop_caches.

I think drop_caches is the only means of flushing the cache from outside the process.
How bad was the performance when you tried it? (Was there a difference with or without the sync command?)

Modifying s3fs itself is either using fdatasync like #2157 or using the O_DIRECT flag, but personally if it can be handled with the O_DIRECT flag(switched by option), that method is acceptable.
@gaul How about do you think?

I understand that this issue is due to Container's Limit and OOM behavior, so it is different from bare metal, VM, etc. environments.

@ggtakec commented on GitHub (May 27, 2023): @tanguofu _(Please let us continue the discussion of comment #2157 in this issue.)_ The problem is that when running s3fs in a container(kubernetes/docker), downloading files of tens of gigabytes will increase the ative_file cache(page cache), which will hit the OOM threshold, isn't it? And since it is in a Container, the active_file cache increases within the free memory area of host/node, and it becomes over the OOM limit. Ant you know that the solution is to either `call sync/fsync/fdatasync`(#2157), `use the O_DIRECT flag`, or `set drop_caches`. I think `drop_caches` is the only means of flushing the cache from outside the process. How bad was the performance when you tried it? (Was there a difference with or without the sync command?) Modifying s3fs itself is either using fdatasync like #2157 or using the `O_DIRECT` flag, but personally if it can be handled with the `O_DIRECT` flag(switched by option), that method is acceptable. @gaul How about do you think? I understand that this issue is due to Container's Limit and OOM behavior, so it is different from bare metal, VM, etc. environments.

kerem commented

2026-03-04 01:51:23 +03:00

Author

Owner

@gaul commented on GitHub (May 27, 2023):

@tanguofu Absolutely not. O_DIRECT is strictly worse than your previous PR to fdatasync at some interval, instead flushing on every write. I don't believe that you understand how operating systems work so I will try to explain but you should start with something like https://www.linuxatemyram.com/.

The kernel buffers writes (dirty data) in-memory to improve performance. This also allows different applications to share resources or for the application to run on different hardware without configuration. The kernel decides when to flush based on its view of the system. If every application randomly starts flushing data this hurts performance. Why do you think that cp and similar utilities lack these data flushing policies? I don't understand why you believe you are experiencing out of memory situations which would mean that the kernel is killing processes. I believe you are experiencing the buffer cache growing which the kernel will naturally sync over time and should not concern you.

If you want to influence the kernel's behavior, the application is the wrong place to do this. You can do this via control groups or many other mechanisms: https://unix.stackexchange.com/questions/253816/restrict-size-of-buffer-cache-in-linux. We should not add more broken flags to s3fs which already has too many knobs that users misunderstand and misuse. If you absolutely must have this behavior in your local setup you can do this via LD_PRELOAD.

@gaul commented on GitHub (May 27, 2023): @tanguofu Absolutely not. `O_DIRECT` is strictly worse than your previous PR to `fdatasync` at some interval, instead flushing *on every write*. I don't believe that you understand how operating systems work so I will try to explain but you should start with something like https://www.linuxatemyram.com/. The kernel buffers writes (dirty data) in-memory to improve performance. This also allows different applications to share resources or for the application to run on different hardware without configuration. The kernel decides when to flush based on its view of the system. If every application randomly starts flushing data this hurts performance. Why do you think that `cp` and similar utilities lack these data flushing policies? I don't understand why you believe you are experiencing out of memory situations which would mean that the kernel is killing processes. I believe you are experiencing the buffer cache growing which the kernel will naturally sync over time and should not concern you. If you want to influence the kernel's behavior, the application is the wrong place to do this. You can do this via control groups or many other mechanisms: https://unix.stackexchange.com/questions/253816/restrict-size-of-buffer-cache-in-linux. We should not add more broken flags to s3fs which already has too many knobs that users misunderstand and misuse. If you absolutely must have this behavior in your local setup you can do this via `LD_PRELOAD`.

kerem commented

2026-03-04 01:51:23 +03:00

Author

Owner

@ggtakec commented on GitHub (May 28, 2023):

@gaul You misunderstand something about this issue and me.
I understand that calling fdatasync/fsync etc. directly, and the performance degradation caused by it, should be done at the OS or Driver level.
(This is because I know, and on that premise, I find the next problem.)

Now let me explain why this issue bothers me.

It is possible to predict the size of a memory area requested by the user program(allocation).
Also, the OS checks the unused area of memory, expands the cache (active_file) on the memory for file I/O of the program, and uses it.
And this cache memory(active_file) size is a size that the user cannot comprehend.

This issue proposes a workaround for the s3fs process being killed by OOM Killer in Docker containers and Kubernetes.
In this container environment, the OOM killer has the potential to kill s3fs.

I hope the OOM Killer solves this problem, but as it stands it doesn't seem to.
This issue’s main problem is the cache size(active_file) at using container.
User can check and limit the memory usage of programs which is running in a container.
However, the user cannot accurately estimate the maximum size of the cache if the OS creates it in the case of writing files inside the container.
This is because the free memory size used for the OS cache is the memory size of the parent host (NODE), not the memory size allocated to the container.
This means that the cache size used by the OS can exceed the container's memory size limit, and the user cannot estimate this.

For this reason, programs that exceed the container's memory size limit will eventually be killed by the OOM Killer.

With the above background in mind, let's talk about s3fs.
Take s3fs downloading a huge object(file) as an example.(An example like this issue)
In this case, even if the user has prepared both a disk to store files and a disk for s3fs cache, the cache memory will be used by the OS due to writes after the download.

In general, the parent HOST(NODE) that runs the container has a large size, so I think there is a lot of free memory.
So this cache memory size will easily exceed the POD's memory size limit.

To resolve this case for the user, the user would have to set the maximum size of the object they are trying to download to the POD's limit.(There may be other ways.)
And the file to be downloaded may not be just one object, but may be multiple objects. (Memory size cannot be predicted)

In order to claim that s3fs supports running on containers, I think this issue needs to be addressed.
If we don't face this issue, we have to make it clear that s3fs is not available for Docker and k8s containers.
(I hope the OOM Killer problem is solved)

So I think this issue should not be closed yet.

For this issue, I'll give you the site I found with a quick search. (Because I'm not very good at explaining)
https://codefresh.io/blog/docker-memory-usage/
https://faun.pub/how-much-is-too-much-the-linux-oomkiller-and-used-memory-d32186f29c9d?gi=05246282f475

This issue will be reopened, please discuss for just a little longer.

@ggtakec commented on GitHub (May 28, 2023): @gaul You misunderstand something about this issue and me. I understand that calling `fdatasync`/`fsync` etc. directly, and the performance degradation caused by it, should be done at the OS or Driver level. (This is because I know, and on that premise, I find the next problem.) Now let me explain why this issue bothers me. It is possible to predict the size of a memory area requested by the user program(allocation). Also, the OS checks the unused area of memory, expands the cache (active_file) on the memory for file I/O of the program, and uses it. And this cache memory(active_file) size is a size that the user cannot comprehend. This issue proposes a workaround for the s3fs process being killed by OOM Killer in Docker containers and Kubernetes. In this container environment, the OOM killer has the potential to kill s3fs. I hope the OOM Killer solves this problem, but as it stands it doesn't seem to. This issue’s main problem is the cache size(active_file) at using container. User can check and limit the memory usage of programs which is running in a container. However, the user cannot accurately estimate the maximum size of the cache if the OS creates it in the case of writing files inside the container. This is because the free memory size used for the OS cache is the memory size of the parent host (NODE), not the memory size allocated to the container. This means that the cache size used by the OS can exceed the container's memory size limit, and the user cannot estimate this. For this reason, programs that exceed the container's memory size limit will eventually be killed by the OOM Killer. With the above background in mind, let's talk about s3fs. Take s3fs downloading a huge object(file) as an example.(An example like this issue) In this case, even if the user has prepared both a disk to store files and a disk for s3fs cache, the cache memory will be used by the OS due to writes after the download. In general, the parent HOST(NODE) that runs the container has a large size, so I think there is a lot of free memory. So this cache memory size will easily exceed the POD's memory size limit. To resolve this case for the user, the user would have to set the maximum size of the object they are trying to download to the POD's limit.(There may be other ways.) And the file to be downloaded may not be just one object, but may be multiple objects. (Memory size cannot be predicted) In order to claim that s3fs supports running on containers, I think this issue needs to be addressed. If we don't face this issue, we have to make it clear that s3fs is not available for Docker and k8s containers. (I hope the OOM Killer problem is solved) So I think this issue should not be closed yet. For this issue, I'll give you the site I found with a quick search. (Because I'm not very good at explaining) https://codefresh.io/blog/docker-memory-usage/ https://faun.pub/how-much-is-too-much-the-linux-oomkiller-and-used-memory-d32186f29c9d?gi=05246282f475 This issue will be reopened, please discuss for just a little longer.

kerem commented

2026-03-04 01:51:24 +03:00

Author

Owner

@tanguofu commented on GitHub (May 29, 2023):

completely agree with @ggtakec perspective. Using s3fs to download large files in Kubernetes containers makes it difficult to set pod limits. In my opinion, the best solution is to call fdatasync to refresh the system cache.
and based on my testing, there is no difference in performance when using or not using fdatasync.

@tanguofu commented on GitHub (May 29, 2023): completely agree with @ggtakec perspective. Using s3fs to download large files in Kubernetes containers makes it difficult to set pod limits. In my opinion, the best solution is to call fdatasync to refresh the system cache. and based on my testing, there is no difference in performance when using or not using fdatasync.

kerem commented

2026-03-04 01:51:24 +03:00

Author

Owner

@gaul commented on GitHub (May 29, 2023):

The reason syncing is a completely broken suggestion is that it conflates the durability of data that O_DIRECT and fdatasync control with the memory use of the buffer cache. Using the former to influence the latter is a severe performance pessimization and you should read more fully about when O_DIRECT is (rarely) useful, specifically for databases and other situations with transactional guarantees. s3fs does not and should not provide these for its data cache and trying to sync data at random intervals (or on every write as proposed here!) will hurt 99% of users. A configuration flag will further confuse unsophisticated users who believe that they are somehow improving performance or memory usage.

The correct solution is for the kernel to stall s3fs (or any program) writes for the IO system to catch up. I don't believe that @tanguofu has shown when the out of memory killer actually kills s3fs or any other process due to excessive buffer cache and I strongly suspect that they have a misconfigured their system. I suggest closing this issue and they can open a new issue with the actual symptom of your container memory problems that we must reproduce before merging radical solutions like syncing writes.

I have offered you other workarounds that are best suited for your niche situation and there are many resources easily found via searching. If there is indeed a general problem that affects a broad cross-section of users then we can discuss and address this but you need to make the case since this is the first time in 10 years that anyone has claimed that s3fs writing causes out of memory issues due to the buffer cache. Again you should explain why this uniquely affects s3fs and not other IO utilities like cp and rsync which copy large files. There are surely better ways for s3fs to interact with IO but these are generally small optimizations, e.g., posix_fadvise, O_TMPFILE.

@gaul commented on GitHub (May 29, 2023): The reason syncing is a completely broken suggestion is that it conflates the *durability* of data that `O_DIRECT` and `fdatasync` control with the *memory use* of the buffer cache. Using the former to influence the latter is a severe performance pessimization and you should read more fully about when `O_DIRECT` is (rarely) useful, specifically for databases and other situations with transactional guarantees. s3fs does not and should not provide these for its data cache and trying to sync data at random intervals (or on every write as proposed here!) will hurt 99% of users. A configuration flag will further confuse unsophisticated users who believe that they are somehow improving performance or memory usage. The correct solution is for the *kernel* to stall s3fs (or any program) writes for the IO system to catch up. I don't believe that @tanguofu has shown when the out of memory killer actually kills s3fs or any other process due to excessive buffer cache and I strongly suspect that they have a misconfigured their system. I suggest closing this issue and they can open a new issue with the actual symptom of your container memory problems that we must reproduce before merging radical solutions like syncing writes. I have offered you other workarounds that are best suited for your niche situation and there are [many resources](https://www.baeldung.com/linux/restrict-size-buffer-cache#how-to-restrict-the-size-of-the-buffer-cache) easily found via searching. If there is indeed a general problem that affects a broad cross-section of users then we can discuss and address this but you need to make the case since this is the first time in 10 years that anyone has claimed that s3fs writing causes out of memory issues due to the buffer cache. Again you should explain why this uniquely affects s3fs and not other IO utilities like `cp` and `rsync` which copy large files. There are surely better ways for s3fs to interact with IO but these are generally small optimizations, e.g., `posix_fadvise`, `O_TMPFILE`.

kerem commented

2026-03-04 01:51:24 +03:00

Author

Owner

@ggtakec commented on GitHub (May 29, 2023):

@gaul
I reverted the merged code (#2157) because it was my deeply thoughtless decision.
As you pointed out, I think that decision should have been made after considering other things.

As I said before, I want to make it support that s3fs can work inside containers.
For that reason, I argue that it is necessary to solve this problem (including other methods and settings without sticking to fdatasync).

Again, "why is this problem happening?":

That is, a huge file cache is created aside where the container(s3fs run on) does not manage the memory range.
And the OOM Killer kills the s3fs POD. (I think that the same thing will happen in the future)
My understanding is that in non-container environments(VM or bearmetal) the file cache is created within a memory range within the system(OS) where s3fs is running.
On this case, if the system runs out of free memory, it will naturally cache out.
But the problem is when the container.
It means that writes to disk from the container are file cached by the drivers in the base OS(parent host).
In other words, the memory used is outside the limits of the cgroup for the container. (it is different layers)
When s3fs is running inside a container, the cgroup has a memory limit set, but this file cache can use memory outside of that.
And the OOM Killer is detecting it.

At the time of reverting #2157, I was going to look into setting up that container and other solutions to this problem instead of fixing s3fs.
So, instead of just discussing fdatasync, I'd like to think of all possible ways to avoid POD outages with the OOM Killer.
(I still haven't fully grasped the behavior of OOM Killer regarding file caching)

I think that s3fs may be different than other simple programs.
That is creating and writing cache files/download files for s3fs internally without the user's knowledge.
We are facing this issue with s3fs working inside a container.

Therefore, I would like to consolidate the information in this issue as it is.
And Im sorry for change this issue subject.

If we can solve this, I hope we can make a strong case that s3fs works fine in containers.

@ggtakec commented on GitHub (May 29, 2023): @gaul I reverted the merged code (#2157) because it was my deeply thoughtless decision. As you pointed out, I think that decision should have been made after considering other things. As I said before, I want to make it support that s3fs can work inside containers. For that reason, I argue that it is necessary to solve this problem (including other methods and settings without sticking to fdatasync). Again, "why is this problem happening?": That is, a huge file cache is created aside where the container(s3fs run on) does not manage the memory range. And the `OOM Killer` kills the s3fs POD. (I think that the same thing will happen in the future) My understanding is that in non-container environments(VM or bearmetal) the file cache is created within a memory range within the system(OS) where s3fs is running. On this case, if the system runs out of free memory, it will naturally cache out. But the problem is when the container. It means that writes to disk from the container are file cached by the drivers in the base OS(parent host). In other words, the memory used is outside the limits of the `cgroup` for the container. (it is different layers) When s3fs is running inside a container, the `cgroup` has a memory limit set, but this file cache can use memory outside of that. And the OOM Killer is detecting it. At the time of reverting #2157, I was going to look into setting up that container and other solutions to this problem instead of fixing s3fs. So, instead of just discussing fdatasync, I'd like to think of all possible ways to avoid POD outages with the OOM Killer. (I still haven't fully grasped the behavior of OOM Killer regarding file caching) I think that s3fs may be different than other simple programs. That is creating and writing cache files/download files for s3fs internally without the user's knowledge. We are facing this issue with s3fs working inside a container. Therefore, I would like to consolidate the information in this issue as it is. And Im sorry for change this issue subject. If we can solve this, I hope we can make a strong case that s3fs works fine in containers.

kerem commented

2026-03-04 01:51:24 +03:00

Author

Owner

@gaul commented on GitHub (May 29, 2023):

Again, "why is this problem happening?":

Let's start with, "is this problem happening?" to which I have doubts. @tanguofu please provide a self-contained test case for us to reproduce your symptoms.

@gaul commented on GitHub (May 29, 2023): > Again, "why is this problem happening?": Let's start with, "is this problem happening?" to which I have doubts. @tanguofu please provide a self-contained test case for us to reproduce your symptoms.

kerem commented

2026-03-04 01:51:24 +03:00

Author

Owner

@tanguofu commented on GitHub (May 31, 2023):

OS：

[root@tcs-172-16-2-5 ~]# cat /etc/os-release 
NAME="CentOS Linux"
VERSION="7 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="7"
PRETTY_NAME="CentOS Linux 7 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:7"
HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"

CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"

the memory of machine which has large than 320GB

[root@tcs-172-16-2-5 ~]# free -m
              total        used        free      shared  buff/cache   available
Mem:         321698      155769       21807          46      144121      164216
Swap:             0           0           0

start a testpod with limits of 100Mib memory and in this pod use s3fs mount a bucket which contain a file large than 10GB

s3fs bucket_for_test  /mnt/s3fs-fuse -o passwd_file=/etc/passwd-s3fs -o url=http://minio.ti-inf.svc.cluster.local:80 -o use_path_request_style -f

then run copy file in this pod to copy the file of bucket to local dir such as data.

cp -f /mnt/s3fs-fuse/10GB-file /data

the pod will be crashed and the s3fs kill by OOM kill.

the pod use memory can be watch by, which will be large than 8GB if the pod

kubectl top pod -n xx testpod

but if you watch the memory of s3fs , which is very little

[root@tcs-172-16-2-5 ~]# ps aux |grep s3fs
root     1151862  0.0  0.0 1550700 10016 ?       Sl   May27   0:19 s3fs tibackup /mnt/unifiedcsi/s3fs/csi-objectstorage-ti-platform-fs-tibackup -o passwd_file=/etc/passwd-s3fs -o url=http://minio.ti-inf.svc.cluster.local:80 -o use_path_request_style -f

and the more logs of oom canbe get by

dmesg -T

@tanguofu commented on GitHub (May 31, 2023): OS： ``` [root@tcs-172-16-2-5 ~]# cat /etc/os-release NAME="CentOS Linux" VERSION="7 (Core)" ID="centos" ID_LIKE="rhel fedora" VERSION_ID="7" PRETTY_NAME="CentOS Linux 7 (Core)" ANSI_COLOR="0;31" CPE_NAME="cpe:/o:centos:centos:7" HOME_URL="https://www.centos.org/" BUG_REPORT_URL="https://bugs.centos.org/" CENTOS_MANTISBT_PROJECT="CentOS-7" CENTOS_MANTISBT_PROJECT_VERSION="7" REDHAT_SUPPORT_PRODUCT="centos" REDHAT_SUPPORT_PRODUCT_VERSION="7" ``` the memory of machine which has large than 320GB ``` [root@tcs-172-16-2-5 ~]# free -m total used free shared buff/cache available Mem: 321698 155769 21807 46 144121 164216 Swap: 0 0 0 ``` start a testpod with limits of 100Mib memory and in this pod use s3fs mount a bucket which contain a file large than 10GB ``` s3fs bucket_for_test /mnt/s3fs-fuse -o passwd_file=/etc/passwd-s3fs -o url=http://minio.ti-inf.svc.cluster.local:80 -o use_path_request_style -f ``` then run copy file in this pod to copy the file of bucket to local dir such as data. ``` cp -f /mnt/s3fs-fuse/10GB-file /data ``` the pod will be crashed and the s3fs kill by OOM kill. the pod use memory can be watch by, which will be large than 8GB if the pod ``` kubectl top pod -n xx testpod ``` but if you watch the memory of s3fs , which is very little ``` [root@tcs-172-16-2-5 ~]# ps aux |grep s3fs root 1151862 0.0 0.0 1550700 10016 ? Sl May27 0:19 s3fs tibackup /mnt/unifiedcsi/s3fs/csi-objectstorage-ti-platform-fs-tibackup -o passwd_file=/etc/passwd-s3fs -o url=http://minio.ti-inf.svc.cluster.local:80 -o use_path_request_style -f ``` and the more logs of oom canbe get by ``` dmesg -T ```

kerem commented

2026-03-04 01:51:24 +03:00

Author

Owner

@ggtakec commented on GitHub (Jun 3, 2023):

@tanguofu Thanks for your reports and kindness.

I tried this on my cheap host and was unable to reproduce the problem.

I prepared the environment with Ubuntu + minikube, started the ubuntu 22.04 POD, and run s3fs on it, and I tried downloading a 10GB file(object).
Indeed, the active_file in memory.stat seen from the HOST side of minikube and inside the POD will decrease, but it will return after about 50MB decrease.
It repeats while the file is downloading.
In my reproducing tests, the cache for files(curl's output file, s3fs7s cache and save file) appears to be periodically flushed to the disk.

@tanguofu Are there any hosts in your environment where this phenomenon does not occur?
If you have, is there any difference between the environment that works fine and the one that causes the error(which may include drivers)?

As @gaul mentions, currently I can not find any constant glitches caused by Docker containers(or pods on kubernetes).

Additional Information
We have open Issue #2035 similar to this issue.

More research is needed and I thinnk community help is needed to resolve this issue.
This issue is most likely related to the active_file cache or something and maybe not a s3fs dependent issue, but we would like to resolve about s3fs on the containers. )

@ggtakec commented on GitHub (Jun 3, 2023): @tanguofu Thanks for your reports and kindness. I tried this on my cheap host and was unable to reproduce the problem. I prepared the environment with `Ubuntu` + `minikube`, started the `ubuntu 22.04 POD`, and run s3fs on it, and I tried downloading a 10GB file(object). Indeed, the `active_file` in `memory.stat` seen from the HOST side of `minikube` and inside the POD will decrease, but it will return after about 50MB decrease. It repeats while the file is downloading. In my reproducing tests, the cache for files(curl's output file, s3fs7s cache and save file) appears to be periodically flushed to the disk. @tanguofu Are there any hosts in your environment where this phenomenon does not occur? If you have, is there any difference between the environment that works fine and the one that causes the error(which may include drivers)? As @gaul mentions, currently I can not find any constant glitches caused by Docker containers(or pods on kubernetes). - Additional Information We have open Issue #2035 similar to this issue. More research is needed and I thinnk community help is needed to resolve this issue. This issue is most likely related to the `active_file` cache or something and maybe not a s3fs dependent issue, but we would like to resolve about s3fs on the containers. )

kerem commented

2026-03-04 01:51:24 +03:00

Author

Owner

@tanguofu commented on GitHub (Jun 5, 2023):

@ggtakec could you test with the memory is 64GB and the kernel is 4.19， this is my system info：

i think the /proc/sys/vm/dirty_ratio maybe make system use more cache。

[root@tcs-172-16-3-7 ~]# uname  -a
Linux tcs-172-16-3-7 4.19.90-24.4.v2101.ky10.x86_64 #1 SMP Mon May 24 12:14:55 CST 2021 x86_64 x86_64 x86_64 GNU/Linux

[root@tcs-172-16-3-7 ~]# ls  /proc/sys/vm/dirty*  |xargs -i bash -c "echo {} && cat {}"
/proc/sys/vm/dirty_background_bytes
0
/proc/sys/vm/dirty_background_ratio
5
/proc/sys/vm/dirty_bytes
0
/proc/sys/vm/dirty_expire_centisecs
3000
/proc/sys/vm/dirty_ratio
40
/proc/sys/vm/dirtytime_expire_seconds
43200
/proc/sys/vm/dirty_writeback_centisecs
500

@tanguofu commented on GitHub (Jun 5, 2023): @ggtakec could you test with the memory is 64GB and the kernel is 4.19， this is my system info： i think the `/proc/sys/vm/dirty_ratio` maybe make system use more cache。 ``` [root@tcs-172-16-3-7 ~]# uname -a Linux tcs-172-16-3-7 4.19.90-24.4.v2101.ky10.x86_64 #1 SMP Mon May 24 12:14:55 CST 2021 x86_64 x86_64 x86_64 GNU/Linux [root@tcs-172-16-3-7 ~]# ls /proc/sys/vm/dirty* |xargs -i bash -c "echo {} && cat {}" /proc/sys/vm/dirty_background_bytes 0 /proc/sys/vm/dirty_background_ratio 5 /proc/sys/vm/dirty_bytes 0 /proc/sys/vm/dirty_expire_centisecs 3000 /proc/sys/vm/dirty_ratio 40 /proc/sys/vm/dirtytime_expire_seconds 43200 /proc/sys/vm/dirty_writeback_centisecs 500 ```

kerem commented

2026-03-04 01:51:24 +03:00

Author

Owner

@ggtakec commented on GitHub (Jun 10, 2023):

This comment(https://github.com/s3fs-fuse/s3fs-fuse/issues/873#issuecomment-1584002365) may be the same.

@ggtakec commented on GitHub (Jun 10, 2023): This comment(https://github.com/s3fs-fuse/s3fs-fuse/issues/873#issuecomment-1584002365) may be the same.

kerem commented

2026-03-04 01:51:24 +03:00

Author

Owner

@ggtakec commented on GitHub (Jun 12, 2023):

@tanguofu
I haven't found a clear answer yet. (I'm sorry about I haven't been able to test it on the 4.19 kernel.)
However, I have found that this phenomenon has been raised by some users who are using old kernel(and cgroups).

And I think that new cgroups and kernels don't seem to have this problem.(but I haven't found anything definitive yet.)

Given this situation, building a proactive sync(flush) into s3fs itself doesn't seem to be the right workaround(and I agree with @gaul).

Is it possible to tune the following values in your node host?

vm.dirty_background_ratio
vm.dirty_ratio
Tuning this parameter seems to be the best option, although it may degrade overall performance.

I will continue to investigate, but please let us know your opinion and findings.
Thanks in advance for your assistance.

@ggtakec commented on GitHub (Jun 12, 2023): @tanguofu I haven't found a clear answer yet. (I'm sorry about I haven't been able to test it on the 4.19 kernel.) However, I have found that this phenomenon has been raised by some users who are using old kernel(and cgroups). And I think that new cgroups and kernels don't seem to have this problem.(but I haven't found anything definitive yet.) Given this situation, building a proactive sync(flush) into s3fs itself doesn't seem to be the right workaround(and I agree with @gaul). Is it possible to tune the following values in your node host? - vm.dirty_background_ratio - vm.dirty_ratio Tuning this parameter seems to be the best option, although it may degrade overall performance. I will continue to investigate, but please let us know your opinion and findings. Thanks in advance for your assistance.

kerem referenced this issue

2026-03-04 02:02:17 +03:00

[PR #1101] [MERGED] Avoid race when using thread-unsafe gmtime #1814

Rows
Columns

[GH-ISSUE #2156] OOM Killer kills s3fs running in container #1101