mirror of
https://github.com/s3fs-fuse/s3fs-fuse.git
synced 2026-04-25 21:35:58 +03:00
[GH-ISSUE #2156] OOM Killer kills s3fs running in container #1101
Labels
No labels
bug
bug
dataloss
duplicate
enhancement
feature request
help wanted
invalid
need info
performance
pull-request
question
question
testing
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/s3fs-fuse#1101
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @tanguofu on GitHub (May 11, 2023).
Original GitHub issue: https://github.com/s3fs-fuse/s3fs-fuse/issues/2156
when write the big file for example the file is tens of gigabytes in size. the active_file cache in memory of system because very big, which will be trigger the oom when the s3fs run in a pod which has poor mem limit.
so it is possible to add the options make s3fs write use O_DIRECT flags to reduce the system memory cache for write large file
many thanks!
@ggtakec commented on GitHub (May 13, 2023):
@tanguofu
Could you tell us what version of s3fs you are using and the contents of the command line(or fstab entry) when starting s3fs?
I'm wondering why the s3fs process gets tens of gigabytes in a single object download.
(If this is occurred, I believe it can be avoided with an option, so please tell me how to start it.)
Thanks in advance for your assistance.
@gaul commented on GitHub (May 14, 2023):
If s3fs has unbounded memory then this is something we should fix. This has nothing to do with
O_DIRECTwhich limits the kernel page cache.@tanguofu commented on GitHub (May 16, 2023):
the memory is used by page cache of system memory, so i add the fdatasync to flush cache fix this.
@ggtakec commented on GitHub (May 27, 2023):
@tanguofu
(Please let us continue the discussion of comment #2157 in this issue.)
The problem is that when running s3fs in a container(kubernetes/docker), downloading files of tens of gigabytes will increase the ative_file cache(page cache), which will hit the OOM threshold, isn't it?
And since it is in a Container, the active_file cache increases within the free memory area of host/node, and it becomes over the OOM limit.
Ant you know that the solution is to either
call sync/fsync/fdatasync(#2157),use the O_DIRECT flag, orset drop_caches.I think
drop_cachesis the only means of flushing the cache from outside the process.How bad was the performance when you tried it? (Was there a difference with or without the sync command?)
Modifying s3fs itself is either using fdatasync like #2157 or using the
O_DIRECTflag, but personally if it can be handled with theO_DIRECTflag(switched by option), that method is acceptable.@gaul How about do you think?
I understand that this issue is due to Container's Limit and OOM behavior, so it is different from bare metal, VM, etc. environments.
@gaul commented on GitHub (May 27, 2023):
@tanguofu Absolutely not.
O_DIRECTis strictly worse than your previous PR tofdatasyncat some interval, instead flushing on every write. I don't believe that you understand how operating systems work so I will try to explain but you should start with something like https://www.linuxatemyram.com/.The kernel buffers writes (dirty data) in-memory to improve performance. This also allows different applications to share resources or for the application to run on different hardware without configuration. The kernel decides when to flush based on its view of the system. If every application randomly starts flushing data this hurts performance. Why do you think that
cpand similar utilities lack these data flushing policies? I don't understand why you believe you are experiencing out of memory situations which would mean that the kernel is killing processes. I believe you are experiencing the buffer cache growing which the kernel will naturally sync over time and should not concern you.If you want to influence the kernel's behavior, the application is the wrong place to do this. You can do this via control groups or many other mechanisms: https://unix.stackexchange.com/questions/253816/restrict-size-of-buffer-cache-in-linux. We should not add more broken flags to s3fs which already has too many knobs that users misunderstand and misuse. If you absolutely must have this behavior in your local setup you can do this via
LD_PRELOAD.@ggtakec commented on GitHub (May 28, 2023):
@gaul You misunderstand something about this issue and me.
I understand that calling
fdatasync/fsyncetc. directly, and the performance degradation caused by it, should be done at the OS or Driver level.(This is because I know, and on that premise, I find the next problem.)
Now let me explain why this issue bothers me.
It is possible to predict the size of a memory area requested by the user program(allocation).
Also, the OS checks the unused area of memory, expands the cache (active_file) on the memory for file I/O of the program, and uses it.
And this cache memory(active_file) size is a size that the user cannot comprehend.
This issue proposes a workaround for the s3fs process being killed by OOM Killer in Docker containers and Kubernetes.
In this container environment, the OOM killer has the potential to kill s3fs.
I hope the OOM Killer solves this problem, but as it stands it doesn't seem to.
This issue’s main problem is the cache size(active_file) at using container.
User can check and limit the memory usage of programs which is running in a container.
However, the user cannot accurately estimate the maximum size of the cache if the OS creates it in the case of writing files inside the container.
This is because the free memory size used for the OS cache is the memory size of the parent host (NODE), not the memory size allocated to the container.
This means that the cache size used by the OS can exceed the container's memory size limit, and the user cannot estimate this.
For this reason, programs that exceed the container's memory size limit will eventually be killed by the OOM Killer.
With the above background in mind, let's talk about s3fs.
Take s3fs downloading a huge object(file) as an example.(An example like this issue)
In this case, even if the user has prepared both a disk to store files and a disk for s3fs cache, the cache memory will be used by the OS due to writes after the download.
In general, the parent HOST(NODE) that runs the container has a large size, so I think there is a lot of free memory.
So this cache memory size will easily exceed the POD's memory size limit.
To resolve this case for the user, the user would have to set the maximum size of the object they are trying to download to the POD's limit.(There may be other ways.)
And the file to be downloaded may not be just one object, but may be multiple objects. (Memory size cannot be predicted)
In order to claim that s3fs supports running on containers, I think this issue needs to be addressed.
If we don't face this issue, we have to make it clear that s3fs is not available for Docker and k8s containers.
(I hope the OOM Killer problem is solved)
So I think this issue should not be closed yet.
For this issue, I'll give you the site I found with a quick search. (Because I'm not very good at explaining)
https://codefresh.io/blog/docker-memory-usage/
https://faun.pub/how-much-is-too-much-the-linux-oomkiller-and-used-memory-d32186f29c9d?gi=05246282f475
This issue will be reopened, please discuss for just a little longer.
@tanguofu commented on GitHub (May 29, 2023):
completely agree with @ggtakec perspective. Using s3fs to download large files in Kubernetes containers makes it difficult to set pod limits. In my opinion, the best solution is to call fdatasync to refresh the system cache.
and based on my testing, there is no difference in performance when using or not using fdatasync.
@gaul commented on GitHub (May 29, 2023):
The reason syncing is a completely broken suggestion is that it conflates the durability of data that
O_DIRECTandfdatasynccontrol with the memory use of the buffer cache. Using the former to influence the latter is a severe performance pessimization and you should read more fully about whenO_DIRECTis (rarely) useful, specifically for databases and other situations with transactional guarantees. s3fs does not and should not provide these for its data cache and trying to sync data at random intervals (or on every write as proposed here!) will hurt 99% of users. A configuration flag will further confuse unsophisticated users who believe that they are somehow improving performance or memory usage.The correct solution is for the kernel to stall s3fs (or any program) writes for the IO system to catch up. I don't believe that @tanguofu has shown when the out of memory killer actually kills s3fs or any other process due to excessive buffer cache and I strongly suspect that they have a misconfigured their system. I suggest closing this issue and they can open a new issue with the actual symptom of your container memory problems that we must reproduce before merging radical solutions like syncing writes.
I have offered you other workarounds that are best suited for your niche situation and there are many resources easily found via searching. If there is indeed a general problem that affects a broad cross-section of users then we can discuss and address this but you need to make the case since this is the first time in 10 years that anyone has claimed that s3fs writing causes out of memory issues due to the buffer cache. Again you should explain why this uniquely affects s3fs and not other IO utilities like
cpandrsyncwhich copy large files. There are surely better ways for s3fs to interact with IO but these are generally small optimizations, e.g.,posix_fadvise,O_TMPFILE.@ggtakec commented on GitHub (May 29, 2023):
@gaul
I reverted the merged code (#2157) because it was my deeply thoughtless decision.
As you pointed out, I think that decision should have been made after considering other things.
As I said before, I want to make it support that s3fs can work inside containers.
For that reason, I argue that it is necessary to solve this problem (including other methods and settings without sticking to fdatasync).
Again, "why is this problem happening?":
That is, a huge file cache is created aside where the container(s3fs run on) does not manage the memory range.
And the
OOM Killerkills the s3fs POD. (I think that the same thing will happen in the future)My understanding is that in non-container environments(VM or bearmetal) the file cache is created within a memory range within the system(OS) where s3fs is running.
On this case, if the system runs out of free memory, it will naturally cache out.
But the problem is when the container.
It means that writes to disk from the container are file cached by the drivers in the base OS(parent host).
In other words, the memory used is outside the limits of the
cgroupfor the container. (it is different layers)When s3fs is running inside a container, the
cgrouphas a memory limit set, but this file cache can use memory outside of that.And the OOM Killer is detecting it.
At the time of reverting #2157, I was going to look into setting up that container and other solutions to this problem instead of fixing s3fs.
So, instead of just discussing fdatasync, I'd like to think of all possible ways to avoid POD outages with the OOM Killer.
(I still haven't fully grasped the behavior of OOM Killer regarding file caching)
I think that s3fs may be different than other simple programs.
That is creating and writing cache files/download files for s3fs internally without the user's knowledge.
We are facing this issue with s3fs working inside a container.
Therefore, I would like to consolidate the information in this issue as it is.
And Im sorry for change this issue subject.
If we can solve this, I hope we can make a strong case that s3fs works fine in containers.
@gaul commented on GitHub (May 29, 2023):
Let's start with, "is this problem happening?" to which I have doubts. @tanguofu please provide a self-contained test case for us to reproduce your symptoms.
@tanguofu commented on GitHub (May 31, 2023):
OS:
the memory of machine which has large than 320GB
start a testpod with limits of 100Mib memory and in this pod use s3fs mount a bucket which contain a file large than 10GB
then run copy file in this pod to copy the file of bucket to local dir such as data.
the pod will be crashed and the s3fs kill by OOM kill.
the pod use memory can be watch by, which will be large than 8GB if the pod
but if you watch the memory of s3fs , which is very little
and the more logs of oom canbe get by
@ggtakec commented on GitHub (Jun 3, 2023):
@tanguofu Thanks for your reports and kindness.
I tried this on my cheap host and was unable to reproduce the problem.
I prepared the environment with
Ubuntu+minikube, started theubuntu 22.04 POD, and run s3fs on it, and I tried downloading a 10GB file(object).Indeed, the
active_fileinmemory.statseen from the HOST side ofminikubeand inside the POD will decrease, but it will return after about 50MB decrease.It repeats while the file is downloading.
In my reproducing tests, the cache for files(curl's output file, s3fs7s cache and save file) appears to be periodically flushed to the disk.
@tanguofu Are there any hosts in your environment where this phenomenon does not occur?
If you have, is there any difference between the environment that works fine and the one that causes the error(which may include drivers)?
As @gaul mentions, currently I can not find any constant glitches caused by Docker containers(or pods on kubernetes).
We have open Issue #2035 similar to this issue.
More research is needed and I thinnk community help is needed to resolve this issue.
This issue is most likely related to the
active_filecache or something and maybe not a s3fs dependent issue, but we would like to resolve about s3fs on the containers. )@tanguofu commented on GitHub (Jun 5, 2023):
@ggtakec could you test with the memory is 64GB and the kernel is 4.19, this is my system info:
i think the
/proc/sys/vm/dirty_ratiomaybe make system use more cache。@ggtakec commented on GitHub (Jun 10, 2023):
This comment(https://github.com/s3fs-fuse/s3fs-fuse/issues/873#issuecomment-1584002365) may be the same.
@ggtakec commented on GitHub (Jun 12, 2023):
@tanguofu
I haven't found a clear answer yet. (I'm sorry about I haven't been able to test it on the 4.19 kernel.)
However, I have found that this phenomenon has been raised by some users who are using old kernel(and cgroups).
And I think that new cgroups and kernels don't seem to have this problem.(but I haven't found anything definitive yet.)
Given this situation, building a proactive sync(flush) into s3fs itself doesn't seem to be the right workaround(and I agree with @gaul).
Is it possible to tune the following values in your node host?
Tuning this parameter seems to be the best option, although it may degrade overall performance.
I will continue to investigate, but please let us know your opinion and findings.
Thanks in advance for your assistance.