mirror of
https://github.com/s3fs-fuse/s3fs-fuse.git
synced 2026-04-25 13:26:00 +03:00
[GH-ISSUE #964] s3fs: segfault error 4 in libc-2.27.so #539
Labels
No labels
bug
bug
dataloss
duplicate
enhancement
feature request
help wanted
invalid
need info
performance
pull-request
question
question
testing
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/s3fs-fuse#539
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @woodcoder on GitHub (Feb 22, 2019).
Original GitHub issue: https://github.com/s3fs-fuse/s3fs-fuse/issues/964
I'm having issues with s3fs using very high CPU when under load after a few days of running -- having switched on debug it segfaulted. Details below.
The other symptom is occasionally ending up with
Transport endpoint is not connectederrors. In all cases restarting the mount recovers the situation (although occasionally I get left withtarget is busyerrors and have toumount -lto get things back).Additional Information
The following information is very important in order to help us to help you. Omission of the following details may delay your support request or receive no attention at all.
Keep in mind that the commands we provide to retrieve information are oriented to GNU/Linux Distributions, so you could need to use others if you use s3fs on macOS or BSD
Version of s3fs being used (s3fs --version)
Amazon Simple Storage Service File System V1.84(commit:unknown) with OpenSSL
Version of fuse being used (pkg-config --modversion fuse, rpm -qi fuse, dpkg -s fuse)
2.9.7
Kernel information (uname -r)
4.15.0-1029-aws
GNU/Linux Distribution, if applicable (cat /etc/os-release)
Ubuntu 18.04.1 LTS \n \l
systemd mount options
s3fs syslog messages (grep s3fs /var/log/syslog, journalctl | grep s3fs, or s3fs outputs)
Details about issue
@woodcoder commented on GitHub (Feb 22, 2019):
May be related to #805, #547 and #759.
@gaul commented on GitHub (Feb 22, 2019):
Looks like heap corruption -- could you try running s3fs under Valgrind?
@woodcoder commented on GitHub (Feb 22, 2019):
That might be possible (it's a live service) -- how would I build the s3fs to provide the most useful info? At the moment our s3fs install process is:
And then rather than using the systemd unit I presume we just run s3fs under valgrind from the command line (with the same options)?
@gaul commented on GitHub (Feb 23, 2019):
The default configuration will likely work but you may help by disabling optimization but retaining debug symbols:
Testing against master will make it easier to correlate. Valgrind does impose CPU overhead but if s3fs is IO bound as it usually is then Valgrind should not slow access too much. We would really appreciate your assistance here since this issue affects several people and we cannot reproduce it ourselves.
@woodcoder commented on GitHub (Feb 25, 2019):
Hi!
Here're the results of a period of running s3fs under valgrind (
sudo valgrind --leak-check=full --log-file=valgrind.log s3fs...).This is the 1.84 s3fs version as downloaded from the releases and built as above. We didn't see any segfault (yet?) but I wondered if this valgrind.log was what you were looking for?
I can run it again, but let me know if I need to use any further parameters (or update the source to master).
Many thanks!
@gaul commented on GitHub (Mar 3, 2019):
@woodcoder This output is very helpful! I am still puzzling over why this happens -- do you know which sequence of operations triggers it? Also which flags do you provide to s3fs?
@woodcoder commented on GitHub (Mar 3, 2019):
The flags are at the top of the logfile (although bucket-name isn't the real bucket name):
-o rw,allow_other,use_sse=1,iam_role=bucket-name-iam-role,host=https://s3.amazonaws.com,use_cache=/tmp/bucket-name-cache,retries=5,dev,suidI'm not sure of the exact sequence of operations that triggers it -- however the filesystem is used for uploading image files, so in general it's a case of uploading an image, sometimes creating a thumbnail, and then lots of reads for downloads. I would imagine there might be some directory listing (to ensure unique filenames) and file attribute checking (to ensure up-to-dateness, before serving files) going on too.
The only other info that might be relevant is that:
The crontab is running
/usr/bin/find /tmp/bucket-name-cache -type f -daystart -mtime +5 -delete.Version 1.83, while memory hungry, doesn't seem to suffer the same problem (at least on Ubuntu 16.04 LTS -- we've rolled back to this version now to see if it's stable on Ubuntu 18.04 LTS too).
@gaul commented on GitHub (Mar 4, 2019):
I successfully reproduced these symptoms with a simple concurrent test creating and removing files. However, the reference counting and locking discipline confuses me and I will need some more time to investigate.
@woodcoder commented on GitHub (Apr 19, 2019):
@gaul @ggtakec I wanted to say thank you for fixing this issue! We've been running 1.85 for a while now and it seems much more stable.
I still have a bit of feeling that the memory usage is very slowly creeping up, so I did run a valgrind again in the same way as above, but with 1.85 release. It reports one error, valgrind.log, I don't know if that's an issue?
Nonetheless, I've not seen any more segfaults so far so thank you!!!
@ggtakec commented on GitHub (Apr 22, 2019):
@woodcoder Thanks for your help.
There maybe still be problems, I examine in detail for this.
@gaul commented on GitHub (Apr 27, 2019):
@woodcoder I am glad that s3fs works better and the Valgrind feedback you provided suggests that this is not yet completely fixed. Reopening...
@woodcoder commented on GitHub (Jun 8, 2019):
I'm unfortunately definitely still seeing memory usage grow with 1.85 and so I've done some more logging in the hope it will be useful in locating the problem. Following the advice in another issue I tried running it with
valgrind --tool=massifthis time. Logs are below:valgrind-massif.log
massif.out.9798.txt
massif.out.9801.txt
The first time I tried this I actually recreated the segfault, but didn't realise that massif created separate log files, so unfortunately only have the valgrind file for this:
valgrind-massif.log
Finally, I ran it again using
valgrind --leak-check=fullin the hope that this might help:valgrind.log
These are all running against the 1.85 release compiled from source as above.
@woodcoder commented on GitHub (Aug 13, 2019):
Hi @gaul and @ggtakec
I'm still seeing s3fs memory usage climbing from about 400M resident to 900M+ after a several days.
I ran a couple more valgrind massif logging sessions against the latest source (as of the 11 Aug 2019, commit
ccc79ec139):valgrind-massif.log
massif.out.22949.txt
massif.out.22952.txt
valgrind-massif.log
massif.out.1411.txt
massif.out.1413.txt
These logs (and the ones against the 1.85 release in the above comment) only show a fairly short period of time, so I don't know if they show the leak/help at all?
Very happy to run some longer/different valgrind settings to help resolve this.
@gaul commented on GitHub (Sep 4, 2019):
Sorry for the delayed response but I still do not understand the error. Using
ms_printshows the top memory consumer is gnutls:100 MB of ASN-related data seems unlikely but the way we use libcurl seems correct. I wonder if you run
s3fs -o no_check_certificateif this works around your issue? Testing this might point us in the right direction.@woodcoder commented on GitHub (Sep 5, 2019):
Hi @gaul -- thank you for looking into this! I will give that option a try and let you know if it improves things. Would a valgrind massif log against the latest source be useful for you?
I see what you mean about 100MB of ASN data seeming unlikely! I'm using IAM and SSE options -- is that relevant at all (or does that all happen outside curl)?
@woodcoder commented on GitHub (Jan 26, 2020):
Hi @gaul and @ggtakec
Here's a further valgrind massif logging session against the source (as of the 8 Sep 2019, commit
81102a5963):valgrind-massif.log
massif.out.19228.txt
massif.out.19230.txt
This is using the
no_check_certificateoption. I'm not convinced it solves the problem? Please let me know if I can try any other options/logging with different versions etc. to help get to the bottom of this.@woodcoder commented on GitHub (Jan 26, 2020):
BTW while getting set up for the above valgrind run I was also seeing the No Transport endpoint error (mentioned in #1228) from 23 Sep 2019 commit
58b3cce320onwards. If I build any version from source after that commit I get a segfault on startup and the mount point requires aumount -lto clear.Here's the valgrind logging from that version in case it's useful:
valgrind-massif.log
massif.out.17875.txt
massif.out.17877.txt
@woodcoder commented on GitHub (Mar 28, 2020):
This issue has been confirmed as due to excessive memory usage when using curl with the GnuTLS backend see https://github.com/curl/curl/issues/5102.
Workaround is to use OpenSSL.