mirror of
https://github.com/s3fs-fuse/s3fs-fuse.git
synced 2026-04-25 05:16:00 +03:00
[GH-ISSUE #988] d-state processes on RHEL7.6 #548
Labels
No labels
bug
bug
dataloss
duplicate
enhancement
feature request
help wanted
invalid
need info
performance
pull-request
question
question
testing
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/s3fs-fuse#548
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @muryoutaisuu on GitHub (Mar 22, 2019).
Original GitHub issue: https://github.com/s3fs-fuse/s3fs-fuse/issues/988
Version of s3fs being used (s3fs --version)
Version of fuse being used (pkg-config --modversion fuse, rpm -qi fuse, dpkg -s fuse)
Kernel information (uname -r)
GNU/Linux Distribution, if applicable (cat /etc/os-release)
s3fs command line used, if applicable
/etc/fstab entry, if applicable
s3fs syslog messages (grep s3fs /var/log/syslog, journalctl | grep s3fs, or s3fs outputs)
if you execute s3fs with dbglevel, curldbg option, you can get detail debug messages
I have changed hostname and S3 endpoint (we've S3 on prem).
These are the last logs before I had to reboot due to D-state processes, which I only noticed Mar 22 09:58
Details about issue
After upgrading from RHEL7.5 to RHEL7.6 we've encountered d-state processes that referenced s3fs-fuse mounts. It seems to have worked fine on RHEL7.5. I also opened issue #981, but I don't know if they're related.
I did not find anything suspicious in s3fs-fuse logs in rsyslog. The processes hanging (be it
df,lsor anything else) seemed to hang on syscallautofs4_expire_wait.We haven't discovered how and why this happens, we haven't seen any pattern. We can't pinpoint in any direction.
As already mentioned in #976 , we had to get a different workaround for our production system. Nevertheless, I have started s3fs-fuse on my lab environment. Should the issue arise again, I may provide additional information.
@ggtakec commented on GitHub (Mar 24, 2019):
@muryoutaisuu Thank you for the log.
I checked your log, and it seemed the following:
Is this a log when double starting #981?
If it is, I think this log looks like the log of s3fs(not the second one) you started first.
Maybe the first running s3fs did not put any logs, after running the second s3fs.(It is because the second s3fs handles everything)
And I'm not used to Prometheus, but isn't it the s3fs that you started monitoring with Prometheus first?
(And was there any problem reported by Prometheus?)
If this log is associated with #981, I think this issue appears to be the cause described above.
@gaul commented on GitHub (Mar 25, 2019):
Is this a regression from upgrading RHEL from 7.5 to 7.6 or from upgrading s3fs from 1.84 to 1.85? If the former we should understand what changed and perhaps report a Red Hat Bugzilla issue.
@muryoutaisuu commented on GitHub (Mar 26, 2019):
@ggtakec I'm not sure if it was mounted twice, I didn't check then. Could potentially be the case, as the nonempty option was set.
The mount only contains Prometheus alert rules and Alertmanager configurations which are only loaded on startup and on config reload. Hence the access to the mount should be quite sparse. The mount is not monitored by Prometheus, in fact it could be any other application using the mount.
@gaul The issues started as soon as we upgraded to RHEL7.6 (still using s3fs v1.84 at that time). I then rebuilt s3fs with v1.85 (to ensure, that it's compiled with new libraries) but it still occurred.
I will remount without the nonempty option on my lab environment and see, whether it still happens (and therefore is unrelated to nonempty option).
@muryoutaisuu commented on GitHub (Mar 26, 2019):
I just checked our puppet configuration. A colleage added the nonempty option around 3 weeks ago (coincidentally also around the time we patched our RHEL to 7.6). So the issue might have nothing to do with the OS patch.
I've sent him an email to ask why he added the option. Unfortunately, he's on vacation until next month.
I will report back.
@muryoutaisuu commented on GitHub (Apr 2, 2019):
The colleague couldn't really give me an answer as to why he configured the nonempty option. Either way, I was not able to reproduce the issue on my lab system. I tried with and without nonempty option, and even tried with nonempty option and intentionally mounting twice. I always gave at least 24h time for it to happen, but nothing happened so far. I'd rather not try reproducing it on our production system.
Therefore, I can't say whether the RHEL 7.6 upgrade or the nonempty or both in conjunction led to the issue.
@ggtakec commented on GitHub (Apr 7, 2019):
@muryoutaisuu Thank you for the reply.
For now, I understand that the cause was a double mount, or it was unknown whether it was due to RHEL 7.6.
However, as long as you look at the log and activation reported by you, I think that it is highly likely that it was caused by double mounting.
I close this issue, but please reopen this if it happens again.
Thanks for your help.