[GH-ISSUE #94] s3fs: failed to read - randomly occuring #56

Closed
opened 2026-03-04 01:41:36 +03:00 by kerem · 22 comments
Owner

Originally created by @mknwebsolutions on GitHub (Dec 9, 2014).
Original GitHub issue: https://github.com/s3fs-fuse/s3fs-fuse/issues/94

I've got s3fs mounted and working, every once in a while I'll see a handful of errors i.e. below:

Dec  9 05:40:23 ip-10-0-0-16 s3fs: failed to read(remaining: 0 code: 28  msg: Timeout was reached), so retry this.
Dec  9 05:40:34 ip-10-0-0-16 s3fs: failed to read(remaining: 4 code: 28  msg: Timeout was reached), so retry this.
Dec  9 05:40:34 ip-10-0-0-16 s3fs: failed to read(remaining: 3 code: 28  msg: Timeout was reached), so retry this.
Dec  9 05:40:34 ip-10-0-0-16 s3fs: failed to read(remaining: 2 code: 28  msg: Timeout was reached), so retry this.
Dec  9 05:40:34 ip-10-0-0-16 s3fs: failed to read(remaining: 1 code: 28  msg: Timeout was reached), so retry this.
Dec  9 05:40:34 ip-10-0-0-16 s3fs: failed to read(remaining: 0 code: 28  msg: Timeout was reached), so retry this.
Dec  9 05:40:44 ip-10-0-0-16 s3fs: failed to read(remaining: 4 code: 28  msg: Timeout was reached), so retry this.
Dec  9 05:40:44 ip-10-0-0-16 s3fs: failed to read(remaining: 3 code: 28  msg: Timeout was reached), so retry this.
Dec  9 05:40:44 ip-10-0-0-16 s3fs: failed to read(remaining: 2 code: 28  msg: Timeout was reached), so retry this.
Dec  9 05:40:44 ip-10-0-0-16 s3fs: failed to read(remaining: 1 code: 28  msg: Timeout was reached), so retry this.
Dec  9 05:40:44 ip-10-0-0-16 s3fs: failed to read(remaining: 0 code: 28  msg: Timeout was reached), so retry this.

Eventually the file retry will reach limit and not pull file over s3fs. Not sure why this is happening, it's been pretty random... I want to say I see it occurring more often when files are 20mb+

I'm able to switch into the s3fs directory and view the actual files / touch new files / etc.

Originally created by @mknwebsolutions on GitHub (Dec 9, 2014). Original GitHub issue: https://github.com/s3fs-fuse/s3fs-fuse/issues/94 I've got s3fs mounted and working, every once in a while I'll see a handful of errors i.e. below: ``` Dec 9 05:40:23 ip-10-0-0-16 s3fs: failed to read(remaining: 0 code: 28 msg: Timeout was reached), so retry this. Dec 9 05:40:34 ip-10-0-0-16 s3fs: failed to read(remaining: 4 code: 28 msg: Timeout was reached), so retry this. Dec 9 05:40:34 ip-10-0-0-16 s3fs: failed to read(remaining: 3 code: 28 msg: Timeout was reached), so retry this. Dec 9 05:40:34 ip-10-0-0-16 s3fs: failed to read(remaining: 2 code: 28 msg: Timeout was reached), so retry this. Dec 9 05:40:34 ip-10-0-0-16 s3fs: failed to read(remaining: 1 code: 28 msg: Timeout was reached), so retry this. Dec 9 05:40:34 ip-10-0-0-16 s3fs: failed to read(remaining: 0 code: 28 msg: Timeout was reached), so retry this. Dec 9 05:40:44 ip-10-0-0-16 s3fs: failed to read(remaining: 4 code: 28 msg: Timeout was reached), so retry this. Dec 9 05:40:44 ip-10-0-0-16 s3fs: failed to read(remaining: 3 code: 28 msg: Timeout was reached), so retry this. Dec 9 05:40:44 ip-10-0-0-16 s3fs: failed to read(remaining: 2 code: 28 msg: Timeout was reached), so retry this. Dec 9 05:40:44 ip-10-0-0-16 s3fs: failed to read(remaining: 1 code: 28 msg: Timeout was reached), so retry this. Dec 9 05:40:44 ip-10-0-0-16 s3fs: failed to read(remaining: 0 code: 28 msg: Timeout was reached), so retry this. ``` Eventually the file retry will reach limit and not pull file over s3fs. Not sure why this is happening, it's been pretty random... I want to say I see it occurring more often when files are 20mb+ I'm able to switch into the s3fs directory and view the actual files / touch new files / etc.
kerem closed this issue 2026-03-04 01:41:36 +03:00
Author
Owner

@gaul commented on GitHub (Dec 9, 2014):

@mknwebsolutions What does not reach limit mean? Is the file corrupt, or does your application not report an error? If you are running 1.78 and have an intermittent network connection you may have encountered #64.

<!-- gh-comment-id:66245544 --> @gaul commented on GitHub (Dec 9, 2014): @mknwebsolutions What does not reach limit mean? Is the file corrupt, or does your application not report an error? If you are running 1.78 and have an intermittent network connection you may have encountered #64.
Author
Owner

@mknwebsolutions commented on GitHub (Dec 9, 2014):

@andrewgaul the "not reach limit is this error below:

Dec 9 05:40:13 ip-10-0-0-16 s3fs: Over retry count(3) limit(/file-name-here:1).

The file isn't corrupt. I've tried dumping random files with random sizes across and I see this issue - def not corrupt files. The network is through amazon AWS, and I'm sure AWS isn't having any network issues.

<!-- gh-comment-id:66303514 --> @mknwebsolutions commented on GitHub (Dec 9, 2014): @andrewgaul the "not reach limit is this error below: **Dec 9 05:40:13 ip-10-0-0-16 s3fs: Over retry count(3) limit(/file-name-here:1).** The file isn't corrupt. I've tried dumping random files with random sizes across and I see this issue - def not corrupt files. The network is through amazon AWS, and I'm sure AWS isn't having any network issues.
Author
Owner

@chrislovecnm commented on GitHub (Jan 6, 2015):

I am getting the exact same problem. How can we help you debug this?? I have compiled the master

<!-- gh-comment-id:68822727 --> @chrislovecnm commented on GitHub (Jan 6, 2015): I am getting the exact same problem. How can we help you debug this?? I have compiled the master
Author
Owner

@mknwebsolutions commented on GitHub (Jan 6, 2015):

So I was actually able to get everything working after rebooting the server. It just worked (and still working) since then.

<!-- gh-comment-id:68822825 --> @mknwebsolutions commented on GitHub (Jan 6, 2015): So I was actually able to get everything working after rebooting the server. It just worked (and still working) since then.
Author
Owner

@ggtakec commented on GitHub (Jan 6, 2015):

Hi, all
(I'm sorry for replying late.)

s3fs supports multiparts request(send some request as parallel), I think this problem is dependent on the number of parallel requests as possible.
If you can, please try to set small value for multireq_max and parallel_count options.
I want to know the result of this.

Thanks in advance for your help.

<!-- gh-comment-id:68882950 --> @ggtakec commented on GitHub (Jan 6, 2015): Hi, all (I'm sorry for replying late.) s3fs supports multiparts request(send some request as parallel), I think this problem is dependent on the number of parallel requests as possible. If you can, please try to set small value for multireq_max and parallel_count options. I want to know the result of this. Thanks in advance for your help.
Author
Owner

@chrislovecnm commented on GitHub (Jan 6, 2015):

I have having the problem specifically on the aws ami amazon linux image. I am fine on Gentoo distro running it locally. I am spinning up a Gentoo docker to see if I am ok in aws on Gentoo.

What ami's are confirmed to work?

@ggtakec I will test your recommendations as well.

<!-- gh-comment-id:68920451 --> @chrislovecnm commented on GitHub (Jan 6, 2015): I have having the problem specifically on the aws ami amazon linux image. I am fine on Gentoo distro running it locally. I am spinning up a Gentoo docker to see if I am ok in aws on Gentoo. What ami's are confirmed to work? @ggtakec I will test your recommendations as well.
Author
Owner

@chrislovecnm commented on GitHub (Jan 6, 2015):

@ggtakec initial testing is showing that this appears to be a distro issue. Amazon Linux AMI is throwing those errors insanely. While gentoo docker running on the same damn box is working like a champ. Man at times I HATE Centos and RHEL...

<!-- gh-comment-id:68933400 --> @chrislovecnm commented on GitHub (Jan 6, 2015): @ggtakec initial testing is showing that this appears to be a distro issue. Amazon Linux AMI is throwing those errors insanely. While gentoo docker running on the same damn box is working like a champ. Man at times I HATE Centos and RHEL...
Author
Owner

@mknwebsolutions commented on GitHub (Jan 13, 2015):

I was able to solve the issue by just installing latest s3fs and rebooting.

<!-- gh-comment-id:69678481 --> @mknwebsolutions commented on GitHub (Jan 13, 2015): I was able to solve the issue by just installing latest s3fs and rebooting.
Author
Owner

@csgyuricza commented on GitHub (Jan 13, 2015):

Thank you - I am now able to run it with the latest version, but I still get that same timeout error occasionally.

<!-- gh-comment-id:69678572 --> @csgyuricza commented on GitHub (Jan 13, 2015): Thank you - I am now able to run it with the latest version, but I still get that same timeout error occasionally.
Author
Owner

@mknwebsolutions commented on GitHub (Jan 13, 2015):

What's your bash look like for mounting?

<!-- gh-comment-id:69678683 --> @mknwebsolutions commented on GitHub (Jan 13, 2015): What's your bash look like for mounting?
Author
Owner

@mknwebsolutions commented on GitHub (Jan 13, 2015):

Actually I take that back, looks like my mounted s3 went bad a few hours ago "Transport endpoint is not connected"

<!-- gh-comment-id:69678941 --> @mknwebsolutions commented on GitHub (Jan 13, 2015): Actually I take that back, looks like my mounted s3 went bad a few hours ago "Transport endpoint is not connected"
Author
Owner

@mknwebsolutions commented on GitHub (Jan 13, 2015):

I'm going to try out the -f option (foreground) from https://github.com/s3fs-fuse/s3fs-fuse/issues/57

<!-- gh-comment-id:69684488 --> @mknwebsolutions commented on GitHub (Jan 13, 2015): I'm going to try out the -f option (foreground) from https://github.com/s3fs-fuse/s3fs-fuse/issues/57
Author
Owner

@mknwebsolutions commented on GitHub (Jan 13, 2015):

-f didn't work, back to having the same issue, log below:

s3fs_init(2595): init
s3fs_check_service(2894): check services.
    CheckBucket(2228): check a bucket.
    RequestPerform(1483): HTTP response code 200
s3fs_getattr(691): [path=/]
s3fs_getattr(691): [path=/]
s3fs_getattr(691): [path=/]
s3fs_getattr(691): [path=/]
s3fs_getattr(691): [path=/]
s3fs_getattr(691): [path=/]
s3fs_access(2646): [path=/][mask=X_OK ]
s3fs_opendir(2050): [path=/][flags=100352]
s3fs_readdir(2182): [path=/]
  list_bucket(2225): [path=/]
    ListBucketRequest(2270): [tpath=/]
    RequestPerform(1483): HTTP response code 200
  readdir_multi_head(2105): [path=/][list=0]
    Request(3150): [count=20]
MultiRead(3113): failed to read(remaining: 19 code: 28  msg: Timeout was reached), so retry this.
MultiRead(3113): failed to read(remaining: 18 code: 28  msg: Timeout was reached), so retry this.
MultiRead(3113): failed to read(remaining: 17 code: 28  msg: Timeout was reached), so retry this.
MultiRead(3113): failed to read(remaining: 16 code: 28  msg: Timeout was reached), so retry this.
MultiRead(3113): failed to read(remaining: 15 code: 28  msg: Timeout was reached), so retry this.
MultiRead(3113): failed to read(remaining: 14 code: 28  msg: Timeout was reached), so retry this.
MultiRead(3113): failed to read(remaining: 13 code: 28  msg: Timeout was reached), so retry this.
MultiRead(3113): failed to read(remaining: 12 code: 28  msg: Timeout was reached), so retry this.
MultiRead(3113): failed to read(remaining: 11 code: 28  msg: Timeout was reached), so retry this.
<!-- gh-comment-id:69690286 --> @mknwebsolutions commented on GitHub (Jan 13, 2015): -f didn't work, back to having the same issue, log below: ``` s3fs_init(2595): init s3fs_check_service(2894): check services. CheckBucket(2228): check a bucket. RequestPerform(1483): HTTP response code 200 s3fs_getattr(691): [path=/] s3fs_getattr(691): [path=/] s3fs_getattr(691): [path=/] s3fs_getattr(691): [path=/] s3fs_getattr(691): [path=/] s3fs_getattr(691): [path=/] s3fs_access(2646): [path=/][mask=X_OK ] s3fs_opendir(2050): [path=/][flags=100352] s3fs_readdir(2182): [path=/] list_bucket(2225): [path=/] ListBucketRequest(2270): [tpath=/] RequestPerform(1483): HTTP response code 200 readdir_multi_head(2105): [path=/][list=0] Request(3150): [count=20] MultiRead(3113): failed to read(remaining: 19 code: 28 msg: Timeout was reached), so retry this. MultiRead(3113): failed to read(remaining: 18 code: 28 msg: Timeout was reached), so retry this. MultiRead(3113): failed to read(remaining: 17 code: 28 msg: Timeout was reached), so retry this. MultiRead(3113): failed to read(remaining: 16 code: 28 msg: Timeout was reached), so retry this. MultiRead(3113): failed to read(remaining: 15 code: 28 msg: Timeout was reached), so retry this. MultiRead(3113): failed to read(remaining: 14 code: 28 msg: Timeout was reached), so retry this. MultiRead(3113): failed to read(remaining: 13 code: 28 msg: Timeout was reached), so retry this. MultiRead(3113): failed to read(remaining: 12 code: 28 msg: Timeout was reached), so retry this. MultiRead(3113): failed to read(remaining: 11 code: 28 msg: Timeout was reached), so retry this. ```
Author
Owner

@mknwebsolutions commented on GitHub (Jan 13, 2015):

I bumped my instance up to a Medium Instance on EC2 -- errors are immediately gone. This is the second server that followed suit. Micro + Small EC2 instances constantly fail, must be a weak connection or so?

<!-- gh-comment-id:69794477 --> @mknwebsolutions commented on GitHub (Jan 13, 2015): I bumped my instance up to a Medium Instance on EC2 -- errors are immediately gone. This is the second server that followed suit. Micro + Small EC2 instances constantly fail, must be a weak connection or so?
Author
Owner

@chrislovecnm commented on GitHub (Jan 28, 2015):

@mknwebsolutions regardless of connection speed I have this problem with the Amazon ami.

<!-- gh-comment-id:71774867 --> @chrislovecnm commented on GitHub (Jan 28, 2015): @mknwebsolutions regardless of connection speed I have this problem with the Amazon ami.
Author
Owner

@mknwebsolutions commented on GitHub (Jan 28, 2015):

It's a very weird issue. After my last comment here, my medium instance did again fail a few times. After numerous restart, s3fs finally locked in and is still steady today. Bug is unknown so far. Could be a DNS issue or something.

<!-- gh-comment-id:71778921 --> @mknwebsolutions commented on GitHub (Jan 28, 2015): It's a very weird issue. After my last comment here, my medium instance did again fail a few times. After numerous restart, s3fs finally locked in and is still steady today. Bug is unknown so far. Could be a DNS issue or something.
Author
Owner

@ggtakec commented on GitHub (Mar 8, 2015):

Hi, all

I heard the libcurl problem(?) about this issue from @boazrf in #117.

I was able to overcome the problem by downgrading libcurl to version 7.31. It seems that there is a known bug in new version of curl that causes a failure with it's DNS cache in some cases. The bug doesn't exists in version 7.31 (http://stackoverflow.com/questions/27093467/curl-hostname-was-not-found-in-dns-cache-error).

Because yum downgrade didn't work I did the following:

  1. I've downloaded curl 7.31 source (wget http://www.execve.net/curl/curl-7.31.0.tar.gz)
  2. Built it
  3. Manually replaced libcurl (cp libcurl.so.4.3.0 /usr/lib64/libcurl.so.4.3.0)
  4. Restart s3fs mount

Following that the message Hostname was NOT found in DNS cache disappeared and so did CURLE_COULDNT_RESOLVE_HOST, and most important: files opening stopped failing.

So - I consider this issue closed. I belive this workaround will also resolved issue #94. It might be a good idea to update install doc and add check for valid libcurl version.

One of case, when s3fs gets CURLE_COULDNT_RESOLVE_HOST error, it makes timeout error.
If someone who has same problem, please try to check libcurl version.

Thanks in advance for your assistance.

<!-- gh-comment-id:77745305 --> @ggtakec commented on GitHub (Mar 8, 2015): Hi, all I heard the libcurl problem(?) about this issue from @boazrf in #117. > I was able to overcome the problem by downgrading libcurl to version 7.31. It seems that there is a known bug in new version of curl that causes a failure with it's DNS cache in some cases. The bug doesn't exists in version 7.31 (http://stackoverflow.com/questions/27093467/curl-hostname-was-not-found-in-dns-cache-error). > > Because yum downgrade didn't work I did the following: > 1. I've downloaded curl 7.31 source (wget http://www.execve.net/curl/curl-7.31.0.tar.gz) > 2. Built it > 3. Manually replaced libcurl (cp libcurl.so.4.3.0 /usr/lib64/libcurl.so.4.3.0) > 4. Restart s3fs mount > > Following that the message Hostname was NOT found in DNS cache disappeared and so did CURLE_COULDNT_RESOLVE_HOST, and most important: files opening stopped failing. > > So - I consider this issue closed. I belive this workaround will also resolved issue #94. It might be a good idea to update install doc and add check for valid libcurl version. One of case, when s3fs gets CURLE_COULDNT_RESOLVE_HOST error, it makes timeout error. If someone who has same problem, please try to check libcurl version. Thanks in advance for your assistance.
Author
Owner

@mknwebsolutions commented on GitHub (Mar 10, 2015):

@ggtakec makes sense, I figured it was something with DNS. Downgrading I'd say is a temp solution until real solid solution is rolled out.

<!-- gh-comment-id:78114379 --> @mknwebsolutions commented on GitHub (Mar 10, 2015): @ggtakec makes sense, I figured it was something with DNS. Downgrading I'd say is a temp solution until real solid solution is rolled out.
Author
Owner

@ggtakec commented on GitHub (Mar 24, 2015):

I'm looking for a cause of this problem now, but not able to solve these problems.

I think that a cause of #117 was CURLE_COULDNT_RESOLVE_HOST, this has failed to resolve the host name.(There are a lot of direct cause for this.)

Otherwise, I was using a s3fs that was very small connet timeout on EC2, and I was able to get a retry error as same as mknwebsolutions's result.
If the this cause is derived from the connect timeout, we should set a large value for "connect_timeout" options(this value is 10s as default). Maybe then need to set "readwrite_timeout" option too(readwrite_timeout default value is 30s).

Last, about #105 "transport endpoint not connected" error, this error probably is ENOTCONN(errno) which might be a bug of s3fs. However, this error is related to connect too, then maybe we can avoid this problem in the above options.

If you can, please try to specify those options, and let me know the result.
Regards,

<!-- gh-comment-id:85610811 --> @ggtakec commented on GitHub (Mar 24, 2015): I'm looking for a cause of this problem now, but not able to solve these problems. I think that a cause of #117 was CURLE_COULDNT_RESOLVE_HOST, this has failed to resolve the host name.(There are a lot of direct cause for this.) Otherwise, I was using a s3fs that was very small connet timeout on EC2, and I was able to get a retry error as same as mknwebsolutions's result. If the this cause is derived from the connect timeout, we should set a large value for "connect_timeout" options(this value is 10s as default). Maybe then need to set "readwrite_timeout" option too(readwrite_timeout default value is 30s). Last, about #105 "transport endpoint not connected" error, this error probably is ENOTCONN(errno) which might be a bug of s3fs. However, this error is related to connect too, then maybe we can avoid this problem in the above options. If you can, please try to specify those options, and let me know the result. Regards,
Author
Owner

@mknwebsolutions commented on GitHub (Mar 24, 2015):

@ggtakec I'm recalling some prior experience where 10s timeout is definitely too low, should be at least 30 seconds. If I remember, I had issues in the past with AWS endpoints taking greater than 10s to resolve "X".

<!-- gh-comment-id:85626931 --> @mknwebsolutions commented on GitHub (Mar 24, 2015): @ggtakec I'm recalling some prior experience where 10s timeout is definitely too low, should be at least 30 seconds. If I remember, I had issues in the past with AWS endpoints taking greater than 10s to resolve "X".
Author
Owner

@ggtakec commented on GitHub (Apr 12, 2015):

@mknwebsolutions I updated the default timeout value changed by #167.
Please check it.
If the timeout error is occurred, please try to change timeout value by connect_timeout and readwrite_timeout options.
Regards,

<!-- gh-comment-id:91971477 --> @ggtakec commented on GitHub (Apr 12, 2015): @mknwebsolutions I updated the default timeout value changed by #167. Please check it. If the timeout error is occurred, please try to change timeout value by connect_timeout and readwrite_timeout options. Regards,
Author
Owner

@ggtakec commented on GitHub (Jan 17, 2016):

I'm closing this issue, if you have a problem yet, please post new issue or reopen this issue.

Thanks in advance for your help.

<!-- gh-comment-id:172299937 --> @ggtakec commented on GitHub (Jan 17, 2016): I'm closing this issue, if you have a problem yet, please post new issue or reopen this issue. Thanks in advance for your help.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/s3fs-fuse#56
No description provided.