mirror of
https://github.com/s3fs-fuse/s3fs-fuse.git
synced 2026-04-25 13:26:00 +03:00
[GH-ISSUE #1353] [Ceph v15.2.2] s3fs random data corruption at read #724
Labels
No labels
bug
bug
dataloss
duplicate
enhancement
feature request
help wanted
invalid
need info
performance
pull-request
question
question
testing
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/s3fs-fuse#724
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @pkoutsov on GitHub (Aug 6, 2020).
Original GitHub issue: https://github.com/s3fs-fuse/s3fs-fuse/issues/1353
Additional Information
Version of s3fs being used (s3fs --version)
s3fs/1.86 (commit hash
e0a38ad; OpenSSL)Version of fuse being used (pkg-config --modversion fuse, rpm -qi fuse, dpkg -s fuse)
FUSE library version: 2.9.7
Kernel information (uname -r)
4.4.0-116-generic
GNU/Linux Distribution, if applicable (cat /etc/os-release)
Ubuntu 16.04.6 LTS
s3fs command line used, if applicable
s3fs syslog messages (grep s3fs /var/log/syslog, journalctl | grep s3fs, or s3fs outputs)
s3fs_debug.zip
Details about issue
First, I would like to thank you for the effort you put on s3fs-fuse. Now to the issue. I am experiencing random data corruptions when I read the training tfrecords of Resnet-50 (uploaded on a S3 Ceph v15.2.2-backed mountpoint). More specifically, in my effort to investigate the issue further, I wrote this python script that reads all the training tfrecords, in parallel fashion (32 reading workers) through s3fs mountpoint, and compares the md5 of the resulting file to the etag reported by ceph (this is valid because I uploaded each tfrecords as single put object and thus the etag represents the md5 of the contents). I have captured and attached a debug output of s3fs for a run that my script reported tfrecord train-00639-of-01024 as corrupted. Interestingly, the length of the corrupted tfrecord, read by s3fs, matches the one on the S3 endpoint. Also if a rerun my script this file wont be corrupted but I will experience corruptions on some other tfrecords.
PS: I have repeat this validation with other S3 mounters, such as goofys, and I don't face this issue. I would prefer to use s3fs though.
Thanks
@gaul commented on GitHub (Aug 16, 2020):
@pkoutsov can you test with the latest master? This has a concurrency fix that may address your symptoms. If it doesn't, it would be great if you can minimize this test case in some way that I can reproduce it on my system. We take data corruption seriously and want to fix this as soon as possible.
@pkoutsov commented on GitHub (Aug 16, 2020):
@gaul thanks for looking into this. I repeated my test case and I still get data corruptions at read. However this random effect is less frequent with the current master #1363. While I am looking for a way to minimize my test case so it is reproducible by you, my hint is that when s3fs tries to read multiple objects [multiple concurrent object readers ~35+], each ~140MB, that the endpoint will provide all of them blazingly fast [Ceph cluster (maybe this can be a MinIO instance 🤔)], you should experience the same data corruptions.
@gaul commented on GitHub (Oct 10, 2020):
@pkoutsov Could you test again with the latest master? It includes a race condition fix
0e895f60a0. I also added a test for concurrent readers3bc565b986. If you can reproduce these symptoms using a test this would help us find a solution.@pkoutsov commented on GitHub (Oct 12, 2020):
@gaul I tested again with the latest upstream and unfortunately I am still experiencing data corruptions. I started wondering if there is another component that causes them in my whole stack (ceph->s3fs->tensorflow) but with other s3 mounters the corruptions are not present. Ok I will try and mess with concurrent readers test and reproduce my corruptions.
@gaul commented on GitHub (Nov 15, 2020):
@pkoutsov any update? We would really like to track down any possible corruptions.
@gaul commented on GitHub (Feb 8, 2021):
Please reopen if symptoms persist.