mirror of
https://github.com/s3fs-fuse/s3fs-fuse.git
synced 2026-04-25 05:16:00 +03:00
[GH-ISSUE #1972] Questions about file integrity #999
Labels
No labels
bug
bug
dataloss
duplicate
enhancement
feature request
help wanted
invalid
need info
performance
pull-request
question
question
testing
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/s3fs-fuse#999
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @vicasong on GitHub (Jun 29, 2022).
Original GitHub issue: https://github.com/s3fs-fuse/s3fs-fuse/issues/1972
If I don't get it wrong, when the
enable_content_md5option is turned on, the large file upload will pass the ContentMD5 header of the part to ensure that the uploaded data is correct. And ensure data integrity during file download by downloading the original part and calculating the etag of each part. Part etag values are not explicitly documented, and implicitly considered md5-hash.We encountered a very strange problem some time ago: executing the
md5sumcommand twice (about 1 hour between the two commands) on the same file in the s3fs mount directory got different results, in fact the first calculation The result is an error.We didn't find any relevant logs or other clues, it's confusing, maybe it's an issue with an older version of s3fs, I feel like it's probably related to the data integrity of the downloaded file.
According to the description of AWS related documents (Checking object integrity), the support of checkSum is provided in SDK2, which will be more explicit than etag.
There is also a tricky problem, files uploaded from s3fs and then downloaded elsewhere need to verify data integrity, we don't want to do calculations on implicit etags, but files uploaded without AWS S3's checkSum are not supported by checkSum. So, is there a plan for that support?
@gaul commented on GitHub (Jun 29, 2022):
enable_content_md5is kind of an older option sincesigv4implies SHA-256 integrity, unlessenable_unsigned_payloadis enabled. Further, HTTPS has some content integrity so transit is an unlikely source for data corruption.More likely is a bug in s3fs itself. Please make sure you use the latest version 1.91 which fixes several data corruption issues. Note that some Linux distributions have very old versions, e.g., Ubuntu 18.04.
Next please share which flags you used to mount s3fs and the exact operations that you observed data corruption with and other details like the size of files. Also please specify which other clients might be writing to the same S3 bucket.
@vicasong commented on GitHub (Jun 30, 2022):
The problem is very strange. We use a shell script to output the meta information of the file and the calculated md5 value. The md5 output is different for the two calls, but other information like file size is the same. The last modification time of these files in s3 was in June last year. This problem suddenly appeared at the beginning of this month, and it seems that it is not easy to reproduce.
I have one more question, I know very little about it, can we really rely on signatures and https protocol for data integrity? Earlier this year, aws rolled out checksum support for S3 files, does that mean it's necessary.
s3fs mount has no other options:
s3fs ${bucketName} /mnt/s3 -o passwd_file=${HOME}/.passwd-s3fs@ggtakec commented on GitHub (Jul 3, 2022):
@vicasong
I'm sorry if my comment is irrelevant.
I would like to confirm a little about your comparison of md5 values.
I think that you had this problem with a large file that took an hour to upload and download.
And you run s3fs with only the minimum required options.
In this case, s3fs will do the multipart upload with the default part size.
For example, I inspected a 26214405 byte file as follows:
When uploading, the two parts are uploaded with an md5 value for each.
Then, when I downloaded this file(ex. using cat command), it was divided into three parts and downloaded.
You'll notice that uploads and downloads have different splits.
(The size of the final part in uploading is adjusted)
I want to know which md5 values you compared.
If you compared the md5 of each part, were they in the same range?
Thanks in advance for your help.
@vicasong commented on GitHub (Jul 4, 2022):
Maybe my description wasn't detailed enough that you didn't understand it.
For Example:
s3fs mounted at
/mnt/s3, and there is a file namedkey-001in the bucket.So, I just run
wc -c /mnt/s3/key-001command to get file size andmd5sum /mnt/s3/key-001command to get file total md5 value.I'm not sure how s3fs handles these, the md5sum command reads the contents of the entire file from the beginning to the end for calculation. These shouldn't be affected by partial downloads, right?
@ggtakec commented on GitHub (Jul 4, 2022):
@vicasong Thank you for the detailed explanation.(I could understand that)
When the md5sum command runs, s3fs downloads the file.
To be precise, the md5sum command passes the read system call through the system to s3fs via FUSE.
When reading, s3fs is getting a size of about 10MB if it is not loading data from the specified start position.
The downloaded data is stored on the local disk(temporary file or cache file).
If it is a bug in s3fs, I think the following possibilities are possible:
an additional question.
Is your local disk free space for downloading file enough?
There is a slight difference in the behavior of s3fs between when the local disk has enough space and when it is imminent.
(It has special logic to save local disk space)
@vicasong commented on GitHub (Jul 5, 2022):
Disk space is enough, we always care about this.
We feel that s3fs cannot accurately guarantee that the data downloaded to the disk for each file is correct. We will consider additional insurance measures. For this reason, we have planned to change the business process of the file.
Thanks for your answer.
@ggtakec commented on GitHub (Jul 5, 2022):
@vicasong
With enough your disk space, I understood that data inconsistencies may have occurred with normal logic rather than special logic.
We will consider whether s3fs can check the integrity of the downloaded file.
If you find a similar bug again, please add additional information to this issue.
Thanks in advance for your help.