[GH-ISSUE #808] Random write behavior #466

Closed
opened 2026-03-04 01:45:51 +03:00 by kerem · 10 comments
Owner

Originally created by @kunallillaney on GitHub (Aug 8, 2018).
Original GitHub issue: https://github.com/s3fs-fuse/s3fs-fuse/issues/808

I had a few questions on how s3fs handles random writes since I seeing some weird network stats (given the disclaimer in the README and this subsequent post #607)

Background:

  • Setup: Existing file on S3 - 100MB. I am using fio for random writes on this and monitoring the network I/O using slurm.
  • From my understanding, S3FS prefetches data in 40MB (multipart block size* threads) blocks into the cache when a region is a read. This is verified by slurm and the sparse file in the cache.
  • However, when I perform random writes on S3FS, say changing a single byte, no data is fetched either into the cache or shown by slurm (double checked with /proc/net/dev/) but file size worth of data (100MB in this case) does get transmitted to S3. Here s3fs is not overwriting the other parts of the file with garbage data but correctly updating the single byte.

Question:

  • In the random write case, my assumption was s3fs did a multipart-download, merge changes and multipart-upload. Is this not the case? Does it perform multipart-copy for unchanged regions for S3 (similar to the operation for rename) but the network stats do not support this since it should have only transmitted 40MB worth data and copied the other 60MB. Am I missing something?

Thanks.

Originally created by @kunallillaney on GitHub (Aug 8, 2018). Original GitHub issue: https://github.com/s3fs-fuse/s3fs-fuse/issues/808 I had a few questions on how s3fs handles random writes since I seeing some weird network stats (given the disclaimer in the README and this subsequent post #607) Background: * Setup: Existing file on S3 - 100MB. I am using fio for random writes on this and monitoring the network I/O using slurm. * From my understanding, S3FS prefetches data in 40MB (multipart block size* threads) blocks into the cache when a region is a read. This is verified by slurm and the sparse file in the cache. * However, when I perform random writes on S3FS, say changing a single byte, no data is fetched either into the cache or shown by slurm (double checked with /proc/net/dev/) but file size worth of data (100MB in this case) does get transmitted to S3. Here s3fs is not overwriting the other parts of the file with garbage data but correctly updating the single byte. Question: * In the random write case, my assumption was s3fs did a multipart-download, merge changes and multipart-upload. Is this not the case? Does it perform multipart-copy for unchanged regions for S3 (similar to the operation for rename) but the network stats do not support this since it should have only transmitted 40MB worth data and copied the other 60MB. Am I missing something? Thanks.
kerem 2026-03-04 01:45:51 +03:00
  • closed this issue
  • added the
    dataloss
    label
Author
Owner

@kunallillaney commented on GitHub (Aug 8, 2018):

And a follow on question is if does it indeed use multi-part copy, then how does it get around the 5GB target file size limit for range reads placed by S3.

<!-- gh-comment-id:411524274 --> @kunallillaney commented on GitHub (Aug 8, 2018): And a follow on question is if does it indeed use multi-part copy, then how does it get around the [5GB target file size limit](https://docs.aws.amazon.com/AmazonS3/latest/API/mpUploadUploadPartCopy.html) for range reads placed by S3.
Author
Owner

@gaul commented on GitHub (Sep 11, 2018):

Which options do you mount s3fs with? Have you enabled the data cache? This could explain the read behavior you experience.

For writes, s3fs does issue multipart uploads and copies data for the unchanged regions. Hence you should expect a 1-byte random write to a 100 MB file to copy (9) 10 MB chunks and upload (1) 10 MB chunk. If the cache does not contain this range it will have to download it before uploading the whole part.

<!-- gh-comment-id:420156600 --> @gaul commented on GitHub (Sep 11, 2018): Which options do you mount s3fs with? Have you enabled the data cache? This could explain the read behavior you experience. For writes, s3fs does issue multipart uploads and copies data for the unchanged regions. Hence you should expect a 1-byte random write to a 100 MB file to copy (9) 10 MB chunks and upload (1) 10 MB chunk. If the cache does not contain this range it will have to download it before uploading the whole part.
Author
Owner

@kunallillaney commented on GitHub (Sep 11, 2018):

The data cache is enabled. I tried this by opening the file in "r+" mode to update. In this case, the file ends up with correct data in the updated region and zeros everywhere else (which sounds like a bug to me). I don't think s3fs copies data for the other 9 chunks. If it does, can you please explain how it gets around the 5GB file size limit I mention above? S3FS needs to do range reads on the remaining 90MB of data to do so but S3 does not allow it since the target file is only 100MB.
I believe this was verified by @orozery as well.

<!-- gh-comment-id:420157983 --> @kunallillaney commented on GitHub (Sep 11, 2018): The data cache is enabled. I tried this by opening the file in "r+" mode to update. In this case, the file ends up with correct data in the updated region and zeros everywhere else (which sounds like a bug to me). I don't think s3fs copies data for the other 9 chunks. If it does, can you please explain how it gets around the 5GB file size limit I mention above? S3FS needs to do range reads on the remaining 90MB of data to do so but S3 does not allow it since the target file is only 100MB. I believe this was verified by @orozery as well.
Author
Owner

@gaul commented on GitHub (Sep 11, 2018):

If you observe data corruption, with unexpected zeros, this is a serious issue and I will take a look at it. Can you minimize a test case with exact steps to reproduce this behavior?

s3fs should copy ranges which are not updated on the server. S3 parts, either uploaded or copied, are limited to >= 5 MB and <= 5 GB but the total MPU object size can be up to 5 TB.

<!-- gh-comment-id:420158811 --> @gaul commented on GitHub (Sep 11, 2018): If you observe data corruption, with unexpected zeros, this is a serious issue and I will take a look at it. Can you minimize a test case with exact steps to reproduce this behavior? s3fs should copy ranges which are not updated on the server. S3 parts, either uploaded or copied, are limited to >= 5 MB and <= 5 GB but the total MPU object size can be up to 5 TB.
Author
Owner

@kunallillaney commented on GitHub (Sep 11, 2018):

@gaul You are correct in saying that S3 parts are limited to >= 5 MB and <= 5 GB but this is the case for only uploaded ones. They make explicit mention of this in their boto3 documentation under CopySourceRange parameter where it is stated "You can copy a range only if the source object is greater than 5 GB". I don't think this is limited to boto3 since I tried this over the HTTP API as well and it threw and error.

I will post steps or a small script to recreate the issue in AM tomorrow.

<!-- gh-comment-id:420161067 --> @kunallillaney commented on GitHub (Sep 11, 2018): @gaul You are correct in saying that S3 parts are limited to >= 5 MB and <= 5 GB but this is the case for only uploaded ones. They make explicit mention of this in their [boto3 documentation](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#S3.Client.upload_part_copy) under CopySourceRange parameter where it is stated "_You can copy a range only if the source object is greater than 5 GB_". I don't think this is limited to boto3 since I tried this over the HTTP API as well and it threw and error. I will post steps or a small script to recreate the issue in AM tomorrow.
Author
Owner

@gaul commented on GitHub (Sep 17, 2018):

The boto3 documentation for part size is incorrect. This phrase seems to originate from some AWS code generation which I cannot find the source for so I opened aws/aws-cli#3577 which has the same issue and hope they can direct to the proper location.

Did you reproduce your data loss issue? Please provide instructions so I can investigate this otherwise can you close the issue?

<!-- gh-comment-id:422161255 --> @gaul commented on GitHub (Sep 17, 2018): The boto3 documentation for part size is incorrect. This phrase seems to originate from some AWS code generation which I cannot find the source for so I opened aws/aws-cli#3577 which has the same issue and hope they can direct to the proper location. Did you reproduce your data loss issue? Please provide instructions so I can investigate this otherwise can you close the issue?
Author
Owner

@kunallillaney commented on GitHub (Sep 17, 2018):

@gaul Sorry I have been busy with a paper and have been unable to look at this. When I had looked at it last, I was able to reproduce it multiple times and the issue was also confirmed by another contributor to the repository (mentioned above). I am positive that this issue exists and I will post the steps sometime next week.

<!-- gh-comment-id:422200425 --> @kunallillaney commented on GitHub (Sep 17, 2018): @gaul Sorry I have been busy with a paper and have been unable to look at this. When I had looked at it last, I was able to reproduce it multiple times and the issue was also confirmed by another contributor to the repository (mentioned above). I am positive that this issue exists and I will post the steps sometime next week.
Author
Owner

@kunallillaney commented on GitHub (Oct 31, 2018):

@gaul Sorry I was busy with a paper and then traveling for a conference. Here are the steps to reproduce this bug

  1. Mount s3fs with cache enabled option.
  2. Copy a file greater than number of threadsxmultipart size. I chose a file of about 200MB for a 4 thread 10 MB part size setup. Please do no create this file via dd since this creates a file filled with \x00.
  3. Unmount s3fs and clear the cache.
  4. Mount s3fs again with the same options as before.
  5. Open the file from C or python with r+ mode. (I checked this with both these languages and the issue occurs in both.)
  6. Write to the file at the start or some offset and close the file.
  7. Read the file again and check at offsets which are not the offset yo wrote to and you see \x00.
  8. Alternately, you can unmount the filesystem clear the cache and then repeat Step 6 and the same issue occurs.

Please let me know if this is unclear or you have further questions.

<!-- gh-comment-id:434731673 --> @kunallillaney commented on GitHub (Oct 31, 2018): @gaul Sorry I was busy with a paper and then traveling for a conference. Here are the steps to reproduce this bug 1. Mount s3fs with cache enabled option. 2. Copy a file greater than number of threadsxmultipart size. I chose a file of about 200MB for a 4 thread 10 MB part size setup. Please do no create this file via dd since this creates a file filled with \x00. 3. Unmount s3fs and clear the cache. 4. Mount s3fs again with the same options as before. 5. Open the file from C or python with r+ mode. (I checked this with both these languages and the issue occurs in both.) 6. Write to the file at the start or some offset and close the file. 7. Read the file again and check at offsets which are not the offset yo wrote to and you see \x00. 8. Alternately, you can unmount the filesystem clear the cache and then repeat Step 6 and the same issue occurs. Please let me know if this is unclear or you have further questions.
Author
Owner

@gaul commented on GitHub (Jan 23, 2019):

@kunallillaney Could you test again with master? We fixed a few zero data issues and #918 is the most similar to your symptom.

<!-- gh-comment-id:456892426 --> @gaul commented on GitHub (Jan 23, 2019): @kunallillaney Could you test again with master? We fixed a few zero data issues and #918 is the most similar to your symptom.
Author
Owner

@gaul commented on GitHub (Jun 25, 2019):

Closing due to inactivity.

<!-- gh-comment-id:505609933 --> @gaul commented on GitHub (Jun 25, 2019): Closing due to inactivity.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/s3fs-fuse#466
No description provided.