[GH-ISSUE #850] Incorrect etag value after large file upload #496

Closed
opened 2026-03-04 01:46:07 +03:00 by kerem · 5 comments
Owner

Originally created by @pawelmarkowski on GitHub (Nov 2, 2018).
Original GitHub issue: https://github.com/s3fs-fuse/s3fs-fuse/issues/850

Additional Information

Incorrect ETAG in Object Storage after large file upload. If you upload 1GB file to Ceph Object storage you will receive incorrect etag value.
file_hash_large = '7917e22de415e3943220abef484c8526'
file_size_large = 1040000000 [B]

Nevertheless in small files case it looks fine. If I download the large file that was uploaded earlier by s3fs and calculate md5 - it is correct, so file is not broken, but s3fs makes something wrong with metadata.

Version of s3fs being used (s3fs --version)

Amazon Simple Storage Service File System V1.84(commit:f36ac3d) with OpenSSL

Version of fuse being used (pkg-config --modversion fuse, rpm -qi fuse, dpkg -s fuse)

2.9.7

Platform | Linux-4.15.0-38-generic-x86_64-with-LinuxMint-19-tara
Plugins | {'xdist': '1.24.0', 'repeat': '0.7.0', 'metadata': '1.7.0', 'html': '1.19.0', 'forked': '0.2'}
Python

s3fs command line used, if applicable

we were trying: 
 -o use_path_request_style -o umask=0222 -o allow_other -o enable_content_md5 -o dbglevel=debug -f
and:
-o use_path_request_style -o umask=0222 -o allow_other -o dbglevel=debug -f

Details about issue

AssertionError: assert
MD5 value - 7917e22de415e3943220abef484c8526
ETAG + 1a5409445aa4a897571415c264201158-100

Originally created by @pawelmarkowski on GitHub (Nov 2, 2018). Original GitHub issue: https://github.com/s3fs-fuse/s3fs-fuse/issues/850 ### Additional Information Incorrect ETAG in Object Storage after large file upload. If you upload 1GB file to Ceph Object storage you will receive incorrect etag value. file_hash_large = '7917e22de415e3943220abef484c8526' file_size_large = 1040000000 [B] Nevertheless in small files case it looks fine. If I download the large file that was uploaded earlier by s3fs and calculate md5 - it is correct, so file is not broken, but s3fs makes something wrong with metadata. #### Version of s3fs being used (s3fs --version) _Amazon Simple Storage Service File System V1.84(commit:f36ac3d) with OpenSSL_ #### Version of fuse being used (pkg-config --modversion fuse, rpm -qi fuse, dpkg -s fuse) _2.9.7_ Platform | Linux-4.15.0-38-generic-x86_64-with-LinuxMint-19-tara Plugins | {'xdist': '1.24.0', 'repeat': '0.7.0', 'metadata': '1.7.0', 'html': '1.19.0', 'forked': '0.2'} Python #### s3fs command line used, if applicable ``` we were trying: -o use_path_request_style -o umask=0222 -o allow_other -o enable_content_md5 -o dbglevel=debug -f and: -o use_path_request_style -o umask=0222 -o allow_other -o dbglevel=debug -f ``` ### Details about issue AssertionError: assert MD5 value - 7917e22de415e3943220abef484c8526 ETAG + 1a5409445aa4a897571415c264201158-100
kerem closed this issue 2026-03-04 01:46:07 +03:00
Author
Owner

@gaul commented on GitHub (Nov 4, 2018):

Could you provide steps to reproduce this along with the source of the error, e.g., s3fs or ceph? I successfully ran dd if=/dev/zero of=gaulbackup/tmp bs=1M count=1000 status=progress with -o enable_content_md5 against AWS so I wonder if ceph does something different.

<!-- gh-comment-id:435638760 --> @gaul commented on GitHub (Nov 4, 2018): Could you provide steps to reproduce this along with the source of the error, e.g., s3fs or ceph? I successfully ran `dd if=/dev/zero of=gaulbackup/tmp bs=1M count=1000 status=progress` with `-o enable_content_md5` against AWS so I wonder if ceph does something different.
Author
Owner

@pawelmarkowski commented on GitHub (Nov 6, 2018):

We have mount:
/usr/local/bin/s3fs products /mnt/buck -o passwd_file=~/.passwd-s3fs -o url=https://endpointurlcom.com:8080 -o use_path_request_style -o umask=0222 -o allow_other -o enable_content_md5 -o dbglevel=debug -f -o uid=1000 -o gid=1000

  1. dd if=/dev/zero of=/mnt/buck/zero bs=1M count=1000 status=progress

  2. Run test:
    self = <test_read.TestReading object at 0x7f7b73c5fc88>, key = 'zero'

    @pytest.mark.slow
    def test_checksum(self, key):
    file_hash = hashlib.md5(file_as_bytes(
    open(os.path.join(c['MOUNTPOINT'], key), 'rb'))).hexdigest()
    object_hash = self.s3.head_object(Bucket=c['BUCKET'], Key=key)[
    'ETag'].strip('"')

  assert file_hash == object_hash

E AssertionError: assert 'e5c834fbdaa6...5eb9404eefdd4' == '210d5322e146a...dc36e067f-100'
E - e5c834fbdaa6bfd8eac5eb9404eefdd4
E + 210d5322e146ac65333e9a8dc36e067f-100

tests/test_read.py:84: AssertionError

<!-- gh-comment-id:436208165 --> @pawelmarkowski commented on GitHub (Nov 6, 2018): We have mount: /usr/local/bin/s3fs products /mnt/buck -o passwd_file=~/.passwd-s3fs -o url=https://endpointurlcom.com:8080 -o use_path_request_style -o umask=0222 -o allow_other -o enable_content_md5 -o dbglevel=debug -f -o uid=1000 -o gid=1000 1. dd if=/dev/zero of=/mnt/buck/zero bs=1M count=1000 status=progress 2. Run test: self = <test_read.TestReading object at 0x7f7b73c5fc88>, key = 'zero' @pytest.mark.slow def test_checksum(self, key): file_hash = hashlib.md5(file_as_bytes( open(os.path.join(c['MOUNTPOINT'], key), 'rb'))).hexdigest() object_hash = self.s3.head_object(Bucket=c['BUCKET'], Key=key)[ 'ETag'].strip('"') > assert file_hash == object_hash E AssertionError: assert 'e5c834fbdaa6...5eb9404eefdd4' == '210d5322e146a...dc36e067f-100' E - e5c834fbdaa6bfd8eac5eb9404eefdd4 E + 210d5322e146ac65333e9a8dc36e067f-100 tests/test_read.py:84: AssertionError
Author
Owner

@pawelmarkowski commented on GitHub (Nov 6, 2018):

Logs look fine. I will send you an email @gaul

<!-- gh-comment-id:436208740 --> @pawelmarkowski commented on GitHub (Nov 6, 2018): Logs look fine. I will send you an email @gaul
Author
Owner

@sqlbot commented on GitHub (Nov 6, 2018):

I believe the problem is with your expectations.

ETag == MD5 is an assumption that does not always hold.

Multipart uploads always result in multipart ETags in the form shown (the -100 means the upload was sent using 100 chunks, and the hex portion is the hex-encoded md5 of the result of concatenating the bytes of the binary MD5s of the 100 individual chunks, in order).

The ETag, in any event, is created by the storage service, not s3fs.

Using -o nomultipart disables multipart uploads, and should result in the storage service assigning the ETag you are expecting. It also will limit your largest possible upload to 5 GB and will probably result in inferior performance, since you lose the parallel upload behavior that multipart allows, as well as any possibility of partial retry.

<!-- gh-comment-id:436233488 --> @sqlbot commented on GitHub (Nov 6, 2018): I believe the problem is with your expectations. ETag == MD5 is an assumption that does not always hold. Multipart uploads always result in multipart ETags in the form shown (the `-100` means the upload was sent using 100 chunks, and the hex portion is the hex-encoded md5 of the result of concatenating the bytes of the binary MD5s of the 100 individual chunks, in order). The ETag, in any event, is created by the storage service, not s3fs. Using `-o nomultipart` disables multipart uploads, and should result in the storage service assigning the ETag you are expecting. It also will limit your largest possible upload to 5 GB and will probably result in inferior performance, since you lose the parallel upload behavior that multipart allows, as well as any possibility of partial retry.
Author
Owner

@pawelmarkowski commented on GitHub (Nov 15, 2018):

thanks @sqlbot for explanation

<!-- gh-comment-id:439033024 --> @pawelmarkowski commented on GitHub (Nov 15, 2018): thanks @sqlbot for explanation
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/s3fs-fuse#496
No description provided.