mirror of
https://github.com/s3fs-fuse/s3fs-fuse.git
synced 2026-04-25 05:16:00 +03:00
[GH-ISSUE #566] Question about where to add new metadata features #320
Labels
No labels
bug
bug
dataloss
duplicate
enhancement
feature request
help wanted
invalid
need info
performance
pull-request
question
question
testing
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/s3fs-fuse#320
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @colakong on GitHub (Apr 20, 2017).
Original GitHub issue: https://github.com/s3fs-fuse/s3fs-fuse/issues/566
Hi everybody,
I'm looking at adding some features to s3fs for our users, to help them validate downloaded objects.
Adding extra checksum metadata
The first feature would add md5 checksum metadata to any object uploads (separate from etag). This would be meant to help users validate object downloads without needing to implement a client-side version of s3 multi-part checksums for some objects.
It looks like that may be possible in
create_file_object()ins3fs.cpp. Is that the right place for it?Adding checksum directory structure
The second feature would add a read-only
.s3_obj_chksumdirectory at the root of the mount-point, which has the same hierarchy as the mount-point except that contents of files return the corresponding object's checksum.Is this something that seems reasonable to do, given the current structure of the project? Do you know where a good place to add that feature might be?
Together the features would look like this:
@gaul commented on GitHub (Apr 20, 2017):
I strongly prefer to work with and improve the existing ETag-based checksums instead of adding new metadata that serves only a single user. I understand that multipart upload and range requests make this more complicated but it seems like you can solve your problem today by disabling MPU and exposing the ETag via extended attributes. Further HTTPS should ensure data in-flight while the ETag ensures data at-rest. What corruption vector are you worried about?
If you want to do programmatic things with the S3 protocol, perhaps you can write a middleware to S3Proxy?
@colakong commented on GitHub (Apr 21, 2017):
Thanks for the response Andrew. Good point about the extended attributes 👍
I'm not sure if it'll be practical for us to disable MPU based on the object sizes we work with, but it's something to consider :)
The features were meant for bit-rot detection, where after an object was uploaded some portion of it was changed/corrupted in a way that wasn't detected by the underlying storage system being exposed through s3.
@gaul commented on GitHub (Apr 23, 2017):
How do you propose to calculate a single checksum over the entire object for a multi-part upload where parts may upload simultaneously or out of order?
The multipart ETag is actually a single-level Merkle tree hash of part hashes concatenated with the number of parts, e.g., ffffffffffffffffffffffffffffffff-31. If you used a known part size and if s3fs exposed the ETag, you could actually calculate this in your application.
However this should not be needed; most S3 storage systems scrub data behind the scenes and proactively repair it based on ETag.
@colakong commented on GitHub (Apr 23, 2017):
The checksum could be computed before a multi-part upload is started.
The extra checksum metadata was intended to help users validate object downloads without needing to implement the s3 multi-part checksums on the client side. I've provided an example of the multi-part checksum calculation, but the preference is to use a checksum for the entire object.
I agree in that corruption isn't likely. Our users care very much about the integrity of their data, so the features' primary benefit is to help them feel more comfortable with our object storage system.
@gaul commented on GitHub (Apr 23, 2017):
s3fs supports extended attributes which turn into S3 object metadata. Thus you can implement these checksums with
setfattrandgetfattrtoday.@colakong commented on GitHub (Apr 24, 2017):
Great, thankyou Andrew :)