[GH-ISSUE #715] Cache file corruption #405

Closed
opened 2026-03-04 01:45:13 +03:00 by kerem · 10 comments
Owner

Originally created by @spectre683 on GitHub (Feb 6, 2018).
Original GitHub issue: https://github.com/s3fs-fuse/s3fs-fuse/issues/715

Cache file corruption

We are experiencing an issue in which files become corrupt on the local machine. When this issue occurs the file at the S3FS mount point and the file in the S3FS cache become corrupt. The file on S3 is not affected. File corruption in this case means that the file size appears to be correct but the file is effectively empty and reading from it only returns 0x00 bytes. We periodically clear our cache with a cron job that executes a command something like this:

/usr/bin/find /tmp/<redacted bucket name>/ /tmp/.<redacted bucket name>.stat/ -type f -amin +5 -exec rm {} \; > /dev/null 2>&1

Corruption will occur if a cached file is removed from the cache directory but not from the stat directory and then the file is accessed from the mount point. For example, if we have a file called a_file.dat stored in a bucket called a_bucket then the steps to reproduce this problem would be like this:

  1. Observe that the file is OK:
$ stat /tmp/a_bucket/a_file.dat
  File: '/tmp/a_bucket/a_file.dat’
  Size: 6447104       Blocks: 12592      IO Block: 4096   regular file
Device: ca02h/51714d    Inode: 525618      Links: 1
Access: (0600/-rw-------)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2018-02-04 02:17:50.872290054 +0900
Modify: 2018-01-24 10:38:32.000000000 +0900
Change: 2018-02-04 02:17:50.868290094 +0900
 Birth: -
  1. Delete the file from the cache directory only:
rm /tmp/a_bucket/a_file.dat
  1. Check on the file:
$ stat /tmp/a_bucket/a_file.dat
stat: cannot stat '/tmp/a_bucket/a_file.dat': No such file or directory
  1. Access (read) the file from the mount point.

  2. Check on the file:

$ stat /tmp/a_bucket/a_file.dat
  File: '/tmp/a_bucket/a_file.dat'
  Size: 6447104       Blocks: 0          IO Block: 4096   regular file
Device: ca02h/51714d    Inode: 525607      Links: 1
Access: (0600/-rw-------)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2018-02-04 02:18:51.511672646 +0900
Modify: 2018-01-24 10:38:32.000000000 +0900
Change: 2018-02-04 02:18:51.511672646 +0900
 Birth: -

Note that the blocks are now 0. In this condition reading from the file will only produce 0x00 values. Deleting the file from the stat directory and re-reading it from the mount point will cause S3FS to reacquire it from S3 and thus replace the file in the cache directory with a valid file.

Additional Information

The following information is very important in order to help us to help you. Omission of the following details may delay your support request or receive no attention at all.

Version of s3fs being used (s3fs --version)

$ s3fs --version
Amazon Simple Storage Service File System V1.80(commit:unknown) with GnuTLS(gcrypt)
Copyright (C) 2010 Randy Rizun <rrizun@gmail.com>
License GPL2: GNU GPL version 2 <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Version of fuse being used (pkg-config --modversion fuse)

$ dpkg -l fuse
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name                                         Version                     Architecture                Description
+++-============================================-===========================-===========================-=============================================================================================
ii  fuse                                         2.9.3-15+deb8u2             amd64                       Filesystem in Userspace

System information (uname -r)

$ uname -r
3.16.0-5-amd64

Distro (cat /etc/issue)

$ cat /etc/issue
Debian GNU/Linux 8 \n \l
$ cat /etc/debian_version 
8.10

s3fs command line used (if applicable)

N/A

/etc/fstab entry (if applicable):

s3fs#<redacted> /var/local/<redacted>/data fuse uid=<redacted>,gid=<redacted>,use_cache=/tmp,allow_other,_netdev,iam_role=<redacted> 0 0

s3fs syslog messages (grep s3fs /var/log/syslog, or s3fs outputs)

if you execute s3fs with dbglevel, curldbg option, you can get detail debug messages

N/A

Details about issue

N/A

Originally created by @spectre683 on GitHub (Feb 6, 2018). Original GitHub issue: https://github.com/s3fs-fuse/s3fs-fuse/issues/715 ### Cache file corruption We are experiencing an issue in which files become corrupt on the local machine. When this issue occurs the file at the S3FS mount point and the file in the S3FS cache become corrupt. The file on S3 is not affected. File corruption in this case means that the file size appears to be correct but the file is effectively empty and reading from it only returns 0x00 bytes. We periodically clear our cache with a cron job that executes a command something like this: ``` /usr/bin/find /tmp/<redacted bucket name>/ /tmp/.<redacted bucket name>.stat/ -type f -amin +5 -exec rm {} \; > /dev/null 2>&1 ``` Corruption will occur if a cached file is removed from the cache directory but not from the stat directory and then the file is accessed from the mount point. For example, if we have a file called `a_file.dat` stored in a bucket called `a_bucket` then the steps to reproduce this problem would be like this: 1. Observe that the file is OK: ``` $ stat /tmp/a_bucket/a_file.dat File: '/tmp/a_bucket/a_file.dat’ Size: 6447104 Blocks: 12592 IO Block: 4096 regular file Device: ca02h/51714d Inode: 525618 Links: 1 Access: (0600/-rw-------) Uid: ( 0/ root) Gid: ( 0/ root) Access: 2018-02-04 02:17:50.872290054 +0900 Modify: 2018-01-24 10:38:32.000000000 +0900 Change: 2018-02-04 02:17:50.868290094 +0900 Birth: - ``` 2. Delete the file from the cache directory only: ``` rm /tmp/a_bucket/a_file.dat ``` 3. Check on the file: ``` $ stat /tmp/a_bucket/a_file.dat stat: cannot stat '/tmp/a_bucket/a_file.dat': No such file or directory ``` 4. Access (read) the file from the mount point. 5. Check on the file: ``` $ stat /tmp/a_bucket/a_file.dat File: '/tmp/a_bucket/a_file.dat' Size: 6447104 Blocks: 0 IO Block: 4096 regular file Device: ca02h/51714d Inode: 525607 Links: 1 Access: (0600/-rw-------) Uid: ( 0/ root) Gid: ( 0/ root) Access: 2018-02-04 02:18:51.511672646 +0900 Modify: 2018-01-24 10:38:32.000000000 +0900 Change: 2018-02-04 02:18:51.511672646 +0900 Birth: - ``` Note that the blocks are now 0. In this condition reading from the file will only produce 0x00 values. Deleting the file from the stat directory and re-reading it from the mount point will cause S3FS to reacquire it from S3 and thus replace the file in the cache directory with a valid file. ### Additional Information _The following information is very important in order to help us to help you. Omission of the following details may delay your support request or receive no attention at all._ #### Version of s3fs being used (s3fs --version) ``` $ s3fs --version Amazon Simple Storage Service File System V1.80(commit:unknown) with GnuTLS(gcrypt) Copyright (C) 2010 Randy Rizun <rrizun@gmail.com> License GPL2: GNU GPL version 2 <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. ``` #### Version of fuse being used (pkg-config --modversion fuse) ``` $ dpkg -l fuse Desired=Unknown/Install/Remove/Purge/Hold | Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend |/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad) ||/ Name Version Architecture Description +++-============================================-===========================-===========================-============================================================================================= ii fuse 2.9.3-15+deb8u2 amd64 Filesystem in Userspace ``` #### System information (uname -r) ``` $ uname -r 3.16.0-5-amd64 ``` #### Distro (cat /etc/issue) ``` $ cat /etc/issue Debian GNU/Linux 8 \n \l $ cat /etc/debian_version 8.10 ``` #### s3fs command line used (if applicable) ``` N/A ``` #### /etc/fstab entry (if applicable): ``` s3fs#<redacted> /var/local/<redacted>/data fuse uid=<redacted>,gid=<redacted>,use_cache=/tmp,allow_other,_netdev,iam_role=<redacted> 0 0 ``` #### s3fs syslog messages (grep s3fs /var/log/syslog, or s3fs outputs) _if you execute s3fs with dbglevel, curldbg option, you can get detail debug messages_ ``` N/A ``` ### Details about issue N/A
kerem 2026-03-04 01:45:13 +03:00
  • closed this issue
  • added the
    dataloss
    label
Author
Owner

@MrMoose commented on GitHub (Aug 6, 2018):

I have the same problem. Gentoo, ~x64, Kernel 4.17.11

Happened several times in a few days I am using this, causing substantial file loss. Is there any workaround or something I can do?

<!-- gh-comment-id:410613413 --> @MrMoose commented on GitHub (Aug 6, 2018): I have the same problem. Gentoo, ~x64, Kernel 4.17.11 Happened several times in a few days I am using this, causing substantial file loss. Is there any workaround or something I can do?
Author
Owner

@CMCDragonkai commented on GitHub (Oct 4, 2018):

I'm not sure if this is related, but I had a python script that works locally by outputting CSV files using the native csv module and with line buffering. Using it with the current release of s3fs 1.83 results in a bunch of null bytes written at the beginning of the CSV file. The beginning of the hexdump is:

00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
003a74d0  00 00 69 70 5f 32 31 34  30 35 37 36 65 62 38 63  |..ip_2140576eb8c|

It's a crazy number of null bytes. It also happens again later.

003a7b00  38 64 35 62 61 38 2e 6a  70 67 2c 32 36 38 31 0d  |8d5ba8.jpg,2681.|
003a7b10  0a 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
003a7b20  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
003a8910  00 00 00 00 00 00 00 00  00 00 00 00 00 69 70 5f  |.............ip_|

I have no idea what's causing this, but it literally corrupts the file.

I do want to mention that in order to check what was going on, I went into the cache directory while the file was being written, and used less to inspect the cached file that was being written, but that shouldn't cause these kinds of bugs.

<!-- gh-comment-id:426846706 --> @CMCDragonkai commented on GitHub (Oct 4, 2018): I'm not sure if this is related, but I had a python script that works locally by outputting CSV files using the native csv module and with line buffering. Using it with the current release of s3fs 1.83 results in a bunch of null bytes written at the beginning of the CSV file. The beginning of the hexdump is: ``` 00000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| * 003a74d0 00 00 69 70 5f 32 31 34 30 35 37 36 65 62 38 63 |..ip_2140576eb8c| ``` It's a crazy number of null bytes. It also happens again later. ``` 003a7b00 38 64 35 62 61 38 2e 6a 70 67 2c 32 36 38 31 0d |8d5ba8.jpg,2681.| 003a7b10 0a 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| 003a7b20 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| * 003a8910 00 00 00 00 00 00 00 00 00 00 00 00 00 69 70 5f |.............ip_| ``` I have no idea what's causing this, but it literally corrupts the file. I do want to mention that in order to check what was going on, I went into the cache directory while the file was being written, and used less to inspect the cached file that was being written, but that shouldn't cause these kinds of bugs.
Author
Owner

@gaul commented on GitHub (Oct 4, 2018):

If you provide a way to reproduce these symptoms I can look into this.

<!-- gh-comment-id:427099044 --> @gaul commented on GitHub (Oct 4, 2018): If you provide a way to reproduce these symptoms I can look into this.
Author
Owner

@gaul commented on GitHub (Jan 26, 2019):

master includes a fix for #918 which may address your symptoms.

<!-- gh-comment-id:457786621 --> @gaul commented on GitHub (Jan 26, 2019): master includes a fix for #918 which may address your symptoms.
Author
Owner

@ggtakec commented on GitHub (Mar 29, 2019):

@spectre683 @MrMoose @CMCDragonkai
The s3fs cache file is a sparse file.
In other words, s3fs uses ftruncate to adjust the size of cache file to the size of S3 object.
Cache files are created as Sparse files, and are segmented.
s3fs manages those segments by fixed size.
As needed, s3fs reads objects from S3 with a fixed size and finally fills the cache file.
If the target segment is cached, s3fs reads the area from the cache file.

With the above behavior, the block size of the cache file itself may indicate less than the size of the object.
The condition is not abnormal, but is a normal cache condition.

<!-- gh-comment-id:478080205 --> @ggtakec commented on GitHub (Mar 29, 2019): @spectre683 @MrMoose @CMCDragonkai The s3fs cache file is a sparse file. In other words, s3fs uses ftruncate to adjust the size of cache file to the size of S3 object. Cache files are created as Sparse files, and are segmented. s3fs manages those segments by fixed size. As needed, s3fs reads objects from S3 with a fixed size and finally fills the cache file. If the target segment is cached, s3fs reads the area from the cache file. With the above behavior, the block size of the cache file itself may indicate less than the size of the object. The condition is not abnormal, but is a normal cache condition.
Author
Owner

@spectre683 commented on GitHub (Mar 31, 2019):

@ggtakec I have not had time to study the internals of s3fs - thank you for the explanation regarding the s3fs caching mechanism. However, I do not think your explanation and the original bug report are inconsistent. In other words, I think it is possible that the s3fs caching mechanism is expected to work as you describe and there is also a bug that results in the behavior described in the original bug report.

All the original bug report is saying is that if the steps described in the report are followed I was able to consistently create a situation in which unexpected data was read from a file. In short, my expectation was that s3fs would return the actual file content and not a string of 0x00 bytes.

I have not had the chance to re-evaluate this with #918 so maybe it has been fixed. However, we are only using s3fs to read files and since the description for #918 describes data loss when writing to the middle of a file I'm not sure that this is relevant to the issue I described. When I get a chance I will try to re-evaluate the issue I describe and comment as to whether it has been resolved or not.

<!-- gh-comment-id:478322040 --> @spectre683 commented on GitHub (Mar 31, 2019): @ggtakec I have not had time to study the internals of s3fs - thank you for the explanation regarding the s3fs caching mechanism. However, I do not think your explanation and the original bug report are inconsistent. In other words, I think it is possible that the s3fs caching mechanism is expected to work as you describe and there is also a bug that results in the behavior described in the original bug report. All the original bug report is saying is that if the steps described in the report are followed I was able to consistently create a situation in which unexpected data was read from a file. In short, my expectation was that s3fs would return the actual file content and not a string of 0x00 bytes. I have not had the chance to re-evaluate this with #918 so maybe it has been fixed. However, we are only using s3fs to read files and since the description for #918 describes data loss when writing to the middle of a file I'm not sure that this is relevant to the issue I described. When I get a chance I will try to re-evaluate the issue I describe and comment as to whether it has been resolved or not.
Author
Owner

@gaul commented on GitHub (Apr 3, 2019):

@spectre683 We are very concerned about data loss issues; please reopen if you can reproduce this with the latest build. 1.84 and 1.85 include several data loss fixes including #918 so retesting would really help us, especially if you can provide a reproduceable test case. Please let us know the results either way!

<!-- gh-comment-id:479463456 --> @gaul commented on GitHub (Apr 3, 2019): @spectre683 We are very concerned about data loss issues; please reopen if you can reproduce this with the latest build. 1.84 and 1.85 include several data loss fixes including #918 so retesting would really help us, especially if you can provide a reproduceable test case. Please let us know the results either way!
Author
Owner

@gaul commented on GitHub (Apr 9, 2019):

Closing due to inactivity. Please reopen if symptoms persist.

<!-- gh-comment-id:481185156 --> @gaul commented on GitHub (Apr 9, 2019): Closing due to inactivity. Please reopen if symptoms persist.
Author
Owner

@spectre683 commented on GitHub (Nov 2, 2020):

@gaul With sincere apologies for the length of time it took to get to this, I can confirm that the issue no longer seems to exist. I looked at 1.80 (to confirm I could still produce the issue) and then 1.84, 1.85, and 1.87 to confirm it doesn't occur anymore. I don't have the environment in which the issue originally occurred anymore but I had a VM laying around that was reasonably close and used that. Specifically fuse = 2.9.3-15+deb8u3 and uname -r = 3.16.0-6-amd64. The distro information is the same as the original report. I casually created a bucket with file to test with and used a combination of md5sum and od'ing the first 16 bytes of the file for the access / read step (step 4). Nothing super exacting but I'm reasonably confident that I recreated the issue I was experiencing and showed it does not reproduce in 1.84, 1.85, 1.87. Apologies again for the delay in responding and thank you for your assistance.

<!-- gh-comment-id:720548396 --> @spectre683 commented on GitHub (Nov 2, 2020): @gaul With sincere apologies for the length of time it took to get to this, I can confirm that the issue no longer seems to exist. I looked at 1.80 (to confirm I could still produce the issue) and then 1.84, 1.85, and 1.87 to confirm it doesn't occur anymore. I don't have the environment in which the issue originally occurred anymore but I had a VM laying around that was reasonably close and used that. Specifically fuse = 2.9.3-15+deb8u3 and uname -r = 3.16.0-6-amd64. The distro information is the same as the original report. I casually created a bucket with file to test with and used a combination of md5sum and od'ing the first 16 bytes of the file for the access / read step (step 4). Nothing super exacting but I'm reasonably confident that I recreated the issue I was experiencing and showed it does not reproduce in 1.84, 1.85, 1.87. Apologies again for the delay in responding and thank you for your assistance.
Author
Owner

@gaul commented on GitHub (Nov 2, 2020):

Thanks for testing this!

<!-- gh-comment-id:720779513 --> @gaul commented on GitHub (Nov 2, 2020): Thanks for testing this!
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/s3fs-fuse#405
No description provided.