mirror of
https://github.com/s3fs-fuse/s3fs-fuse.git
synced 2026-04-25 13:26:00 +03:00
[GH-ISSUE #973] slow writes with PHP fputcsv #542
Labels
No labels
bug
bug
dataloss
duplicate
enhancement
feature request
help wanted
invalid
need info
performance
pull-request
question
question
testing
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/s3fs-fuse#542
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @adonig on GitHub (Mar 9, 2019).
Original GitHub issue: https://github.com/s3fs-fuse/s3fs-fuse/issues/973
Additional Information
I use s3fs every month to mount a bucket and then transfer around 1.6GB of DynamoDB records into a CSV file in the bucket. Today I tried it twice with the current master and after around one hour the writing performance became so terrible, that it wasn't possible to finish the dump. I did a checkout of commit
e9297f39eabecause that was the version I used last month. Thereafter everything went through as expected.Version of s3fs being used (s3fs --version)
current master (
0d43d070cc)Version of fuse being used (pkg-config --modversion fuse, rpm -qi fuse, dpkg -s fuse)
2.9.4
Kernel information (uname -r)
3.14.48-33.39.amzn1.x86_64
GNU/Linux Distribution, if applicable (cat /etc/os-release)
NAME="Amazon Linux AMI"
VERSION="2015.03"
ID="amzn"
ID_LIKE="rhel fedora"
VERSION_ID="2015.03"
PRETTY_NAME="Amazon Linux AMI 2015.03"
ANSI_COLOR="0;33"
CPE_NAME="cpe:/o:amazon:linux:2015.03:ga"
HOME_URL="http://aws.amazon.com/amazon-linux-ami/"
/etc/fstab entry, if applicable
mybucket /mnt/mybucket fuse.s3fs _netdev,allow_other,default_acl=public-read-write,uid=14,gid=50,umask=0002 0 0
@gaul commented on GitHub (Mar 10, 2019):
Could you be more specific about terrible performance, e.g., MB/s? You might want to try some smaller files to test.
@adonig commented on GitHub (Mar 10, 2019):
I don't have the MB/s. Normally I dump around 100,000 records each 13s. In both failing cases, after around probably an hour the throughput went down to 100,000 in 1,900s and then 100,000 in 9,200s until I stopped it.
I can make another run tomorrow. Is there anything I can do to make it as informative as possible for you? Enabling debugging output or something alike?
@gaul commented on GitHub (Mar 11, 2019):
I benchmarked master
895d5006bbwrites at 50% faster than 1.8406032aa661on an EC2 m5a.large in us-east:@adonig commented on GitHub (Mar 11, 2019):
I'll try to determine the exact commit that introduced my problem.
@adonig commented on GitHub (Mar 11, 2019):
I did a phone book search to find the first commit which is causing the problem. It's the commit
10d9f75366. I have no idea what curl has to do with it. I'm just writing DynamoDB records into a CSV file in a bucket, but I'm able to reliably reproduce the problem with that commit. I'm willing to further help finding out what exactly is causing the problem, but I have no idea what I can do to help you :-)@adonig commented on GitHub (Mar 11, 2019):
Looking at the CloudWatch monitoring for DynamoDB I noticed the working commits have a different load pattern. It seems like before the dump is finished, the load normally goes a bit down, then up and down again. For the commit that is causing the problem, the load goes down and stays down, while the fputcsv function blocks longer and s3fs uses around 100% CPU. Normally fputcsv gets called 100,000 times in around 13 seconds. When the blocking happens, it gets called like 100,000 times in ~5,500 seconds, so a call to fputcsv blocks like ~60 ms instead to a tenth of a millisecond.
@gaul commented on GitHub (Apr 9, 2019):
Can you share a test case that reproduce this behavior? Are you copying the files, rsync, etc.?
@adonig commented on GitHub (Apr 9, 2019):
This is about what we do:
`
`
@gaul commented on GitHub (Apr 9, 2019):
Could you make this self-contained in some way, e.g., emit bogus data? I don't know PHP and cannot modify this. Please share usage instructions too so I can reproduce this locally.
@adonig commented on GitHub (Apr 9, 2019):
I think if the size of the fields and some other factors like timing are not important, it might be possible to reproduce the bug using this script (save as
foo.php, change$filenameand run it viaphp foo.php). I will also try it later, when I find the time.It just creates a CSV file in a bucket and then adds 16 million entries.
@ggtakec commented on GitHub (Apr 9, 2019):
@adonig Thanks for sample code.
I do not know the contents of the fputcsv function, but does this function call fflush every time it writes one line?
(Or repeat open, write close)
If this function repeats the flush operation, s3fs will try to upload the file every flush.
This affects performance.
If you can try it, please try using the local file output by fputcvs.
After the fputcvs loop is complete, try copying the local file to the s3fs mounted directory.
If performance improves this way, I think it is necessary to prevent fputcvs function from being called continuously.
@gaul commented on GitHub (Feb 3, 2020):
If your application calls
fsyncexcessively, it's possible that libeatmydata could work around this. Be careful though since this can remove durability guarantees.@gaul commented on GitHub (Jul 26, 2020):
Please test with the suggested workaround and reopen if the symptoms persist.