[GH-ISSUE #973] slow writes with PHP fputcsv #542

Closed
opened 2026-03-04 01:46:31 +03:00 by kerem · 13 comments
Owner

Originally created by @adonig on GitHub (Mar 9, 2019).
Original GitHub issue: https://github.com/s3fs-fuse/s3fs-fuse/issues/973

Additional Information

I use s3fs every month to mount a bucket and then transfer around 1.6GB of DynamoDB records into a CSV file in the bucket. Today I tried it twice with the current master and after around one hour the writing performance became so terrible, that it wasn't possible to finish the dump. I did a checkout of commit e9297f39ea because that was the version I used last month. Thereafter everything went through as expected.

Version of s3fs being used (s3fs --version)

current master (0d43d070cc)

Version of fuse being used (pkg-config --modversion fuse, rpm -qi fuse, dpkg -s fuse)

2.9.4

Kernel information (uname -r)

3.14.48-33.39.amzn1.x86_64

GNU/Linux Distribution, if applicable (cat /etc/os-release)

NAME="Amazon Linux AMI"
VERSION="2015.03"
ID="amzn"
ID_LIKE="rhel fedora"
VERSION_ID="2015.03"
PRETTY_NAME="Amazon Linux AMI 2015.03"
ANSI_COLOR="0;33"
CPE_NAME="cpe:/o:amazon:linux:2015.03:ga"
HOME_URL="http://aws.amazon.com/amazon-linux-ami/"

/etc/fstab entry, if applicable

mybucket /mnt/mybucket fuse.s3fs _netdev,allow_other,default_acl=public-read-write,uid=14,gid=50,umask=0002 0 0

Originally created by @adonig on GitHub (Mar 9, 2019). Original GitHub issue: https://github.com/s3fs-fuse/s3fs-fuse/issues/973 ### Additional Information I use s3fs every month to mount a bucket and then transfer around 1.6GB of DynamoDB records into a CSV file in the bucket. Today I tried it twice with the current master and after around one hour the writing performance became so terrible, that it wasn't possible to finish the dump. I did a checkout of commit e9297f39ead8a309393503af740dc2c2fb8748b3 because that was the version I used last month. Thereafter everything went through as expected. #### Version of s3fs being used (s3fs --version) current master (0d43d070ccf55a8745d8fe95ec9558e29a85cbb2) #### Version of fuse being used (pkg-config --modversion fuse, rpm -qi fuse, dpkg -s fuse) 2.9.4 #### Kernel information (uname -r) 3.14.48-33.39.amzn1.x86_64 #### GNU/Linux Distribution, if applicable (cat /etc/os-release) NAME="Amazon Linux AMI" VERSION="2015.03" ID="amzn" ID_LIKE="rhel fedora" VERSION_ID="2015.03" PRETTY_NAME="Amazon Linux AMI 2015.03" ANSI_COLOR="0;33" CPE_NAME="cpe:/o:amazon:linux:2015.03:ga" HOME_URL="http://aws.amazon.com/amazon-linux-ami/" #### /etc/fstab entry, if applicable mybucket /mnt/mybucket fuse.s3fs _netdev,allow_other,default_acl=public-read-write,uid=14,gid=50,umask=0002 0 0
kerem closed this issue 2026-03-04 01:46:32 +03:00
Author
Owner

@gaul commented on GitHub (Mar 10, 2019):

Could you be more specific about terrible performance, e.g., MB/s? You might want to try some smaller files to test.

<!-- gh-comment-id:471235401 --> @gaul commented on GitHub (Mar 10, 2019): Could you be more specific about terrible performance, e.g., MB/s? You might want to try some smaller files to test.
Author
Owner

@adonig commented on GitHub (Mar 10, 2019):

I don't have the MB/s. Normally I dump around 100,000 records each 13s. In both failing cases, after around probably an hour the throughput went down to 100,000 in 1,900s and then 100,000 in 9,200s until I stopped it.

I can make another run tomorrow. Is there anything I can do to make it as informative as possible for you? Enabling debugging output or something alike?

<!-- gh-comment-id:471236050 --> @adonig commented on GitHub (Mar 10, 2019): I don't have the MB/s. Normally I dump around 100,000 records each 13s. In both failing cases, after around probably an hour the throughput went down to 100,000 in 1,900s and then 100,000 in 9,200s until I stopped it. I can make another run tomorrow. Is there anything I can do to make it as informative as possible for you? Enabling debugging output or something alike?
Author
Owner

@gaul commented on GitHub (Mar 11, 2019):

I benchmarked master 895d5006bb writes at 50% faster than 1.84 06032aa661 on an EC2 m5a.large in us-east:

$ dd if=/dev/zero of=mnt/2gb count=2048 bs=1M
2048+0 records in
2048+0 records out
2147483648 bytes (2.1 GB) copied, 59.9399 s, 35.8 MB/s

$ dd if=/dev/zero of=mnt/2gb count=2048 bs=1M
2048+0 records in
2048+0 records out
2147483648 bytes (2.1 GB) copied, 39.696 s, 54.1 MB/s
<!-- gh-comment-id:471493451 --> @gaul commented on GitHub (Mar 11, 2019): I benchmarked master 895d5006bb253c5b49fd1ed23cceacf9f1745850 writes at 50% faster than 1.84 06032aa661f8cfd06b997147efb47f6e0f0bbb48 on an EC2 m5a.large in us-east: ``` $ dd if=/dev/zero of=mnt/2gb count=2048 bs=1M 2048+0 records in 2048+0 records out 2147483648 bytes (2.1 GB) copied, 59.9399 s, 35.8 MB/s $ dd if=/dev/zero of=mnt/2gb count=2048 bs=1M 2048+0 records in 2048+0 records out 2147483648 bytes (2.1 GB) copied, 39.696 s, 54.1 MB/s ```
Author
Owner

@adonig commented on GitHub (Mar 11, 2019):

I'll try to determine the exact commit that introduced my problem.

<!-- gh-comment-id:471504809 --> @adonig commented on GitHub (Mar 11, 2019): I'll try to determine the exact commit that introduced my problem.
Author
Owner

@adonig commented on GitHub (Mar 11, 2019):

I did a phone book search to find the first commit which is causing the problem. It's the commit 10d9f75366. I have no idea what curl has to do with it. I'm just writing DynamoDB records into a CSV file in a bucket, but I'm able to reliably reproduce the problem with that commit. I'm willing to further help finding out what exactly is causing the problem, but I have no idea what I can do to help you :-)

<!-- gh-comment-id:471696159 --> @adonig commented on GitHub (Mar 11, 2019): I did a phone book search to find the first commit which is causing the problem. It's the commit 10d9f75366558e110242667d84b52382f79ab632. I have no idea what curl has to do with it. I'm just writing DynamoDB records into a CSV file in a bucket, but I'm able to reliably reproduce the problem with that commit. I'm willing to further help finding out what exactly is causing the problem, but I have no idea what I can do to help you :-)
Author
Owner

@adonig commented on GitHub (Mar 11, 2019):

Looking at the CloudWatch monitoring for DynamoDB I noticed the working commits have a different load pattern. It seems like before the dump is finished, the load normally goes a bit down, then up and down again. For the commit that is causing the problem, the load goes down and stays down, while the fputcsv function blocks longer and s3fs uses around 100% CPU. Normally fputcsv gets called 100,000 times in around 13 seconds. When the blocking happens, it gets called like 100,000 times in ~5,500 seconds, so a call to fputcsv blocks like ~60 ms instead to a tenth of a millisecond.

<!-- gh-comment-id:471718859 --> @adonig commented on GitHub (Mar 11, 2019): Looking at the CloudWatch monitoring for DynamoDB I noticed the working commits have [a different load pattern](https://imgur.com/a/CneRfvm). It seems like before the dump is finished, the load normally [goes a bit down](https://imgur.com/a/oL5zfkJ), then up and down again. For the commit that is causing the problem, [the load goes down](https://imgur.com/a/uNwqukO) and stays down, while the fputcsv function blocks longer and [s3fs uses around 100% CPU](https://imgur.com/a/H7MmpFT). Normally fputcsv gets called 100,000 times in around 13 seconds. When the blocking happens, it gets called like 100,000 times in ~5,500 seconds, so a call to fputcsv blocks like ~60 ms instead to a tenth of a millisecond.
Author
Owner

@gaul commented on GitHub (Apr 9, 2019):

Can you share a test case that reproduce this behavior? Are you copying the files, rsync, etc.?

<!-- gh-comment-id:481178148 --> @gaul commented on GitHub (Apr 9, 2019): Can you share a test case that reproduce this behavior? Are you copying the files, rsync, etc.?
Author
Owner

@adonig commented on GitHub (Apr 9, 2019):

This is about what we do:

`

    $request = [
        'TableName' => 'a_dynamodb_table',
        'FilterExpression' => 'begins_with(#ts, :prefix)',
        'ExpressionAttributeNames' => [
            '#ts' => 'timestamp',
        ],
        'ExpressionAttributeValues' =>  [
            ':prefix' => ['S' => '2019-02'],  // i.e. all items from February (around 16 million)
        ],
    ];

    /* Do a paginated Scan and write the items to a CSV file in a S3 bucket mounted via S3FS. */
    do {
        if (isset($response) && isset($response['LastEvaluatedKey'])) {
            $request['ExclusiveStartKey'] = $response['LastEvaluatedKey'];
        }

        $response = $dynamodb->scan($request);

        foreach ($response['Items'] as $item) {
            $fields = [ /* some fields from the item */ ];
            fputcsv($file, $fields);  // Starts to block.
        }
    } while (isset($response['LastEvaluatedKey']));

`

<!-- gh-comment-id:481183657 --> @adonig commented on GitHub (Apr 9, 2019): This is about what we do: ` $request = [ 'TableName' => 'a_dynamodb_table', 'FilterExpression' => 'begins_with(#ts, :prefix)', 'ExpressionAttributeNames' => [ '#ts' => 'timestamp', ], 'ExpressionAttributeValues' => [ ':prefix' => ['S' => '2019-02'], // i.e. all items from February (around 16 million) ], ]; /* Do a paginated Scan and write the items to a CSV file in a S3 bucket mounted via S3FS. */ do { if (isset($response) && isset($response['LastEvaluatedKey'])) { $request['ExclusiveStartKey'] = $response['LastEvaluatedKey']; } $response = $dynamodb->scan($request); foreach ($response['Items'] as $item) { $fields = [ /* some fields from the item */ ]; fputcsv($file, $fields); // Starts to block. } } while (isset($response['LastEvaluatedKey'])); `
Author
Owner

@gaul commented on GitHub (Apr 9, 2019):

Could you make this self-contained in some way, e.g., emit bogus data? I don't know PHP and cannot modify this. Please share usage instructions too so I can reproduce this locally.

<!-- gh-comment-id:481225791 --> @gaul commented on GitHub (Apr 9, 2019): Could you make this self-contained in some way, e.g., emit bogus data? I don't know PHP and cannot modify this. Please share usage instructions too so I can reproduce this locally.
Author
Owner

@adonig commented on GitHub (Apr 9, 2019):

I think if the size of the fields and some other factors like timing are not important, it might be possible to reproduce the bug using this script (save as foo.php, change $filename and run it via php foo.php). I will also try it later, when I find the time.

<?php

$filename = "/path/to/mounted/bucket/filename.csv";

$file = fopen($filename, 'w');
for ($i = 0; $i < 16000000; $i++) {
    fputcsv($file, [
        'field1' => $i,
        'field2' => $i + 1,
        'field3' => $i + 2,
        'field4' => $i + 3,
        'field5' => $i + 4,
        'field6' => $i + 5,
        'field7' => $i + 6,
    ]);
}
fclose($file);

It just creates a CSV file in a bucket and then adds 16 million entries.

<!-- gh-comment-id:481240384 --> @adonig commented on GitHub (Apr 9, 2019): I think if the size of the fields and some other factors like timing are not important, it might be possible to reproduce the bug using this script (save as `foo.php`, change `$filename` and run it via `php foo.php`). I will also try it later, when I find the time. ``` <?php $filename = "/path/to/mounted/bucket/filename.csv"; $file = fopen($filename, 'w'); for ($i = 0; $i < 16000000; $i++) { fputcsv($file, [ 'field1' => $i, 'field2' => $i + 1, 'field3' => $i + 2, 'field4' => $i + 3, 'field5' => $i + 4, 'field6' => $i + 5, 'field7' => $i + 6, ]); } fclose($file); ``` It just creates a CSV file in a bucket and then adds 16 million entries.
Author
Owner

@ggtakec commented on GitHub (Apr 9, 2019):

@adonig Thanks for sample code.
I do not know the contents of the fputcsv function, but does this function call fflush every time it writes one line?
(Or repeat open, write close)
If this function repeats the flush operation, s3fs will try to upload the file every flush.
This affects performance.

If you can try it, please try using the local file output by fputcvs.
After the fputcvs loop is complete, try copying the local file to the s3fs mounted directory.
If performance improves this way, I think it is necessary to prevent fputcvs function from being called continuously.

<!-- gh-comment-id:481274466 --> @ggtakec commented on GitHub (Apr 9, 2019): @adonig Thanks for sample code. I do not know the contents of the fputcsv function, but does this function call fflush every time it writes one line? (Or repeat open, write close) If this function repeats the flush operation, s3fs will try to upload the file every flush. This affects performance. If you can try it, please try using the local file output by fputcvs. After the fputcvs loop is complete, try copying the local file to the s3fs mounted directory. If performance improves this way, I think it is necessary to prevent fputcvs function from being called continuously.
Author
Owner

@gaul commented on GitHub (Feb 3, 2020):

If your application calls fsync excessively, it's possible that libeatmydata could work around this. Be careful though since this can remove durability guarantees.

<!-- gh-comment-id:581291372 --> @gaul commented on GitHub (Feb 3, 2020): If your application calls `fsync` excessively, it's possible that [libeatmydata](https://github.com/stewartsmith/libeatmydata) could work around this. Be careful though since this can remove durability guarantees.
Author
Owner

@gaul commented on GitHub (Jul 26, 2020):

Please test with the suggested workaround and reopen if the symptoms persist.

<!-- gh-comment-id:663951550 --> @gaul commented on GitHub (Jul 26, 2020): Please test with the suggested workaround and reopen if the symptoms persist.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/s3fs-fuse#542
No description provided.