[GH-ISSUE #156] Improve performance by not locking the database #91

Closed
opened 2026-02-27 10:15:41 +03:00 by kerem · 11 comments
Owner

Originally created by @mokurin000 on GitHub (Apr 6, 2025).
Original GitHub issue: https://github.com/matze/wastebin/issues/156

In my testing branch human-readable-perf, I replaced the current Arc<Mutex<Connection>> pattern with a background thread receiving a command and a oneshot channel, to send result back

And the replacement will increase average performance of insertion by 12.5%, see report-.zip generated by wastebin-bench.

TL;DR: from ~40k QPS to ~45k QPS, benched on i7-12700H running archlinux (mainline kernel)

cargo r -r -- --host http://127.0.0.1:8088 --run-time 30

The main cost is readability of the code (I did not find a design not such ugly for now)

As it's just for performance test, only insert is implemented in the -perf branch.

Originally created by @mokurin000 on GitHub (Apr 6, 2025). Original GitHub issue: https://github.com/matze/wastebin/issues/156 In my testing branch [human-readable-perf](https://github.com/mokurin000/wastebin/tree/human-readable-perf), I replaced the current `Arc<Mutex<Connection>>` pattern with a background thread receiving a command and a oneshot channel, to send result back And the replacement will increase average performance of insertion by 12.5%, see [report-.zip](https://github.com/user-attachments/files/19622038/report-.zip) generated by [`wastebin-bench`](https://github.com/mokurin000/wastebin-bench). > TL;DR: from ~40k QPS to ~45k QPS, benched on `i7-12700H` running archlinux (mainline kernel) ```bash cargo r -r -- --host http://127.0.0.1:8088 --run-time 30 ``` The main cost is readability of the code (I did not find a design not such ugly for now) As it's just for performance test, only `insert` is implemented in the -perf branch.
kerem closed this issue 2026-02-27 10:15:41 +03:00
Author
Owner

@mokurin000 commented on GitHub (Apr 6, 2025):

Alternatively, we may also replace self.conn to a thread_local! storaged Connection (Obviously we need not multiple database instances inside one wastebin server instance). Thus, mutex is no longer needed, and the code could stay clean. Performance should be similar with the flume based solution.

<!-- gh-comment-id:2781492491 --> @mokurin000 commented on GitHub (Apr 6, 2025): Alternatively, we may also replace `self.conn` to a `thread_local!` storaged `Connection` (Obviously we need not multiple database instances inside one wastebin server instance). Thus, mutex is no longer needed, and the code could stay clean. Performance should be similar with the flume based solution.
Author
Owner

@matze commented on GitHub (Apr 6, 2025):

Can you re-run the test on a single-core machine/VM/container? Just to see the impact of how tokio's background threads deal in such situation compared to a dedicated one.

<!-- gh-comment-id:2781522829 --> @matze commented on GitHub (Apr 6, 2025): Can you re-run the test on a single-core machine/VM/container? Just to see the impact of how tokio's background threads deal in such situation compared to a dedicated one.
Author
Owner

@mokurin000 commented on GitHub (Apr 6, 2025):

Can you re-run the test on a single-core machine/VM/container? Just to see the impact of how tokio's background threads deal in such situation compared to a dedicated one.

I will do so when I am avaliable

BTW I found a command could limit the process to run with single CPU core

numactl --physcpubind=+0 ./target/release/wastebin
<!-- gh-comment-id:2781527795 --> @mokurin000 commented on GitHub (Apr 6, 2025): > Can you re-run the test on a single-core machine/VM/container? Just to see the impact of how tokio's background threads deal in such situation compared to a dedicated one. I will do so when I am avaliable BTW I found a command could limit the process to run with single CPU core ```bash numactl --physcpubind=+0 ./target/release/wastebin ```
Author
Owner

@mokurin000 commented on GitHub (Apr 7, 2025):

Can you re-run the test on a single-core machine/VM/container? Just to see the impact of how tokio's background threads deal in such situation compared to a dedicated one.

In the single-core scene, there is no difference between the kanel version and the spawn_blocking version;
20472 vs 20770 QPS, latter is of the spawn_blocking version
In multi-core scene (14c20t), kanel version is 11% faster:
45458 QPS vs 50274.89 QPS

Thanks to @qaqland, with a generic call() method we could have performance gain without big refactor work

<!-- gh-comment-id:2784132070 --> @mokurin000 commented on GitHub (Apr 7, 2025): > Can you re-run the test on a single-core machine/VM/container? Just to see the impact of how tokio's background threads deal in such situation compared to a dedicated one. In the single-core scene, there is no difference between the kanel version and the spawn_blocking version; 20472 vs 20770 QPS, latter is of the spawn_blocking version In multi-core scene (14c20t), kanel version is 11% faster: 45458 QPS vs 50274.89 QPS Thanks to @qaqland, with a generic call() method we could have performance gain without big refactor work
Author
Owner

@matze commented on GitHub (Apr 12, 2025):

Thanks to @qaqland, with a generic call() method we could have performance gain without big refactor work

Is it also in your branch? Sorry for being late, this week has been a bit scarce regarding time.

<!-- gh-comment-id:2798792922 --> @matze commented on GitHub (Apr 12, 2025): > Thanks to @qaqland, with a generic call() method we could have performance gain without big refactor work Is it also in your branch? Sorry for being late, this week has been a bit scarce regarding time.
Author
Owner

@mokurin000 commented on GitHub (Apr 14, 2025):

Thanks to @qaqland, with a generic call() method we could have performance gain without big refactor work

Is it also in your branch? Sorry for being late, this week has been a bit scarce regarding time.

Hi! Sorry for slow response

Yeah, I have implemented it in https://github.com/mokurin000/wastebin/tree/human-readable-perf-kanal

I tried to cherry-pick the performance patch, but there are too many conflicts, it's now based on the human-readable branch

<!-- gh-comment-id:2800634318 --> @mokurin000 commented on GitHub (Apr 14, 2025): > > Thanks to [@qaqland](https://github.com/qaqland), with a generic call() method we could have performance gain without big refactor work > > Is it also in your branch? Sorry for being late, this week has been a bit scarce regarding time. Hi! Sorry for slow response Yeah, I have implemented it in https://github.com/mokurin000/wastebin/tree/human-readable-perf-kanal I tried to cherry-pick the performance patch, but there are too many conflicts, it's now based on the human-readable branch
Author
Owner

@matze commented on GitHub (May 17, 2025):

Can you try this branch and check the results? It's a similar design but uses a bog standard mpsc channel. On my system I see even better improvements of around 45% rather than 12.5%.

<!-- gh-comment-id:2888564620 --> @matze commented on GitHub (May 17, 2025): Can you try [this branch](https://github.com/matze/wastebin/tree/improve-perf) and check the results? It's a similar design but uses a bog standard `mpsc` channel. On my system I see even better improvements of around 45% rather than 12.5%.
Author
Owner

@mokurin000 commented on GitHub (May 18, 2025):

Can you try this branch and check the results? It's a similar design but uses a bog standard mpsc channel. On my system I see even better improvements of around 45% rather than 12.5%.

nice work!

my kanal implementation sends boxed closures, which is more expensive than sending commands, but requires less work.

we may also try to replace tokio mpsc channel with kanal ones? That is the fastest channel currently

<!-- gh-comment-id:2888721915 --> @mokurin000 commented on GitHub (May 18, 2025): > Can you try [this branch](https://github.com/matze/wastebin/tree/improve-perf) and check the results? It's a similar design but uses a bog standard `mpsc` channel. On my system I see even better improvements of around 45% rather than 12.5%. nice work! my kanal implementation sends boxed closures, which is more expensive than sending commands, but requires less work. we may also try to replace tokio mpsc channel with kanal ones? That is the fastest channel currently
Author
Owner

@mokurin000 commented on GitHub (May 18, 2025):

Strange... on my i7-12700H, running on Arch 6.14.6

branch RPS
master 46191.89
kanel 48767.67
tokio-mpsc 50692.79

Is that due to CPU difference?
For benchmarking, my parameters are 5 secs warmup and 30 secs bench

<!-- gh-comment-id:2888750270 --> @mokurin000 commented on GitHub (May 18, 2025): Strange... on my i7-12700H, running on Arch 6.14.6 branch | RPS ------- | ----- master | 46191.89 kanel | 48767.67 tokio-mpsc | 50692.79 Is that due to CPU difference? For benchmarking, my parameters are 5 secs warmup and 30 secs bench
Author
Owner

@matze commented on GitHub (May 18, 2025):

Is that due to CPU difference?

Perhaps. I get wildly different results on an i7-13700H (20 threads): 16209.17 (master) vs 24695.38 (mpsc) vs 33287.72 (kanal). And differences become smaller for smaller user numbers. So yeah, I will go for kanal even though it's yet another dependency :-/

Another difference is that my implementation runs the database handler in a spawn_blocking thread that is managed by tokio and both server listener and database handler futures are scheduled with futures_concurrency's join() rather than spawning a tokio task.

<!-- gh-comment-id:2888890930 --> @matze commented on GitHub (May 18, 2025): > Is that due to CPU difference? Perhaps. I get wildly different results on an i7-13700H (20 threads): 16209.17 (master) vs 24695.38 (mpsc) vs 33287.72 (kanal). And differences become smaller for smaller user numbers. So yeah, I will go for kanal even though it's yet another dependency :-/ Another difference is that my implementation runs the database handler in a `spawn_blocking` thread that is managed by tokio and both server listener and database handler futures are scheduled with `futures_concurrency`'s `join()` rather than spawning a tokio task.
Author
Owner

@matze commented on GitHub (May 18, 2025):

One last thing: these huge numbers are of course only possible with an in-memory database. These changes do not do much when the disk is hammered with writes. But in any case, a lot more reads than writes is probably the norm for a pastebin.

<!-- gh-comment-id:2888993086 --> @matze commented on GitHub (May 18, 2025): One last thing: these huge numbers are of course only possible with an in-memory database. These changes do not do much when the disk is hammered with writes. But in any case, a lot more reads than writes is probably the norm for a pastebin.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/wastebin-matze#91
No description provided.