mirror of
https://github.com/hickory-dns/hickory-dns.git
synced 2026-04-25 03:05:51 +03:00
[GH-ISSUE #1276] "unable to enqueue message" when AsyncClient<UdpResponse> sends too many requests #641
Labels
No labels
blocked
breaking-change
bug
bug:critical
bug:tests
cleanup
compliance
compliance
compliance
crate:all
crate:client
crate:native-tls
crate:proto
crate:recursor
crate:resolver
crate:resolver
crate:rustls
crate:server
crate:util
dependencies
docs
duplicate
easy
easy
enhance
enhance
enhance
feature:dns-over-https
feature:dns-over-quic
feature:dns-over-tls
feature:dnsssec
feature:global_lb
feature:mdns
feature:tsig
features:edns
has workaround
ops
perf
platform:WASM
platform:android
platform:fuchsia
platform:linux
platform:macos
platform:windows
pull-request
question
test
tools
tools
trust
unclear
wontfix
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/hickory-dns#641
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @LEXUGE on GitHub (Nov 8, 2020).
Original GitHub issue: https://github.com/hickory-dns/hickory-dns/issues/1276
Describe the bug
During pressure tests for scale like 890 qps, I found
where
drouteis the name of my project.To Reproduce
Hard to get down to a minimal reproducible code snippet. However, I did the following:
AsyncClientand send through it.There is only one
AsyncClient, but it is cloned for several times.I tested with delay in between, 1 millisecond doesn't help (issue persists for
unable to enqueue), 2 milliseconds result in timeout (I set timeout for like 2 seconds per query).I also tested to have multiple
AsyncClient, which results in high rate of timeout.Code related can be found here
Expected behavior
No error
System:
Version:
Crate: client
Version: 0.19.5
Additional context
I also used
tokio-compat, but it doesn't seem to occur the issue.@LEXUGE commented on GitHub (Nov 8, 2020):
It might be caused by
CHANNEL_BUFFER_SIZE. Can we have an option to have unbounded mpsc channel?@LEXUGE commented on GitHub (Nov 9, 2020):
I found out this is not related to buffer size, but I used async Mutex across
sendwhich causes the problem. I don't know why holding async Mutex across causes the problem. But it is solved.@djc commented on GitHub (Nov 9, 2020):
Can you explain a bit more about the changes that fixed the problem you were seeing? It sounds like there might be a potential deadlock in trust-dns that you triggered?
@LEXUGE commented on GitHub (Nov 9, 2020):
Sure, here is the problematic code. I have roughly an async
Mutex<HashMap<usize, AsyncClient<UdpResponse>, and I locked it across thesend().awaitoperation. After a while of running free of error, it starts to time out, and finally gets meunable to enqueue: couldn't find receiversorreceiver is gone(I don't remember exactly).Either block the async
Mutexbefore and after thesendoperation or using syncMutexhelps me mitigate the issue. In conclusion, I can't holdMutexGuard<AsyncClient>acrosssend().await, else it causes the issue.Also, unbounded or bounded
mpscis not related to this issue as I tried to use unbounded channel, getting the same issue. In addition to that, both0.19.5and0.20-alpha3get me the issue with same error:receiver is gone.HashMap<usize, AsyncClient>is like a correspondence between the configurations user specified and the clients. It's only 5 or 4 keys.@djc commented on GitHub (Nov 9, 2020):
Why are you holding on to so many separate
AsyncClientinstances in the first place?@LEXUGE commented on GitHub (Nov 9, 2020):
No, I am not. I created a
HashMapto "cache"AsyncClientfor different configuration. I cloned the sameAsyncClientfor each incoming query to reduce the time needed for creating an extra client.@LEXUGE commented on GitHub (Nov 29, 2020):
I tested the same code on with Tokio 0.3 on main branch, seems like it is able to enqueue messages, but it has to wait for a long time. Can we change the channel to be unlimited? Or have an option for that
@djc commented on GitHub (Nov 29, 2020):
Effectively this is the
AsyncClientapplying backpressure towards your application, because the internal queue isn't being emptied fast enough. I think the solution here is (a) investigating what the performance problem is on the other side of the queue, or (b) keep a buffer in your application. We could make the channel larger, but in the limit that would just mean that your queries start to time out.@LEXUGE commented on GitHub (Nov 29, 2020):
Does having multiple AsyncClient help? I suppose each client has its own queue (background)?
@djc commented on GitHub (Nov 29, 2020):
If you set up multiple async clients separately that should mitigate the problem, yes. But again, that just means now you're potentially opening multiple connections to the same backend servers, which might cause your client to get rate-limited sooner.
It would be useful to know what the bottleneck in draining the queue is, in your application.
@LEXUGE commented on GitHub (Nov 29, 2020):
Thanks, I would investigate further to see where exact point is.
@djc commented on GitHub (Nov 29, 2020):
dcompass looks like a cool project, BTW!
Is there a particular reason you're using trust-dns-client here rather than trust-dns-resolver?
@LEXUGE commented on GitHub (Nov 29, 2020):
To tell you the truth, the first prototype used resolver because I didn't understand how the client works. However, that means I have to refill the IP response back to create a new packet and send back, which is cumbersome. Using client means I only need to forward query and send back the answers, which is kind of more native.
@LEXUGE commented on GitHub (Nov 29, 2020):
Seems like it's a network issue only. It now can top
3000 qps!@djc commented on GitHub (Nov 30, 2020):
So this can be closed again, right?
@LEXUGE commented on GitHub (Nov 30, 2020):
I think so. However, regarding the queue size, I hope it could be increased or exposed with an argument.
@djc commented on GitHub (Nov 30, 2020):
I don't really see a good reason to do that: if there's a network delay for example, it's better for your application to become aware of that sooner rather than later (through timeouts).
@LEXUGE commented on GitHub (Jan 14, 2021):
Update:
Although backpressure is expected, but on
0.20the underlying channel seems always full after once it's filled. i.e. it's irrecoverable. I suspect this is a bug.@djc commented on GitHub (Jan 14, 2021):
Would be nice to have a minimal reproduction that demonstrates the issue.
@LEXUGE commented on GitHub (Jan 14, 2021):
I believe this is the minimal sample. However, due to having a "good" network environment now, I cannot really test it out. (It works under current environment).
@djc commented on GitHub (Jan 14, 2021):
I don't think it would be a bug if that fails. The application needs to be able to "handle" backpressure from the underlying library, by slowing down the rate of requests if necessary. The channel always being full just means the receiver cannot handle the rate of requests faster than the sender is trying to send it. If that happens with your application, you should find some way for your application to "handle" the backpressure, for example by temporarily delaying further requests or load balancing with another channel.
Maybe do some reading on backpressure if you're not familiar with it:
https://medium.com/@jayphelps/backpressure-explained-the-flow-of-data-through-software-2350b3e77ce7
@LEXUGE commented on GitHub (Jan 14, 2021):
However, the situation here is that the channel is full even though the sender stopped sending any message for a considerable amount of time. That is what I unexpected. I think by either cancelling or dropping or handling internally in
trust-dns, it should be able to empty the channel somehow.@LEXUGE commented on GitHub (Jan 14, 2021):
For example, after I stressed the requestor out, it is full even after 1 minute as I tested for
www.example.cn. (I tested for longer intervals as well).But I cannot reproduce on that minimal sample still. Maybe this issue is related to my codebase, however, I don't see any substantial difference between the sample and my codebase.
@djc commented on GitHub (Jan 15, 2021):
Okay, that does sound look a bug. Without some way to reproduce it or a more detailed problem report, I'm not sure I have any avenues for fixing it, though.
@LEXUGE commented on GitHub (Jan 16, 2021):
I tweaked the logging a little and found this bizarre situation.
This might be related to my codebase. I am not sure if I misused the
UdpClientStream. What I did is to create a client for once and clone it for later usages.It's weird to see receiver being dropped though.
@LEXUGE commented on GitHub (Jan 16, 2021):
Seems like before the receiver was gone, there was always some error like
failed to associate send_message response to the sender. This may cause background and receiver to drop, resulting in this issue.@LEXUGE commented on GitHub (Jan 17, 2021):
The root cause of this issue is that the background task encountered some error and exited, while the client tried to send the message but failed due to non-existed background. What I did is to create a new
AsyncClientwhenever it fails.@djc commented on GitHub (Jan 17, 2021):
Did you figure out what error the background task encountered? We should maybe make the background task more resilient to failure if the error is at all recoverable.
@LEXUGE commented on GitHub (Jan 17, 2021):
Mainly
failed to associate send_message response to the sender.For HTTPS clients, there may also be
io_stream hit an error, shutting down: not an error(approximately as underlying streams get closenotify).@bluejekyll commented on GitHub (Jan 18, 2021):
Thanks for all the research on this. I'm wondering if the issue here is that we are hitting a network error, but the request future waiting for the result isn't able to be notified, b/c there is no ID associated back to the stream (thus forcing the timeout to expire before resolving itself).
Do we need a better method of binding the request id and IO Stream associated together, such that when the IO Stream fails we can immediately return a result to the channel waiting for a response? (I'm guessing this might be the issue based on @LEXUGE 's research).
@LEXUGE commented on GitHub (Jan 18, 2021):
There are two cases separately.
One is that the background encountered some PERMANENT errors, in that case
AsyncClientshould make the caller aware of the situation ordestructitself maybe, since a client without background makes no sense.Another case is that the
AsyncClientirresponsibly being dropped. In that way, I expect the background exit by itself (as it does currently).However, if the error is transient (like a network issue, if the background is able to tell), I expect the background to survive through those errors.
@djc commented on GitHub (Jan 18, 2021):
The behavior in the case of failing to associate seems wrong; see if #1356 improves the situation?
For the other one, is that the exact message? I cannot find where the "not an error" phrase would have come from.
@LEXUGE commented on GitHub (Jan 18, 2021):
It is not exact. The exact message (for DoH clients) is
io_stream hit an error, shutting down: h2 stream errored: protocol error: not a result of an error. This is actually not an error I suppose as this is part of the action taken by h2 forCloseNotify.@LEXUGE commented on GitHub (Jan 18, 2021):
And for
failing to associate, that PR doesn't seem to be the right behavior either as I pointed out. Currently, if there are multipleAsyncClients and a singleDnsExchangeBackground, and one of theAsyncClientsent the query but went away, then theDnsExchangeBackgrounddies out. However, with that PR, even if all the clients are dropped, the background still carries on, which is not that right though.Can we figure out how to let the background exit if and only if all clients are dropped or some irrecoverable errors encountered (better to let all the clients know in that case).
@djc commented on GitHub (Jan 18, 2021):
That's not how I understand the change I made. It seems to me that the change only lets the task go on if sending a response to a receiver fails. However, the
DnsExchangeBackgroundwill also check for theoutbound_messagesqueue of requests, and if that goes emptyoutbound_messages.as_mut().poll_next(cx) = Poll::Ready(None)the underlying stream will shutdown, and theDnsExchangeBackgroundwill returnPoll::Ready(Ok(()))once shutdown is complete. This should take care of the scenario you describe.@LEXUGE commented on GitHub (Jan 18, 2021):
Thanks, that seems right to me now.