[GH-ISSUE #228] intermittent long stalling when using UDP EDIT: TCP likely affected too. Problem resolvable by Cloak client restart #184

Open
opened 2026-02-26 12:34:12 +03:00 by kerem · 4 comments
Owner

Originally created by @LindaFerum on GitHub (Aug 2, 2023).
Original GitHub issue: https://github.com/cbeuw/Cloak/issues/228

It appears UDP has a problem where after serving very active shadowsocks UDP forwarding instance for a few hours connection will just drop and permanently stall on shadowsocks side (shadowsocks rust's log at this point shows that the services behind it are trying to send more packets but nothing arrives from Cloak's side) while Cloak client appears reasonably okay.

I believe it is a Cloak problem and not an shadowsocks or network or downstream software problem because so far I have been able to immediately resolve the issue by restarting Cloak client (while restarting shadowsocks-rust or the program actually generating the UDP packets does nothing)

It may be pretty hard to reproduce because it requires very active UDP connection (just web browsing doesn't cut it EDIT: watching a youtube video seems to "do the trick", browsing was done using OpenVPN[UDP+Socks5 to shadowsocks client))
Configuring shadowsocks UDP to use more than one worker thread (shadowsocks rust port allows that) seems to trigger the issue faster

Originally created by @LindaFerum on GitHub (Aug 2, 2023). Original GitHub issue: https://github.com/cbeuw/Cloak/issues/228 It appears UDP has a problem where after serving very active shadowsocks UDP forwarding instance for a few hours connection will just drop and permanently stall on shadowsocks side (shadowsocks rust's log at this point shows that the services behind it are trying to send more packets but nothing arrives from Cloak's side) while Cloak client appears reasonably okay. I believe it is a Cloak problem and not an shadowsocks or network or downstream software problem because so far I have been able to immediately resolve the issue by restarting Cloak client (while restarting shadowsocks-rust or the program actually generating the UDP packets does nothing) It may be pretty hard to reproduce because it requires very active UDP connection (just web browsing doesn't cut it EDIT: watching a youtube video seems to "do the trick", browsing was done using OpenVPN[UDP+Socks5 to shadowsocks client)) Configuring shadowsocks UDP to use more than one worker thread (shadowsocks rust port allows that) seems to trigger the issue faster
Author
Owner

@LindaFerum commented on GitHub (Aug 3, 2023):

Additional observation:
I tried to obtain a log using "verbosity trace"

While it is quite obvious that something is not allright (the log moves extremely fast at first in stdout but then suddenly slows to a crawl) no error is ever displayed (TRAC[2023-08-03T08:08:28+03:00] 135 read from stream 1 with err style messages, they just sort of slow down to a crawl as problem asserts itself)

Restarting cloak client (as I mentioned before) resolves the problem for a bit

EDITED TO ADD

This also happens when using TCP mode, albeit far less frequently

TCP configurations tested:
OpenVPN(TCP)<->Cloak
OpenVPN(TCP)<->shadowsocks-rust<->Cloak

(OpenVPN used to convert UDP traffic to TCP)

At consistently high load the connection would just stall eventually (openVPN losing connection)

TCP connections tend to eventually (seconds to minutes) recover from stall (so yeah, TCP works better) but there's definitely something weird going on here on Cloak's part (during a stall, restarting OpenVPN or shadowsocks does not help, but restarting client does help, suggesting it's same problem as I initially ran into with UDP)

<!-- gh-comment-id:1663298653 --> @LindaFerum commented on GitHub (Aug 3, 2023): Additional observation: I tried to obtain a log using "verbosity trace" While it is quite obvious that something is not allright (the log moves extremely fast at first in stdout but then suddenly slows to a crawl) no error is ever displayed (TRAC[2023-08-03T08:08:28+03:00] 135 read from stream 1 with err <nil> style messages, they just sort of slow down to a crawl as problem asserts itself) Restarting cloak client (as I mentioned before) resolves the problem for a bit EDITED TO ADD This also happens when using TCP mode, albeit far less frequently TCP configurations tested: OpenVPN(TCP)<->Cloak OpenVPN(TCP)<->shadowsocks-rust<->Cloak (OpenVPN used to convert UDP traffic to TCP) At consistently high load the connection would just stall eventually (openVPN losing connection) TCP connections tend to eventually (seconds to minutes) recover from stall (so yeah, TCP works better) but there's definitely something weird going on here on Cloak's part (during a stall, restarting OpenVPN or shadowsocks does not help, but restarting client does help, suggesting it's same problem as I initially ran into with UDP)
Author
Owner

@LindaFerum commented on GitHub (Aug 4, 2023):

Can consistently reproduce the "TCP variant" of hiccup problem via following procedure:

VM1 (runs browser with youtube video and a terminal with ping constantly trying to ping 8.8.8.8)
|
VM2 (OpenVPN, TCP mode with SOCKS proxy option (TCP) enabled, config is ProtonVPN's free tier TCP server with socks-proxy directive added)
|
VM3 (runs cloak configured to serve TCP connection to the SOCKS proxy)
|
internet
|
VPS, with Cloak server and SOCKS proxy to which TCP connection is delivered via cloak
|
more internet :)
|
ProtonVPN's free VPN (TCP of course)
|
more internet :)

Connection starts great and works reliably for 4-15 minutes
Then ping suddenly stalls for multiple seconds.
Sometimes it self recovers fast
Sometimes it takes a while.

Usually it does not break connection

Nothing in Cloak's log
Nothing in open VPN log (unless connection breaks in which case it does usual TCP openvpn dance)

Evidence it is a Cloak issue and not say, networking:

Replacing Cloak in VM3 with Dante in chaining config completely resolves the situation, no more hiccups.

When running Dante and Cloak in VM3 in parallel (on different ports) just switching between two OpenVPN configs (exactly same, but one points to Dante's port on VM3 the other to Cloak's port) allows to immediately switch between "hiccups present" and "no hiccups"

EDIT: I will continue running this VM periodically from the "lab" (rich term for my rickety setup) and see how it goes in terms of "TCP hiccuping" , will also set up a roughly similar VM testbed for UDP but it's a bit trickier to get good comparator there, UDP support in SOCKS kinda sux)

EDIT:
So running those two (the "through TCP cloak" and "raw TCP socks" chain) on same uplink (good country, no filtering/blocking)

I'm finding that

  1. actually the hiccups with Cloak are very intermittent and "luck based" so maybe something external (network conditions?) is triggering them
  2. never happen on SOCKS-TCP variant so it's not entirely reducible to network problems
  3. playing around with number of connections (and for some reason StreamTimeout though this may be placebo :) ) seems to have some effect, I've found that on my particular connection 5 is the happy connection number (Cloak TCP almost never "hiccups") while 3,4, and 6 all have inferior performance.
<!-- gh-comment-id:1665520228 --> @LindaFerum commented on GitHub (Aug 4, 2023): Can consistently reproduce the "TCP variant" of hiccup problem via following procedure: VM1 (runs browser with youtube video and a terminal with ping constantly trying to ping 8.8.8.8) | VM2 (OpenVPN, TCP mode with SOCKS proxy option (TCP) enabled, config is ProtonVPN's free tier TCP server with socks-proxy directive added) | VM3 (runs cloak configured to serve TCP connection to the SOCKS proxy) | internet | VPS, with Cloak server and SOCKS proxy to which TCP connection is delivered via cloak | more internet :) | ProtonVPN's free VPN (TCP of course) | more internet :) Connection starts great and works reliably for 4-15 minutes Then ping suddenly stalls for multiple seconds. Sometimes it self recovers fast Sometimes it takes a while. Usually it does not break connection Nothing in Cloak's log Nothing in open VPN log (unless connection breaks in which case it does usual TCP openvpn dance) Evidence it is a Cloak issue and not say, networking: Replacing Cloak in VM3 with Dante in [chaining config](https://www.inet.no/dante/doc/1.3.x/config/chaining.html) completely resolves the situation, no more hiccups. When running Dante and Cloak in VM3 in parallel (on different ports) just switching between two OpenVPN configs (exactly same, but one points to Dante's port on VM3 the other to Cloak's port) allows to immediately switch between "hiccups present" and "no hiccups" EDIT: I will continue running this VM periodically from the "lab" (rich term for my rickety setup) and see how it goes in terms of "TCP hiccuping" , will also set up a roughly similar VM testbed for UDP but it's a bit trickier to get good comparator there, UDP support in SOCKS kinda sux) EDIT: So running those two (the "through TCP cloak" and "raw TCP socks" chain) on same uplink (good country, no filtering/blocking) I'm finding that 1) actually the hiccups with Cloak are very intermittent and "luck based" so maybe something external (network conditions?) is triggering them 2) never happen on SOCKS-TCP variant so it's not entirely reducible to network problems 3) playing around with number of connections (and for some reason StreamTimeout though this may be placebo :) ) seems to have some effect, I've found that on my particular connection 5 is the happy connection number (Cloak TCP almost never "hiccups") while 3,4, and 6 all have inferior performance.
Author
Owner

@notsure2 commented on GitHub (Aug 7, 2023):

Cloak tunnels udp packets inside tcp and since it's now tcp there is no more udp packet loss, so protocols that depend on sensing udp packet loss to optimize their rate get confused. Also it's affected by the same issue you described for tcp as well.

<!-- gh-comment-id:1667423593 --> @notsure2 commented on GitHub (Aug 7, 2023): Cloak tunnels udp packets inside tcp and since it's now tcp there is no more udp packet loss, so protocols that depend on sensing udp packet loss to optimize their rate get confused. Also it's affected by the same issue you described for tcp as well.
Author
Owner

@LindaFerum commented on GitHub (Aug 7, 2023):

Hm, I think it has something to do with how Cloak handles its "outer layer" TCP connection (possibly some small intermittent issue in connectivity which is unavoidable at some point triggers it to manifest) and UDP just gets hit harder due to being encapsulated inside affected TCP (so you get "two problems" instead of one in some weird way)

<!-- gh-comment-id:1668606989 --> @LindaFerum commented on GitHub (Aug 7, 2023): Hm, I think it has something to do with how Cloak handles its "outer layer" TCP connection (possibly some small intermittent issue in connectivity which is unavoidable at some point triggers it to manifest) and UDP just gets hit harder due to being encapsulated inside affected TCP (so you get "two problems" instead of one in some weird way)
Sign in to join this conversation.
No labels
pull-request
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/Cloak#184
No description provided.