[GH-ISSUE #642] [BUG] Successful task is being re-run because of a "context deadline exceeded" error #1333

Closed
opened 2026-03-07 22:08:41 +03:00 by kerem · 5 comments
Owner

Originally created by @nikhilsaraf on GitHub (Apr 10, 2023).
Original GitHub issue: https://github.com/hibiken/asynq/issues/642

Originally assigned to: @hibiken on GitHub.

Describe the bug
I'm not sure what the bug is, but can answer more questions or share information as needed.

If there are any logs I can look at to help understand when a task is completed and when the context deadline may have been exceeded I'm happy to share those

To Reproduce
Steps to reproduce the behavior (Code snippets if applicable):

  1. Not sure. But I ended my message with a defer() statement in which I enqueued new messages using a new client, this is the only thing I can think of that could have triggered it.

Also, this has never happened before for me even with the same workflow and nothing has changed in my workflow

Is it possible that I have a memory leak somewhere which is keeping my thread active?

Expected behavior
Message should have been run exactly once and should have ended when my handler function finished.

Screenshots

Last Failure
Failed at:     Mon Apr 10 13:41:37 UTC 2023
Error message: context deadline exceeded

Environment (please complete the following information):

  • OS = Linux
  • Version of asynq package = v0.24.0

Additional context
First task (started at 20230410T131137UTC):

2023/04/10T13:11:37.441733UTC   INFO    taskID=73ed809e-4f92-4421-bc49-a74855e70981, taskType=VerifyTransferTask        handlers/managedHandler.go:69                   processing task with TaskID=73ed809e-4f92-4421-bc49-a74855e70981
2023/04/10T13:53:57.043366UTC   INFO    taskID=73ed809e-4f92-4421-bc49-a74855e70981, taskType=VerifyTransferTask        queueclient/asynqClient.go:78                   enqueued task: id=2b22b003-3b0b-4643-a72f-494fa1a50c2b (parentID=73ed809e-
4f92-4421-bc49-a74855e70981)

There is no log line for wrote task result for TaskID, which is probably why we retried it (even though it succeeded)?
Not sure why the code to execute wrote task result for TaskID was not run (see below), since that's never happened before. I think that has something to do with the context deadline exceeded which "cut off" the message's time.
I'm thinking may it did not execute because the system was busy and the thread was paused and didn't get enough CPU time?

Second task (started at 20230410T134157UTC)

2023/04/10T13:41:57.708944UTC   INFO    taskID=73ed809e-4f92-4421-bc49-a74855e70981, taskType=VerifyTransferTask        handlers/managedHandler.go:69                   processing task with TaskID=73ed809e-4f92-4421-bc49-a74855e70981
2023/04/10T13:53:13.31521UTC    INFO    taskID=73ed809e-4f92-4421-bc49-a74855e70981, taskType=VerifyTransferTask        queueclient/asynqClient.go:78                   enqueued task: id=6927cb7f-366f-4902-ad03-3f5b0250dc9b (parentID=73ed809e-
4f92-4421-bc49-a74855e70981)
2023/04/10T13:53:13.317213UTC   INFO    taskID=73ed809e-4f92-4421-bc49-a74855e70981, taskType=VerifyTransferTask        handlers/managedHandler.go:199                  wrote task result for TaskID=73ed809e-4f92-4421-bc49-a74855e70981

I am currently calling writeOutputToResultWriter in a defer statement, which never got executed for some reason. This is the function that writes the wrote task result for TaskID log line.

The solution I am thinking now is to remove all my defer statements as I was expecting the defer to be called no matter what happened, but that doesn't seem to be the case here

Please also note that I have a panic statement in one my defer calls that enqueues more messages and would panic if that failed. I can't access the log line for that since I run the queue server in a screen call on linux and can't scroll back up to see any panic statement that may have been triggered. The only plausible explanation is that the enqueue failed and caused the panic call to get run, which caused my task to be re-executed. So then that leaves the question of whether there is any way for me to find out why the panic was called when enqueuing the subsequent messages from inside the taskHandler. I am updating my code to be a little more debuggable but wondering if you have any advice on where to go from here on this specific issue

Originally created by @nikhilsaraf on GitHub (Apr 10, 2023). Original GitHub issue: https://github.com/hibiken/asynq/issues/642 Originally assigned to: @hibiken on GitHub. **Describe the bug** I'm not sure what the bug is, but can answer more questions or share information as needed. If there are any logs I can look at to help understand when a task is completed and when the context deadline may have been exceeded I'm happy to share those **To Reproduce** Steps to reproduce the behavior (Code snippets if applicable): 1. Not sure. But I ended my message with a defer() statement in which I enqueued new messages using a new client, this is the only thing I can think of that could have triggered it. Also, this has never happened before for me even with the same workflow and nothing has changed in my workflow Is it possible that I have a memory leak somewhere which is keeping my thread active? **Expected behavior** Message should have been run exactly once and should have ended when my handler function finished. **Screenshots** ``` Last Failure Failed at: Mon Apr 10 13:41:37 UTC 2023 Error message: context deadline exceeded ``` **Environment (please complete the following information):** - OS = Linux - Version of `asynq` package = v0.24.0 **Additional context** First task (started at 20230410T131137UTC): ``` 2023/04/10T13:11:37.441733UTC INFO taskID=73ed809e-4f92-4421-bc49-a74855e70981, taskType=VerifyTransferTask handlers/managedHandler.go:69 processing task with TaskID=73ed809e-4f92-4421-bc49-a74855e70981 ``` ``` 2023/04/10T13:53:57.043366UTC INFO taskID=73ed809e-4f92-4421-bc49-a74855e70981, taskType=VerifyTransferTask queueclient/asynqClient.go:78 enqueued task: id=2b22b003-3b0b-4643-a72f-494fa1a50c2b (parentID=73ed809e- 4f92-4421-bc49-a74855e70981) ``` There is no log line for `wrote task result for TaskID`, which is probably why we retried it (even though it succeeded)? Not sure why the code to execute `wrote task result for TaskID` was not run (see below), since that's never happened before. I think that has something to do with the context deadline exceeded which "cut off" the message's time. I'm thinking may it did not execute because the system was busy and the thread was paused and didn't get enough CPU time? Second task (started at 20230410T134157UTC) ``` 2023/04/10T13:41:57.708944UTC INFO taskID=73ed809e-4f92-4421-bc49-a74855e70981, taskType=VerifyTransferTask handlers/managedHandler.go:69 processing task with TaskID=73ed809e-4f92-4421-bc49-a74855e70981 ``` ``` 2023/04/10T13:53:13.31521UTC INFO taskID=73ed809e-4f92-4421-bc49-a74855e70981, taskType=VerifyTransferTask queueclient/asynqClient.go:78 enqueued task: id=6927cb7f-366f-4902-ad03-3f5b0250dc9b (parentID=73ed809e- 4f92-4421-bc49-a74855e70981) ``` ``` 2023/04/10T13:53:13.317213UTC INFO taskID=73ed809e-4f92-4421-bc49-a74855e70981, taskType=VerifyTransferTask handlers/managedHandler.go:199 wrote task result for TaskID=73ed809e-4f92-4421-bc49-a74855e70981 ``` I am currently calling `writeOutputToResultWriter` in a `defer` statement, which never got executed for some reason. This is the function that writes the `wrote task result for TaskID` log line. The solution I am thinking now is to remove all my defer statements as I was expecting the defer to be called no matter what happened, but that doesn't seem to be the case here _Please also note that I have a `panic` statement in one my defer calls that enqueues more messages and would panic if that failed. I can't access the log line for that since I run the queue server in a `screen` call on linux and can't scroll back up to see any panic statement that may have been triggered. The only plausible explanation is that the enqueue failed and caused the panic call to get run, which caused my task to be re-executed. So then that leaves the question of whether there is any way for me to find out why the `panic` was called when enqueuing the subsequent messages from inside the taskHandler. I am updating my code to be a little more debuggable but wondering if you have any advice on where to go from here on this specific issue_
kerem 2026-03-07 22:08:41 +03:00
  • closed this issue
  • added the
    bug
    label
Author
Owner

@nikhilsaraf commented on GitHub (Apr 10, 2023):

The larger question I'm trying to understand is why the second (retry) task started at 2023/04/10T13:41:57.708944UTC even though the first task was still running at 2023/04/10T13:53:57.043366UTC

Is there something I'm doing wrong that is causing the same task to be run twice concurrently?

<!-- gh-comment-id:1502115166 --> @nikhilsaraf commented on GitHub (Apr 10, 2023): The larger question I'm trying to understand is why the second (retry) task started at `2023/04/10T13:41:57.708944UTC` even though the first task was still running at `2023/04/10T13:53:57.043366UTC` Is there something I'm doing wrong that is causing the same task to be run twice concurrently?
Author
Owner

@bfwbbrj commented on GitHub (Apr 11, 2023):

It's probably caused by your code running longer than usual. Currently, ProcessTask has a SOFT EXECUTION TIME LIMIT of 30 seconds and is checked&refreshed every 5 seconds, and a HARD EXECUTION TIME LIMIT default to 30 seconds and can be overridden by setting Timeout/Deadline or both in your Msg. Note that if neither was set, the 30 seconds HARD LIMIT will be used and this should be the reason you are seeing the context deadline exceeded error. I suggest you set one of them to a large enough value to avoid this kind of issue, you can find the docs here.

Please also note that I have a panic statement in one my defer calls that enqueues more messages and would panic if that failed. I can't access the log line for that since I run the queue server in a screen call on linux and can't scroll back up to see any panic statement that may have been triggered. The only plausible explanation is that the enqueue failed and caused the panic call to get run, which caused my task to be re-executed. So then that leaves the question of whether there is any way for me to find out why the panic was called when enqueuing the subsequent messages from inside the taskHandler. I am updating my code to be a little more debuggable but wondering if you have any advice on where to go from here on this specific issue

You should be able to use defer and recover to recover your panic and log relevant infos, you can find more in this blog.

P.S. Note that asynq only guarantees at least once message delivery, so it'd be better to make sure all your tasks are idempotent.

<!-- gh-comment-id:1502676251 --> @bfwbbrj commented on GitHub (Apr 11, 2023): It's probably caused by your code running longer than usual. Currently, `ProcessTask` has a **SOFT EXECUTION TIME LIMIT** of `30 seconds` and is checked&refreshed every `5 seconds`, and a **HARD EXECUTION TIME LIMIT** default to `30 seconds` and can be overridden by setting `Timeout`/`Deadline` or both in your [Msg](https://github.com/hibiken/asynq/blob/cc777ebdaa62b69bd6e985fa97117b854e7d1cd6/client.go#L121). Note that if neither was set, the `30 seconds` **HARD LIMIT** will be used and this should be the reason you are seeing the `context deadline exceeded` error. I suggest you set one of them to a large enough value to avoid this kind of issue, you can find the docs [here](https://github.com/hibiken/asynq/blob/cc777ebdaa62b69bd6e985fa97117b854e7d1cd6/internal/base/base.go#L269). >Please also note that I have a panic statement in one my defer calls that enqueues more messages and would panic if that failed. I can't access the log line for that since I run the queue server in a screen call on linux and can't scroll back up to see any panic statement that may have been triggered. The only plausible explanation is that the enqueue failed and caused the panic call to get run, which caused my task to be re-executed. So then that leaves the question of whether there is any way for me to find out why the panic was called when enqueuing the subsequent messages from inside the taskHandler. I am updating my code to be a little more debuggable but wondering if you have any advice on where to go from here on this specific issue You should be able to use `defer` and `recover` to recover your `panic` and log relevant infos, you can find more in this [blog](https://go.dev/blog/defer-panic-and-recover). P.S. Note that asynq only guarantees `at least once` message delivery, so it'd be better to make sure all your tasks are idempotent.
Author
Owner

@nikhilsaraf commented on GitHub (Apr 11, 2023):

The soft and hard execution time limits are very helpful, thank you.

The first task that failed ran for 30 minutes

I will of course restructure the task to be run with in the deadline time, however, I’m now wondering why it didn’t fail much sooner because of the context deadline exceeded?

(see timings in the log lines posted above)

Note: I don’t sent any deadlines today

<!-- gh-comment-id:1503211823 --> @nikhilsaraf commented on GitHub (Apr 11, 2023): The soft and hard execution time limits are very helpful, thank you. The first task that failed ran for 30 *minutes* I will of course restructure the task to be run with in the deadline time, however, I’m now wondering why it didn’t fail much sooner because of the context deadline exceeded? (see timings in the log lines posted above) _Note: I don’t sent any deadlines today_
Author
Owner

@bfwbbrj commented on GitHub (Apr 11, 2023):

I’m now wondering why it didn’t fail much sooner because of the context deadline exceeded?

I made a typo, the default timeout is 30 minutes. And this should explain why it didn't fail much sooner.

I will of course restructure the task to be run with in the deadline time

Actually, you don't have to. You can choose to set a larger timeout value to avoid task failures due to timeouts. (Though task restructure works as well ) :)

<!-- gh-comment-id:1503355200 --> @bfwbbrj commented on GitHub (Apr 11, 2023): >I’m now wondering why it didn’t fail much sooner because of the context deadline exceeded? I made a typo, the default timeout is **30 minutes**. And this should explain why it didn't fail much sooner. >I will of course restructure the task to be run with in the deadline time Actually, you don't have to. You can choose to set a larger timeout value to avoid task failures due to timeouts. (Though task restructure works as well ) :)
Author
Owner

@nikhilsaraf commented on GitHub (Apr 11, 2023):

Ok great, it all makes sense now. Thank you for addressing all my questions!

<!-- gh-comment-id:1503372939 --> @nikhilsaraf commented on GitHub (Apr 11, 2023): Ok great, it all makes sense now. Thank you for addressing all my questions!
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/asynq#1333
No description provided.