[GH-ISSUE #1079] [FEATURE REQUEST] Prioritized Execution of Failed Retry Tasks #2541

Open
opened 2026-03-15 20:49:25 +03:00 by kerem · 1 comment
Owner

Originally created by @HarvestWu on GitHub (Oct 15, 2025).
Original GitHub issue: https://github.com/hibiken/asynq/issues/1079

Originally assigned to: @hibiken, @kamikazechaser on GitHub.

Is your feature request related to a problem? Please describe.
I'm frustrated when a task fails, is scheduled for a retry, and gets placed at the end of a long queue. If the system is processing a backlog of tasks, this failed task must wait for the entire queue to be processed before it can be retried. This can lead to significant delays in processing critical tasks that may have failed due to a transient issue (e.g., a brief network hiccup or a temporary dependency unavailability), effectively causing "starvation" for the retrying task.

Describe the solution you'd like
I would like Asynq to support an option where a task that fails and is scheduled for a retry can be re-queued to the ​​front​​ of its queue, instead of the back. This would ensure that the next available worker picks up the failed task for a retry immediately, minimizing delay. This could be implemented as a new RetryModeor an option when enqueuing a task, for example: asynq.RetryQueueFront().

Describe alternatives you've considered

  1. Increasing RetryDelaysignificantly:​​ This doesn't solve the problem; it just spaces out the long waits between retries.
  2. ​​Creating a separate high-priority queue for retries:​​ This is a workaround where I would manually handle failures by enqueuing a new task into a "retry" queue with higher priority. However, this requires complex error handling logic outside of Asynq's built-in retry mechanism and loses the original task's state and history.
  3. Reducing the number of retries and relying on external monitoring to re-enqueue failed tasks:​​ This adds significant operational complexity and is not a robust solution.

Additional context
The current behavior, where retries are placed at the back of the queue, is safe for preventing a constantly failing task from blocking the queue. However, a configuration option to prioritize retries would be invaluable for use cases where immediate retry for potentially transient failures is critical for application performance and user experience. This feature is available in other message queue systems and would greatly enhance Asynq's flexibility.

Originally created by @HarvestWu on GitHub (Oct 15, 2025). Original GitHub issue: https://github.com/hibiken/asynq/issues/1079 Originally assigned to: @hibiken, @kamikazechaser on GitHub. **Is your feature request related to a problem? Please describe.** I'm frustrated when a task fails, is scheduled for a retry, and gets placed at the end of a long queue. If the system is processing a backlog of tasks, this failed task must wait for the entire queue to be processed before it can be retried. This can lead to significant delays in processing critical tasks that may have failed due to a transient issue (e.g., a brief network hiccup or a temporary dependency unavailability), effectively causing "starvation" for the retrying task. **Describe the solution you'd like** I would like Asynq to support an option where a task that fails and is scheduled for a retry can be re-queued to the ​​front​​ of its queue, instead of the back. This would ensure that the next available worker picks up the failed task for a retry immediately, minimizing delay. This could be implemented as a new RetryModeor an option when enqueuing a task, for example: asynq.RetryQueueFront(). **Describe alternatives you've considered** 1. Increasing RetryDelaysignificantly:​​ This doesn't solve the problem; it just spaces out the long waits between retries. 2. ​​Creating a separate high-priority queue for retries:​​ This is a workaround where I would manually handle failures by enqueuing a new task into a "retry" queue with higher priority. However, this requires complex error handling logic outside of Asynq's built-in retry mechanism and loses the original task's state and history. 3. Reducing the number of retries and relying on external monitoring to re-enqueue failed tasks:​​ This adds significant operational complexity and is not a robust solution. **Additional context** The current behavior, where retries are placed at the back of the queue, is safe for preventing a constantly failing task from blocking the queue. However, a configuration option to prioritize retries would be invaluable for use cases where immediate retry for potentially transient failures is critical for application performance and user experience. This feature is available in other message queue systems and would greatly enhance Asynq's flexibility.
Author
Owner

@RychEmrycho commented on GitHub (Dec 11, 2025):

curious, are you clubbing non-critical and critical task in the same queue? if all the task in the queue are in the same level of criticality, then i think its fair for the failed task to wait for its turn after another critical tasks are completed; after all they are in the same level of criticality, right?

maybe, we can also look at from different angle. why a critical task need to wait for long delay, can we speed up the processing of the tasks in general, such as by increasing the consumer instance, increasing the worker size, optimizing the task processing, etc.

additionally, since you mention about network hiccup/transient error then i assume it involving http client components? in that case, you can consider a retrier mechanism for example using heimdall (https://github.com/gojek/heimdall?tab=readme-ov-file#creating-an-http-client-with-a-retry-mechanism), with this you can test a few retry before marking the task failed.

however, im not quite sure about your problem statement in details, if none of the above applies, then perhaps you actually need the functionality 😅

<!-- gh-comment-id:3642665943 --> @RychEmrycho commented on GitHub (Dec 11, 2025): curious, are you clubbing non-critical and critical task in the same queue? if all the task in the queue are in the same level of criticality, then i think its fair for the failed task to wait for its turn after another critical tasks are completed; after all they are in the same level of criticality, right? maybe, we can also look at from different angle. why a critical task need to wait for long delay, can we speed up the processing of the tasks in general, such as by increasing the consumer instance, increasing the worker size, optimizing the task processing, etc. additionally, since you mention about network hiccup/transient error then i assume it involving http client components? in that case, you can consider a retrier mechanism for example using heimdall (https://github.com/gojek/heimdall?tab=readme-ov-file#creating-an-http-client-with-a-retry-mechanism), with this you can test a few retry before marking the task failed. however, im not quite sure about your problem statement in details, if none of the above applies, then perhaps you actually need the functionality 😅
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/asynq#2541
No description provided.