[GH-ISSUE #1046] [BUG] Scheduler Tasks Frequently Skipped When Large Number of Handlers Registered #2522

Open
opened 2026-03-15 20:46:40 +03:00 by kerem · 5 comments
Owner

Originally created by @lingcoder on GitHub (May 7, 2025).
Original GitHub issue: https://github.com/hibiken/asynq/issues/1046

Originally assigned to: @hibiken, @kamikazechaser on GitHub.

Describe the bug
We have encountered a critical issue where scheduled tasks managed by the Scheduler are often skipped and not executed as expected. Our application registers around 25 Scheduler tasks and about 30 regular worker handlers. The problem occurs even though we are running a single asynq instance with asynq.Config.Concurrency set to 20,000.

Environment (please complete the following information):

  • OS: Ubuntu 22.04
  • asynq package version: v0.25.1
  • Redis/Valkey version: 7.0.11

To Reproduce
Steps to reproduce the behavior:

  1. Setup background processing with a single asynq instance, configuring asynq.Config.Concurrency to 20,000.
  2. Register approximately 25 Scheduler tasks and 30 regular worker handlers.
  3. Observe the execution over time.
  4. Notice that several scheduled tasks are skipped and do not get executed as expected.

Expected behavior
All Scheduler tasks should be executed according to their schedule without being skipped.

Screenshots
If applicable, add screenshots or logs here to help explain your problem.

Additional context
We have tried adjusting the concurrency and monitoring resource usage, but the issue persists. This bug is causing critical scheduled jobs to be missed, which impacts our application's reliability.
We suspect that this issue may be related to the load on Redis or the application instance, and possibly the high number of registered scheduled tasks and task handlers. We did not experience this problem when the number of scheduled tasks was smaller, but it started to appear as the number increased recently. This is especially problematic for tasks that are scheduled to run only once per day—if they are skipped, it has a significant impact on our business.

Originally created by @lingcoder on GitHub (May 7, 2025). Original GitHub issue: https://github.com/hibiken/asynq/issues/1046 Originally assigned to: @hibiken, @kamikazechaser on GitHub. **Describe the bug** We have encountered a critical issue where scheduled tasks managed by the Scheduler are often skipped and not executed as expected. Our application registers around 25 Scheduler tasks and about 30 regular worker handlers. The problem occurs even though we are running a single asynq instance with `asynq.Config.Concurrency` set to 20,000. **Environment (please complete the following information):** - OS: Ubuntu 22.04 - `asynq` package version: v0.25.1 - Redis/Valkey version: 7.0.11 **To Reproduce** Steps to reproduce the behavior: 1. Setup background processing with a single asynq instance, configuring `asynq.Config.Concurrency` to 20,000. 2. Register approximately 25 Scheduler tasks and 30 regular worker handlers. 3. Observe the execution over time. 4. Notice that several scheduled tasks are skipped and do not get executed as expected. **Expected behavior** All Scheduler tasks should be executed according to their schedule without being skipped. **Screenshots** If applicable, add screenshots or logs here to help explain your problem. **Additional context** We have tried adjusting the concurrency and monitoring resource usage, but the issue persists. This bug is causing critical scheduled jobs to be missed, which impacts our application's reliability. We suspect that this issue may be related to the load on Redis or the application instance, and possibly the high number of registered scheduled tasks and task handlers. We did not experience this problem when the number of scheduled tasks was smaller, but it started to appear as the number increased recently. This is especially problematic for tasks that are scheduled to run only once per day—if they are skipped, it has a significant impact on our business.
Author
Owner

@wangz1x commented on GitHub (May 8, 2025):

@lingcoder Will the number of handler or the time cost by task affect the result?

Now I have

  • registered 25 simple tasks in one scheduler
  • tasks 2~5 seconds each task
  • each run every two minutes
  • with 5 handlers registered in worker

I haven't notice "skipped".

But I'm still quite concerned because the new feature will depend on the scheduler.

<!-- gh-comment-id:2862120757 --> @wangz1x commented on GitHub (May 8, 2025): @lingcoder Will the number of handler or the time cost by task affect the result? Now I have - registered 25 simple tasks in one scheduler - tasks 2~5 seconds each task - each run every two minutes - with 5 handlers registered in worker I haven't notice "skipped". But I'm still quite concerned because the new feature will depend on the scheduler.
Author
Owner

@kamikazechaser commented on GitHub (May 15, 2025):

20,000 is an unusually high (and arbitrary) concurrency setting for a single instance. Could you describe your setup i.e.

  • Are the workloads CPU or IO bound
  • What are the specs of the machine
  • Where is Redis hosted
  • Are you collecting both go_process and asynq metrics?

several scheduled tasks are skipped

Could you describe this more. What state are the tasks in: https://github.com/hibiken/asynq/wiki/Life-of-a-Task#task-lifecycle

<!-- gh-comment-id:2882783800 --> @kamikazechaser commented on GitHub (May 15, 2025): 20,000 is an unusually high (and arbitrary) concurrency setting for a single instance. Could you describe your setup i.e. * Are the workloads CPU or IO bound * What are the specs of the machine * Where is Redis hosted * Are you collecting both go_process and asynq metrics? > several scheduled tasks are skipped Could you describe this more. What state are the tasks in: https://github.com/hibiken/asynq/wiki/Life-of-a-Task#task-lifecycle
Author
Owner

@lingcoder commented on GitHub (May 23, 2025):

@kamikazechaser
Thanks for your quick response! Here is some additional background and clarifications:

Workload: Our tasks are quite IO-intensive and involve a lot of database batch processing.
Deployment: We are running on AWS EKS, and Redis is hosted on AWS ElastiCache (Redis 7.0.x).
Monitoring: We have monitoring in place for our infrastructure, but we haven't established detailed metrics specifically for asynq yet. I will try to gather more specifics, but unfortunately our operations engineer recently left, so we are currently limited in this regard.
Skipped/Unfired Tasks: The skipped scheduled tasks do not appear in our application logs at all – it's as if those scheduled moments are missed entirely, rather than the job being delayed long enough to eventually appear. The cron expressions we use specify precise run points, not just intervals (e.g., 0 0 * * *). Sadly, when a task is skipped, it leaves no log or trace (failed or otherwise), and I haven’t yet checked the exact Redis state at the skipped times.
On Premises / Small Scale: When we ran fewer scheduled tasks earlier, everything seemed fine. The problems started after scaling up the schedules.
Issue Scope: At this stage, as a developer (not infra), I don't have deep-dive visibility into the infra/runtime/metrics, but this issue has high business impact for us.
I'm opening this issue partly to find out if anyone else has encountered similar problems when scaling up the number of scheduled tasks/handlers, particularly with a relatively high IO workload and on managed Redis. If this is a known limit or has a recommended workaround, it would help us justify prioritizing actions or design changes internally.

If you or others have insight into what to further investigate, or if there are known scheduler/accounting/concurrency limits in asynq, please point me in the right direction.

I will also try to arrange for someone to further analyze and investigate the issue.

Thanks again for your support and for an awesome project!

<!-- gh-comment-id:2903119149 --> @lingcoder commented on GitHub (May 23, 2025): @kamikazechaser Thanks for your quick response! Here is some additional background and clarifications: Workload: Our tasks are quite IO-intensive and involve a lot of database batch processing. Deployment: We are running on AWS EKS, and Redis is hosted on AWS ElastiCache (Redis 7.0.x). Monitoring: We have monitoring in place for our infrastructure, but we haven't established detailed metrics specifically for asynq yet. I will try to gather more specifics, but unfortunately our operations engineer recently left, so we are currently limited in this regard. Skipped/Unfired Tasks: The skipped scheduled tasks do not appear in our application logs at all – it's as if those scheduled moments are missed entirely, rather than the job being delayed long enough to eventually appear. The cron expressions we use specify precise run points, not just intervals (e.g., 0 0 * * *). Sadly, when a task is skipped, it leaves no log or trace (failed or otherwise), and I haven’t yet checked the exact Redis state at the skipped times. On Premises / Small Scale: When we ran fewer scheduled tasks earlier, everything seemed fine. The problems started after scaling up the schedules. Issue Scope: At this stage, as a developer (not infra), I don't have deep-dive visibility into the infra/runtime/metrics, but this issue has high business impact for us. I'm opening this issue partly to find out if anyone else has encountered similar problems when scaling up the number of scheduled tasks/handlers, particularly with a relatively high IO workload and on managed Redis. If this is a known limit or has a recommended workaround, it would help us justify prioritizing actions or design changes internally. If you or others have insight into what to further investigate, or if there are known scheduler/accounting/concurrency limits in asynq, please point me in the right direction. I will also try to arrange for someone to further analyze and investigate the issue. Thanks again for your support and for an awesome project!
Author
Owner

@kamikazechaser commented on GitHub (May 23, 2025):

20k is too high for a single instance. I would set it to min(file_descriptors, max_db_connection_pool_size, max_redis_connection_pool_size) whichever is lower, then make changes from there on. Then scale it horizontally across multiple machines while also scaling redis and the DB.

<!-- gh-comment-id:2903484689 --> @kamikazechaser commented on GitHub (May 23, 2025): 20k is too high for a single instance. I would set it to `min(file_descriptors, max_db_connection_pool_size, max_redis_connection_pool_size)` whichever is lower, then make changes from there on. Then scale it horizontally across multiple machines while also scaling redis and the DB.
Author
Owner

@lingcoder commented on GitHub (May 23, 2025):

Thanks for your suggestion! I’ve already reduced it to 1k for now. If the issue persists, I’ll adjust further as you recommended. Appreciate your help!

<!-- gh-comment-id:2904193378 --> @lingcoder commented on GitHub (May 23, 2025): Thanks for your suggestion! I’ve already reduced it to 1k for now. If the issue persists, I’ll adjust further as you recommended. Appreciate your help!
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/asynq#2522
No description provided.