[PR #428] Batch upsert SEO page audits, progress reporting, and idempotency for onboarding full-site analysis #734

New issue

Open

opened 2026-03-13 21:06:25 +03:00 by kerem · 0 comments

kerem commented

2026-03-13 21:06:25 +03:00

Owner

📋 Pull Request Information

Original PR: https://github.com/AJaySi/ALwrity/pull/428
Author: @AJaySi
Created: 3/12/2026
Status: 🔄 Open

Base: main ← Head: codex/refactor-onboarding_full_website_analysis_executor

📝 Commits (1)

4e17391 Refactor onboarding full-site audit batching and progress reporting

📊 Changes

1 file changed (+244 additions, -78 deletions)

View changed files

📝 backend/services/scheduler/executors/onboarding_full_website_analysis_executor.py (+244 -78)

📄 Description

Motivation

Avoid frequent DB commits from concurrent URL workers by decoupling network audits from persistence and reduce contention during full-site onboarding audits.
Improve observability and robustness by tracking per-run progress, preserving per-URL failure details, and producing a structured execution summary.
Ensure idempotent re-runs update existing SEOPageAudit rows without creating duplicate records.

Description

Introduced a configurable persist_batch_size (default 50) and refactored _audit_urls to collect per-page audit records in memory and flush them via a new _bulk_upsert_page_audits function in batches.
Changed _audit_single_url to stop performing DB writes and instead return structured results including audit_record and failure_reason, so concurrent tasks do not share mutable session-side writes.
Added _build_audit_record to create per-page payloads, _update_progress to persist periodic progress into task.payload and task_log.result_data, and aggregated failure analytics (top_fail_reasons) and idempotency metadata in the final result.
Preserved per-URL failure records and continued auditing remaining pages, then persisted failures alongside successes during batch upserts and included failure_details and execution_summary (including success_rate, duration_ms) in task results.

Testing

Compiled the modified module with python -m compileall backend/services/scheduler/executors/onboarding_full_website_analysis_executor.py and it succeeded.

Codex Task

_{🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.}

## 📋 Pull Request Information **Original PR:** https://github.com/AJaySi/ALwrity/pull/428 **Author:** [@AJaySi](https://github.com/AJaySi) **Created:** 3/12/2026 **Status:** 🔄 Open **Base:** `main` ← **Head:** `codex/refactor-onboarding_full_website_analysis_executor` --- ### 📝 Commits (1) - [`4e17391`](https://github.com/AJaySi/ALwrity/commit/4e173918c7cd86134684b71c7947e8658b7f1468) Refactor onboarding full-site audit batching and progress reporting ### 📊 Changes **1 file changed** (+244 additions, -78 deletions) <details> <summary>View changed files</summary> 📝 `backend/services/scheduler/executors/onboarding_full_website_analysis_executor.py` (+244 -78) </details> ### 📄 Description ### Motivation - Avoid frequent DB commits from concurrent URL workers by decoupling network audits from persistence and reduce contention during full-site onboarding audits. - Improve observability and robustness by tracking per-run progress, preserving per-URL failure details, and producing a structured execution summary. - Ensure idempotent re-runs update existing `SEOPageAudit` rows without creating duplicate records. ### Description - Introduced a configurable `persist_batch_size` (default 50) and refactored `_audit_urls` to collect per-page audit records in memory and flush them via a new `_bulk_upsert_page_audits` function in batches. - Changed `_audit_single_url` to stop performing DB writes and instead return structured results including `audit_record` and `failure_reason`, so concurrent tasks do not share mutable session-side writes. - Added `_build_audit_record` to create per-page payloads, `_update_progress` to persist periodic progress into `task.payload` and `task_log.result_data`, and aggregated failure analytics (`top_fail_reasons`) and idempotency metadata in the final result. - Preserved per-URL failure records and continued auditing remaining pages, then persisted failures alongside successes during batch upserts and included `failure_details` and `execution_summary` (including `success_rate`, `duration_ms`) in task results. ### Testing - Compiled the modified module with `python -m compileall backend/services/scheduler/executors/onboarding_full_website_analysis_executor.py` and it succeeded. ------ [Codex Task](https://chatgpt.com/codex/tasks/task_e_69b26c26484883289ed1e6162de9815d) --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>

kerem added the

pull-request

label

2026-03-13 21:06:25 +03:00

No milestone

No project

No assignees

1 participant

Notifications

Due date

The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference

starred/ALwrity#734

No description provided.

Rows
Columns