[PR #1678] [CLOSED] Adding MAX_URL_ATTEMPTS to stop retrying failed URLs #4473

Closed
opened 2026-03-15 01:46:36 +03:00 by kerem · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/ArchiveBox/ArchiveBox/pull/1678
Author: @warenhaus
Created: 4/24/2025
Status: Closed

Base: devHead: dev


📝 Commits (8)

📊 Changes

5 files changed (+115 additions, -80 deletions)

View changed files

📝 archivebox/config/common.py (+2 -0)
archivebox/core/migrations/0075_add_max_url_retries.py (+16 -0)
📝 archivebox/core/models.py (+1 -0)
📝 archivebox/extractors/__init__.py (+95 -80)
📝 etc/ArchiveBox.conf.default (+1 -0)

📄 Description

Summary

Adding MAX_URL_ATTEMPTS to stop retrying failed URLs as a configuration option (default 0, meaning unlimited) and a retry_count column in the database to track the number of attempts. Once a snapshot has been retried MAX_URL_ATTEMPTS times, it will be skipped on further updates.
MAX_URL_ATTEMPTS = 0 means unlimited retries, which is ArchiveBox's behaviour before this change.
The change does not take into account any date info, as was suggested here.

The changes are only a couple of lines of logic, it looks like more due to changed indenting.

Related issues

#109

Changes these areas

  • Feature behavior
  • Configuration options
  • Database

🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/ArchiveBox/ArchiveBox/pull/1678 **Author:** [@warenhaus](https://github.com/warenhaus) **Created:** 4/24/2025 **Status:** ❌ Closed **Base:** `dev` ← **Head:** `dev` --- ### 📝 Commits (8) - [`20c86a7`](https://github.com/ArchiveBox/ArchiveBox/commit/20c86a7076735c67d94e7077d95fb196f382f910) Update common.py - [`818aea4`](https://github.com/ArchiveBox/ArchiveBox/commit/818aea4a95a4ef29776cd6fb07de5492212ba065) Create 0075_add_max_url_retries.py - [`2eadf3c`](https://github.com/ArchiveBox/ArchiveBox/commit/2eadf3c3431c4c3111872e4e9adcda88a41c0061) Update models.py - [`b726198`](https://github.com/ArchiveBox/ArchiveBox/commit/b72619809e5998fb9e387aaef17d3e9a1b5d8087) Update ArchiveBox.conf.default - [`b647b12`](https://github.com/ArchiveBox/ArchiveBox/commit/b647b12ba4ba8bad3e410b38e2a0668f5d4eb5cd) Update __init__.py - [`d00bffb`](https://github.com/ArchiveBox/ArchiveBox/commit/d00bffb375c5923fcba0293ac32023a9106716a1) Update __init__.py - [`e9a8bbf`](https://github.com/ArchiveBox/ArchiveBox/commit/e9a8bbf7745fad3370399c7b3a19c8a67ee16980) Update common.py - [`912eba6`](https://github.com/ArchiveBox/ArchiveBox/commit/912eba6ddc948a0cbfd18e6ac4f2cca70187373b) Update ArchiveBox.conf.default ### 📊 Changes **5 files changed** (+115 additions, -80 deletions) <details> <summary>View changed files</summary> 📝 `archivebox/config/common.py` (+2 -0) ➕ `archivebox/core/migrations/0075_add_max_url_retries.py` (+16 -0) 📝 `archivebox/core/models.py` (+1 -0) 📝 `archivebox/extractors/__init__.py` (+95 -80) 📝 `etc/ArchiveBox.conf.default` (+1 -0) </details> ### 📄 Description # Summary Adding MAX_URL_ATTEMPTS to stop retrying failed URLs as a configuration option (default 0, meaning unlimited) and a retry_count column in the database to track the number of attempts. Once a snapshot has been retried MAX_URL_ATTEMPTS times, it will be skipped on further updates. MAX_URL_ATTEMPTS = 0 means unlimited retries, which is ArchiveBox's behaviour before this change. The change does not take into account any date info, as was [suggested here](https://github.com/ArchiveBox/ArchiveBox/issues/109#issuecomment-439201532). The changes are only a couple of lines of logic, it looks like more due to changed indenting. # Related issues #109 # Changes these areas - [x] Feature behavior - [x] Configuration options - [x] Database --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
kerem 2026-03-15 01:46:36 +03:00
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#4473
No description provided.