[PR #1751] [MERGED] Clean up on_Crawl hooks and remove dead code #1500

Closed
opened 2026-03-01 14:50:03 +03:00 by kerem · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/ArchiveBox/ArchiveBox/pull/1751
Author: @pirate
Created: 12/31/2025
Status: Merged
Merged: 12/31/2025
Merged by: @pirate

Base: devHead: claude/cleanup-on-crawl-hooks-TtLF6


📝 Commits (1)

  • 4c77949 Clean up on_Crawl hooks: remove duplicates and standardize naming

📊 Changes

21 files changed (+109 additions, -1729 deletions)

View changed files

archivebox/plugins/captcha2/config.json (+0 -21)
archivebox/plugins/captcha2/on_Crawl__01_captcha2.js (+0 -121)
archivebox/plugins/captcha2/on_Crawl__11_captcha2_config.js (+0 -279)
archivebox/plugins/captcha2/templates/icon.html (+0 -0)
archivebox/plugins/captcha2/tests/test_captcha2.py (+0 -184)
archivebox/plugins/chrome/on_Crawl__00_chrome_install.py (+0 -184)
📝 archivebox/plugins/chrome/on_Crawl__01_chrome_install.py (+0 -0)
📝 archivebox/plugins/chrome/on_Crawl__10_chrome_validate.py (+0 -0)
📝 archivebox/plugins/chrome/on_Crawl__20_chrome_launch.bg.js (+109 -31)
archivebox/plugins/chrome/on_Crawl__30_chrome_launch.bg.js (+0 -323)
📝 archivebox/plugins/istilldontcareaboutcookies/on_Crawl__02_istilldontcareaboutcookies_install.js (+0 -0)
archivebox/plugins/istilldontcareaboutcookies/on_Crawl__20_install_istilldontcareaboutcookies_extension.js (+0 -59)
📝 archivebox/plugins/search_backend_ripgrep/on_Crawl__00_ripgrep_install.py (+0 -0)
📝 archivebox/plugins/singlefile/on_Crawl__04_singlefile_install.js (+0 -0)
archivebox/plugins/singlefile/on_Crawl__20_install_singlefile_extension.js (+0 -281)
📝 archivebox/plugins/twocaptcha/on_Crawl__05_twocaptcha_install.js (+0 -0)
📝 archivebox/plugins/twocaptcha/on_Crawl__25_twocaptcha_config.js (+0 -0)
archivebox/plugins/ublock/on_Crawl__03_ublock.js (+0 -116)
📝 archivebox/plugins/ublock/on_Crawl__03_ublock_install.js (+0 -0)
archivebox/plugins/wget/on_Crawl__10_wget_validate_config.py (+0 -130)

...and 1 more files

📄 Description

Deleted dead/duplicate hooks:

  • wget/on_Crawl__10_install_wget.py (duplicate of __10_wget_validate_config.py)
  • chrome/on_Crawl__00_chrome_install.py (simpler version, kept full one)
  • chrome/on_Crawl__20_chrome_launch.bg.js (legacy, kept __30 version)
  • singlefile/on_Crawl__20_install_singlefile_extension.js (disabled/dead)
  • istilldontcareaboutcookies/on_Crawl__20_install_*.js (legacy)
  • ublock/on_Crawl__03_ublock.js (legacy, kept __20 version)
  • Entire captcha2/ plugin (legacy version of twocaptcha/)

Renamed hooks to follow consistent pattern: on_Crawl__XX__.
Priority bands:
00-09: Binary/extension installation 10-19: Config validation 20-29: Browser launch and post-launch config

Final hooks:
00 ripgrep_install.py, 01 chrome_install.py 02 istilldontcareaboutcookies_install.js 03 ublock_install.js, 04 singlefile_install.js 05 twocaptcha_install.js 10 chrome_validate.py, 11 wget_validate.py 20 chrome_launch.bg.js, 25 twocaptcha_config.js

Summary

Related issues

Changes these areas

  • Bugfixes
  • Feature behavior
  • Command line interface
  • Configuration options
  • Internal architecture
  • Snapshot data layout on disk

Summary by cubic

Cleaned up Crawl-level hooks by removing legacy/duplicate code and standardizing hook names and priorities. Chrome launch is now a single, updated hook with better extension detection and cleaner outputs.

  • Refactors

    • Removed dead hooks (legacy chrome install/launch, singlefile extension, old ublock/cookies scripts, duplicate wget validate) and the legacy captcha2 plugin in favor of twocaptcha.
    • Renamed hooks to on_Crawl__XX__ with priority bands: 00-09 install, 10-19 validate, 20-29 launch/config.
    • Consolidated Chrome launch into on_Crawl__20_chrome_launch.bg.js; writes outputs to the current dir, resolves real extension IDs via chrome://extensions, and records extensions.json after verification.
  • Migration

    • If you used captcha2, switch to the twocaptcha hooks (on_Crawl__05_twocaptcha_install.js and on_Crawl__25_twocaptcha_config.js).
    • Update any docs/scripts that reference old hook filenames.

Written for commit 4c77949197. Summary will update on new commits.


🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/ArchiveBox/ArchiveBox/pull/1751 **Author:** [@pirate](https://github.com/pirate) **Created:** 12/31/2025 **Status:** ✅ Merged **Merged:** 12/31/2025 **Merged by:** [@pirate](https://github.com/pirate) **Base:** `dev` ← **Head:** `claude/cleanup-on-crawl-hooks-TtLF6` --- ### 📝 Commits (1) - [`4c77949`](https://github.com/ArchiveBox/ArchiveBox/commit/4c77949197cd2481e0ff48df083263dc0b9cb8ae) Clean up on_Crawl hooks: remove duplicates and standardize naming ### 📊 Changes **21 files changed** (+109 additions, -1729 deletions) <details> <summary>View changed files</summary> ➖ `archivebox/plugins/captcha2/config.json` (+0 -21) ➖ `archivebox/plugins/captcha2/on_Crawl__01_captcha2.js` (+0 -121) ➖ `archivebox/plugins/captcha2/on_Crawl__11_captcha2_config.js` (+0 -279) ➖ `archivebox/plugins/captcha2/templates/icon.html` (+0 -0) ➖ `archivebox/plugins/captcha2/tests/test_captcha2.py` (+0 -184) ➖ `archivebox/plugins/chrome/on_Crawl__00_chrome_install.py` (+0 -184) 📝 `archivebox/plugins/chrome/on_Crawl__01_chrome_install.py` (+0 -0) 📝 `archivebox/plugins/chrome/on_Crawl__10_chrome_validate.py` (+0 -0) 📝 `archivebox/plugins/chrome/on_Crawl__20_chrome_launch.bg.js` (+109 -31) ➖ `archivebox/plugins/chrome/on_Crawl__30_chrome_launch.bg.js` (+0 -323) 📝 `archivebox/plugins/istilldontcareaboutcookies/on_Crawl__02_istilldontcareaboutcookies_install.js` (+0 -0) ➖ `archivebox/plugins/istilldontcareaboutcookies/on_Crawl__20_install_istilldontcareaboutcookies_extension.js` (+0 -59) 📝 `archivebox/plugins/search_backend_ripgrep/on_Crawl__00_ripgrep_install.py` (+0 -0) 📝 `archivebox/plugins/singlefile/on_Crawl__04_singlefile_install.js` (+0 -0) ➖ `archivebox/plugins/singlefile/on_Crawl__20_install_singlefile_extension.js` (+0 -281) 📝 `archivebox/plugins/twocaptcha/on_Crawl__05_twocaptcha_install.js` (+0 -0) 📝 `archivebox/plugins/twocaptcha/on_Crawl__25_twocaptcha_config.js` (+0 -0) ➖ `archivebox/plugins/ublock/on_Crawl__03_ublock.js` (+0 -116) 📝 `archivebox/plugins/ublock/on_Crawl__03_ublock_install.js` (+0 -0) ➖ `archivebox/plugins/wget/on_Crawl__10_wget_validate_config.py` (+0 -130) _...and 1 more files_ </details> ### 📄 Description Deleted dead/duplicate hooks: - wget/on_Crawl__10_install_wget.py (duplicate of __10_wget_validate_config.py) - chrome/on_Crawl__00_chrome_install.py (simpler version, kept full one) - chrome/on_Crawl__20_chrome_launch.bg.js (legacy, kept __30 version) - singlefile/on_Crawl__20_install_singlefile_extension.js (disabled/dead) - istilldontcareaboutcookies/on_Crawl__20_install_*.js (legacy) - ublock/on_Crawl__03_ublock.js (legacy, kept __20 version) - Entire captcha2/ plugin (legacy version of twocaptcha/) Renamed hooks to follow consistent pattern: on_Crawl__XX_<plugin>_<action>.<ext> Priority bands: 00-09: Binary/extension installation 10-19: Config validation 20-29: Browser launch and post-launch config Final hooks: 00 ripgrep_install.py, 01 chrome_install.py 02 istilldontcareaboutcookies_install.js 03 ublock_install.js, 04 singlefile_install.js 05 twocaptcha_install.js 10 chrome_validate.py, 11 wget_validate.py 20 chrome_launch.bg.js, 25 twocaptcha_config.js <!-- IMPORTANT: Do not submit PRs with only formatting / PEP8 / line length changes. --> # Summary <!--e.g. This PR fixes ABC or adds the ability to do XYZ...--> # Related issues <!-- e.g. #123 or Roadmap goal # https://github.com/pirate/ArchiveBox/wiki/Roadmap --> # Changes these areas - [ ] Bugfixes - [ ] Feature behavior - [ ] Command line interface - [ ] Configuration options - [ ] Internal architecture - [ ] Snapshot data layout on disk <!-- This is an auto-generated description by cubic. --> --- ## Summary by cubic Cleaned up Crawl-level hooks by removing legacy/duplicate code and standardizing hook names and priorities. Chrome launch is now a single, updated hook with better extension detection and cleaner outputs. - **Refactors** - Removed dead hooks (legacy chrome install/launch, singlefile extension, old ublock/cookies scripts, duplicate wget validate) and the legacy captcha2 plugin in favor of twocaptcha. - Renamed hooks to on_Crawl__XX_<plugin>_<action> with priority bands: 00-09 install, 10-19 validate, 20-29 launch/config. - Consolidated Chrome launch into on_Crawl__20_chrome_launch.bg.js; writes outputs to the current dir, resolves real extension IDs via chrome://extensions, and records extensions.json after verification. - **Migration** - If you used captcha2, switch to the twocaptcha hooks (on_Crawl__05_twocaptcha_install.js and on_Crawl__25_twocaptcha_config.js). - Update any docs/scripts that reference old hook filenames. <sup>Written for commit 4c77949197cd2481e0ff48df083263dc0b9cb8ae. Summary will update on new commits.</sup> <!-- End of auto-generated description by cubic. --> --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
kerem 2026-03-01 14:50:03 +03:00
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#1500
No description provided.