[PR #1242] [MERGED] Fix HTML title parsing bugs. #1360

New issue

Closed

opened 2026-03-01 14:49:28 +03:00 by kerem · 0 comments

kerem commented

2026-03-01 14:49:28 +03:00

Owner

📋 Pull Request Information

Original PR: https://github.com/ArchiveBox/ArchiveBox/pull/1242
Author: @benmuth
Created: 10/9/2023
Status: ✅ Merged
Merged: 10/10/2023
Merged by: @pirate

Base: dev ← Head: fix-titles-with-empty-tag

📝 Commits (1)

77917e9 Fix HTML title parsing bugs.

📊 Changes

1 file changed (+1 additions, -1 deletions)

View changed files

📝 archivebox/extractors/title.py (+1 -1)

📄 Description

Summary

This slightly modifies the HTML_TITLE_REGEX and fixes two parsing errors. The first occurred when title tags were empty (e.g. <title></title>) which was parsed as </title. The second occurred when titles were a single character (e.g. <title>A</title>) which was not matched by the regex, and so the title would fall back to link.base_url.

With this change, now when tags are empty, the title falls back to link.base_url, and single character titles are parsed correctly.

I tested the new regex with the edge cases I could think of, and found some malformed HTML will still lead to undesired behavior. A more robust regex could probably be used in the future.

#1222

Changes these areas

Bugfixes
Feature behavior
Command line interface
Configuration options
Internal architecture
Snapshot data layout on disk

_{🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.}

## 📋 Pull Request Information **Original PR:** https://github.com/ArchiveBox/ArchiveBox/pull/1242 **Author:** [@benmuth](https://github.com/benmuth) **Created:** 10/9/2023 **Status:** ✅ Merged **Merged:** 10/10/2023 **Merged by:** [@pirate](https://github.com/pirate) **Base:** `dev` ← **Head:** `fix-titles-with-empty-tag` --- ### 📝 Commits (1) - [`77917e9`](https://github.com/ArchiveBox/ArchiveBox/commit/77917e9b5527cae659604286aec96760e409bf21) Fix HTML title parsing bugs. ### 📊 Changes **1 file changed** (+1 additions, -1 deletions) <details> <summary>View changed files</summary> 📝 `archivebox/extractors/title.py` (+1 -1) </details> ### 📄 Description  # Summary  This slightly modifies the `HTML_TITLE_REGEX` and fixes two parsing errors. The first occurred when title tags were empty (e.g. `<title></title>`) which was parsed as `</title`. The second occurred when titles were a single character (e.g. `<title>A</title>`) which was not matched by the regex, and so the title would fall back to `link.base_url`. With this change, now when tags are empty, the title falls back to `link.base_url`, and single character titles are parsed correctly. I tested the new regex with the edge cases I could think of, and found some malformed HTML will still lead to undesired behavior. A more robust regex could probably be used in the future. # Related issues  #1222 # Changes these areas - [x] Bugfixes - [ ] Feature behavior - [ ] Command line interface - [ ] Configuration options - [ ] Internal architecture - [ ] Snapshot data layout on disk --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>