[PR #1242] Fix HTML title parsing bugs. #2873

New issue

Closed

opened 2026-03-01 18:01:00 +03:00 by kerem · 0 comments

kerem commented

2026-03-01 18:01:00 +03:00

Owner

Original Pull Request: https://github.com/ArchiveBox/ArchiveBox/pull/1242

State: closed
Merged: Yes

Summary

This slightly modifies the HTML_TITLE_REGEX and fixes two parsing errors. The first occurred when title tags were empty (e.g. <title></title>) which was parsed as </title. The second occurred when titles were a single character (e.g. <title>A</title>) which was not matched by the regex, and so the title would fall back to link.base_url.

With this change, now when tags are empty, the title falls back to link.base_url, and single character titles are parsed correctly.

I tested the new regex with the edge cases I could think of, and found some malformed HTML will still lead to undesired behavior. A more robust regex could probably be used in the future.

#1222

Changes these areas

Bugfixes
Feature behavior
Command line interface
Configuration options
Internal architecture
Snapshot data layout on disk

**Original Pull Request:** https://github.com/ArchiveBox/ArchiveBox/pull/1242 **State:** closed **Merged:** Yes ---  # Summary  This slightly modifies the `HTML_TITLE_REGEX` and fixes two parsing errors. The first occurred when title tags were empty (e.g. `<title></title>`) which was parsed as `</title`. The second occurred when titles were a single character (e.g. `<title>A</title>`) which was not matched by the regex, and so the title would fall back to `link.base_url`. With this change, now when tags are empty, the title falls back to `link.base_url`, and single character titles are parsed correctly. I tested the new regex with the edge cases I could think of, and found some malformed HTML will still lead to undesired behavior. A more robust regex could probably be used in the future. # Related issues  #1222 # Changes these areas - [x] Bugfixes - [ ] Feature behavior - [ ] Command line interface - [ ] Configuration options - [ ] Internal architecture - [ ] Snapshot data layout on disk

kerem

2026-03-01 18:01:00 +03:00

closed this issue
added the
pull-request
label

No milestone

No project

No assignees

1 participant

Notifications

Due date

The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference

starred/ArchiveBox#2873

No description provided.

Rows
Columns