mirror of
https://github.com/ArchiveBox/ArchiveBox.git
synced 2026-04-25 17:16:00 +03:00
[PR #1242] [MERGED] Fix HTML title parsing bugs. #1360
Labels
No labels
expected: maybe someday
expected: next release
expected: release after next
expected: unlikely unless contributed
good first ticket
help wanted
pull-request
scope: all users
scope: windows users
size: easy
size: hard
size: medium
size: medium
status: backlog
status: blocked
status: done
status: idea-phase
status: needs followup
status: wip
status: wontfix
touches: API/CLI/Spec
touches: configuration
touches: data/schema/architecture
touches: dependencies/packaging
touches: docs
touches: js
touches: views/replayers/html/css
why: correctness
why: functionality
why: performance
why: security
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/ArchiveBox#1360
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
📋 Pull Request Information
Original PR: https://github.com/ArchiveBox/ArchiveBox/pull/1242
Author: @benmuth
Created: 10/9/2023
Status: ✅ Merged
Merged: 10/10/2023
Merged by: @pirate
Base:
dev← Head:fix-titles-with-empty-tag📝 Commits (1)
77917e9Fix HTML title parsing bugs.📊 Changes
1 file changed (+1 additions, -1 deletions)
View changed files
📝
archivebox/extractors/title.py(+1 -1)📄 Description
Summary
This slightly modifies the
HTML_TITLE_REGEXand fixes two parsing errors. The first occurred when title tags were empty (e.g.<title></title>) which was parsed as</title. The second occurred when titles were a single character (e.g.<title>A</title>) which was not matched by the regex, and so the title would fall back tolink.base_url.With this change, now when tags are empty, the title falls back to
link.base_url, and single character titles are parsed correctly.I tested the new regex with the edge cases I could think of, and found some malformed HTML will still lead to undesired behavior. A more robust regex could probably be used in the future.
Related issues
#1222
Changes these areas
🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.
_EXTRA_ARGSfor various extractors #1388_EXTRA_ARGSfor various extractors #2894_EXTRA_ARGSfor various extractors #4399