[PR #493] [MERGED] Feature: add og:title metadata as alternative title #2700

Closed
opened 2026-03-01 18:00:28 +03:00 by kerem · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/ArchiveBox/ArchiveBox/pull/493
Author: @ttimasdf
Created: 9/27/2020
Status: Merged
Merged: 10/30/2020
Merged by: @pirate

Base: masterHead: feat-ogtitle


📝 Commits (1)

  • eda3836 feat: add og:title metadata as alternative title

📊 Changes

1 file changed (+30 additions, -7 deletions)

View changed files

📝 archivebox/extractors/title.py (+30 -7)

📄 Description

Summary

Open Graph Tags is a mechanism for a website to provide metadata when an URL is shared through social media. I encountered some rare cases that the webpage's <title> tag is set through Javascript dynamically at page load, but an og:title metadata is present.

This PR make use of open graph tags as a fallback for grabbing page title (also has the potential to gather other metadatas). It extend the title extractor with HTMLParser, not breaking any existing behaviour. The <title> tag content is still the highest priority.

I did not extend the existing regex expression because the HTML is complicated and contains many special cases that cannot be handled easily by regex. In examples below, two attributes in <meta> is in different order and only the second one has <xxx /> form. I didn't even try to parse it through regex. 💫

examples:

link2: Blogspot

<meta content="分析台湾可取icatch DVR固件" property="og:title">

link1: WeChat Platform ,

<meta property="og:title" content="Cocos2dx-js 逆向分析乘凉" />

Changes these areas

  • Bugfixes
  • Feature behavior
  • Command line interface
  • Configuration options
  • Internal architecture
  • Archived data layout on disk

🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/ArchiveBox/ArchiveBox/pull/493 **Author:** [@ttimasdf](https://github.com/ttimasdf) **Created:** 9/27/2020 **Status:** ✅ Merged **Merged:** 10/30/2020 **Merged by:** [@pirate](https://github.com/pirate) **Base:** `master` ← **Head:** `feat-ogtitle` --- ### 📝 Commits (1) - [`eda3836`](https://github.com/ArchiveBox/ArchiveBox/commit/eda3836dee4f4b0ebf805ca708bc3f31533788a9) feat: add og:title metadata as alternative title ### 📊 Changes **1 file changed** (+30 additions, -7 deletions) <details> <summary>View changed files</summary> 📝 `archivebox/extractors/title.py` (+30 -7) </details> ### 📄 Description # Summary [Open Graph Tags](https://www.redclayinteractive.com/what-are-open-graph-tags/) is a mechanism for a website to provide metadata when an URL is shared through social media. I encountered some rare cases that the webpage's `<title>` tag is set through Javascript dynamically at page load, but an `og:title` metadata is present. This PR make use of open graph tags as a fallback for grabbing page title (also has the potential to gather other metadatas). It extend the `title` extractor with `HTMLParser`, not breaking any existing behaviour. The `<title>` tag content is still the highest priority. I did not extend the existing regex expression because the HTML is complicated and contains many special cases that cannot be handled easily by regex. In examples below, two attributes in `<meta>` is in different order and only the second one has `<xxx />` form. I didn't even try to parse it through regex. 💫 examples: [link2: Blogspot](https://icatch99.blogspot.com/2019/11/icatch-dvr.html) ```html <meta content="分析台湾可取icatch DVR固件" property="og:title"> ``` [link1: WeChat Platform](https://mp.weixin.qq.com/s/KErkNbtnlpaVZRvJd4J-pQ) , ```html <meta property="og:title" content="Cocos2dx-js 逆向分析乘凉" /> ``` # Changes these areas - [ ] Bugfixes - [x] Feature behavior - [ ] Command line interface - [ ] Configuration options - [ ] Internal architecture - [ ] Archived data layout on disk --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
kerem 2026-03-01 18:00:28 +03:00
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#2700
No description provided.