[PR #478] [MERGED] Feature: Add @postlight/mercury-parser as an alternative to Readability #1181

Closed
opened 2026-03-01 14:48:45 +03:00 by kerem · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/ArchiveBox/ArchiveBox/pull/478
Author: @ttimasdf
Created: 9/22/2020
Status: Merged
Merged: 9/22/2020
Merged by: @cdvv7788

Base: masterHead: with-mercury


📝 Commits (7)

  • e521a2e feat: Add mercury-parser
  • 7167bb6 add environmental variable for mercury-parser
  • 0a79927 add mercury-parser npm dependency
  • 75e78b6 tests: add test for mercury-parser
  • 1c052b3 fix: add mercury-parser to extractors list
  • ed84730 add MERCURY_BINARY to CI env
  • 59a29a4 feat: Add mercury-parsed content to summary page

📊 Changes

12 files changed (+804 additions, -6 deletions)

View changed files

📝 .github/workflows/test.yml (+1 -0)
📝 Dockerfile (+3 -1)
📝 archivebox/config/__init__.py (+14 -0)
📝 archivebox/config/stubs.py (+3 -0)
📝 archivebox/extractors/__init__.py (+2 -0)
archivebox/extractors/mercury.py (+95 -0)
📝 archivebox/index/schema.py (+2 -0)
📝 archivebox/themes/legacy/link_details.html (+13 -0)
📝 package-lock.json (+661 -5)
📝 package.json (+1 -0)
📝 tests/fixtures.py (+1 -0)
📝 tests/test_extractors.py (+8 -0)

📄 Description

Summary

This PR add mercury-parser as a backend engine to extract meaningful content from HTML, as an alternative to Readability.

Mercury Parser powers the Mercury AMP Converter and Mercury Reader, a Chrome extension that removes ads and distractions, leaving only text and images for a beautiful reading view on any site.

mercury parser is also used by Newsblur, my favorite RSS Reader 😄

Changes these areas

  • Bugfixes
  • Feature behavior
  • Command line interface
  • Configuration options
  • Internal architecture
  • Archived data layout on disk

Motivation

I'm trying to archive this article. https://bbs.pediy.com/thread-229215.htm

Readability could only extract the last code block inside this article.

image

image

Mercury Parser is able to extract the whole content

image


🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/ArchiveBox/ArchiveBox/pull/478 **Author:** [@ttimasdf](https://github.com/ttimasdf) **Created:** 9/22/2020 **Status:** ✅ Merged **Merged:** 9/22/2020 **Merged by:** [@cdvv7788](https://github.com/cdvv7788) **Base:** `master` ← **Head:** `with-mercury` --- ### 📝 Commits (7) - [`e521a2e`](https://github.com/ArchiveBox/ArchiveBox/commit/e521a2e6fc9dd76212ab59df1b2094fd48d01705) feat: Add mercury-parser - [`7167bb6`](https://github.com/ArchiveBox/ArchiveBox/commit/7167bb6f41268275ae6ca0d26dcb95ea49a40a19) add environmental variable for mercury-parser - [`0a79927`](https://github.com/ArchiveBox/ArchiveBox/commit/0a79927601099386e33b467369cf8f0dd6190b3b) add mercury-parser npm dependency - [`75e78b6`](https://github.com/ArchiveBox/ArchiveBox/commit/75e78b64855705bcc2e00c2d651f1de81e555345) tests: add test for mercury-parser - [`1c052b3`](https://github.com/ArchiveBox/ArchiveBox/commit/1c052b36cb65a658c96bc8e117f40a6695eda5a7) fix: add mercury-parser to extractors list - [`ed84730`](https://github.com/ArchiveBox/ArchiveBox/commit/ed84730df8e65867df0cfcba21a3966fc1179320) add MERCURY_BINARY to CI env - [`59a29a4`](https://github.com/ArchiveBox/ArchiveBox/commit/59a29a44b73a35ed3b333338f35a4b711e806e15) feat: Add mercury-parsed content to summary page ### 📊 Changes **12 files changed** (+804 additions, -6 deletions) <details> <summary>View changed files</summary> 📝 `.github/workflows/test.yml` (+1 -0) 📝 `Dockerfile` (+3 -1) 📝 `archivebox/config/__init__.py` (+14 -0) 📝 `archivebox/config/stubs.py` (+3 -0) 📝 `archivebox/extractors/__init__.py` (+2 -0) ➕ `archivebox/extractors/mercury.py` (+95 -0) 📝 `archivebox/index/schema.py` (+2 -0) 📝 `archivebox/themes/legacy/link_details.html` (+13 -0) 📝 `package-lock.json` (+661 -5) 📝 `package.json` (+1 -0) 📝 `tests/fixtures.py` (+1 -0) 📝 `tests/test_extractors.py` (+8 -0) </details> ### 📄 Description # Summary This PR add [mercury-parser](https://github.com/postlight/mercury-parser) as a backend engine to extract meaningful content from HTML, as an alternative to Readability. > Mercury Parser powers the Mercury AMP Converter and Mercury Reader, a Chrome extension that removes ads and distractions, leaving only text and images for a beautiful reading view on any site. mercury parser is also used by [Newsblur](https://github.com/samuelclay/NewsBlur), my favorite RSS Reader 😄 # Changes these areas - [ ] Bugfixes - [x] Feature behavior - [ ] Command line interface - [x] Configuration options - [ ] Internal architecture - [x] Archived data layout on disk # Motivation I'm trying to archive this article. https://bbs.pediy.com/thread-229215.htm Readability could only extract the last code block inside this article. ![image](https://user-images.githubusercontent.com/2762704/93863908-b3fbcf00-fcf6-11ea-945b-73d0f493e6c2.png) ![image](https://user-images.githubusercontent.com/2762704/93863973-cd9d1680-fcf6-11ea-8c72-5ee98b46a975.png) Mercury Parser is able to extract the whole content ![image](https://user-images.githubusercontent.com/2762704/93864049-e9082180-fcf6-11ea-8a04-2b472b79d19b.png) --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
kerem 2026-03-01 14:48:45 +03:00
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#1181
No description provided.