[PR #492] [CLOSED] Feature: Pre-download images inside HTML content extracted by mercury parser #2699

Closed
opened 2026-03-01 18:00:28 +03:00 by kerem · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/ArchiveBox/ArchiveBox/pull/492
Author: @ttimasdf
Created: 9/27/2020
Status: Closed

Base: masterHead: with-mercury-fulltext


📝 Commits (9)

  • ff880c2 feat: extract content from singlepage (WIP)
  • dc6d5c7 feat: Inline images while extracting
  • 4a59ef1 fix: solved encoding issue and other common issues
  • c448b46 fix: output HTML encoding, wrong html input
  • ef36aec remove additional field
  • 0f35a67 Revert "feat: Inline images while extracting"
  • 87abb40 make mercury-extractor async everywhere
  • f102352 fix: remove unnecessary import
  • 9926109 fix: add missing imports

📊 Changes

3 files changed (+145 additions, -11 deletions)

View changed files

📝 Dockerfile (+1 -1)
📝 archivebox/extractors/mercury.py (+35 -10)
bin/mercury-extractor (+109 -0)

📄 Description

Summary

This PR added a feature that makes mercury parser somewhere alike readability, but implemented from a different approach. It inline all images into data:base64,**** form, make the output fully (or partially?) self-contained.

For this moment, readability extract content from ./singlepage.html variant of archived content. This could be less accurate in some special cases (e.g. the article I mentioned #478 https://bbs.pediy.com/thread-229215.htm) because the HTML is different from original, and the inlined style/media may become a noise. This parser instead find content from ./output.html which is exactly the same as browser seen, yielding more accurate results. Personally I think it could be a supplement to readability parser, not a replacement.

However, some of the media contained (mostly images that I care about) still link to online website/CDN that may expire later, or blocked by hot link protection. This PR add a wrapper script for HTML post-processing, find and extract images from its original URL, and put back to HTML output with its base64-encoded form. It also added a <meta charset="UTF-8"> at the beginning of content to prevent garbling of non-ASCII content inside some browsers (at least Edge Chromium cannot display correctly both readability and mercury parser output).

This feature is tested manually against this article. at least work for me 😄 https://worldwonderer.github.io/%E9%80%86%E5%90%91%E6%9F%90%E9%9F%B3%E7%9F%AD%E8%A7%86%E9%A2%91App%E4%B9%8B%E8%AE%BE%E5%A4%87%E6%BF%80%E6%B4%BB/

image

Changes these areas

  • Bugfixes
  • Feature behavior
  • Command line interface
  • Configuration options
  • Internal architecture
  • Archived data layout on disk

🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/ArchiveBox/ArchiveBox/pull/492 **Author:** [@ttimasdf](https://github.com/ttimasdf) **Created:** 9/27/2020 **Status:** ❌ Closed **Base:** `master` ← **Head:** `with-mercury-fulltext` --- ### 📝 Commits (9) - [`ff880c2`](https://github.com/ArchiveBox/ArchiveBox/commit/ff880c2d7ab420fbc4ce1779f847029ad719f481) feat: extract content from singlepage (WIP) - [`dc6d5c7`](https://github.com/ArchiveBox/ArchiveBox/commit/dc6d5c7bc141e95e99721e3083f5b98a44800892) feat: Inline images while extracting - [`4a59ef1`](https://github.com/ArchiveBox/ArchiveBox/commit/4a59ef1e38f36bef7cdb52b3d32f899d558d1c90) fix: solved encoding issue and other common issues - [`c448b46`](https://github.com/ArchiveBox/ArchiveBox/commit/c448b4689133e032cb50c999dcf6d278f25b740a) fix: output HTML encoding, wrong html input - [`ef36aec`](https://github.com/ArchiveBox/ArchiveBox/commit/ef36aec767728a53bca6fed63135c9194c6adb60) remove additional field - [`0f35a67`](https://github.com/ArchiveBox/ArchiveBox/commit/0f35a673ce29d79463d90243a95616097874f0ec) Revert "feat: Inline images while extracting" - [`87abb40`](https://github.com/ArchiveBox/ArchiveBox/commit/87abb40109e5cef0d0234208a283b553dfb9d350) make mercury-extractor async everywhere - [`f102352`](https://github.com/ArchiveBox/ArchiveBox/commit/f102352402b2dc389713b7201fd100a2b227fe36) fix: remove unnecessary import - [`9926109`](https://github.com/ArchiveBox/ArchiveBox/commit/9926109949a207cc37d5ea7bd8a4cc8575fcb835) fix: add missing imports ### 📊 Changes **3 files changed** (+145 additions, -11 deletions) <details> <summary>View changed files</summary> 📝 `Dockerfile` (+1 -1) 📝 `archivebox/extractors/mercury.py` (+35 -10) ➕ `bin/mercury-extractor` (+109 -0) </details> ### 📄 Description # Summary This PR added a feature that makes mercury parser somewhere alike readability, but implemented from a different approach. It inline all images into `data:base64,****` form, make the output fully (or partially?) self-contained. For this moment, readability extract content from `./singlepage.html` variant of archived content. This could be less accurate in some special cases (e.g. the article I mentioned #478 https://bbs.pediy.com/thread-229215.htm) because the HTML is different from original, and the inlined style/media may become a noise. This parser instead find content from `./output.html` which is exactly the same as browser seen, yielding more accurate results. Personally I think it could be a *supplement* to readability parser, not a replacement. However, some of the media contained (mostly images that I care about) still link to online website/CDN that may expire later, or blocked by hot link protection. This PR add a wrapper script for HTML post-processing, find and extract images from its original URL, and put back to HTML output with its base64-encoded form. It also added a `<meta charset="UTF-8">` at the beginning of content to prevent garbling of non-ASCII content inside some browsers (at least Edge Chromium cannot display correctly both readability and mercury parser output). This feature is tested manually against this article. at least work for me 😄 https://worldwonderer.github.io/%E9%80%86%E5%90%91%E6%9F%90%E9%9F%B3%E7%9F%AD%E8%A7%86%E9%A2%91App%E4%B9%8B%E8%AE%BE%E5%A4%87%E6%BF%80%E6%B4%BB/ ![image](https://user-images.githubusercontent.com/2762704/94355052-38fa3600-00b3-11eb-9bc1-0cd2d17e0cd9.png) # Changes these areas - [ ] Bugfixes - [x] Feature behavior - [ ] Command line interface - [x] Configuration options - [ ] Internal architecture - [ ] Archived data layout on disk --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
kerem 2026-03-01 18:00:28 +03:00
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#2699
No description provided.