[PR #492] [CLOSED] Feature: Pre-download images inside HTML content extracted by mercury parser #2699

New issue

Closed

opened 2026-03-01 18:00:28 +03:00 by kerem · 0 comments

kerem commented

2026-03-01 18:00:28 +03:00

Owner

📋 Pull Request Information

Original PR: https://github.com/ArchiveBox/ArchiveBox/pull/492
Author: @ttimasdf
Created: 9/27/2020
Status: ❌ Closed

Base: master ← Head: with-mercury-fulltext

📝 Commits (9)

ff880c2 feat: extract content from singlepage (WIP)
dc6d5c7 feat: Inline images while extracting
4a59ef1 fix: solved encoding issue and other common issues
c448b46 fix: output HTML encoding, wrong html input
ef36aec remove additional field
0f35a67 Revert "feat: Inline images while extracting"
87abb40 make mercury-extractor async everywhere
f102352 fix: remove unnecessary import
9926109 fix: add missing imports

📊 Changes

3 files changed (+145 additions, -11 deletions)

View changed files

📝 Dockerfile (+1 -1)
📝 archivebox/extractors/mercury.py (+35 -10)
➕ bin/mercury-extractor (+109 -0)

📄 Description

Summary

This PR added a feature that makes mercury parser somewhere alike readability, but implemented from a different approach. It inline all images into data:base64,**** form, make the output fully (or partially?) self-contained.

For this moment, readability extract content from ./singlepage.html variant of archived content. This could be less accurate in some special cases (e.g. the article I mentioned #478 https://bbs.pediy.com/thread-229215.htm) because the HTML is different from original, and the inlined style/media may become a noise. This parser instead find content from ./output.html which is exactly the same as browser seen, yielding more accurate results. Personally I think it could be a supplement to readability parser, not a replacement.

However, some of the media contained (mostly images that I care about) still link to online website/CDN that may expire later, or blocked by hot link protection. This PR add a wrapper script for HTML post-processing, find and extract images from its original URL, and put back to HTML output with its base64-encoded form. It also added a <meta charset="UTF-8"> at the beginning of content to prevent garbling of non-ASCII content inside some browsers (at least Edge Chromium cannot display correctly both readability and mercury parser output).

This feature is tested manually against this article. at least work for me 😄 https://worldwonderer.github.io/%E9%80%86%E5%90%91%E6%9F%90%E9%9F%B3%E7%9F%AD%E8%A7%86%E9%A2%91App%E4%B9%8B%E8%AE%BE%E5%A4%87%E6%BF%80%E6%B4%BB/

Changes these areas

Bugfixes
Feature behavior
Command line interface
Configuration options
Internal architecture
Archived data layout on disk

_{🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.}

## 📋 Pull Request Information **Original PR:** https://github.com/ArchiveBox/ArchiveBox/pull/492 **Author:** [@ttimasdf](https://github.com/ttimasdf) **Created:** 9/27/2020 **Status:** ❌ Closed **Base:** `master` ← **Head:** `with-mercury-fulltext` --- ### 📝 Commits (9) - [`ff880c2`](https://github.com/ArchiveBox/ArchiveBox/commit/ff880c2d7ab420fbc4ce1779f847029ad719f481) feat: extract content from singlepage (WIP) - [`dc6d5c7`](https://github.com/ArchiveBox/ArchiveBox/commit/dc6d5c7bc141e95e99721e3083f5b98a44800892) feat: Inline images while extracting - [`4a59ef1`](https://github.com/ArchiveBox/ArchiveBox/commit/4a59ef1e38f36bef7cdb52b3d32f899d558d1c90) fix: solved encoding issue and other common issues - [`c448b46`](https://github.com/ArchiveBox/ArchiveBox/commit/c448b4689133e032cb50c999dcf6d278f25b740a) fix: output HTML encoding, wrong html input - [`ef36aec`](https://github.com/ArchiveBox/ArchiveBox/commit/ef36aec767728a53bca6fed63135c9194c6adb60) remove additional field - [`0f35a67`](https://github.com/ArchiveBox/ArchiveBox/commit/0f35a673ce29d79463d90243a95616097874f0ec) Revert "feat: Inline images while extracting" - [`87abb40`](https://github.com/ArchiveBox/ArchiveBox/commit/87abb40109e5cef0d0234208a283b553dfb9d350) make mercury-extractor async everywhere - [`f102352`](https://github.com/ArchiveBox/ArchiveBox/commit/f102352402b2dc389713b7201fd100a2b227fe36) fix: remove unnecessary import - [`9926109`](https://github.com/ArchiveBox/ArchiveBox/commit/9926109949a207cc37d5ea7bd8a4cc8575fcb835) fix: add missing imports ### 📊 Changes **3 files changed** (+145 additions, -11 deletions) <details> <summary>View changed files</summary> 📝 `Dockerfile` (+1 -1) 📝 `archivebox/extractors/mercury.py` (+35 -10) ➕ `bin/mercury-extractor` (+109 -0) </details> ### 📄 Description # Summary This PR added a feature that makes mercury parser somewhere alike readability, but implemented from a different approach. It inline all images into `data:base64,****` form, make the output fully (or partially?) self-contained. For this moment, readability extract content from `./singlepage.html` variant of archived content. This could be less accurate in some special cases (e.g. the article I mentioned #478 https://bbs.pediy.com/thread-229215.htm) because the HTML is different from original, and the inlined style/media may become a noise. This parser instead find content from `./output.html` which is exactly the same as browser seen, yielding more accurate results. Personally I think it could be a *supplement* to readability parser, not a replacement. However, some of the media contained (mostly images that I care about) still link to online website/CDN that may expire later, or blocked by hot link protection. This PR add a wrapper script for HTML post-processing, find and extract images from its original URL, and put back to HTML output with its base64-encoded form. It also added a `<meta charset="UTF-8">` at the beginning of content to prevent garbling of non-ASCII content inside some browsers (at least Edge Chromium cannot display correctly both readability and mercury parser output). This feature is tested manually against this article. at least work for me 😄 https://worldwonderer.github.io/%E9%80%86%E5%90%91%E6%9F%90%E9%9F%B3%E7%9F%AD%E8%A7%86%E9%A2%91App%E4%B9%8B%E8%AE%BE%E5%A4%87%E6%BF%80%E6%B4%BB/ ![image](https://user-images.githubusercontent.com/2762704/94355052-38fa3600-00b3-11eb-9bc1-0cd2d17e0cd9.png) # Changes these areas - [ ] Bugfixes - [x] Feature behavior - [ ] Command line interface - [x] Configuration options - [ ] Internal architecture - [ ] Archived data layout on disk --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>

kerem

2026-03-01 18:00:28 +03:00

closed this issue
added the
pull-request
label

No milestone

No project

No assignees

1 participant

Notifications

Due date

The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference

starred/ArchiveBox#2699

No description provided.

Rows
Columns