mirror of
https://github.com/ArchiveBox/ArchiveBox.git
synced 2026-04-26 01:26:00 +03:00
[PR #492] [CLOSED] Feature: Pre-download images inside HTML content extracted by mercury parser #2699
Labels
No labels
expected: maybe someday
expected: next release
expected: release after next
expected: unlikely unless contributed
good first ticket
help wanted
pull-request
scope: all users
scope: windows users
size: easy
size: hard
size: medium
size: medium
status: backlog
status: blocked
status: done
status: idea-phase
status: needs followup
status: wip
status: wontfix
touches: API/CLI/Spec
touches: configuration
touches: data/schema/architecture
touches: dependencies/packaging
touches: docs
touches: js
touches: views/replayers/html/css
why: correctness
why: functionality
why: performance
why: security
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/ArchiveBox#2699
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
📋 Pull Request Information
Original PR: https://github.com/ArchiveBox/ArchiveBox/pull/492
Author: @ttimasdf
Created: 9/27/2020
Status: ❌ Closed
Base:
master← Head:with-mercury-fulltext📝 Commits (9)
ff880c2feat: extract content from singlepage (WIP)dc6d5c7feat: Inline images while extracting4a59ef1fix: solved encoding issue and other common issuesc448b46fix: output HTML encoding, wrong html inputef36aecremove additional field0f35a67Revert "feat: Inline images while extracting"87abb40make mercury-extractor async everywheref102352fix: remove unnecessary import9926109fix: add missing imports📊 Changes
3 files changed (+145 additions, -11 deletions)
View changed files
📝
Dockerfile(+1 -1)📝
archivebox/extractors/mercury.py(+35 -10)➕
bin/mercury-extractor(+109 -0)📄 Description
Summary
This PR added a feature that makes mercury parser somewhere alike readability, but implemented from a different approach. It inline all images into
data:base64,****form, make the output fully (or partially?) self-contained.For this moment, readability extract content from
./singlepage.htmlvariant of archived content. This could be less accurate in some special cases (e.g. the article I mentioned #478 https://bbs.pediy.com/thread-229215.htm) because the HTML is different from original, and the inlined style/media may become a noise. This parser instead find content from./output.htmlwhich is exactly the same as browser seen, yielding more accurate results. Personally I think it could be a supplement to readability parser, not a replacement.However, some of the media contained (mostly images that I care about) still link to online website/CDN that may expire later, or blocked by hot link protection. This PR add a wrapper script for HTML post-processing, find and extract images from its original URL, and put back to HTML output with its base64-encoded form. It also added a
<meta charset="UTF-8">at the beginning of content to prevent garbling of non-ASCII content inside some browsers (at least Edge Chromium cannot display correctly both readability and mercury parser output).This feature is tested manually against this article. at least work for me 😄 https://worldwonderer.github.io/%E9%80%86%E5%90%91%E6%9F%90%E9%9F%B3%E7%9F%AD%E8%A7%86%E9%A2%91App%E4%B9%8B%E8%AE%BE%E5%A4%87%E6%BF%80%E6%B4%BB/
Changes these areas
🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.