mirror of
https://github.com/ArchiveBox/ArchiveBox.git
synced 2026-04-26 01:26:00 +03:00
[PR #478] [MERGED] Feature: Add @postlight/mercury-parser as an alternative to Readability #2690
Labels
No labels
expected: maybe someday
expected: next release
expected: release after next
expected: unlikely unless contributed
good first ticket
help wanted
pull-request
scope: all users
scope: windows users
size: easy
size: hard
size: medium
size: medium
status: backlog
status: blocked
status: done
status: idea-phase
status: needs followup
status: wip
status: wontfix
touches: API/CLI/Spec
touches: configuration
touches: data/schema/architecture
touches: dependencies/packaging
touches: docs
touches: js
touches: views/replayers/html/css
why: correctness
why: functionality
why: performance
why: security
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/ArchiveBox#2690
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
📋 Pull Request Information
Original PR: https://github.com/ArchiveBox/ArchiveBox/pull/478
Author: @ttimasdf
Created: 9/22/2020
Status: ✅ Merged
Merged: 9/22/2020
Merged by: @cdvv7788
Base:
master← Head:with-mercury📝 Commits (7)
e521a2efeat: Add mercury-parser7167bb6add environmental variable for mercury-parser0a79927add mercury-parser npm dependency75e78b6tests: add test for mercury-parser1c052b3fix: add mercury-parser to extractors listed84730add MERCURY_BINARY to CI env59a29a4feat: Add mercury-parsed content to summary page📊 Changes
12 files changed (+804 additions, -6 deletions)
View changed files
📝
.github/workflows/test.yml(+1 -0)📝
Dockerfile(+3 -1)📝
archivebox/config/__init__.py(+14 -0)📝
archivebox/config/stubs.py(+3 -0)📝
archivebox/extractors/__init__.py(+2 -0)➕
archivebox/extractors/mercury.py(+95 -0)📝
archivebox/index/schema.py(+2 -0)📝
archivebox/themes/legacy/link_details.html(+13 -0)📝
package-lock.json(+661 -5)📝
package.json(+1 -0)📝
tests/fixtures.py(+1 -0)📝
tests/test_extractors.py(+8 -0)📄 Description
Summary
This PR add mercury-parser as a backend engine to extract meaningful content from HTML, as an alternative to Readability.
mercury parser is also used by Newsblur, my favorite RSS Reader 😄
Changes these areas
Motivation
I'm trying to archive this article. https://bbs.pediy.com/thread-229215.htm
Readability could only extract the last code block inside this article.
Mercury Parser is able to extract the whole content
🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.