mirror of
https://github.com/ArchiveBox/ArchiveBox.git
synced 2026-04-25 17:16:00 +03:00
[PR #426] [MERGED] feat: Initial version of readability extractor #1161
Labels
No labels
expected: maybe someday
expected: next release
expected: release after next
expected: unlikely unless contributed
good first ticket
help wanted
pull-request
scope: all users
scope: windows users
size: easy
size: hard
size: medium
size: medium
status: backlog
status: blocked
status: done
status: idea-phase
status: needs followup
status: wip
status: wontfix
touches: API/CLI/Spec
touches: configuration
touches: data/schema/architecture
touches: dependencies/packaging
touches: docs
touches: js
touches: views/replayers/html/css
why: correctness
why: functionality
why: performance
why: security
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/ArchiveBox#1161
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
📋 Pull Request Information
Original PR: https://github.com/ArchiveBox/ArchiveBox/pull/426
Author: @cdvv7788
Created: 8/7/2020
Status: ✅ Merged
Merged: 8/14/2020
Merged by: @pirate
Base:
master← Head:readability-extractor📝 Commits (10+)
7e2b249feat: Initial version of readability extractorb33c66afeat: Split output of readability into multiple files61e08a7docs: Update docs linka147626feat: Avoid running readability when the target is a file0ec747ffeat: Look in wget, singlefile or dom outputs before attempting to download the information againdc87d8btests: Update failing tests8aa7b34tests: Add readability to ignored methods in tests2a68af1tests: Add readability tests5dc7e63feat: Update dockerfile to support readability4d44b17tests: Add readability steps to CI📊 Changes
9 files changed (+182 additions, -1 deletions)
View changed files
📝
Dockerfile(+2 -0)📝
archivebox/config/__init__.py(+15 -1)📝
archivebox/extractors/__init__.py(+3 -0)➕
archivebox/extractors/readability.py(+113 -0)📝
archivebox/index/schema.py(+2 -0)📝
archivebox/themes/legacy/link_details.html(+12 -0)📝
tests/fixtures.py(+1 -0)📝
tests/test_extractors.py(+32 -0)📝
tests/test_init.py(+2 -0)📄 Description
Summary
Initial working version of the readability extractor, using https://github.com/cdvv7788/readability-extractor as wrapper (JSDOM, readability, dompurify).
This has a long way to go, but for now it is functional.
Questions:
readability.jsoninside of the archive. It has several fields. What is the output you are expecting? Is it ok to keep this json or should we extract an specific field?**Related issues: #69
Changes these areas
Pending work
readability.jsonto the detail html indexDockerfile🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.