[PR #426] [MERGED] feat: Initial version of readability extractor #1161

Closed
opened 2026-03-01 14:48:40 +03:00 by kerem · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/ArchiveBox/ArchiveBox/pull/426
Author: @cdvv7788
Created: 8/7/2020
Status: Merged
Merged: 8/14/2020
Merged by: @pirate

Base: masterHead: readability-extractor


📝 Commits (10+)

  • 7e2b249 feat: Initial version of readability extractor
  • b33c66a feat: Split output of readability into multiple files
  • 61e08a7 docs: Update docs link
  • a147626 feat: Avoid running readability when the target is a file
  • 0ec747f feat: Look in wget, singlefile or dom outputs before attempting to download the information again
  • dc87d8b tests: Update failing tests
  • 8aa7b34 tests: Add readability to ignored methods in tests
  • 2a68af1 tests: Add readability tests
  • 5dc7e63 feat: Update dockerfile to support readability
  • 4d44b17 tests: Add readability steps to CI

📊 Changes

9 files changed (+182 additions, -1 deletions)

View changed files

📝 Dockerfile (+2 -0)
📝 archivebox/config/__init__.py (+15 -1)
📝 archivebox/extractors/__init__.py (+3 -0)
archivebox/extractors/readability.py (+113 -0)
📝 archivebox/index/schema.py (+2 -0)
📝 archivebox/themes/legacy/link_details.html (+12 -0)
📝 tests/fixtures.py (+1 -0)
📝 tests/test_extractors.py (+32 -0)
📝 tests/test_init.py (+2 -0)

📄 Description

Summary

Initial working version of the readability extractor, using https://github.com/cdvv7788/readability-extractor as wrapper (JSDOM, readability, dompurify).
This has a long way to go, but for now it is functional.

Questions:

  • Currently, it accepts the html directly as an arg. When something fails, it will output the whole command, which will have a HUGE html attached to it...is that acceptable? Is it better to save it to a tmp_file and load the contents from there?
  • The wrapper is pretty basic. It does what it needs at the bare minimum to work. Improvements there will definitely be necessary.
  • Currently, the output is saved in a file named readability.json inside of the archive. It has several fields. What is the output you are expecting? Is it ok to keep this json or should we extract an specific field?

**Related issues: #69

Changes these areas

  • Bugfixes
  • Feature behavior
  • Command line interface
  • Configuration options
  • Internal architecture
  • Archived data layout on disk

Pending work

  • Add tests
  • Add readability.json to the detail html index
  • Add support for readability to the Dockerfile
  • Add support for readability to the CI pipeline

🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/ArchiveBox/ArchiveBox/pull/426 **Author:** [@cdvv7788](https://github.com/cdvv7788) **Created:** 8/7/2020 **Status:** ✅ Merged **Merged:** 8/14/2020 **Merged by:** [@pirate](https://github.com/pirate) **Base:** `master` ← **Head:** `readability-extractor` --- ### 📝 Commits (10+) - [`7e2b249`](https://github.com/ArchiveBox/ArchiveBox/commit/7e2b249388e4a530dcbcd374de469033eeb36c18) feat: Initial version of readability extractor - [`b33c66a`](https://github.com/ArchiveBox/ArchiveBox/commit/b33c66a9f77e973b0fa338038fcf18dc2eddb584) feat: Split output of readability into multiple files - [`61e08a7`](https://github.com/ArchiveBox/ArchiveBox/commit/61e08a7c430b634022088273ddb0a74a1f6e8b89) docs: Update docs link - [`a147626`](https://github.com/ArchiveBox/ArchiveBox/commit/a14762640e68d32dcb8aa29639fce7474c50a1d6) feat: Avoid running readability when the target is a file - [`0ec747f`](https://github.com/ArchiveBox/ArchiveBox/commit/0ec747f64e9b47fd08555d2c17b555874ace0a90) feat: Look in wget, singlefile or dom outputs before attempting to download the information again - [`dc87d8b`](https://github.com/ArchiveBox/ArchiveBox/commit/dc87d8b68c717438e95a713040fbcdd1849f8508) tests: Update failing tests - [`8aa7b34`](https://github.com/ArchiveBox/ArchiveBox/commit/8aa7b34de731e48e1aba84582a5df43e2add410b) tests: Add readability to ignored methods in tests - [`2a68af1`](https://github.com/ArchiveBox/ArchiveBox/commit/2a68af1b946c59a042d49c3bda414c39fee136b6) tests: Add readability tests - [`5dc7e63`](https://github.com/ArchiveBox/ArchiveBox/commit/5dc7e63792286c31988a964e5d5ef3a89a70ced8) feat: Update dockerfile to support readability - [`4d44b17`](https://github.com/ArchiveBox/ArchiveBox/commit/4d44b172e67e23fd6a2fb835b9d6547900701ff0) tests: Add readability steps to CI ### 📊 Changes **9 files changed** (+182 additions, -1 deletions) <details> <summary>View changed files</summary> 📝 `Dockerfile` (+2 -0) 📝 `archivebox/config/__init__.py` (+15 -1) 📝 `archivebox/extractors/__init__.py` (+3 -0) ➕ `archivebox/extractors/readability.py` (+113 -0) 📝 `archivebox/index/schema.py` (+2 -0) 📝 `archivebox/themes/legacy/link_details.html` (+12 -0) 📝 `tests/fixtures.py` (+1 -0) 📝 `tests/test_extractors.py` (+32 -0) 📝 `tests/test_init.py` (+2 -0) </details> ### 📄 Description # Summary Initial working version of the readability extractor, using https://github.com/cdvv7788/readability-extractor as wrapper (JSDOM, readability, dompurify). This has a long way to go, but for now it is functional. Questions: - Currently, it accepts the html directly as an arg. When something fails, it will output the whole command, which will have a HUGE html attached to it...is that acceptable? Is it better to save it to a tmp_file and load the contents from there? - The wrapper is pretty basic. It does what it needs at the bare minimum to work. Improvements there will definitely be necessary. - Currently, the output is saved in a file named `readability.json` inside of the archive. It has several fields. What is the output you are expecting? Is it ok to keep this json or should we extract an specific field? **Related issues: #69 # Changes these areas - [ ] Bugfixes - [X] Feature behavior - [ ] Command line interface - [ ] Configuration options - [ ] Internal architecture - [X] Archived data layout on disk # Pending work - [x] Add tests - [x] Add `readability.json` to the detail html index - [x] Add support for readability to the `Dockerfile` - [x] Add support for readability to the CI pipeline --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
kerem 2026-03-01 14:48:40 +03:00
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#1161
No description provided.