mirror of
https://github.com/ArchiveBox/ArchiveBox.git
synced 2026-04-25 17:16:00 +03:00
[GH-ISSUE #69] Archive Method: Markdown output by using full-text extraction provided by a "reader mode" library #47
Labels
No labels
expected: maybe someday
expected: next release
expected: release after next
expected: unlikely unless contributed
good first ticket
help wanted
pull-request
scope: all users
scope: windows users
size: easy
size: hard
size: medium
size: medium
status: backlog
status: blocked
status: done
status: idea-phase
status: needs followup
status: wip
status: wontfix
touches: API/CLI/Spec
touches: configuration
touches: data/schema/architecture
touches: dependencies/packaging
touches: docs
touches: js
touches: views/replayers/html/css
why: correctness
why: functionality
why: performance
why: security
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/ArchiveBox#47
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @swhib on GitHub (Mar 10, 2018).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/69
Originally assigned to: @cdvv7788 on GitHub.
Your program is great. I think it would be even better if it also saved a readability version which only tries to keep the relevant parts (reader view, clean view). I see two advantages
@pirate commented on GitHub (Mar 13, 2018):
This is already in progress, I'm working on adding a new archive method based on
mediumexporterthat extracts article body as markdown. Once it's in markdown format any reader view can be used to view it.@pirate commented on GitHub (Jun 11, 2018):
Debating between these options:
@swhib commented on GitHub (Jun 11, 2018):
Thank you very much for considering this issue. Here are some ideas (note that I'm not a programmer):
Is readability comparable to tomd or turndown? The first one cleans and the latter ones only convert? At least when I put a full site into the turndown online converter the result is only gibberish. So if you really wanted markdown woulnd't you have to chain these tools?
But I wonder if you really want to only offer markdown: Bookmark-archiver results are mainly viewed in the browser so there should be an option to view in cleaned in the browser, too? If you need html why convert to markdown and convert it back? I regularly convert articles to markdown for my personal offline evernote alternative. So I can see why an additional markdown export/option would be very useful.
For converting from html to markdown I personally prefer pandoc. Again my preference for pandoc is motivated by generel considerations: By far it's the most widely used converter, has the most contributors, etc. Also, Turndown seems to require hundreds of dependencies whereas with pandoc you only have to trust one source (though this is a question of personal preference). I know that mozilla's readability also requires quite a few dependencies but a) much fewer and b) I hope that mozilla checks the sources of the dependencies (because it should have the manpower). I think requiring pandoc is no problem: There are binaries for every plattform. Putting such a binary in your PATH is easier than downloading your software from the github repo and running it.
If you want an alternative to mozilla's readability there are many different ones: There is python-readability. But the last time I tested I liked the output of mozilla better. python-readability is a one-man project that gets very very updates. There is python-newspaper ...
@rsuhada commented on GitHub (Jul 12, 2018):
Just my 5 cents - experimenting with several libraries indeed python-newspaper seemed at the moment as the most robust to me. Pandoc would be nice (easy dependency) and works decently, but python-newspaper was still better performing.
I'm very interested in this feature!
If I already have an existing archive, will there be an option to convert it directly to markdown, or will it require re-download?
@pirate commented on GitHub (Apr 23, 2019):
A quick update, after investiaging all the options I like Mozilla's https://github.com/mozilla/readability the best so far.
I plan on releasing full-text extraction sometime after v0.4.0 lands, stay tuned for updates!
@pirate commented on GitHub (Aug 14, 2020):
This is done in #426! Thanks @cdvv7788