[GH-ISSUE #69] Archive Method: Markdown output by using full-text extraction provided by a "reader mode" library #3067

Closed
opened 2026-03-14 20:51:53 +03:00 by kerem · 6 comments
Owner

Originally created by @swhib on GitHub (Mar 10, 2018).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/69

Originally assigned to: @cdvv7788 on GitHub.

Your program is great. I think it would be even better if it also saved a readability version which only tries to keep the relevant parts (reader view, clean view). I see two advantages

  • full text searches should be more accurate and contain less irrelevant matches. I save a lot of interesting articles and usually they are surrounded by headings/links to other articles which will show up in my searches.
  • It makes reading an article later much nicer. At the moment in Chromium 63 I can't use the most popular readability extension named Mercury Reader to transform a site form my disk that I have saved with bookmark-archiver. It's the same for Firefox 58.0.2. So I have to copy the actual content into a text editor or word processor to have a distraction-free reading view.
Originally created by @swhib on GitHub (Mar 10, 2018). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/69 Originally assigned to: @cdvv7788 on GitHub. Your program is great. I think it would be even better if it also saved a readability version which only tries to keep the relevant parts (reader view, clean view). I see two advantages - full text searches should be more accurate and contain less irrelevant matches. I save a lot of interesting articles and usually they are surrounded by headings/links to other articles which will show up in my searches. - It makes reading an article later much nicer. At the moment in Chromium 63 I can't use the most popular readability extension named Mercury Reader to transform a site form my disk that I have saved with bookmark-archiver. It's the same for Firefox 58.0.2. So I have to copy the actual content into a text editor or word processor to have a distraction-free reading view.
Author
Owner

@pirate commented on GitHub (Mar 13, 2018):

This is already in progress, I'm working on adding a new archive method based on mediumexporter that extracts article body as markdown. Once it's in markdown format any reader view can be used to view it.

<!-- gh-comment-id:372808612 --> @pirate commented on GitHub (Mar 13, 2018): This is already in progress, I'm working on adding a new archive method based on `mediumexporter` that extracts article body as markdown. Once it's in markdown format any reader view can be used to view it.
Author
Owner

@pirate commented on GitHub (Jun 11, 2018):

Debating between these options:

<!-- gh-comment-id:396111427 --> @pirate commented on GitHub (Jun 11, 2018): Debating between these options: - https://github.com/domchristie/turndown - https://github.com/gaojiuli/tomd - https://github.com/mozilla/readability
Author
Owner

@swhib commented on GitHub (Jun 11, 2018):

Thank you very much for considering this issue. Here are some ideas (note that I'm not a programmer):

Is readability comparable to tomd or turndown? The first one cleans and the latter ones only convert? At least when I put a full site into the turndown online converter the result is only gibberish. So if you really wanted markdown woulnd't you have to chain these tools?

But I wonder if you really want to only offer markdown: Bookmark-archiver results are mainly viewed in the browser so there should be an option to view in cleaned in the browser, too? If you need html why convert to markdown and convert it back? I regularly convert articles to markdown for my personal offline evernote alternative. So I can see why an additional markdown export/option would be very useful.

For converting from html to markdown I personally prefer pandoc. Again my preference for pandoc is motivated by generel considerations: By far it's the most widely used converter, has the most contributors, etc. Also, Turndown seems to require hundreds of dependencies whereas with pandoc you only have to trust one source (though this is a question of personal preference). I know that mozilla's readability also requires quite a few dependencies but a) much fewer and b) I hope that mozilla checks the sources of the dependencies (because it should have the manpower). I think requiring pandoc is no problem: There are binaries for every plattform. Putting such a binary in your PATH is easier than downloading your software from the github repo and running it.

If you want an alternative to mozilla's readability there are many different ones: There is python-readability. But the last time I tested I liked the output of mozilla better. python-readability is a one-man project that gets very very updates. There is python-newspaper ...

<!-- gh-comment-id:396222445 --> @swhib commented on GitHub (Jun 11, 2018): Thank you very much for considering this issue. Here are some ideas (note that I'm not a programmer): Is readability comparable to tomd or turndown? The first one cleans and the latter ones only convert? At least when I put a full site into the [turndown online converter](http://domchristie.github.io/turndown/) the result is only gibberish. So if you really wanted markdown woulnd't you have to chain these tools? But I wonder if you really want to only offer markdown: Bookmark-archiver results are mainly viewed in the browser so there should be an option to view in cleaned in the browser, too? If you need html why convert to markdown and convert it back? I regularly convert articles to markdown for my personal offline evernote alternative. So I can see why an additional markdown export/option would be very useful. For converting from html to markdown I personally prefer pandoc. Again my preference for pandoc is motivated by generel considerations: By far it's the most widely used converter, has the most contributors, etc. Also, Turndown seems to require [hundreds of dependencies](https://github.com/domchristie/turndown/blob/master/package-lock.json) whereas with pandoc you only have to trust one source (though this is a question of personal preference). I know that mozilla's readability also requires quite a few dependencies but a) much fewer and b) I hope that mozilla checks the sources of the dependencies (because it should have the manpower). I think requiring pandoc is no problem: There are binaries for every plattform. Putting such a binary in your PATH is easier than downloading your software from the github repo and running it. If you want an alternative to mozilla's readability there are many different ones: There is [python-readability](https://github.com/buriy/python-readability). But the last time I tested I liked the output of mozilla better. python-readability is a one-man project that gets very very updates. There is [python-newspaper](https://github.com/codelucas/newspaper) ...
Author
Owner

@rsuhada commented on GitHub (Jul 12, 2018):

Just my 5 cents - experimenting with several libraries indeed python-newspaper seemed at the moment as the most robust to me. Pandoc would be nice (easy dependency) and works decently, but python-newspaper was still better performing.

I'm very interested in this feature!
If I already have an existing archive, will there be an option to convert it directly to markdown, or will it require re-download?

<!-- gh-comment-id:404413852 --> @rsuhada commented on GitHub (Jul 12, 2018): Just my 5 cents - experimenting with several libraries indeed python-newspaper seemed at the moment as the most robust to me. Pandoc would be nice (easy dependency) and works decently, but python-newspaper was still better performing. I'm very interested in this feature! If I already have an existing archive, will there be an option to convert it directly to markdown, or will it require re-download?
Author
Owner

@pirate commented on GitHub (Apr 23, 2019):

A quick update, after investiaging all the options I like Mozilla's https://github.com/mozilla/readability the best so far.

I plan on releasing full-text extraction sometime after v0.4.0 lands, stay tuned for updates!

<!-- gh-comment-id:485949952 --> @pirate commented on GitHub (Apr 23, 2019): A quick update, after investiaging all the options I like Mozilla's https://github.com/mozilla/readability the best so far. I plan on releasing full-text extraction sometime after [v0.4.0](https://github.com/pirate/ArchiveBox/pull/207) lands, stay tuned for updates!
Author
Owner

@pirate commented on GitHub (Aug 14, 2020):

This is done in #426! Thanks @cdvv7788

git checkout master
git pull
docker build . -t archivebox
docker run -v $PWD:/data archivebox add 'https://www.nytimes.com/2020/06/11/books/internet-archive-national-emergency-library-coronavirus.html'
<!-- gh-comment-id:674168588 --> @pirate commented on GitHub (Aug 14, 2020): This is done in #426! Thanks @cdvv7788 ```bash git checkout master git pull docker build . -t archivebox docker run -v $PWD:/data archivebox add 'https://www.nytimes.com/2020/06/11/books/internet-archive-national-emergency-library-coronavirus.html' ```
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#3067
No description provided.