[GH-ISSUE #1143] Feature Request: new inscriptis / trafilatura extractor for AI-powered article text and metadata extraction #2226

Open
opened 2026-03-01 17:57:27 +03:00 by kerem · 2 comments
Owner

Originally created by @turian on GitHub (May 5, 2023).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/1143

Type

  • General question or discussion
  • Propose a brand new feature
  • Request modification of existing behavior or design

What is the problem that your feature request solves

For NLP work with web-pages, I've found that inscriptis is the best text extractor.

Describe the ideal specific solution you'd want, and whether it fits into any broader scope of changes

It would be an optional extension.

What hacks or alternative solutions have you tried to solve the problem?

Crawling the Chromium HTML and using inscriptis myself.

How badly do you want this new feature?

  • It's an urgent deal-breaker, I can't live without it
  • It's important to add it in the near-mid term future
  • It would be nice to have eventually

  • I'm willing to contribute dev time / money to fix this issue
  • I like ArchiveBox so far / would recommend it to a friend
  • I've had a lot of difficulty getting ArchiveBox set up
Originally created by @turian on GitHub (May 5, 2023). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/1143 ## Type - [ ] General question or discussion - [X] Propose a brand new feature - [ ] Request modification of existing behavior or design ## What is the problem that your feature request solves For NLP work with web-pages, I've found that [inscriptis](https://github.com/weblyzard/inscriptis) is the best text extractor. ## Describe the ideal specific solution you'd want, and whether it fits into any broader scope of changes It would be an optional extension. ## What hacks or alternative solutions have you tried to solve the problem? Crawling the Chromium HTML and using inscriptis myself. ## How badly do you want this new feature? - [ ] It's an urgent deal-breaker, I can't live without it - [X] It's important to add it in the near-mid term future - [ ] It would be nice to have eventually --- - [X] I'm willing to contribute [dev time](https://github.com/ArchiveBox/ArchiveBox#archivebox-development) / [money](https://github.com/sponsors/pirate) to fix this issue - [X] I like ArchiveBox so far / would recommend it to a friend - [ ] I've had a lot of difficulty getting ArchiveBox set up
Author
Owner

@pirate commented on GitHub (May 5, 2023):

After some quick research I'm leaning towards this one instead: https://github.com/adbar/trafilatura

The reason is that it supports extracting comments and other useful metadata beyond just article text, which I think brings it a distinct advantage that mercury/readability don't provide already.

<!-- gh-comment-id:1536174651 --> @pirate commented on GitHub (May 5, 2023): After some quick research I'm leaning towards this one instead: https://github.com/adbar/trafilatura The reason is that it supports extracting comments and other useful metadata beyond just article text, which I think brings it a distinct advantage that mercury/readability don't provide already.
Author
Owner

@turian commented on GitHub (May 5, 2023):

@pirate TBH, it would be nice to have both options. I was using readilbity and mercury, but they were not good enough. jusText was recommended to me by EleutherAI, but it was too aggressive.

Anyway, it really depends upon the use-case, but having a few options is great. I guess I would also include jusText in that list then.

Because for different tasks, you might need different levels of aggressiveness + precision/recall tradeoff.

I'm doing NLP stuff and using LLMs, and many people are these days too.

<!-- gh-comment-id:1536578180 --> @turian commented on GitHub (May 5, 2023): @pirate TBH, it would be nice to have both options. I was using readilbity and mercury, but they were not good enough. jusText was recommended to me by EleutherAI, but it was too aggressive. Anyway, it really depends upon the use-case, but having a few options is great. I guess I would also include jusText in that list then. Because for different tasks, you might need different levels of aggressiveness + precision/recall tradeoff. I'm doing NLP stuff and using LLMs, and many people are these days too.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#2226
No description provided.