starred/ArchiveBox

Fork 0

mirror of https://github.com/ArchiveBox/ArchiveBox.git synced 2026-04-25 09:06:02 +03:00

[GH-ISSUE #1143] Feature Request: new inscriptis / trafilatura extractor for AI-powered article text and metadata extraction #3736

New issue

Open

opened 2026-03-15 00:13:38 +03:00 by kerem · 2 comments

kerem commented

2026-03-15 00:13:38 +03:00

Owner

Originally created by @turian on GitHub (May 5, 2023).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/1143

Type

General question or discussion
Propose a brand new feature
Request modification of existing behavior or design

What is the problem that your feature request solves

For NLP work with web-pages, I've found that inscriptis is the best text extractor.

Describe the ideal specific solution you'd want, and whether it fits into any broader scope of changes

It would be an optional extension.

What hacks or alternative solutions have you tried to solve the problem?

Crawling the Chromium HTML and using inscriptis myself.

How badly do you want this new feature?

It's an urgent deal-breaker, I can't live without it
It's important to add it in the near-mid term future
It would be nice to have eventually

I'm willing to contribute dev time / money to fix this issue
I like ArchiveBox so far / would recommend it to a friend
I've had a lot of difficulty getting ArchiveBox set up

Originally created by @turian on GitHub (May 5, 2023). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/1143 ## Type - [ ] General question or discussion - [X] Propose a brand new feature - [ ] Request modification of existing behavior or design ## What is the problem that your feature request solves For NLP work with web-pages, I've found that [inscriptis](https://github.com/weblyzard/inscriptis) is the best text extractor. ## Describe the ideal specific solution you'd want, and whether it fits into any broader scope of changes It would be an optional extension. ## What hacks or alternative solutions have you tried to solve the problem? Crawling the Chromium HTML and using inscriptis myself. ## How badly do you want this new feature? - [ ] It's an urgent deal-breaker, I can't live without it - [X] It's important to add it in the near-mid term future - [ ] It would be nice to have eventually --- - [X] I'm willing to contribute [dev time](https://github.com/ArchiveBox/ArchiveBox#archivebox-development) / [money](https://github.com/sponsors/pirate) to fix this issue - [X] I like ArchiveBox so far / would recommend it to a friend - [ ] I've had a lot of difficulty getting ArchiveBox set up

kerem added the

status: idea-phase

touches: dependencies/packaging

expected: unlikely unless contributed

labels

2026-03-15 00:13:38 +03:00

kerem commented

2026-03-15 00:13:49 +03:00

Author

Owner

@pirate commented on GitHub (May 5, 2023):

After some quick research I'm leaning towards this one instead: https://github.com/adbar/trafilatura

The reason is that it supports extracting comments and other useful metadata beyond just article text, which I think brings it a distinct advantage that mercury/readability don't provide already.

@pirate commented on GitHub (May 5, 2023): After some quick research I'm leaning towards this one instead: https://github.com/adbar/trafilatura The reason is that it supports extracting comments and other useful metadata beyond just article text, which I think brings it a distinct advantage that mercury/readability don't provide already.

kerem commented

2026-03-15 00:13:54 +03:00

Author

Owner

@turian commented on GitHub (May 5, 2023):

@pirate TBH, it would be nice to have both options. I was using readilbity and mercury, but they were not good enough. jusText was recommended to me by EleutherAI, but it was too aggressive.

Anyway, it really depends upon the use-case, but having a few options is great. I guess I would also include jusText in that list then.

Because for different tasks, you might need different levels of aggressiveness + precision/recall tradeoff.

I'm doing NLP stuff and using LLMs, and many people are these days too.

@turian commented on GitHub (May 5, 2023): @pirate TBH, it would be nice to have both options. I was using readilbity and mercury, but they were not good enough. jusText was recommended to me by EleutherAI, but it was too aggressive. Anyway, it really depends upon the use-case, but having a few options is great. I guess I would also include jusText in that list then. Because for different tasks, you might need different levels of aggressiveness + precision/recall tradeoff. I'm doing NLP stuff and using LLMs, and many people are these days too.

No milestone

No project

No assignees

1 participant

Notifications

Due date

The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference

starred/ArchiveBox#3736

No description provided.

Rows
Columns