[GH-ISSUE #1139] Feature Request: Add AI-assisted summarization, tagging, search, and more using LLMs / RAG #2223

Open
opened 2026-03-01 17:57:25 +03:00 by kerem · 9 comments
Owner

Originally created by @hbd on GitHub (Apr 18, 2023).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/1139

Type

  • General question or discussion
  • Propose a brand new feature
  • Request modification of existing behavior or design

What is the problem that your feature request solves

I'm fairly new to ArchiveBox, but I really like the simplicity and robustness of the tool. For a long time I have considered building a similar tool to ensure I can deliberately save and revisit content from the internet that is important to me. However, I think ArchiveBox provides everything I want and more.

With the advent of open source LLMs, what does the community here think about a feature that provides a ChatGPT-like feature on top of the content in and ArchiveBox deployment, to allow a user to interact with the content in ArchiveBox? This could be useful for doing things like fine-tuned research, recalling content that was saved in a period of time ("Summarize the content I saved yesterday"), and probably a lot more.

Describe the ideal specific solution you'd want, and whether it fits into any broader scope of changes

Ideally this is a tool on top of ArchiveBox to consume the data saved in an archive to fine-tune an LLM and provide a chat interface to interact with the data.

What hacks or alternative solutions have you tried to solve the problem?

Nothing specific to ArchiveBox :D

How badly do you want this new feature?

  • It's an urgent deal-breaker, I can't live without it
  • It's important to add it in the near-mid term future
  • It would be nice to have eventually

  • I'm willing to contribute dev time / money to fix this issue
  • I like ArchiveBox so far / would recommend it to a friend
  • I've had a lot of difficulty getting ArchiveBox set up
Originally created by @hbd on GitHub (Apr 18, 2023). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/1139 <!-- Please fill out the following information, feel free to delete sections if they're not applicable or if long issue templates annoy you :) --> ## Type - [x] General question or discussion - [x] Propose a brand new feature - [ ] Request modification of existing behavior or design ## What is the problem that your feature request solves <!-- e.g. I need to be able to archive spanish and french subtitle files from a particular <example.com> movie site that's going down soon. --> I'm fairly new to ArchiveBox, but I really like the simplicity and robustness of the tool. For a long time I have considered building a similar tool to ensure I can deliberately save and revisit content from the internet that is important to me. However, I think ArchiveBox provides everything I want and more. With the advent of open source LLMs, what does the community here think about a feature that provides a ChatGPT-like feature on top of the content in and ArchiveBox deployment, to allow a user to interact with the content in ArchiveBox? This could be useful for doing things like fine-tuned research, recalling content that was saved in a period of time ("Summarize the content I saved yesterday"), and probably a lot more. ## Describe the ideal specific solution you'd want, and whether it fits into any broader scope of changes <!-- e.g. I specifically need a new archive method to look for multilingual subtitle files related to pages. The bigger picture solution is the ability for custom user scripts to be run in a puppeteer context during archiving. --> Ideally this is a tool on top of ArchiveBox to consume the data saved in an archive to fine-tune an LLM and provide a chat interface to interact with the data. ## What hacks or alternative solutions have you tried to solve the problem? <!-- A clear and concise description of any alternative solutions, workarounds, or other software you've considered using to fix the problem. --> Nothing specific to ArchiveBox :D ## How badly do you want this new feature? - [ ] It's an urgent deal-breaker, I can't live without it - [ ] It's important to add it in the near-mid term future - [x] It would be nice to have eventually --- - [x] I'm willing to contribute [dev time](https://github.com/ArchiveBox/ArchiveBox#archivebox-development) / [money](https://github.com/sponsors/pirate) to fix this issue - [x] I like ArchiveBox so far / would recommend it to a friend - [ ] I've had a lot of difficulty getting ArchiveBox set up
Author
Owner

@pirate commented on GitHub (Apr 19, 2023):

I'm already thinking about some potential applications of LLMs in ArchiveBox 😉

Will post about my plans as they develop, but curious to hear about specific ideas from the community (beyond auto-summarization and auto-tagging) in the meantime...

<!-- gh-comment-id:1514023254 --> @pirate commented on GitHub (Apr 19, 2023): I'm already thinking about some potential applications of LLMs in ArchiveBox 😉 Will post about my plans as they develop, but curious to hear about specific ideas from the community (beyond auto-summarization and auto-tagging) in the meantime...
Author
Owner

@OleVik commented on GitHub (Aug 12, 2023):

Like hbd I am new to ArchiveBox, but a use case I regularly come across is traversing old copies of websites on Archive.org where content is - naturally - structured in child pages. That is, the topmost page has been archived many times across dates, but not subsequent pages below it, and at best it becomes a slog to find some, if any, content that belongs or relates to the main page.

I would think an LLM could help operationalize strategies to find related archived content, to "fill in the gaps", across time and space. Content that in the early 2000s existed under one domain have sometimes later moved to other domains, been updated, changed, removed, lost, and finding sources that relate to one another can be quite an effort. Machine-aid in digging deeper could yield a much higher degree of completion in archival efforts.

<!-- gh-comment-id:1675801930 --> @OleVik commented on GitHub (Aug 12, 2023): Like hbd I am new to ArchiveBox, but a use case I regularly come across is traversing old copies of websites on Archive.org where content is - naturally - structured in child pages. That is, the topmost page has been archived many times across dates, but not subsequent pages below it, and at best it becomes a slog to find some, if any, content that belongs or relates to the main page. I would think an LLM could help operationalize strategies to find related archived content, to "fill in the gaps", across time and space. Content that in the early 2000s existed under one domain have sometimes later moved to other domains, been updated, changed, removed, lost, and finding sources that relate to one another can be quite an effort. Machine-aid in digging deeper could yield a much higher degree of completion in archival efforts.
Author
Owner

@viraptor commented on GitHub (Jan 12, 2024):

If you want to create something that works today, you can try deploying a separate RAG system (like https://github.com/weaviate/Verba - an example, not a recommendation). Then you can import the processed text, like the readability result and query it.

While having some native-to-archivebox ability to process documents would be nice, with a small wrapper around archivebox add you can import the data into another system you want. Maybe that would even be a good idea to see another, established system integrated into archivebox's UI.

<!-- gh-comment-id:1888920681 --> @viraptor commented on GitHub (Jan 12, 2024): If you want to create something that works today, you can try deploying a separate RAG system (like https://github.com/weaviate/Verba - an example, not a recommendation). Then you can import the processed text, like the readability result and query it. While having some native-to-archivebox ability to process documents would be nice, with a small wrapper around `archivebox add` you can import the data into another system you want. Maybe that would even be a good idea to see another, established system integrated into archivebox's UI.
Author
Owner

@elsatch commented on GitHub (Feb 11, 2024):

First of all, I am so happy to find this thread! I've searched the website, the zulip conversations and found no traces of LLM integration.

So here is my plan for ArchiveBox <-> LLM integration: create a LlamaIndex Loader for ArchiveBox, so I can query all contents seamlessly using one of the most popular RAG frameworks out there. Also ensures compatibility with most things around LLM ecosystem. Let me break it down.

LlamaIndex is a data framework for LLM Aplications focused mostly in Retrieval Augmented Generation scenarios. It's available with MIT license and built with Python and Typescript support. It can work with cloud based LLM but also, and most importantly, with local hosted LLMs. Repo: https://github.com/run-llama/llama_index

LlamaHub is a website that collects all kinds of connectors to access with third party data. There are connectors for local files in PDF, doc, json, etc. formats. There are abstract connectors like But also connectors for other platforms like Notion, Slack, github, RSS etc. The term used for these connectors is Loaders/Readers.

The resulting documents can be used with LlamaIndex or any of the other popular LLM frameworks like LangChain, Semantic Kernel, etc.

So building the loader would open the door to any kind of integration that uses ArchiveBox as data backend.

LlamaIndex offers a streamlined method to import directories called SimpleDirectoryReader. We could point this to the archivebox folder, but it will get really noisy with all the replicas. When calling the function, we can filter by extension to get only the txt files, but then we will be missing the UI metadata.

Things I am missing to process further developing the connector:

  • From the ArchiveBox side: I'd love to extract the tags of the UI and pass those as metadata to filter later on at the vector store level, but I've found no way to export them. (archivebox list --json returns all data except those tags :) )
  • From the LlamaIndex side: Find out which of the output formats works better with the default LlamaIndex configuration.
<!-- gh-comment-id:1937364028 --> @elsatch commented on GitHub (Feb 11, 2024): First of all, I am so happy to find this thread! I've searched the website, the zulip conversations and found no traces of LLM integration. So here is my plan for ArchiveBox <-> LLM integration: create a LlamaIndex Loader for ArchiveBox, so I can query all contents seamlessly using one of the most popular RAG frameworks out there. Also ensures compatibility with most things around LLM ecosystem. Let me break it down. LlamaIndex is a data framework for LLM Aplications focused mostly in Retrieval Augmented Generation scenarios. It's available with MIT license and built with Python and Typescript support. It can work with cloud based LLM but also, and most importantly, with local hosted LLMs. Repo: https://github.com/run-llama/llama_index LlamaHub is a website that collects all kinds of connectors to access with third party data. There are connectors for local files in PDF, doc, json, etc. formats. There are abstract connectors like But also connectors for other platforms like Notion, Slack, github, RSS etc. The term used for these connectors is Loaders/Readers. The resulting documents can be used with LlamaIndex or any of the other popular LLM frameworks like LangChain, Semantic Kernel, etc. So building the loader would open the door to any kind of integration that uses ArchiveBox as data backend. LlamaIndex offers a streamlined method to import directories called SimpleDirectoryReader. We could point this to the archivebox folder, but it will get really noisy with all the replicas. When calling the function, we can filter by extension to get only the txt files, but then we will be missing the UI metadata. Things I am missing to process further developing the connector: - From the ArchiveBox side: I'd love to extract the tags of the UI and pass those as metadata to filter later on at the vector store level, but I've found no way to export them. (archivebox list --json returns all data except those tags :) ) - From the LlamaIndex side: Find out which of the output formats works better with the default LlamaIndex configuration.
Author
Owner

@pirate commented on GitHub (Feb 12, 2024):

That sounds very cool @elsatch and I welcome any integrations with LLM tools for search/curation.

We can add the tags to the --json output easily if that would help.
As far as output formats, I recommend the readability + mercury outputs as they extract raw text well from discussion threads, news articles, social media, etc.

Image extraction should be easier soon as we're adding a new gallery-dl extractor in the next major release. We might also add an OCR extractor eventually to get text out of images & PDFs if that's helpful.

<!-- gh-comment-id:1937939948 --> @pirate commented on GitHub (Feb 12, 2024): That sounds very cool @elsatch and I welcome any integrations with LLM tools for search/curation. We can add the tags to the `--json` output easily if that would help. As far as output formats, I recommend the readability + mercury outputs as they extract raw text well from discussion threads, news articles, social media, etc. Image extraction should be easier soon as we're adding a new `gallery-dl` extractor in the next major release. We might also add an OCR extractor eventually to get text out of images & PDFs if that's helpful.
Author
Owner

@elsatch commented on GitHub (Feb 12, 2024):

If you could add the tags to the json output, that would be great! It will create meaningful outputs.

I am considering adding readability and singlehtml formats to check the performance of both options. Readability output looks really nice but on long articles it might require splitting the file in fixed length chucks. On singlehtml, well formatted files could be split on heading level, giving more meaningful text chunks.

LlamaIndex seems to be working on multimodal vision models too, so both options might converge.

<!-- gh-comment-id:1937948667 --> @elsatch commented on GitHub (Feb 12, 2024): If you could add the tags to the json output, that would be great! It will create meaningful outputs. I am considering adding readability and singlehtml formats to check the performance of both options. Readability output looks really nice but on long articles it might require splitting the file in fixed length chucks. On singlehtml, well formatted files could be split on heading level, giving more meaningful text chunks. LlamaIndex seems to be working on multimodal vision models too, so both options might converge.
Author
Owner

@pirate commented on GitHub (Feb 22, 2024):

Leaving this here for future AI ideas: https://github.com/PrefectHQ/marvin unecessary, we have way more tools available in 2026+

Also magika for content type detection. nah its no good

<!-- gh-comment-id:1959107611 --> @pirate commented on GitHub (Feb 22, 2024): ~Leaving this here for future AI ideas: https://github.com/PrefectHQ/marvin~ unecessary, we have way more tools available in 2026+ ~Also [magika](https://opensource.googleblog.com/2024/02/magika-ai-powered-fast-and-efficient-file-type-identification.html) for content type detection.~ nah its no good
Author
Owner

@mapleshadow commented on GitHub (Jul 18, 2025):

up

<!-- gh-comment-id:3089327735 --> @mapleshadow commented on GitHub (Jul 18, 2025): up
Author
Owner

@viraptor commented on GitHub (Feb 24, 2026):

This should be reopened, the linked PR does the first prerequisite for the requested feature, but doesn't implement the feature itself.

<!-- gh-comment-id:3955371812 --> @viraptor commented on GitHub (Feb 24, 2026): This should be reopened, the linked PR does the first prerequisite for the requested feature, but doesn't implement the feature itself.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#2223
No description provided.