mirror of
https://github.com/ArchiveBox/ArchiveBox.git
synced 2026-04-25 09:06:02 +03:00
[GH-ISSUE #1139] Feature Request: Add AI-assisted summarization, tagging, search, and more using LLMs / RAG #2223
Labels
No labels
expected: maybe someday
expected: next release
expected: release after next
expected: unlikely unless contributed
good first ticket
help wanted
pull-request
scope: all users
scope: windows users
size: easy
size: hard
size: medium
size: medium
status: backlog
status: blocked
status: done
status: idea-phase
status: needs followup
status: wip
status: wontfix
touches: API/CLI/Spec
touches: configuration
touches: data/schema/architecture
touches: dependencies/packaging
touches: docs
touches: js
touches: views/replayers/html/css
why: correctness
why: functionality
why: performance
why: security
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/ArchiveBox#2223
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @hbd on GitHub (Apr 18, 2023).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/1139
Type
What is the problem that your feature request solves
I'm fairly new to ArchiveBox, but I really like the simplicity and robustness of the tool. For a long time I have considered building a similar tool to ensure I can deliberately save and revisit content from the internet that is important to me. However, I think ArchiveBox provides everything I want and more.
With the advent of open source LLMs, what does the community here think about a feature that provides a ChatGPT-like feature on top of the content in and ArchiveBox deployment, to allow a user to interact with the content in ArchiveBox? This could be useful for doing things like fine-tuned research, recalling content that was saved in a period of time ("Summarize the content I saved yesterday"), and probably a lot more.
Describe the ideal specific solution you'd want, and whether it fits into any broader scope of changes
Ideally this is a tool on top of ArchiveBox to consume the data saved in an archive to fine-tune an LLM and provide a chat interface to interact with the data.
What hacks or alternative solutions have you tried to solve the problem?
Nothing specific to ArchiveBox :D
How badly do you want this new feature?
@pirate commented on GitHub (Apr 19, 2023):
I'm already thinking about some potential applications of LLMs in ArchiveBox 😉
Will post about my plans as they develop, but curious to hear about specific ideas from the community (beyond auto-summarization and auto-tagging) in the meantime...
@OleVik commented on GitHub (Aug 12, 2023):
Like hbd I am new to ArchiveBox, but a use case I regularly come across is traversing old copies of websites on Archive.org where content is - naturally - structured in child pages. That is, the topmost page has been archived many times across dates, but not subsequent pages below it, and at best it becomes a slog to find some, if any, content that belongs or relates to the main page.
I would think an LLM could help operationalize strategies to find related archived content, to "fill in the gaps", across time and space. Content that in the early 2000s existed under one domain have sometimes later moved to other domains, been updated, changed, removed, lost, and finding sources that relate to one another can be quite an effort. Machine-aid in digging deeper could yield a much higher degree of completion in archival efforts.
@viraptor commented on GitHub (Jan 12, 2024):
If you want to create something that works today, you can try deploying a separate RAG system (like https://github.com/weaviate/Verba - an example, not a recommendation). Then you can import the processed text, like the readability result and query it.
While having some native-to-archivebox ability to process documents would be nice, with a small wrapper around
archivebox addyou can import the data into another system you want. Maybe that would even be a good idea to see another, established system integrated into archivebox's UI.@elsatch commented on GitHub (Feb 11, 2024):
First of all, I am so happy to find this thread! I've searched the website, the zulip conversations and found no traces of LLM integration.
So here is my plan for ArchiveBox <-> LLM integration: create a LlamaIndex Loader for ArchiveBox, so I can query all contents seamlessly using one of the most popular RAG frameworks out there. Also ensures compatibility with most things around LLM ecosystem. Let me break it down.
LlamaIndex is a data framework for LLM Aplications focused mostly in Retrieval Augmented Generation scenarios. It's available with MIT license and built with Python and Typescript support. It can work with cloud based LLM but also, and most importantly, with local hosted LLMs. Repo: https://github.com/run-llama/llama_index
LlamaHub is a website that collects all kinds of connectors to access with third party data. There are connectors for local files in PDF, doc, json, etc. formats. There are abstract connectors like But also connectors for other platforms like Notion, Slack, github, RSS etc. The term used for these connectors is Loaders/Readers.
The resulting documents can be used with LlamaIndex or any of the other popular LLM frameworks like LangChain, Semantic Kernel, etc.
So building the loader would open the door to any kind of integration that uses ArchiveBox as data backend.
LlamaIndex offers a streamlined method to import directories called SimpleDirectoryReader. We could point this to the archivebox folder, but it will get really noisy with all the replicas. When calling the function, we can filter by extension to get only the txt files, but then we will be missing the UI metadata.
Things I am missing to process further developing the connector:
@pirate commented on GitHub (Feb 12, 2024):
That sounds very cool @elsatch and I welcome any integrations with LLM tools for search/curation.
We can add the tags to the
--jsonoutput easily if that would help.As far as output formats, I recommend the readability + mercury outputs as they extract raw text well from discussion threads, news articles, social media, etc.
Image extraction should be easier soon as we're adding a new
gallery-dlextractor in the next major release. We might also add an OCR extractor eventually to get text out of images & PDFs if that's helpful.@elsatch commented on GitHub (Feb 12, 2024):
If you could add the tags to the json output, that would be great! It will create meaningful outputs.
I am considering adding readability and singlehtml formats to check the performance of both options. Readability output looks really nice but on long articles it might require splitting the file in fixed length chucks. On singlehtml, well formatted files could be split on heading level, giving more meaningful text chunks.
LlamaIndex seems to be working on multimodal vision models too, so both options might converge.
@pirate commented on GitHub (Feb 22, 2024):
Leaving this here for future AI ideas: https://github.com/PrefectHQ/marvinunecessary, we have way more tools available in 2026+Also magika for content type detection.nah its no good@mapleshadow commented on GitHub (Jul 18, 2025):
up
@viraptor commented on GitHub (Feb 24, 2026):
This should be reopened, the linked PR does the first prerequisite for the requested feature, but doesn't implement the feature itself.