mirror of
https://github.com/ArchiveBox/ArchiveBox.git
synced 2026-04-25 17:16:00 +03:00
[GH-ISSUE #22] Add ability to do full-text search of archived text/markdown/html content #1527
Labels
No labels
expected: maybe someday
expected: next release
expected: release after next
expected: unlikely unless contributed
good first ticket
help wanted
pull-request
scope: all users
scope: windows users
size: easy
size: hard
size: medium
size: medium
status: backlog
status: blocked
status: done
status: idea-phase
status: needs followup
status: wip
status: wontfix
touches: API/CLI/Spec
touches: configuration
touches: data/schema/architecture
touches: dependencies/packaging
touches: docs
touches: js
touches: views/replayers/html/css
why: correctness
why: functionality
why: performance
why: security
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/ArchiveBox#1527
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @ilvar on GitHub (Jun 11, 2017).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/22
As a user with lots of bookmarks, I want to have a search input to filter list by text matches in URL, title and extracted content.
http://elasticlunr.com/ can be used for simple client side full text search.
@pirate commented on GitHub (Jun 18, 2017):
This is definitely one of my goals, but think
agis fast enough without involving the complexity of frontend offline search. (It also supports regex)I just whipped this search script up, all that needs to be done now is add a search field to the index.html which queries the server and highlights matching sites.
Usage:
ARCHIVE_PATHto your actual archive pathhttp://127.0.0.1:8080/search?search=firefox->search.py:
@pirate commented on GitHub (Jun 18, 2017):
Done: https://github.com/pirate/bookmark-archiver/pull/24
@pirate commented on GitHub (Sep 4, 2017):
@ilvar I spent a few hours this weekend implementing elasticlunr. It's an awesome piece of software, and I'll definitely end up using it in one of my projects eventually!
Unfortunately, it had a few deal-breaking problems:
If you want to take a crack at it, go ahead, maybe I missed some simple thing that could make it work out in the end.
Otherwise, I think I'm going to go the backend regex/grep route, since people are increasingly requesting backend features like a UI to add and organize new links. The backend will of course be optional,
archive.pywill still produce static html files, but if people want backend features they can runserver.pywhich provides a full UI.[*] I tried indexing everything after
<body>in the html files, but that led to a lot of wasted index storage for html tags, not to mention it broke stemming because<b>word1 word2 word3didn't reliably index<b>separately fromword1. Without doing full XML parsing or other crazyness, I don't see an easy way to do clean body text extraction.@pirate commented on GitHub (Apr 23, 2019):
I'm going to reopen this as it will be feasible as soon as a solution for #69 is merged :)
I will be integrated directly into the new
archivebox serverfeature that's already released in v0.4.0, and will likely use SQLite full-text indexing or RedisSearch for fast searches through the entire archive.@adamwolf commented on GitHub (Sep 18, 2020):
Hi @pirate! Do you have newer thoughts on how you'd like to see this implemented? (You had most recently mentioned SQLite full-text indexing or RedisSearch in this issue).
@pirate commented on GitHub (Sep 22, 2020):
I think there are a few possible approaches, I was personally leaning towards starting out by shelling out to something like
agorripgrepbecause it's blindingly simple and only requires a single binary dependency and no database management. That would let us search all the text-based outputs (singlefile, wget, DOM, readability, etc.) using regex or exact strings, and it would be reasonably fast up to the ~10GB mark depending on your disk speed. However, considering many users have large archives, this may just be a distraction from a proper search solution.SQLite and RediSearch both require adding the full document body to a central db, which will grow in size rapidly as the dataset increases, so I don't think either are suitable. (RediSearch also only supports English & Chinese, and is
x86_64-only at the moment, a painful drawback for our Raspi users.)I think the best solution going forward is something like https://github.com/valeriansaliou/sonic. It's not a document store index, so it doesn't actually store or return the full search-text internally, only document IDs, which is perfect for our use-case. It also supports tons of languages, is packaged nicely, and is relatively resource-efficient compared to a behemoth of complexity like ElasticSearch.
@cdvv7788 and @apkallum can you start thinking about how we'd add Sonic search support to our Django backend and UI?
@cdvv7788 commented on GitHub (Sep 22, 2020):
Yes, I can give it a check. If it only stores a search index, that will be better than having all of the text in the database. I will let you know how the experiment goes.
@urbien commented on GitHub (Nov 15, 2020):
This feature would be a dream-come-true for me. My personal itch is to answer a question "where did I see something like this idea / notion recently"? I usually can also add when I saw it approximately, e.g. last month or 3-4 months ago. Google can't answer this question for me and no amount of bookmarking helps me.
Elasticsearch is an overkill, but the underlying Lucene engine is just a library that will do the job without any server. But it is in Java so the startup is slow. A good alternative is https://github.com/tantivy-search/tantivy - fast, written in Rust, and has most of the powerful Lucene search capabilities.
@IvanVas commented on GitHub (Dec 17, 2020):
Thanks for implementing the full-text search. There is a problem running the Sonic on Arm though: https://github.com/valeriansaliou/sonic/issues/218
@pirate commented on GitHub (Dec 17, 2020):
Yeah, I suspect with the M1 they will implement it soon, but the fallback with ripgrep is quite good on smaller archives. Have you had a chance to try the
ripgrepbackend? It's quite modular so if you have another alternative in mind for ARM we could add it as an alternative backend easily.@IvanVas commented on GitHub (Dec 17, 2020):
I'm testing on Nano Pi (SBC similar to Raspberry PiX).
ripgrep is also not installed in the latest docker image I have (linux, arm64).
@pirate commented on GitHub (Dec 17, 2020):
The latest docker image is v0.4.24, not v0.5.0 (the version with sonic and ripgrep). You must clone the branch locally and build the docker image to test it:
That version ^ should have both ripgrep and sonic available for testing.
@pirate commented on GitHub (Feb 1, 2021):
This was released in
v0.5.3and improved inv0.5.4courtesy of @jdcaballerov. Future improvements are yet still planned, but we can track those in separate tickets.@valeriansaliou commented on GitHub (Nov 9, 2021):
Hello, following up there as this issue has been referenced in a Sonic issue. I'm the Sonic author, and it's now supported on ARM (it also supports Apple Silicon). MUSL static builds are not yet possible, but more traditional builds based on glibc work perfectly on all architectures now.
See associated
v1.3.1release: https://github.com/valeriansaliou/sonic/releases/tag/v1.3.1@pirate commented on GitHub (Nov 12, 2021):
Thanks for following up @valeriansaliou! I'll bump the Sonic version in our next release.