[GH-ISSUE #22] Add ability to do full-text search of archived text/markdown/html content #1527

Closed
opened 2026-03-01 17:51:26 +03:00 by kerem · 15 comments
Owner

Originally created by @ilvar on GitHub (Jun 11, 2017).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/22

As a user with lots of bookmarks, I want to have a search input to filter list by text matches in URL, title and extracted content.

http://elasticlunr.com/ can be used for simple client side full text search.

Originally created by @ilvar on GitHub (Jun 11, 2017). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/22 As a user with lots of bookmarks, I want to have a search input to filter list by text matches in URL, title and extracted content. http://elasticlunr.com/ can be used for simple client side full text search.
Author
Owner

@pirate commented on GitHub (Jun 18, 2017):

This is definitely one of my goals, but think ag is fast enough without involving the complexity of frontend offline search. (It also supports regex)
I just whipped this search script up, all that needs to be done now is add a search field to the index.html which queries the server and highlights matching sites.

Usage:

  1. set ARCHIVE_PATH to your actual archive path
  2. use it on the cli
./search.py "train number \d+"     # search with regex
./search.py --exact "[1] (citation needed)."   # search for an exact string
  1. use it as a REST service
./search.py --server 8080

http://127.0.0.1:8080/search?search=firefox ->

/1409779227/www.mozilla.org/en-US/firefox/help/index.html
/1409779227.1/www.mozilla.org/en-US/firefox/customize/index.html
/1409779227.3/www.mozilla.org/media/img/firefox/horizon/stars.8398dac91f60.svg
/1497562974/www.ghacks.net/2009/07/23/firefox-bookmarks-archiver/index.html
/1497562974.2/duckduckgo.com/index.html?q=firefox+export+bookmarks&t=ffhp&ia=web.html

search.py:

import sys
from subprocess import run, PIPE

ARCHIVE_PATH = '/Users/yourusername/Documents/bookmark-archiver/bookmarks/archive'

if run(['which', 'ag'], stderr=DEVNULL).returncode:
    print("[X] Please install ag the silver searcher:\n\t apt install silversearcher-ag\n\t brew install the_silver_searcher")
    raise SystemExit(1)

def search_archive(pattern, regex=False):
    args = '-g' if regex else '-Qg'
    ag = run(['ag', args, pattern, ARCHIVE_PATH], stdout=PIPE, stderr=PIPE, timeout=60)
    return (l.decode().replace(ARCHIVE_PATH, '') for l in ag.stdout.splitlines())


def server(port=8080):
    try:
        from flask import Flask
        from flask import request
    except ImportError:
        print('[X] Please install Flask to use the search server: pip install Flask')
        raise SystemExit(1)

    app = Flask('Bookmark Archive')

    @app.route("/search", methods=['GET'])
    def search():
        pattern = request.args.get('search', '')
        use_regex = request.args.get('regex', '')
        return '\n'.join(search_archive(pattern, use_regex))

    app.run(port=port)


if __name__ == '__main__':
    argc = len(sys.argv)
    if '--server' in sys.argv:
        port = sys.argv[2] if argc > 2 else '8080'
        server(port)
    else:
        pattern = sys.argv[2] if argc > 2 else sys.argv[1]
        verbatim = argc > 2  # assumes only possible argument is --exact

        matches = search_archive(pattern, regex=not verbatim)
        print('\n'.join(matches))
<!-- gh-comment-id:309263359 --> @pirate commented on GitHub (Jun 18, 2017): This is definitely one of my goals, but think [`ag`](https://github.com/ggreer/the_silver_searcher) is fast enough without involving the complexity of frontend offline search. (It also supports regex) I just whipped this search script up, all that needs to be done now is add a search field to the index.html which queries the server and highlights matching sites. **Usage:** 1. set `ARCHIVE_PATH` to your actual archive path 2. use it on the cli ```bash ./search.py "train number \d+" # search with regex ./search.py --exact "[1] (citation needed)." # search for an exact string ``` 3. use it as a REST service ```bash ./search.py --server 8080 ``` `http://127.0.0.1:8080/search?search=firefox` -> ``` /1409779227/www.mozilla.org/en-US/firefox/help/index.html /1409779227.1/www.mozilla.org/en-US/firefox/customize/index.html /1409779227.3/www.mozilla.org/media/img/firefox/horizon/stars.8398dac91f60.svg /1497562974/www.ghacks.net/2009/07/23/firefox-bookmarks-archiver/index.html /1497562974.2/duckduckgo.com/index.html?q=firefox+export+bookmarks&t=ffhp&ia=web.html ``` **search.py:** ```python import sys from subprocess import run, PIPE ARCHIVE_PATH = '/Users/yourusername/Documents/bookmark-archiver/bookmarks/archive' if run(['which', 'ag'], stderr=DEVNULL).returncode: print("[X] Please install ag the silver searcher:\n\t apt install silversearcher-ag\n\t brew install the_silver_searcher") raise SystemExit(1) def search_archive(pattern, regex=False): args = '-g' if regex else '-Qg' ag = run(['ag', args, pattern, ARCHIVE_PATH], stdout=PIPE, stderr=PIPE, timeout=60) return (l.decode().replace(ARCHIVE_PATH, '') for l in ag.stdout.splitlines()) def server(port=8080): try: from flask import Flask from flask import request except ImportError: print('[X] Please install Flask to use the search server: pip install Flask') raise SystemExit(1) app = Flask('Bookmark Archive') @app.route("/search", methods=['GET']) def search(): pattern = request.args.get('search', '') use_regex = request.args.get('regex', '') return '\n'.join(search_archive(pattern, use_regex)) app.run(port=port) if __name__ == '__main__': argc = len(sys.argv) if '--server' in sys.argv: port = sys.argv[2] if argc > 2 else '8080' server(port) else: pattern = sys.argv[2] if argc > 2 else sys.argv[1] verbatim = argc > 2 # assumes only possible argument is --exact matches = search_archive(pattern, regex=not verbatim) print('\n'.join(matches)) ```
Author
Owner

@pirate commented on GitHub (Jun 18, 2017):

Done: https://github.com/pirate/bookmark-archiver/pull/24

<!-- gh-comment-id:309268608 --> @pirate commented on GitHub (Jun 18, 2017): Done: https://github.com/pirate/bookmark-archiver/pull/24
Author
Owner

@pirate commented on GitHub (Sep 4, 2017):

@ilvar I spent a few hours this weekend implementing elasticlunr. It's an awesome piece of software, and I'll definitely end up using it in one of my projects eventually!

Unfortunately, it had a few deal-breaking problems:

  • it's difficult to extract clean, searchable text from the archived sites *
  • the indexes produced from <100 articles are in the 3-6mb range, with sizes going >15mb for 200+ articles (the index only, without full documents embedded)
  • past 200 articles, the indexes produced caused stack size exceeded errors when trying to serialize & deserialize them

If you want to take a crack at it, go ahead, maybe I missed some simple thing that could make it work out in the end.

Otherwise, I think I'm going to go the backend regex/grep route, since people are increasingly requesting backend features like a UI to add and organize new links. The backend will of course be optional, archive.py will still produce static html files, but if people want backend features they can run server.py which provides a full UI.

[*] I tried indexing everything after <body> in the html files, but that led to a lot of wasted index storage for html tags, not to mention it broke stemming because <b>word1 word2 word3 didn't reliably index <b> separately from word1. Without doing full XML parsing or other crazyness, I don't see an easy way to do clean body text extraction.

<!-- gh-comment-id:327023714 --> @pirate commented on GitHub (Sep 4, 2017): @ilvar I spent a few hours this weekend implementing elasticlunr. It's an awesome piece of software, and I'll definitely end up using it in one of my projects eventually! Unfortunately, it had a few deal-breaking problems: - it's difficult to extract clean, searchable text from the archived sites * - the indexes produced from <100 articles are in the 3-6mb range, with sizes going >15mb for 200+ articles (the index only, without full documents embedded) - past 200 articles, the indexes produced caused stack size exceeded errors when trying to serialize & deserialize them If you want to take a crack at it, go ahead, maybe I missed some simple thing that could make it work out in the end. Otherwise, I think I'm going to go the backend regex/grep route, since people are increasingly requesting backend features like a UI to add and organize new links. The backend will of course be optional, `archive.py` will still produce static html files, but if people want backend features they can run `server.py` which provides a full UI. [*] I tried indexing everything after `<body>` in the html files, but that led to a lot of wasted index storage for html tags, not to mention it broke stemming because `<b>word1 word2 word3` didn't reliably index `<b>` separately from `word1`. Without doing full XML parsing or other crazyness, I don't see an easy way to do clean body text extraction.
Author
Owner

@pirate commented on GitHub (Apr 23, 2019):

I'm going to reopen this as it will be feasible as soon as a solution for #69 is merged :)

I will be integrated directly into the new archivebox server feature that's already released in v0.4.0, and will likely use SQLite full-text indexing or RedisSearch for fast searches through the entire archive.

<!-- gh-comment-id:485948152 --> @pirate commented on GitHub (Apr 23, 2019): I'm going to reopen this as it will be feasible as soon as a solution for #69 is merged :) I will be integrated directly into the new `archivebox server` feature that's already released in v0.4.0, and will likely use SQLite full-text indexing or [RedisSearch](https://redislabs.com/blog/search-benchmarking-redisearch-vs-elasticsearch/) for fast searches through the entire archive.
Author
Owner

@adamwolf commented on GitHub (Sep 18, 2020):

Hi @pirate! Do you have newer thoughts on how you'd like to see this implemented? (You had most recently mentioned SQLite full-text indexing or RedisSearch in this issue).

<!-- gh-comment-id:694934164 --> @adamwolf commented on GitHub (Sep 18, 2020): Hi @pirate! Do you have newer thoughts on how you'd like to see this implemented? (You had most recently mentioned SQLite full-text indexing or RedisSearch in this issue).
Author
Owner

@pirate commented on GitHub (Sep 22, 2020):

I think there are a few possible approaches, I was personally leaning towards starting out by shelling out to something like ag or ripgrep because it's blindingly simple and only requires a single binary dependency and no database management. That would let us search all the text-based outputs (singlefile, wget, DOM, readability, etc.) using regex or exact strings, and it would be reasonably fast up to the ~10GB mark depending on your disk speed. However, considering many users have large archives, this may just be a distraction from a proper search solution.

SQLite and RediSearch both require adding the full document body to a central db, which will grow in size rapidly as the dataset increases, so I don't think either are suitable. (RediSearch also only supports English & Chinese, and is x86_64-only at the moment, a painful drawback for our Raspi users.)

I think the best solution going forward is something like https://github.com/valeriansaliou/sonic. It's not a document store index, so it doesn't actually store or return the full search-text internally, only document IDs, which is perfect for our use-case. It also supports tons of languages, is packaged nicely, and is relatively resource-efficient compared to a behemoth of complexity like ElasticSearch.

@cdvv7788 and @apkallum can you start thinking about how we'd add Sonic search support to our Django backend and UI?

<!-- gh-comment-id:696925964 --> @pirate commented on GitHub (Sep 22, 2020): I think there are a few possible approaches, I was personally leaning towards starting out by shelling out to something like `ag` or `ripgrep` because it's blindingly simple and only requires a single binary dependency and no database management. That would let us search all the text-based outputs (singlefile, wget, DOM, readability, etc.) using regex or exact strings, and it would be reasonably fast up to the ~10GB mark depending on your disk speed. However, considering many users have large archives, this may just be a distraction from a proper search solution. SQLite and RediSearch both require adding the full document body to a central db, which will grow in size rapidly as the dataset increases, so I don't think either are suitable. (RediSearch also only supports English & Chinese, and is `x86_64`-only at the moment, a painful drawback for our Raspi users.) I think the best solution going forward is something like https://github.com/valeriansaliou/sonic. It's not a document store index, so it doesn't actually store or return the full search-text internally, only document IDs, which is perfect for our use-case. It also supports tons of languages, is packaged nicely, and is relatively resource-efficient compared to a behemoth of complexity like ElasticSearch. @cdvv7788 and @apkallum can you start thinking about how we'd add Sonic search support to our Django backend and UI?
Author
Owner

@cdvv7788 commented on GitHub (Sep 22, 2020):

Yes, I can give it a check. If it only stores a search index, that will be better than having all of the text in the database. I will let you know how the experiment goes.

<!-- gh-comment-id:696947877 --> @cdvv7788 commented on GitHub (Sep 22, 2020): Yes, I can give it a check. If it only stores a search index, that will be better than having all of the text in the database. I will let you know how the experiment goes.
Author
Owner

@urbien commented on GitHub (Nov 15, 2020):

This feature would be a dream-come-true for me. My personal itch is to answer a question "where did I see something like this idea / notion recently"? I usually can also add when I saw it approximately, e.g. last month or 3-4 months ago. Google can't answer this question for me and no amount of bookmarking helps me.

Elasticsearch is an overkill, but the underlying Lucene engine is just a library that will do the job without any server. But it is in Java so the startup is slow. A good alternative is https://github.com/tantivy-search/tantivy - fast, written in Rust, and has most of the powerful Lucene search capabilities.

<!-- gh-comment-id:727656645 --> @urbien commented on GitHub (Nov 15, 2020): This feature would be a dream-come-true for me. My personal itch is to answer a question "where did I see something like this idea / notion recently"? I usually can also add when I saw it approximately, e.g. last month or 3-4 months ago. Google can't answer this question for me and no amount of bookmarking helps me. Elasticsearch is an overkill, but the underlying Lucene engine is just a library that will do the job without any server. But it is in Java so the startup is slow. A good alternative is https://github.com/tantivy-search/tantivy - fast, written in Rust, and has most of the powerful Lucene search capabilities.
Author
Owner

@IvanVas commented on GitHub (Dec 17, 2020):

Thanks for implementing the full-text search. There is a problem running the Sonic on Arm though: https://github.com/valeriansaliou/sonic/issues/218

<!-- gh-comment-id:747523818 --> @IvanVas commented on GitHub (Dec 17, 2020): Thanks for implementing the full-text search. There is a problem running the Sonic on Arm though: https://github.com/valeriansaliou/sonic/issues/218
Author
Owner

@pirate commented on GitHub (Dec 17, 2020):

Yeah, I suspect with the M1 they will implement it soon, but the fallback with ripgrep is quite good on smaller archives. Have you had a chance to try the ripgrep backend? It's quite modular so if you have another alternative in mind for ARM we could add it as an alternative backend easily.

<!-- gh-comment-id:747559328 --> @pirate commented on GitHub (Dec 17, 2020): Yeah, I suspect with the M1 they will implement it soon, but the fallback with ripgrep is quite good on smaller archives. Have you had a chance to try the `ripgrep` backend? It's quite modular so if you have another alternative in mind for ARM we could add it as an alternative backend easily.
Author
Owner

@IvanVas commented on GitHub (Dec 17, 2020):

Yeah, I suspect with the M1 they will implement it soon, but the fallback with ripgrep is quite good on smaller archives. Have you had a chance to try the ripgrep backend? It's quite modular so if you have another alternative in mind for ARM we could add it as an alternative backend easily.

I'm testing on Nano Pi (SBC similar to Raspberry PiX).
ripgrep is also not installed in the latest docker image I have (linux, arm64).

<!-- gh-comment-id:747572985 --> @IvanVas commented on GitHub (Dec 17, 2020): > Yeah, I suspect with the M1 they will implement it soon, but the fallback with ripgrep is quite good on smaller archives. Have you had a chance to try the `ripgrep` backend? It's quite modular so if you have another alternative in mind for ARM we could add it as an alternative backend easily. I'm testing on Nano Pi (SBC similar to Raspberry PiX). ripgrep is also not installed in the latest docker image I have (linux, arm64).
Author
Owner

@pirate commented on GitHub (Dec 17, 2020):

The latest docker image is v0.4.24, not v0.5.0 (the version with sonic and ripgrep). You must clone the branch locally and build the docker image to test it:

git clone https://github.com/ArchiveBox/ArchiveBox
cd ArchiveBox
git checkout v0.5.0
docker build . -t archivebox

docker run -v $PWD:/data -it archivebox/archivebox version

That version ^ should have both ripgrep and sonic available for testing.

<!-- gh-comment-id:747580423 --> @pirate commented on GitHub (Dec 17, 2020): The latest docker image is v0.4.24, not v0.5.0 (the version with sonic and ripgrep). You must clone the branch locally and build the docker image to test it: ```bash git clone https://github.com/ArchiveBox/ArchiveBox cd ArchiveBox git checkout v0.5.0 docker build . -t archivebox docker run -v $PWD:/data -it archivebox/archivebox version ``` That version ^ should have both ripgrep and sonic available for testing.
Author
Owner

@pirate commented on GitHub (Feb 1, 2021):

This was released in v0.5.3 and improved in v0.5.4 courtesy of @jdcaballerov. Future improvements are yet still planned, but we can track those in separate tickets.

<!-- gh-comment-id:770744848 --> @pirate commented on GitHub (Feb 1, 2021): This was released in `v0.5.3` and improved in `v0.5.4` courtesy of @jdcaballerov. Future improvements are yet still planned, but we can track those in separate tickets.
Author
Owner

@valeriansaliou commented on GitHub (Nov 9, 2021):

Hello, following up there as this issue has been referenced in a Sonic issue. I'm the Sonic author, and it's now supported on ARM (it also supports Apple Silicon). MUSL static builds are not yet possible, but more traditional builds based on glibc work perfectly on all architectures now.

See associated v1.3.1 release: https://github.com/valeriansaliou/sonic/releases/tag/v1.3.1

<!-- gh-comment-id:964347081 --> @valeriansaliou commented on GitHub (Nov 9, 2021): Hello, following up there as this issue has been referenced in a Sonic issue. I'm the Sonic author, and it's now supported on ARM (it also supports Apple Silicon). MUSL static builds are not yet possible, but more traditional builds based on glibc work perfectly on all architectures now. See associated `v1.3.1` release: https://github.com/valeriansaliou/sonic/releases/tag/v1.3.1
Author
Owner

@pirate commented on GitHub (Nov 12, 2021):

Thanks for following up @valeriansaliou! I'll bump the Sonic version in our next release.

<!-- gh-comment-id:967221456 --> @pirate commented on GitHub (Nov 12, 2021): Thanks for following up @valeriansaliou! I'll bump the Sonic version in our next release.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#1527
No description provided.