[GH-ISSUE #267] Can we have vector database along with tags? #178

Open
opened 2026-03-02 11:47:21 +03:00 by kerem · 21 comments
Owner

Originally created by @echo-saurav on GitHub (Jul 1, 2024).
Original GitHub issue: https://github.com/karakeep-app/karakeep/issues/267

Automatically getting tags with ollama is quite nice ! but i think it would be more awesome if it stored text vector , so we can search by similar text, or make filter to have similar links / image together

Originally created by @echo-saurav on GitHub (Jul 1, 2024). Original GitHub issue: https://github.com/karakeep-app/karakeep/issues/267 Automatically getting tags with ollama is quite nice ! but i think it would be more awesome if it stored text vector , so we can search by similar text, or make filter to have similar links / image together
Author
Owner

@MohamedBassem commented on GitHub (Jul 1, 2024):

This is definitely planned and I actually have a prototype for it already :) Just trying to find a reasonable vector db without adding extra dependencies.

<!-- gh-comment-id:2200383047 --> @MohamedBassem commented on GitHub (Jul 1, 2024): This is definitely planned and I actually have a prototype for it already :) Just trying to find a reasonable vector db without adding extra dependencies.
Author
Owner

@echo-saurav commented on GitHub (Jul 1, 2024):

awsome !
You can see weaviate . it has text and image vector both (I know its not really lightweight , but i like it a lot because of its customisation ability )

<!-- gh-comment-id:2200399747 --> @echo-saurav commented on GitHub (Jul 1, 2024): awsome ! You can see [weaviate ](https://weaviate.io/) . it has text and image vector both (I know its not really lightweight , but i like it a lot because of its customisation ability )
Author
Owner

@ieaves commented on GitHub (Jul 1, 2024):

You can actually do vector search in postgres pretty easily. Postgres would also have the side benefit of not requiring the database to be mounted into every container.

<!-- gh-comment-id:2201212533 --> @ieaves commented on GitHub (Jul 1, 2024): You can actually do vector search in postgres pretty easily. Postgres would also have the side benefit of not requiring the database to be mounted into every container.
Author
Owner

@MohamedBassem commented on GitHub (Jul 1, 2024):

The problem is that hoarder currently doesn't depend on postgres. So introducing postgres now as a dependency will be very disruptive. If I'm to start hoarder from scratch, I'd have gone for postgres for everything (database, FTS, vector search, etc). But it's too late now unfortunately.

<!-- gh-comment-id:2201224700 --> @MohamedBassem commented on GitHub (Jul 1, 2024): The problem is that hoarder currently doesn't depend on postgres. So introducing postgres now as a dependency will be very disruptive. If I'm to start hoarder from scratch, I'd have gone for postgres for everything (database, FTS, vector search, etc). But it's too late now unfortunately.
Author
Owner

@ieaves commented on GitHub (Jul 1, 2024):

Ahh okay, I'm not familiar with Drizzle but perusing the docs made it look like a fairly simply drop in replacement.

<!-- gh-comment-id:2201305623 --> @ieaves commented on GitHub (Jul 1, 2024): Ahh okay, I'm not familiar with Drizzle but perusing the docs made it look like a fairly simply drop in replacement.
Author
Owner

@MohamedBassem commented on GitHub (Jul 1, 2024):

it's less about the code changes and more about asking every existing user to add a new dependency and migrate their data.

<!-- gh-comment-id:2201345080 --> @MohamedBassem commented on GitHub (Jul 1, 2024): it's less about the code changes and more about asking every existing user to add a new dependency and migrate their data.
Author
Owner

@kamtschatka commented on GitHub (Jul 2, 2024):

I actually REALLY like the idea of moving to postgres.

  • We get rid of SQLite:
    ** Is is a toy db anyways. Every project has the fate of running into SQLite limitations sooner or later for the bigger users. There are already requests to use OAuth, so I can really imagine that people have bigger plans
    ** Postgres is a common database and people might already have one in their homelab (i certainly do)
  • We can get rid of Meilisearch:
    ** Needs to be kept in sync with the db all the time
    ** We already had issues, where it ran out of sync after some tag merging
  • We also have benefits for the search improvements I made. It is currently juggling a lot of stuff around in memory, since searching happens in Meilisearch and in the DB. That could be changed into optimized queries
  • We decouple the Worker from the Web app a little bit. For the database you no longer need to have the same location mounted on both apps (which already caused issues as well). It is way more obvious to configure the same database in the config. Granted, we STILL need the same directory mounted for writing the assets to the disk, but theoretically that could also be changed in the future to e.g. allow remote scrapers that are placed outside your network with only a network connection to the web app.

Yes, I understand that it is disruptive, but there is already a section in the release notes on what to keep an eye on and if we update the UI to show that you need to add new environment variables with a postgres db and we offer an automatic data migration (could be version 0.16.0 with not much else), we could keep the disruption low AND open up a whole lot of possibilities for us in the future.

<!-- gh-comment-id:2202041224 --> @kamtschatka commented on GitHub (Jul 2, 2024): I actually REALLY like the idea of moving to postgres. * We get rid of SQLite: ** Is is a toy db anyways. Every project has the fate of running into SQLite limitations sooner or later for the bigger users. There are already requests to use OAuth, so I can really imagine that people have bigger plans ** Postgres is a common database and people might already have one in their homelab (i certainly do) * We can get rid of Meilisearch: ** Needs to be kept in sync with the db all the time ** We already had issues, where it ran out of sync after some tag merging * We also have benefits for the search improvements I made. It is currently juggling a lot of stuff around in memory, since searching happens in Meilisearch and in the DB. That could be changed into optimized queries * We decouple the Worker from the Web app a little bit. For the database you no longer need to have the same location mounted on both apps (which already caused issues as well). It is way more obvious to configure the same database in the config. Granted, we STILL need the same directory mounted for writing the assets to the disk, but theoretically that could also be changed in the future to e.g. allow remote scrapers that are placed outside your network with only a network connection to the web app. Yes, I understand that it is disruptive, but there is already a section in the release notes on what to keep an eye on and if we update the UI to show that you need to add new environment variables with a postgres db and we offer an automatic data migration (could be version 0.16.0 with not much else), we could keep the disruption low AND open up a whole lot of possibilities for us in the future.
Author
Owner

@MohamedBassem commented on GitHub (Jul 7, 2024):

  • Is is a toy db anyways.

I pretty much disagree that sqlite is a toy database. Cloudflare's D1 database for example is built on top of sqlite. Other companies like fly.io and turso are also offering prod databases built on top of sqlite. Tailscale for example, aslo embraced sqlite in prod. We're way way far from approaching the limits of sqlite. It also fits us well because we don't need a client/server architecture given that our deployments are usually on a single machine.

We can get rid of Meilisearch:

Sqlite contains full text search btw (https://www.sqlite.org/fts5.html) and the extension is already enabled in our docker containers. I haven't given it a try so I don't know how good it is compared to meillisearch's. I also didn't give postgres' FTS a try as well. So if getting rid of meillisearch is a goal, there's a route to do it on sqlite as well.

Two limitations that I know about in sqlite's FTS (that I don't know if pg handles better):

  1. I remember searching for good packages in npm to interact with sqlite's FTS, but didn't find decent libraries.
  2. Sqlite's FTS doesn't support fuzzy search, at least natively.

It is currently juggling a lot of stuff around in memory

This can be solved if we're to move to sqlite's FTS.

For the database you no longer need to have the same location mounted on both apps

I've been actually thinking about going the route that immich went. Just merge the workers and web containers into one. That'll simplify the deployment a bit without sacrificing on anything. I initially went with separate container for the worker as the worker was the one spawning the chrome process and I didn't want this to be mixed with the web container. But now chrome is in its own container and we can probably just spin up the workers as a background job inside the web container.

Yes, I understand that it is disruptive

This "IS" my biggest concern. It is very disruptive and we will lose some users because of that move. I'm for example, still stuck on old immich releases because I don't have time to go through all the recent breaking changes that they introduced. I want hoarder to just work for people, and regardless of how many bells we add to the UI, we're going to break some deployment with this migration, and I don't really want this to happen.

I understand that sometimes this is a cost we'll have to pay, but so far, I'm not seeing the strong justification to pay it just yet.

Another Route

There's another route we can take though. We can double down on sqlite:

  1. Merge web and workers container.
  2. Migrate away from meillisearch to sqlite's FTS (if it's good enough).
  3. Migrate away from bullmq to a queue built on top of sqlite. We're not high QPS service anyways, so it shouldn't be that hard.
  4. There's a WIP sqlite vector search extension (https://github.com/asg017/sqlite-vec) that I've been keeping an eye and seems like recently it got sponsors from mozilla, turso, fly, etc. We can adopt this once it's mature for our vector database as well.
<!-- gh-comment-id:2212512520 --> @MohamedBassem commented on GitHub (Jul 7, 2024): > * Is is a toy db anyways. I pretty much disagree that sqlite is a toy database. Cloudflare's D1 database for example is built on top of sqlite. Other companies like fly.io and turso are also offering prod databases built on top of sqlite. Tailscale for example, aslo embraced sqlite in prod. We're way way far from approaching the limits of sqlite. It also fits us well because we don't need a client/server architecture given that our deployments are usually on a single machine. > We can get rid of Meilisearch: Sqlite contains full text search btw (https://www.sqlite.org/fts5.html) and the extension is already enabled in our docker containers. I haven't given it a try so I don't know how good it is compared to meillisearch's. I also didn't give postgres' FTS a try as well. So if getting rid of meillisearch is a goal, there's a route to do it on sqlite as well. Two limitations that I know about in sqlite's FTS (that I don't know if pg handles better): 1. I remember searching for good packages in npm to interact with sqlite's FTS, but didn't find decent libraries. 2. Sqlite's FTS doesn't support fuzzy search, at least natively. > It is currently juggling a lot of stuff around in memory This can be solved if we're to move to sqlite's FTS. > For the database you no longer need to have the same location mounted on both apps I've been actually thinking about going the route that immich went. Just merge the workers and web containers into one. That'll simplify the deployment a bit without sacrificing on anything. I initially went with separate container for the worker as the worker was the one spawning the chrome process and I didn't want this to be mixed with the web container. But now chrome is in its own container and we can probably just spin up the workers as a background job inside the web container. > Yes, I understand that it is disruptive This "IS" my biggest concern. It is very disruptive and we will lose some users because of that move. I'm for example, still stuck on old immich releases because I don't have time to go through all the recent breaking changes that they introduced. I want hoarder to just work for people, and regardless of how many bells we add to the UI, we're going to break some deployment with this migration, and I don't really want this to happen. I understand that sometimes this is a cost we'll have to pay, but so far, I'm not seeing the strong justification to pay it just yet. # Another Route There's another route we can take though. We can double down on sqlite: 1. Merge web and workers container. 2. Migrate away from meillisearch to sqlite's FTS (if it's good enough). 3. Migrate away from bullmq to a queue built on top of sqlite. We're not high QPS service anyways, so it shouldn't be that hard. 4. There's a WIP sqlite vector search extension (https://github.com/asg017/sqlite-vec) that I've been keeping an eye and seems like recently it got sponsors from mozilla, turso, fly, etc. We can adopt this once it's mature for our vector database as well.
Author
Owner

@wbste commented on GitHub (Sep 29, 2024):

  • Is is a toy db anyways.

I pretty much disagree that sqlite is a toy database. Cloudflare's D1 database for example is built on top of sqlite. Other companies like fly.io and turso are also offering prod databases built on top of sqlite. Tailscale for example, aslo embraced sqlite in prod. We're way way far from approaching the limits of sqlite. It also fits us well because we don't need a client/server architecture given that our deployments are usually on a single machine.

We can get rid of Meilisearch:

Sqlite contains full text search btw (https://www.sqlite.org/fts5.html) and the extension is already enabled in our docker containers. I haven't given it a try so I don't know how good it is compared to meillisearch's. I also didn't give postgres' FTS a try as well. So if getting rid of meillisearch is a goal, there's a route to do it on sqlite as well.

Two limitations that I know about in sqlite's FTS (that I don't know if pg handles better):

1. I remember searching for good packages in npm to interact with sqlite's FTS, but didn't find decent libraries.

2. Sqlite's FTS doesn't support fuzzy search, at least natively.

It is currently juggling a lot of stuff around in memory

This can be solved if we're to move to sqlite's FTS.

For the database you no longer need to have the same location mounted on both apps

I've been actually thinking about going the route that immich went. Just merge the workers and web containers into one. That'll simplify the deployment a bit without sacrificing on anything. I initially went with separate container for the worker as the worker was the one spawning the chrome process and I didn't want this to be mixed with the web container. But now chrome is in its own container and we can probably just spin up the workers as a background job inside the web container.

Yes, I understand that it is disruptive

This "IS" my biggest concern. It is very disruptive and we will lose some users because of that move. I'm for example, still stuck on old immich releases because I don't have time to go through all the recent breaking changes that they introduced. I want hoarder to just work for people, and regardless of how many bells we add to the UI, we're going to break some deployment with this migration, and I don't really want this to happen.

I understand that sometimes this is a cost we'll have to pay, but so far, I'm not seeing the strong justification to pay it just yet.

Another Route

There's another route we can take though. We can double down on sqlite:

1. Merge web and workers container.

2. Migrate away from meillisearch to sqlite's FTS (if it's good enough).

3. Migrate away from bullmq to a queue built on top of sqlite. We're not high QPS service anyways, so it shouldn't be that hard.

4. There's a WIP sqlite vector search extension (https://github.com/asg017/sqlite-vec) that I've been keeping an eye and seems like recently it got sponsors from mozilla, turso, fly, etc. We can adopt this once it's mature for our vector database as well.

Yeah I'm 100% on board for doubling down in sqlite. All your above points are valid, it's portable, and it supports FTS and similarity search with the extensions you highlighted...would love to add semantic (or adjustable-weight hybrid) search :)

Another option (I stumbled on, but never tried myself) is this zero-dependency sqlite vector search project here. Of course you still need something to do the text-to-vector conversion, but the processing looks about as lightweight as it can be for sqlite!

<!-- gh-comment-id:2381629153 --> @wbste commented on GitHub (Sep 29, 2024): > > * Is is a toy db anyways. > > I pretty much disagree that sqlite is a toy database. Cloudflare's D1 database for example is built on top of sqlite. Other companies like fly.io and turso are also offering prod databases built on top of sqlite. Tailscale for example, aslo embraced sqlite in prod. We're way way far from approaching the limits of sqlite. It also fits us well because we don't need a client/server architecture given that our deployments are usually on a single machine. > > > We can get rid of Meilisearch: > > Sqlite contains full text search btw (https://www.sqlite.org/fts5.html) and the extension is already enabled in our docker containers. I haven't given it a try so I don't know how good it is compared to meillisearch's. I also didn't give postgres' FTS a try as well. So if getting rid of meillisearch is a goal, there's a route to do it on sqlite as well. > > Two limitations that I know about in sqlite's FTS (that I don't know if pg handles better): > > 1. I remember searching for good packages in npm to interact with sqlite's FTS, but didn't find decent libraries. > > 2. Sqlite's FTS doesn't support fuzzy search, at least natively. > > > > It is currently juggling a lot of stuff around in memory > > This can be solved if we're to move to sqlite's FTS. > > > For the database you no longer need to have the same location mounted on both apps > > I've been actually thinking about going the route that immich went. Just merge the workers and web containers into one. That'll simplify the deployment a bit without sacrificing on anything. I initially went with separate container for the worker as the worker was the one spawning the chrome process and I didn't want this to be mixed with the web container. But now chrome is in its own container and we can probably just spin up the workers as a background job inside the web container. > > > Yes, I understand that it is disruptive > > This "IS" my biggest concern. It is very disruptive and we will lose some users because of that move. I'm for example, still stuck on old immich releases because I don't have time to go through all the recent breaking changes that they introduced. I want hoarder to just work for people, and regardless of how many bells we add to the UI, we're going to break some deployment with this migration, and I don't really want this to happen. > > I understand that sometimes this is a cost we'll have to pay, but so far, I'm not seeing the strong justification to pay it just yet. > # Another Route > > There's another route we can take though. We can double down on sqlite: > > 1. Merge web and workers container. > > 2. Migrate away from meillisearch to sqlite's FTS (if it's good enough). > > 3. Migrate away from bullmq to a queue built on top of sqlite. We're not high QPS service anyways, so it shouldn't be that hard. > > 4. There's a WIP sqlite vector search extension (https://github.com/asg017/sqlite-vec) that I've been keeping an eye and seems like recently it got sponsors from mozilla, turso, fly, etc. We can adopt this once it's mature for our vector database as well. Yeah I'm 100% on board for doubling down in sqlite. All your above points are valid, it's portable, and it supports FTS and similarity search with the extensions you highlighted...would love to add semantic (or adjustable-weight hybrid) search :) Another option (I stumbled on, but never tried myself) is this zero-dependency sqlite vector search project [here](https://github.com/JarkkoPar/sqlite-ndvss). Of course you still need something to do the text-to-vector conversion, but the processing looks about as lightweight as it can be for sqlite!
Author
Owner

@huyz commented on GitHub (Oct 7, 2024):

My two cents: it seems too early in Hoarder history to worry about disruption. Most of us have been using Hoarder for just a few months?

https://star-history.com/#hoarder-app/hoarder&Date shows
screenshot 2024-10-07T074217Z@2x

Looks to me like Hoarder is just starting to take off. If anything, now is the time to make big changes before it becomes impossible later when you're at 50K stars :)

And as @kamtschatka argued, everyone has PostgreSQL in their homelab, often several times over.

<!-- gh-comment-id:2396164947 --> @huyz commented on GitHub (Oct 7, 2024): My two cents: it seems too early in Hoarder history to worry about disruption. Most of us have been using Hoarder for just a few months? https://star-history.com/#hoarder-app/hoarder&Date shows ![screenshot 2024-10-07T074217Z@2x](https://github.com/user-attachments/assets/4ae223cc-3cc6-454c-b028-a344fd5ce602) Looks to me like Hoarder is just starting to take off. If anything, now is the time to make big changes before it becomes impossible later when you're at 50K stars :) And as @kamtschatka argued, everyone has PostgreSQL in their homelab, often several times over.
Author
Owner

@airdogvan commented on GitHub (Oct 16, 2024):

It seems to me that, aside from any technical consideration about the merits of sqllite or other databases, your project won't really be seriously considered by most people who will be looking for a stable long term solution if not using a recognized as reliable database engine is chosen, such as pgres, mysql or Mongodb for example.

I may be wrong but I doubt it...

<!-- gh-comment-id:2417541346 --> @airdogvan commented on GitHub (Oct 16, 2024): It seems to me that, aside from any technical consideration about the merits of sqllite or other databases, your project won't really be seriously considered by most people who will be looking for a stable long term solution if not using a recognized as reliable database engine is chosen, such as pgres, mysql or Mongodb for example. I may be wrong but I doubt it...
Author
Owner

@MohamedBassem commented on GitHub (Oct 16, 2024):

I mentioned this before and I'll mention it again. I care about backward compatibility a lot and I'll not be breaking it unless there are extremely good reasons. So far, there are no good reasons and we're sticking with sqlite for the foreseeable future.

People take a project seriously for how stable it is and not for the database engine it's using.

<!-- gh-comment-id:2417570278 --> @MohamedBassem commented on GitHub (Oct 16, 2024): I mentioned this before and I'll mention it again. I care about backward compatibility a lot and I'll not be breaking it unless there are extremely good reasons. So far, there are no good reasons and we're sticking with sqlite for the foreseeable future. People take a project seriously for how stable it is and not for the database engine it's using.
Author
Owner

@acelinkio commented on GitHub (Oct 26, 2024):

Doubling down on SQLLite feels like a premature optimization. SQLLite is growing in maturity and does have some promising use cases for ultra-low latency. However that comes as a tradeoff as SQLLite does not have the same tooling to reduce operational overhead for things like clustering, backup/restore, or scaling. Technically you can distribute SQLLite and horizontally scale with some clever filesystems. Those approaches require specialized infrastructure pieces not typically hosted in homelabs.

You mentioned that if you were starting from scratch that Postgres would have been selected. I strongly encourage leaning into the architecture design you want instead of succumbing to the problems of getting there. I understand the desire to not introduce breaking changes, but major changes should be expected from any project in a v0 state. Create a migration plan and just keep charging forward. Or maybe support both SQLLite / Postgres based deployments. Either way I've seen way too many sqllite databases get corrupted inside of docker/kubernetes.

*edit small wording changes

<!-- gh-comment-id:2439190652 --> @acelinkio commented on GitHub (Oct 26, 2024): Doubling down on SQLLite feels like a premature optimization. SQLLite is growing in maturity and does have some promising use cases for ultra-low latency. However that comes as a tradeoff as SQLLite does not have the same tooling to reduce operational overhead for things like clustering, backup/restore, or scaling. Technically you can distribute SQLLite and horizontally scale with some clever filesystems. Those approaches require specialized infrastructure pieces not typically hosted in homelabs. You mentioned that if you were starting from scratch that Postgres would have been selected. I strongly encourage leaning into the architecture design you want instead of succumbing to the problems of getting there. I understand the desire to not introduce breaking changes, but major changes should be expected from any project in a v0 state. Create a migration plan and just keep charging forward. Or maybe support both SQLLite / Postgres based deployments. Either way I've seen way too many sqllite databases get corrupted inside of docker/kubernetes. *edit small wording changes
Author
Owner

@grapemix commented on GitHub (Oct 30, 2024):

  1. When you already put "This app is under heavy development and it's far from stable.", I think users already expect backward compatibility problem. So as long as your export and import features works correctly, don't worry about it. ;)

  2. Your target users are not regular non technical users. People choose to self-host your app are technical and of course we do care about the db choice.

  3. I am not sure how's the concurrency performance of SQLLite in multiple worker and multiple users environment, but it will be really painful if you decide to switch DB in that late stage.

  4. In short, switching to Postgres doesn't necessary mean hoarder loses users. It might attracts other types of user.

    Since sqlite depends on persistent volume, it requires a persistent volume which can handle multiple read-write if multiple pod is involved, but this feature only available in some persistent volumes type of CEPH/OpenEBS.

    Lots of people's homelab have already installed postgres' operator, but not much people have installed CEPH or its alternative. If an admin want to run multiple instance/pod/container/worker of hoarder in multiple machine env(not hostPath), the admin can only self-host hoarder if hoarder uses postgres DB, unless they also install CEPH or its alternative.

  5. https://github.com/tensorchord/pgvecto.rs/ if you want to take a look for postgres+vector support.

  6. Data integration, migration and monitoring are important to us. With trusted tools and technology like wals, grafana dashboard, grafana alerts do make us feel like at home (I have alerts and dashboard for each pg clusters). >.< We don't just install your app and forget it. Just like adopting a pet, we have to make sure it plays nice with the others.

  7. Even switching to postgres, I think Meilisearch is still useful because we can share Meilisearch instances and use Meilisearch to search for resources from other apps.

Nevertheless, if you think switching to Postgres will waste you so much time or bring you so much pressure, it is OK to stay in SQLLite. It is not the end of the day.

<!-- gh-comment-id:2448129222 --> @grapemix commented on GitHub (Oct 30, 2024): 1. When you already put "This app is under heavy development and it's far from stable.", I think users already expect backward compatibility problem. So as long as your export and import features works correctly, don't worry about it. ;) 2. Your target users are not regular non technical users. People choose to self-host your app are technical and of course we do care about the db choice. 3. I am not sure how's the concurrency performance of SQLLite in multiple worker and multiple users environment, but it will be really painful if you decide to switch DB in that late stage. 4. In short, switching to Postgres doesn't necessary mean hoarder loses users. It might attracts other types of user. Since sqlite depends on persistent volume, it requires a persistent volume which can handle multiple read-write if multiple pod is involved, but this feature only available in some persistent volumes type of CEPH/OpenEBS. Lots of people's homelab have already installed postgres' operator, but not much people have installed CEPH or its alternative. If an admin want to run multiple instance/pod/container/worker of hoarder in multiple machine env(not hostPath), the admin can only self-host hoarder if hoarder uses postgres DB, unless they also install CEPH or its alternative. 5. https://github.com/tensorchord/pgvecto.rs/ if you want to take a look for postgres+vector support. 6. Data integration, migration and monitoring are important to us. With trusted tools and technology like wals, grafana dashboard, grafana alerts do make us feel like at home (I have alerts and dashboard for each pg clusters). >.< We don't just install your app and forget it. Just like adopting a pet, we have to make sure it plays nice with the others. 7. Even switching to postgres, I think Meilisearch is still useful because we can share Meilisearch instances and use Meilisearch to search for resources from other apps. Nevertheless, if you think switching to Postgres will waste you so much time or bring you so much pressure, it is OK to stay in SQLLite. It is not the end of the day.
Author
Owner

@csanchez-jetdev commented on GitHub (Nov 5, 2024):

I noticed that Meilisearch (which is already used in the project) has recently added vector search capabilities that could be leveraged here. This would allow us to add vector search without introducing new dependencies or changing the database architecture.

Meilisearch's vector search feature supports multiple embedders including Ollama (which is already used for tags) and OpenAI. We could configure it like this:

const configureVectorSearch = async () => {
  try {
    const response = await fetch('http://localhost:7700/indexes/bookmarks/settings', {
      method: 'PATCH',
      headers: {
        'Content-Type': 'application/json'
      },
      body: JSON.stringify({
        embedders: {
          default: {
            source: 'ollama',
            model: 'llama3.1', // or another compatible model
            documentTemplate: 'A bookmark titled {{doc.title}} whose content starts with {{doc.content|truncatewords: 20}}'
          }
        }
      })
    });

    if (!response.ok) {
      throw new Error(`HTTP error! status: ${response.status}`);
    }

    const data = await response.json();
    return data;
  } catch (error) {
    console.error('Error configuring vector search:', error);
    throw error;
  }
};

// Example usage with MeiliSearch client
import { MeiliSearch } from 'meilisearch';

const configureMeiliSearchVectors = async () => {
  const client = new MeiliSearch({
    host: 'http://localhost:7700',
    apiKey: 'your-api-key'
  });

  try {
    await client.index('bookmarks').updateSettings({
      embedders: {
        default: {
          source: 'ollama',
          model: 'llama3.1',
          documentTemplate: 'A bookmark titled {{doc.title}} whose content starts with {{doc.content|truncatewords: 20}}'
        }
      }
    });
  } catch (error) {
    console.error('Error configuring MeiliSearch vectors:', error);
    throw error;
  }
};

The MeiliSearch client version is probably preferable since it handles authentication and other details more cleanly.

// Example search function using vector/hybrid search
const performHybridSearch = async (query, semanticRatio = 0.5) => {
  try {
    const searchParams = {
      q: query,
      hybrid: {
        semanticRatio: semanticRatio,
        embedder: 'default'
      }
    };

    const results = await client.index('bookmarks').search(query, searchParams);
    return results;
  } catch (error) {
    console.error('Error performing hybrid search:', error);
    throw error;
  }
};

Why use Meilisearch vector:

  • No new dependencies - uses existing Meilisearch integration
  • Works with Ollama and OpenAi setup used for tags
  • Can be enabled as an experimental feature without breaking changes
  • Supports hybrid search (combining keyword + semantic search)

Would love to hear thoughts on this approach as it seems to align well with the project's goals while maintaining stability and avoiding major architectural changes.

<!-- gh-comment-id:2456719874 --> @csanchez-jetdev commented on GitHub (Nov 5, 2024): I noticed that Meilisearch (which is already used in the project) has recently added [vector search capabilities](https://www.meilisearch.com/blog/introducing-vector-search) that could be leveraged here. This would allow us to add vector search without introducing new dependencies or changing the database architecture. Meilisearch's vector search feature supports multiple embedders including Ollama (which is already used for tags) and OpenAI. We could configure it like this: ```javascript const configureVectorSearch = async () => { try { const response = await fetch('http://localhost:7700/indexes/bookmarks/settings', { method: 'PATCH', headers: { 'Content-Type': 'application/json' }, body: JSON.stringify({ embedders: { default: { source: 'ollama', model: 'llama3.1', // or another compatible model documentTemplate: 'A bookmark titled {{doc.title}} whose content starts with {{doc.content|truncatewords: 20}}' } } }) }); if (!response.ok) { throw new Error(`HTTP error! status: ${response.status}`); } const data = await response.json(); return data; } catch (error) { console.error('Error configuring vector search:', error); throw error; } }; // Example usage with MeiliSearch client import { MeiliSearch } from 'meilisearch'; const configureMeiliSearchVectors = async () => { const client = new MeiliSearch({ host: 'http://localhost:7700', apiKey: 'your-api-key' }); try { await client.index('bookmarks').updateSettings({ embedders: { default: { source: 'ollama', model: 'llama3.1', documentTemplate: 'A bookmark titled {{doc.title}} whose content starts with {{doc.content|truncatewords: 20}}' } } }); } catch (error) { console.error('Error configuring MeiliSearch vectors:', error); throw error; } }; ``` The MeiliSearch client version is probably preferable since it handles authentication and other details more cleanly. ```javascript // Example search function using vector/hybrid search const performHybridSearch = async (query, semanticRatio = 0.5) => { try { const searchParams = { q: query, hybrid: { semanticRatio: semanticRatio, embedder: 'default' } }; const results = await client.index('bookmarks').search(query, searchParams); return results; } catch (error) { console.error('Error performing hybrid search:', error); throw error; } }; ``` ## Why use Meilisearch vector: - No new dependencies - uses existing Meilisearch integration - Works with Ollama and OpenAi setup used for tags - Can be enabled as an experimental feature without breaking changes - Supports hybrid search (combining keyword + semantic search) Would love to hear thoughts on this approach as it seems to align well with the project's goals while maintaining stability and avoiding major architectural changes.
Author
Owner

@hadleyrich commented on GitHub (Dec 22, 2024):

I get not wanting to add new dependencies, but having postgres as an option would be pretty nice for HA setups. Not having to have persistent volumes is a pretty big plus from my point of view.

I would much rather have a container I can spin up and down on any host and point to a postgres DB and S3 storage than to mess around with local volumes. HA is a big thing for me.

Besides that, I've only just come across Hoarder and deployed it this evening. It's a fantastic looking app so congrats on that and obviously it's yours to take whichever direction you want :)

<!-- gh-comment-id:2558371809 --> @hadleyrich commented on GitHub (Dec 22, 2024): I get not wanting to add new dependencies, but having postgres as an option would be pretty nice for HA setups. Not having to have persistent volumes is a pretty big plus from my point of view. I would much rather have a container I can spin up and down on any host and point to a postgres DB and S3 storage than to mess around with local volumes. HA is a big thing for me. Besides that, I've only just come across Hoarder and deployed it this evening. It's a fantastic looking app so congrats on that and obviously it's yours to take whichever direction you want :)
Author
Owner

@rafaribe commented on GitHub (Jan 8, 2025):

I get not wanting to add new dependencies, but having postgres as an option would be pretty nice for HA setups. Not having to have persistent volumes is a pretty big plus from my point of view.

I would much rather have a container I can spin up and down on any host and point to a postgres DB and S3 storage than to mess around with local volumes. HA is a big thing for me.

Besides that, I've only just come across Hoarder and deployed it this evening. It's a fantastic looking app so congrats on that and obviously it's yours to take whichever direction you want :)

This, I already have all the infrastructure setup for postgres, backups tested, the works and I believe most people with an homelab have as well.

@MohamedBassem Why don't you introduce a interface where you abstract the database away? That way people can use SQLite or Postgres or anything else that fits hoarder.

<!-- gh-comment-id:2577184046 --> @rafaribe commented on GitHub (Jan 8, 2025): > I get not wanting to add new dependencies, but having postgres as an option would be pretty nice for HA setups. Not having to have persistent volumes is a pretty big plus from my point of view. > > I would much rather have a container I can spin up and down on any host and point to a postgres DB and S3 storage than to mess around with local volumes. HA is a big thing for me. > > Besides that, I've only just come across Hoarder and deployed it this evening. It's a fantastic looking app so congrats on that and obviously it's yours to take whichever direction you want :) This, I already have all the infrastructure setup for postgres, backups tested, the works and I believe most people with an homelab have as well. @MohamedBassem Why don't you introduce a interface where you abstract the database away? That way people can use SQLite or Postgres or anything else that fits hoarder.
Author
Owner

@caquillo07 commented on GitHub (Mar 24, 2025):

Forcing SQLite is the only thing preventing us from being able to deploy this at my place of work, not being able to properly deploy as HA without external file systems is a huge drawback. Have a DB abstraction that lets you plug in any database is a common pattern, in fact am sure drizzle supports this out of the box as well.

Would you be open to PRs to allow for this? Essentially SQLite would be the default, but if a new DB connection string is passed, it can use that one instead.

<!-- gh-comment-id:2749370955 --> @caquillo07 commented on GitHub (Mar 24, 2025): Forcing SQLite is the only thing preventing us from being able to deploy this at my place of work, not being able to properly deploy as HA without external file systems is a huge drawback. Have a DB abstraction that lets you plug in any database is a common pattern, in fact am sure drizzle supports this out of the box as well. Would you be open to PRs to allow for this? Essentially SQLite would be the default, but if a new DB connection string is passed, it can use that one instead.
Author
Owner

@rymurr commented on GitHub (Apr 24, 2025):

I noticed that Meilisearch (which is already used in the project) has recently added vector search capabilities that could be leveraged here. This would allow us to add vector search without introducing new dependencies or changing the database architecture.

I configured this manually for my local install and it works great. I manually created the vector index w/ curl and added

hybrid: {
        semanticRatio: 0.8,
        embedder: 'default'
      }

to the search endpoint. I've found it works great!

<!-- gh-comment-id:2827555807 --> @rymurr commented on GitHub (Apr 24, 2025): > I noticed that Meilisearch (which is already used in the project) has recently added [vector search capabilities](https://www.meilisearch.com/blog/introducing-vector-search) that could be leveraged here. This would allow us to add vector search without introducing new dependencies or changing the database architecture. I configured this manually for my local install and it works great. I manually created the vector index w/ `curl` and added ``` hybrid: { semanticRatio: 0.8, embedder: 'default' } ``` to the search endpoint. I've found it works great!
Author
Owner

@Fmstrat commented on GitHub (Jun 26, 2025):

Why not do what Nextcloud does?

  • The default DB is SQLite
  • If you want to use Postgres, set a POSTGRES_CONN_STR ENV variable
  • In the documentation's docker-compose.yml, add the postgres instance and config for new users

This doesnt break legacy users, and allows for a better system going forward.

There are good arguments for how SQLite is better than it has been, but it's still a single file single connection DB. Locking will be an issue when using the CLI, for instance. Disk corruption can be more impactful, etc.

<!-- gh-comment-id:3008169728 --> @Fmstrat commented on GitHub (Jun 26, 2025): Why not do what Nextcloud does? - The default DB is SQLite - If you want to use Postgres, set a `POSTGRES_CONN_STR` ENV variable - In the documentation's `docker-compose.yml`, add the postgres instance and config for new users This doesnt break legacy users, and allows for a better system going forward. There are good arguments for how SQLite is better than it has been, but it's still a single file single connection DB. Locking will be an issue when using the CLI, for instance. Disk corruption can be more impactful, etc.
Author
Owner

@tommyalatalo commented on GitHub (Jul 15, 2025):

There are many benefits with Postgres as have already been mentioned by others, and I support this feature as well.

I don't think it's a good idea to think about backwards compatibility in a project that is currently on v0.25.0, the road to 1.0.0 is expected to be include changes of all kinds. The initial choice of sqlite was probably good to get the project off the ground, but if the goal is to have a reliable and scalable self hosted service that can also support high availability setups you need something like Postgres instead. Like you said yourself, look at projects like Immich, they made big changes on the way, because they were necessary, and they're all the better for it. Instead of getting bogged down by early decisions and tangled up in workarounds that will only delay the inevitable rewrite to get it right anyway they sorted things out and are in a much better place today.

I would suggest to draft a roadmap for how to get better database support in place, since that is what you said you would have done if starting over today. As other projects have done, you could default to using sqlite to offer the "easy" way, but still allow for configuring use of Postgres and other databases by using an interface for those who want/need something more reliable and/or are already using Postgres with backups etc already in place.

This is true in my case as well, I run a multi-node homelab with NFS storage to allow for easy relocation of workloads across the nodes, I currently can't run karakeep because sqlite works very poorly over NFS. I already have a Postgres instance up, and S3 storage too, so if karakeep would support them I could run it as a very reliable and resilient application with high availability.

<!-- gh-comment-id:3074811258 --> @tommyalatalo commented on GitHub (Jul 15, 2025): There are many benefits with Postgres as have already been mentioned by others, and I support this feature as well. I don't think it's a good idea to think about backwards compatibility in a project that is currently on v0.25.0, the road to 1.0.0 is expected to be include changes of all kinds. The initial choice of sqlite was probably good to get the project off the ground, but if the goal is to have a reliable and scalable self hosted service that can also support high availability setups you need something like Postgres instead. Like you said yourself, look at projects like Immich, they made big changes on the way, because they were necessary, and they're all the better for it. Instead of getting bogged down by early decisions and tangled up in workarounds that will only delay the inevitable rewrite to get it right anyway they sorted things out and are in a much better place today. I would suggest to draft a roadmap for how to get better database support in place, since that is what you said you would have done if starting over today. As other projects have done, you could default to using sqlite to offer the "easy" way, but still allow for configuring use of Postgres and other databases by using an interface for those who want/need something more reliable and/or are already using Postgres with backups etc already in place. This is true in my case as well, I run a multi-node homelab with NFS storage to allow for easy relocation of workloads across the nodes, I currently can't run karakeep because sqlite works very poorly over NFS. I already have a Postgres instance up, and S3 storage too, so if karakeep would support them I could run it as a very reliable and resilient application with high availability.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/karakeep#178
No description provided.