[GH-ISSUE #703] [OSSHACK] Opt-In Semantic Search #207

Closed
opened 2026-02-26 18:45:52 +03:00 by kerem · 3 comments
Owner

Originally created by @Mythie on GitHub (Dec 1, 2023).
Original GitHub issue: https://github.com/documenso/documenso/issues/703

Improvement Description

Integrate an opt-in semantic search feature into Documenso, leveraging OpenAI embeddings. This feature aims to enhance document search capabilities by providing more contextually relevant results based on the semantic content of documents and the user's search query.

The semantic search will only be available to subscribed users or to all users if a self-hoster has enabled the feature flag.

Users will be able to opt-in to semantic search by enabling the feature in their profile settings. The feature will be disabled by default as users might not want to use it due to privacy concerns or other reasons.

Rationale

Semantic search represents a significant advancement over traditional keyword-based search by understanding the context and meaning behind the text. It can greatly improve user experience by providing more accurate and relevant search results.

Proposed Solution

  • Use the OpenAI Ada model to generate document embeddings.
  • Store the embeddings within our database under a new table using pg_vector.
  • Use a background job to generate embeddings for all existing documents if a user opts-in to semantic search.
  • Use a background job to generate embeddings for new documents as they are added to the database for an opted-in user.
  • Update the find documents method to use the semantic search feature if a user has opted-in.

Caveats

  • We would like to use Trigger.dev for background jobs but we should also consider how we might also implement this without a third-party when self-hosting if possible.
  • The entire feature should be completely opt-in for self-hosters since they have full control over their data. We do not want to force required third-party services on them.

Alternatives

  • Using a different embedding model, such as BERT or RoBERTa.
  • Using a different search engine, such as Elasticsearch or Algolia.

Additional Context

  • The implementation must ensure data privacy and security, considering the sensitive nature of the documents.
  • This is a deviation from our no mandatory third-party services policy therefore we must make this a progressive enahancement rather than a mandatory feature.
Originally created by @Mythie on GitHub (Dec 1, 2023). Original GitHub issue: https://github.com/documenso/documenso/issues/703 ## Improvement Description Integrate an opt-in semantic search feature into Documenso, leveraging OpenAI embeddings. This feature aims to enhance document search capabilities by providing more contextually relevant results based on the semantic content of documents and the user's search query. The semantic search will only be available to subscribed users or to all users if a self-hoster has enabled the feature flag. Users will be able to opt-in to semantic search by enabling the feature in their profile settings. The feature will be disabled by default as users might not want to use it due to privacy concerns or other reasons. ## Rationale Semantic search represents a significant advancement over traditional keyword-based search by understanding the context and meaning behind the text. It can greatly improve user experience by providing more accurate and relevant search results. ## Proposed Solution - Use the OpenAI Ada model to generate document embeddings. - Store the embeddings within our database under a new table using `pg_vector`. - Use a background job to generate embeddings for all existing documents if a user opts-in to semantic search. - Use a background job to generate embeddings for new documents as they are added to the database for an opted-in user. - Update the find documents method to use the semantic search feature if a user has opted-in. ## Caveats - We would like to use Trigger.dev for background jobs but we should also consider how we might also implement this without a third-party when self-hosting if possible. - The entire feature should be completely opt-in for self-hosters since they have full control over their data. We do not want to force required third-party services on them. ## Alternatives - Using a different embedding model, such as BERT or RoBERTa. - Using a different search engine, such as Elasticsearch or Algolia. ## Additional Context - The implementation must ensure data privacy and security, considering the sensitive nature of the documents. - This is a deviation from our no mandatory third-party services policy therefore we must make this a progressive enahancement rather than a mandatory feature.
kerem 2026-02-26 18:45:52 +03:00
Author
Owner

@Mythie commented on GitHub (Dec 2, 2023):

For OSSHack participants we won’t be assigning or following our usual guidelines, please start work if you want to do the challenge.

The first good solution (read: to acceptable quality and conformance) wins!

<!-- gh-comment-id:1837190010 --> @Mythie commented on GitHub (Dec 2, 2023): For OSSHack participants we won’t be assigning or following our usual guidelines, please start work if you want to do the challenge. The first *good* solution (read: to acceptable quality and conformance) wins!
Author
Owner

@VishalMCF commented on GitHub (Dec 20, 2023):

Let's say we are using our cron job for the data security issue. As per my analysis, we can do the following:-
Whenever the user opts in for the semantic search, it should trigger a job request (publish) which needs to be sent to a message queue (as it will be persistent). Then we can have a consumer in the same application instance that listens to the message events and executes the job. In that, we can run background processes without the need to have a timely cron job running. We should have a consumer group that will include all the pods in the production so that one pod gets one event only.
Now here we need to take care of the failure scenario. If any failure happens then that doc again needs to be sent again to the queue. After processing the embeddings we can store it as you suggested but my approach focuses on how to avoid trigger.dev.
Or else we have to notify the user about the data security issue so that before opting in he gets a choice whether to go with trigger.dev or just drop it for now.

<!-- gh-comment-id:1864490822 --> @VishalMCF commented on GitHub (Dec 20, 2023): Let's say we are using our cron job for the data security issue. As per my analysis, we can do the following:- Whenever the user opts in for the semantic search, it should trigger a job request (publish) which needs to be sent to a message queue (as it will be persistent). Then we can have a consumer in the same application instance that listens to the message events and executes the job. In that, we can run background processes without the need to have a timely cron job running. We should have a consumer group that will include all the pods in the production so that one pod gets one event only. Now here we need to take care of the failure scenario. If any failure happens then that doc again needs to be sent again to the queue. After processing the embeddings we can store it as you suggested but my approach focuses on how to avoid trigger.dev. Or else we have to notify the user about the data security issue so that before opting in he gets a choice whether to go with trigger.dev or just drop it for now.
Author
Owner

@VishalMCF commented on GitHub (Dec 21, 2023):

Edit: Bullmq library can be used for processing background tasks

<!-- gh-comment-id:1865872184 --> @VishalMCF commented on GitHub (Dec 21, 2023): Edit: Bullmq library can be used for processing background tasks
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/documenso#207
No description provided.