[GH-ISSUE #180] reindex each time docker down and up #177

Closed
opened 2026-02-27 15:55:29 +03:00 by kerem · 10 comments
Owner

Originally created by @TomKirby on GitHub (Aug 14, 2018).
Original GitHub issue: https://github.com/RD17/ambar/issues/180

Hi,

We are seeing some weird activity where previously indexed directories are being entirely recrawled and being processed by the pipeline when a docker-compose down and docker-compose up is performed..

The logs show the following:

2018-08-14 10:56:13.709: [pipeline] [verbose] [3] add task received for //crawl_diabetesandendocrinology1/DirectoryNameObsecured/FileNameObscured.doc

2018-08-14 10:56:13.734: [pipeline] [verbose] [3] meta found for //crawl_diabetesandendocrinology1/DirectoryNameObsecured/FileNameObscured.doc

This is causing around 65,000 files to be in the queue AMBAR_PIPELINE_QUEUE and causing a large backlog on the pipeline which needs to process all these files..

MY worry is that we have only mounted about 4 shares out of the proposed 20, will each share need entirely reindexing each time the services start up?

It also appears that previously indexed files are no longer avaliable in the search... will these not be unable until the share has been reindexed through the pipline?

Originally created by @TomKirby on GitHub (Aug 14, 2018). Original GitHub issue: https://github.com/RD17/ambar/issues/180 Hi, We are seeing some weird activity where previously indexed directories are being entirely recrawled and being processed by the pipeline when a docker-compose down and docker-compose up is performed.. The logs show the following: ``` 2018-08-14 10:56:13.709: [pipeline] [verbose] [3] add task received for //crawl_diabetesandendocrinology1/DirectoryNameObsecured/FileNameObscured.doc 2018-08-14 10:56:13.734: [pipeline] [verbose] [3] meta found for //crawl_diabetesandendocrinology1/DirectoryNameObsecured/FileNameObscured.doc ``` This is causing around 65,000 files to be in the queue `AMBAR_PIPELINE_QUEUE` and causing a large backlog on the pipeline which needs to process all these files.. MY worry is that we have only mounted about 4 shares out of the proposed 20, will each share need entirely reindexing each time the services start up? It also appears that previously indexed files are no longer avaliable in the search... will these not be unable until the share has been reindexed through the pipline?
kerem closed this issue 2026-02-27 15:55:29 +03:00
Author
Owner

@sochix commented on GitHub (Aug 14, 2018):

@TomKirby yes, Ambar will reindex the shares each time it was restarted. It's happens because during the restart some files may be changed.

it also appears that previously indexed files are no longer avaliable in the search...

it's not true, check your install maybe you remove the Ambar's data directory?

<!-- gh-comment-id:412835341 --> @sochix commented on GitHub (Aug 14, 2018): @TomKirby yes, Ambar will reindex the shares each time it was restarted. It's happens because during the restart some files may be changed. > it also appears that previously indexed files are no longer avaliable in the search... it's not true, check your install maybe you remove the Ambar's data directory?
Author
Owner

@TomKirby commented on GitHub (Aug 14, 2018):

will amber remove files when the crawler detects that the file has been removed? it is possible that the share was unmounted while amber was running, causing the files to appear to of been removed.

<!-- gh-comment-id:412837917 --> @TomKirby commented on GitHub (Aug 14, 2018): will amber remove files when the crawler detects that the file has been removed? it is possible that the share was unmounted while amber was running, causing the files to appear to of been removed.
Author
Owner

@sochix commented on GitHub (Aug 14, 2018):

will amber remove files when the crawler detects that the file has been removed?
yes

it is possible that the share was unmounted while amber was running, causing the files to appear to of been removed.

no, it shouldn't

<!-- gh-comment-id:412847571 --> @sochix commented on GitHub (Aug 14, 2018): > will amber remove files when the crawler detects that the file has been removed? yes > it is possible that the share was unmounted while amber was running, causing the files to appear to of been removed. no, it shouldn't
Author
Owner

@TomKirby commented on GitHub (Aug 14, 2018):

i will allow the queue to clear down and all the files to reindex and then investigate if they are avaliable though search.

<!-- gh-comment-id:412850520 --> @TomKirby commented on GitHub (Aug 14, 2018): i will allow the queue to clear down and all the files to reindex and then investigate if they are avaliable though search.
Author
Owner

@TomKirby commented on GitHub (Aug 16, 2018):

Hi,

Continuing to have problems post Reindex...

I removed and re-added the share mapping, and then i stopped and restarted the crawler.

I can see within the RabbitMQ admin page(which made accessable by mapping the port within docker-compose) that all the files go into the queue.

From the logs I can see that the files are being picked up from the queue by the pipeline and that the meta data is found already.

However when performing a search for the file name, or known content within the file it is not appearing.

Additionally, if a search is performed with tags:crawl_dermatology then it is not displaying any records, this DOES work however for the other 2 crawler names/tags..

<!-- gh-comment-id:413519728 --> @TomKirby commented on GitHub (Aug 16, 2018): Hi, Continuing to have problems post Reindex... I removed and re-added the share mapping, and then i stopped and restarted the crawler. I can see within the RabbitMQ admin page(which made accessable by mapping the port within docker-compose) that all the files go into the queue. From the logs I can see that the files are being picked up from the queue by the pipeline and that the meta data is found already. However when performing a search for the file name, or known content within the file it is not appearing. Additionally, if a search is performed with `tags:crawl_dermatology` then it is not displaying any records, this DOES work however for the other 2 crawler names/tags..
Author
Owner

@sochix commented on GitHub (Aug 16, 2018):

@TomKirby seems like a problem with your setup, check the docker-compose file twice.

<!-- gh-comment-id:413526375 --> @sochix commented on GitHub (Aug 16, 2018): @TomKirby seems like a problem with your setup, check the docker-compose file twice.
Author
Owner

@stale[bot] commented on GitHub (Sep 10, 2018):

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

<!-- gh-comment-id:419949942 --> @stale[bot] commented on GitHub (Sep 10, 2018): This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Author
Owner

@open7c commented on GitHub (Sep 17, 2018):

Are there any plans to change the behaviour of re-indexing everytime the container gets restarted?

On a Test machine we have 30GB of data, the reindexing lasts one day, while new Documents have to wait until tomorrow.

On a clients machine are about 30TB of data, with this mechanism the whole thing would last 10 days? Just for re-indexing, no matter if not one Byte has changed in the Data?

It would be perfect, when the Crawler sends a List of all Files with their Update-Stamps or Fingerprints to a Service, that will internally check, which files needs explicit reindexing, and ignores all others.

<!-- gh-comment-id:421975023 --> @open7c commented on GitHub (Sep 17, 2018): Are there any plans to change the behaviour of re-indexing everytime the container gets restarted? On a Test machine we have 30GB of data, the reindexing lasts one day, while new Documents have to wait until tomorrow. On a clients machine are about 30TB of data, with this mechanism the whole thing would last 10 days? Just for re-indexing, no matter if not one Byte has changed in the Data? It would be perfect, when the Crawler sends a List of all Files with their Update-Stamps or Fingerprints to a Service, that will internally check, which files needs explicit reindexing, and ignores all others.
Author
Owner

@sochix commented on GitHub (Sep 17, 2018):

@open7c Ambar doesn't reindex the share everytime, Ambar only recrawls the share. I think you have a error with your setup or your images outdated.

It would be perfect, when the Crawler sends a List of all Files with their Update-Stamps or Fingerprints to a Service, that will internally check, which files needs explicit reindexing, and ignores all others.

It's already implemented

<!-- gh-comment-id:422033979 --> @sochix commented on GitHub (Sep 17, 2018): @open7c Ambar doesn't reindex the share everytime, Ambar only recrawls the share. I think you have a error with your setup or your images outdated. > It would be perfect, when the Crawler sends a List of all Files with their Update-Stamps or Fingerprints to a Service, that will internally check, which files needs explicit reindexing, and ignores all others. It's already implemented
Author
Owner

@maydo commented on GitHub (Sep 17, 2018):

i can confirm this, it does not reindex, ambar recrawl the files.
on 800k files it takes about 24h recrawling on my end.

btw, maybe you can optimize this, a similar system Qsirch from Qnap, does not recrawl, it find files which has changed or added. and there is about 3mil files in qsirch, its also based on es

<!-- gh-comment-id:422038376 --> @maydo commented on GitHub (Sep 17, 2018): i can confirm this, it does not reindex, ambar recrawl the files. on 800k files it takes about 24h recrawling on my end. btw, maybe you can optimize this, a similar system Qsirch from Qnap, does not recrawl, it find files which has changed or added. and there is about 3mil files in qsirch, its also based on es
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ambar#177
No description provided.