mirror of
https://github.com/RD17/ambar.git
synced 2026-04-25 15:35:49 +03:00
[GH-ISSUE #180] reindex each time docker down and up #177
Labels
No labels
$$ Paid Support
bug
bug
enhancement
help wanted
invalid
pull-request
question
question
wontfix
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/ambar#177
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @TomKirby on GitHub (Aug 14, 2018).
Original GitHub issue: https://github.com/RD17/ambar/issues/180
Hi,
We are seeing some weird activity where previously indexed directories are being entirely recrawled and being processed by the pipeline when a docker-compose down and docker-compose up is performed..
The logs show the following:
This is causing around 65,000 files to be in the queue
AMBAR_PIPELINE_QUEUEand causing a large backlog on the pipeline which needs to process all these files..MY worry is that we have only mounted about 4 shares out of the proposed 20, will each share need entirely reindexing each time the services start up?
It also appears that previously indexed files are no longer avaliable in the search... will these not be unable until the share has been reindexed through the pipline?
@sochix commented on GitHub (Aug 14, 2018):
@TomKirby yes, Ambar will reindex the shares each time it was restarted. It's happens because during the restart some files may be changed.
it's not true, check your install maybe you remove the Ambar's data directory?
@TomKirby commented on GitHub (Aug 14, 2018):
will amber remove files when the crawler detects that the file has been removed? it is possible that the share was unmounted while amber was running, causing the files to appear to of been removed.
@sochix commented on GitHub (Aug 14, 2018):
no, it shouldn't
@TomKirby commented on GitHub (Aug 14, 2018):
i will allow the queue to clear down and all the files to reindex and then investigate if they are avaliable though search.
@TomKirby commented on GitHub (Aug 16, 2018):
Hi,
Continuing to have problems post Reindex...
I removed and re-added the share mapping, and then i stopped and restarted the crawler.
I can see within the RabbitMQ admin page(which made accessable by mapping the port within docker-compose) that all the files go into the queue.
From the logs I can see that the files are being picked up from the queue by the pipeline and that the meta data is found already.
However when performing a search for the file name, or known content within the file it is not appearing.
Additionally, if a search is performed with
tags:crawl_dermatologythen it is not displaying any records, this DOES work however for the other 2 crawler names/tags..@sochix commented on GitHub (Aug 16, 2018):
@TomKirby seems like a problem with your setup, check the docker-compose file twice.
@stale[bot] commented on GitHub (Sep 10, 2018):
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
@open7c commented on GitHub (Sep 17, 2018):
Are there any plans to change the behaviour of re-indexing everytime the container gets restarted?
On a Test machine we have 30GB of data, the reindexing lasts one day, while new Documents have to wait until tomorrow.
On a clients machine are about 30TB of data, with this mechanism the whole thing would last 10 days? Just for re-indexing, no matter if not one Byte has changed in the Data?
It would be perfect, when the Crawler sends a List of all Files with their Update-Stamps or Fingerprints to a Service, that will internally check, which files needs explicit reindexing, and ignores all others.
@sochix commented on GitHub (Sep 17, 2018):
@open7c Ambar doesn't reindex the share everytime, Ambar only recrawls the share. I think you have a error with your setup or your images outdated.
It's already implemented
@maydo commented on GitHub (Sep 17, 2018):
i can confirm this, it does not reindex, ambar recrawl the files.
on 800k files it takes about 24h recrawling on my end.
btw, maybe you can optimize this, a similar system Qsirch from Qnap, does not recrawl, it find files which has changed or added. and there is about 3mil files in qsirch, its also based on es