mirror of
https://github.com/RD17/ambar.git
synced 2026-04-25 15:35:49 +03:00
[GH-ISSUE #205] Tag by folder #201
Labels
No labels
$$ Paid Support
bug
bug
enhancement
help wanted
invalid
pull-request
question
question
wontfix
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/ambar#201
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @s1rk1t on GitHub (Dec 17, 2018).
Original GitHub issue: https://github.com/RD17/ambar/issues/205
I'm trying to follow what happened in Issue #175 but am unable to reproduce his results.
Here's my code:
def AutoTagAmbarFile(self, AmbarFile):
self.SetOCRTag(AmbarFile)
self.SetSourceIdTag(AmbarFile)
self.SetArchiveTag(AmbarFile)
self.SetImageTag(AmbarFile)
self.SetFolderTag(AmbarFile)
Followed by this:
def SetFolderTag(self, AmbarFile):
if('folderName' in AmbarFile['meta']['full_name']):
self.AddTagToAmbarFile(AmbarFile['file_id'], AmbarFile['meta']['full_name']
,self.AUTO_TAG_TYPE, 'folderName')
I've tried altering a pre-existing tag as did the poster in Issue #175 , but was unable to see any change after I rebuilt the Pipeline image, pulled the new image, and spun up a new instance of AMBAR. I've tried clearing my browser cache, as that had caused issues in the past, but there was no change.
Is there somewhere else I need to change some code in order for the new tag to show up on the search page?
Thanks in advance for any help you can offer!
@sochix commented on GitHub (Dec 18, 2018):
Everything looks good.
Check in debug mode that your condition
if('folderName' in AmbarFile['meta']['full_name']):works properly.
@s1rk1t commented on GitHub (Dec 18, 2018):
So I tried this:
def SetFolderNameTag(self, AmbarFile):
fileString = AmbarFile['meta']['full_name']
self.logger.LogMessage('verbose', '{0} is full_name'.format(fileString))
if('folderName' in fileString):
self.AddTagToAmbarFile(AmbarFile['file_id'], AmbarFile['meta']['full_name'], self.AUTO_TAG_TYPE, 'folderName')
and this:
def SetFolderNameTag(self, AmbarFile):
fileString = AmbarFile['meta']['full_name']
if('folderName' in fileString):
self.logger.LogMessage('verbose', 'folderName is in {0}'.format(fileString))
self.AddTagToAmbarFile(AmbarFile['file_id'], AmbarFile['meta']['full_name'], self.AUTO_TAG_TYPE, 'folderName')
but after using the 'sudo docker logs pipelineContainerID' command the output was this:
Dec 18, 2018 2:04:31 PM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
WARNING: J2KImageReader not loaded. JPEG2000 files will not be processed.
See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
for optional dependencies.
Dec 18, 2018 2:04:31 PM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
WARNING: Tesseract OCR is installed and will be automatically applied to image files unless
you've excluded the TesseractOCRParser from the default parser.
Tesseract may dramatically slow down content extraction (TIKA-2359).
As of Tika 1.15 (and prior versions), Tesseract is automatically called.
In future versions of Tika, users may need to turn the TesseractOCRParser on via TikaConfig.
Dec 18, 2018 2:04:31 PM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
WARNING: org.xerial's sqlite-jdbc is not loaded.
Please provide the jar on your classpath to parse sqlite files.
See tika-parsers/pom.xml for the correct version.
Is that the correct command to view the proper log?
I ask because on line 96 of autotagging.py there is this statement:
self.logger.LogMessage('verbose', '{0} tag added to {1}'.format(Tag, FullName))
but I am not seeing any of that output in the log file above
EDIT:
So now (after using docker-compose down and reloading the images) I am getting this in the log (after the previously stated output):
2018-12-18 14:50:09.066066: [info] [0] started
2018-12-18 14:50:09.107822: [info] [0] connecting to Rabbit amqp://rabbit...
2018-12-18 14:50:09.204385: [info] [0] connected to Rabbit!
2018-12-18 14:50:09.220989: [info] [0] waiting for messages...
2018-12-18 14:51:09.128585: [verbose] [0] add task received for (then comes the full_name data)
2018-12-18 14:51:09.151687: [verbose] [0] meta found for (again, the full_name data)
This second 2 line chunk repeats a bunch of times, presumably for each time the new tag is supposed to be applied.
After grepping the language ('meta found for') in that output it looks like it's coming from the pipeline.py file, specifically lines 78 and 113.
Thanks again for your help!
@sochix commented on GitHub (Dec 18, 2018):
Did you crawl a file with 'folderName' in the path?
@s1rk1t commented on GitHub (Dec 18, 2018):
yes
@sochix commented on GitHub (Dec 18, 2018):
Can you please put the full path here as example?
@s1rk1t commented on GitHub (Dec 18, 2018):
Sure,
//mycrawler/outerFolder/subFolder/testDocument.pdf
folder name is outerFolder
@sochix commented on GitHub (Dec 18, 2018):
So your code snippet is:
Am I right?
@s1rk1t commented on GitHub (Dec 18, 2018):
Yes, that looks right.
@sochix commented on GitHub (Dec 18, 2018):
Did you change the ambar pipeline image source in docker-compose file?
@s1rk1t commented on GitHub (Dec 18, 2018):
Yes
@sochix commented on GitHub (Dec 18, 2018):
Can you share your docker-compose file please
@s1rk1t commented on GitHub (Dec 18, 2018):
I think I may have figured it out. Once I ran docker's prune command I was able to see a change in the tag (I had changed ocr to ocr-test like the poster did in Issue #175 ). Rerunning it now to see if the new tags show up.
@s1rk1t commented on GitHub (Dec 18, 2018):
Yep, that did it. It's working as expected now.
Thanks so much for your help!