[GH-ISSUE #205] Tag by folder #201

Closed
opened 2026-02-27 15:55:36 +03:00 by kerem · 13 comments
Owner

Originally created by @s1rk1t on GitHub (Dec 17, 2018).
Original GitHub issue: https://github.com/RD17/ambar/issues/205

I'm trying to follow what happened in Issue #175 but am unable to reproduce his results.

Here's my code:

def AutoTagAmbarFile(self, AmbarFile):
self.SetOCRTag(AmbarFile)
self.SetSourceIdTag(AmbarFile)
self.SetArchiveTag(AmbarFile)
self.SetImageTag(AmbarFile)
self.SetFolderTag(AmbarFile)

Followed by this:

def SetFolderTag(self, AmbarFile):
if('folderName' in AmbarFile['meta']['full_name']):
self.AddTagToAmbarFile(AmbarFile['file_id'], AmbarFile['meta']['full_name']
,self.AUTO_TAG_TYPE, 'folderName')

I've tried altering a pre-existing tag as did the poster in Issue #175 , but was unable to see any change after I rebuilt the Pipeline image, pulled the new image, and spun up a new instance of AMBAR. I've tried clearing my browser cache, as that had caused issues in the past, but there was no change.

Is there somewhere else I need to change some code in order for the new tag to show up on the search page?

Thanks in advance for any help you can offer!

Originally created by @s1rk1t on GitHub (Dec 17, 2018). Original GitHub issue: https://github.com/RD17/ambar/issues/205 I'm trying to follow what happened in Issue #175 but am unable to reproduce his results. Here's my code: def AutoTagAmbarFile(self, AmbarFile): self.SetOCRTag(AmbarFile) self.SetSourceIdTag(AmbarFile) self.SetArchiveTag(AmbarFile) self.SetImageTag(AmbarFile) self.SetFolderTag(AmbarFile) Followed by this: def SetFolderTag(self, AmbarFile): if('folderName' in AmbarFile['meta']['full_name']): self.AddTagToAmbarFile(AmbarFile['file_id'], AmbarFile['meta']['full_name'] ,self.AUTO_TAG_TYPE, 'folderName') I've tried altering a pre-existing tag as did the poster in Issue #175 , but was unable to see any change after I rebuilt the Pipeline image, pulled the new image, and spun up a new instance of AMBAR. I've tried clearing my browser cache, as that had caused issues in the past, but there was no change. Is there somewhere else I need to change some code in order for the new tag to show up on the search page? Thanks in advance for any help you can offer!
kerem closed this issue 2026-02-27 15:55:36 +03:00
Author
Owner

@sochix commented on GitHub (Dec 18, 2018):

Everything looks good.

Check in debug mode that your condition
if('folderName' in AmbarFile['meta']['full_name']):
works properly.

<!-- gh-comment-id:448158228 --> @sochix commented on GitHub (Dec 18, 2018): Everything looks good. Check in debug mode that your condition ```if('folderName' in AmbarFile['meta']['full_name']):``` works properly.
Author
Owner

@s1rk1t commented on GitHub (Dec 18, 2018):

So I tried this:

def SetFolderNameTag(self, AmbarFile):
fileString = AmbarFile['meta']['full_name']
self.logger.LogMessage('verbose', '{0} is full_name'.format(fileString))
if('folderName' in fileString):
self.AddTagToAmbarFile(AmbarFile['file_id'], AmbarFile['meta']['full_name'], self.AUTO_TAG_TYPE, 'folderName')

and this:

def SetFolderNameTag(self, AmbarFile):
fileString = AmbarFile['meta']['full_name']
if('folderName' in fileString):
self.logger.LogMessage('verbose', 'folderName is in {0}'.format(fileString))
self.AddTagToAmbarFile(AmbarFile['file_id'], AmbarFile['meta']['full_name'], self.AUTO_TAG_TYPE, 'folderName')

but after using the 'sudo docker logs pipelineContainerID' command the output was this:

Dec 18, 2018 2:04:31 PM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
WARNING: J2KImageReader not loaded. JPEG2000 files will not be processed.
See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
for optional dependencies.

Dec 18, 2018 2:04:31 PM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
WARNING: Tesseract OCR is installed and will be automatically applied to image files unless
you've excluded the TesseractOCRParser from the default parser.
Tesseract may dramatically slow down content extraction (TIKA-2359).
As of Tika 1.15 (and prior versions), Tesseract is automatically called.
In future versions of Tika, users may need to turn the TesseractOCRParser on via TikaConfig.
Dec 18, 2018 2:04:31 PM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
WARNING: org.xerial's sqlite-jdbc is not loaded.
Please provide the jar on your classpath to parse sqlite files.
See tika-parsers/pom.xml for the correct version.

Is that the correct command to view the proper log?

I ask because on line 96 of autotagging.py there is this statement:

self.logger.LogMessage('verbose', '{0} tag added to {1}'.format(Tag, FullName))

but I am not seeing any of that output in the log file above

EDIT:

So now (after using docker-compose down and reloading the images) I am getting this in the log (after the previously stated output):

2018-12-18 14:50:09.066066: [info] [0] started
2018-12-18 14:50:09.107822: [info] [0] connecting to Rabbit amqp://rabbit...
2018-12-18 14:50:09.204385: [info] [0] connected to Rabbit!
2018-12-18 14:50:09.220989: [info] [0] waiting for messages...

2018-12-18 14:51:09.128585: [verbose] [0] add task received for (then comes the full_name data)
2018-12-18 14:51:09.151687: [verbose] [0] meta found for (again, the full_name data)

This second 2 line chunk repeats a bunch of times, presumably for each time the new tag is supposed to be applied.

After grepping the language ('meta found for') in that output it looks like it's coming from the pipeline.py file, specifically lines 78 and 113.

Thanks again for your help!

<!-- gh-comment-id:448233834 --> @s1rk1t commented on GitHub (Dec 18, 2018): So I tried this: def SetFolderNameTag(self, AmbarFile): fileString = AmbarFile['meta']['full_name'] self.logger.LogMessage('verbose', '{0} is full_name'.format(fileString)) if('folderName' in fileString): self.AddTagToAmbarFile(AmbarFile['file_id'], AmbarFile['meta']['full_name'], self.AUTO_TAG_TYPE, 'folderName') and this: def SetFolderNameTag(self, AmbarFile): fileString = AmbarFile['meta']['full_name'] if('folderName' in fileString): self.logger.LogMessage('verbose', 'folderName is in {0}'.format(fileString)) self.AddTagToAmbarFile(AmbarFile['file_id'], AmbarFile['meta']['full_name'], self.AUTO_TAG_TYPE, 'folderName') but after using the 'sudo docker logs pipelineContainerID' command the output was this: Dec 18, 2018 2:04:31 PM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem WARNING: J2KImageReader not loaded. JPEG2000 files will not be processed. See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io for optional dependencies. Dec 18, 2018 2:04:31 PM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem WARNING: Tesseract OCR is installed and will be automatically applied to image files unless you've excluded the TesseractOCRParser from the default parser. Tesseract may dramatically slow down content extraction (TIKA-2359). As of Tika 1.15 (and prior versions), Tesseract is automatically called. In future versions of Tika, users may need to turn the TesseractOCRParser on via TikaConfig. Dec 18, 2018 2:04:31 PM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem WARNING: org.xerial's sqlite-jdbc is not loaded. Please provide the jar on your classpath to parse sqlite files. See tika-parsers/pom.xml for the correct version. Is that the correct command to view the proper log? I ask because on line 96 of autotagging.py there is this statement: self.logger.LogMessage('verbose', '{0} tag added to {1}'.format(Tag, FullName)) but I am not seeing any of that output in the log file above EDIT: So now (after using docker-compose down and reloading the images) I am getting this in the log (after the previously stated output): 2018-12-18 14:50:09.066066: [info] [0] started 2018-12-18 14:50:09.107822: [info] [0] connecting to Rabbit amqp://rabbit... 2018-12-18 14:50:09.204385: [info] [0] connected to Rabbit! 2018-12-18 14:50:09.220989: [info] [0] waiting for messages... 2018-12-18 14:51:09.128585: [verbose] [0] add task received for (then comes the full_name data) 2018-12-18 14:51:09.151687: [verbose] [0] meta found for (again, the full_name data) This second 2 line chunk repeats a bunch of times, presumably for each time the new tag is supposed to be applied. After grepping the language ('meta found for') in that output it looks like it's coming from the pipeline.py file, specifically lines 78 and 113. Thanks again for your help!
Author
Owner

@sochix commented on GitHub (Dec 18, 2018):

Did you crawl a file with 'folderName' in the path?

<!-- gh-comment-id:448301598 --> @sochix commented on GitHub (Dec 18, 2018): Did you crawl a file with 'folderName' in the path?
Author
Owner

@s1rk1t commented on GitHub (Dec 18, 2018):

yes

<!-- gh-comment-id:448301811 --> @s1rk1t commented on GitHub (Dec 18, 2018): yes
Author
Owner

@sochix commented on GitHub (Dec 18, 2018):

Can you please put the full path here as example?

<!-- gh-comment-id:448302083 --> @sochix commented on GitHub (Dec 18, 2018): Can you please put the full path here as example?
Author
Owner

@s1rk1t commented on GitHub (Dec 18, 2018):

Sure,

//mycrawler/outerFolder/subFolder/testDocument.pdf

folder name is outerFolder

<!-- gh-comment-id:448302840 --> @s1rk1t commented on GitHub (Dec 18, 2018): Sure, //mycrawler/outerFolder/subFolder/testDocument.pdf folder name is outerFolder
Author
Owner

@sochix commented on GitHub (Dec 18, 2018):

So your code snippet is:

def SetFolderNameTag(self, AmbarFile):
  fileString = AmbarFile['meta']['full_name']
  if('outerFolder' in fileString):
    self.logger.LogMessage('verbose', 'outerFolder is in {0}'.format(fileString))
    self.AddTagToAmbarFile(AmbarFile['file_id'], AmbarFile['meta']['full_name'], self.AUTO_TAG_TYPE, 'outerFolder')

Am I right?

<!-- gh-comment-id:448308436 --> @sochix commented on GitHub (Dec 18, 2018): So your code snippet is: ``` def SetFolderNameTag(self, AmbarFile): fileString = AmbarFile['meta']['full_name'] if('outerFolder' in fileString): self.logger.LogMessage('verbose', 'outerFolder is in {0}'.format(fileString)) self.AddTagToAmbarFile(AmbarFile['file_id'], AmbarFile['meta']['full_name'], self.AUTO_TAG_TYPE, 'outerFolder') ``` Am I right?
Author
Owner

@s1rk1t commented on GitHub (Dec 18, 2018):

Yes, that looks right.

<!-- gh-comment-id:448308653 --> @s1rk1t commented on GitHub (Dec 18, 2018): Yes, that looks right.
Author
Owner

@sochix commented on GitHub (Dec 18, 2018):

Did you change the ambar pipeline image source in docker-compose file?

<!-- gh-comment-id:448308911 --> @sochix commented on GitHub (Dec 18, 2018): Did you change the ambar pipeline image source in docker-compose file?
Author
Owner

@s1rk1t commented on GitHub (Dec 18, 2018):

Yes

<!-- gh-comment-id:448308982 --> @s1rk1t commented on GitHub (Dec 18, 2018): Yes
Author
Owner

@sochix commented on GitHub (Dec 18, 2018):

Can you share your docker-compose file please

<!-- gh-comment-id:448309162 --> @sochix commented on GitHub (Dec 18, 2018): Can you share your docker-compose file please
Author
Owner

@s1rk1t commented on GitHub (Dec 18, 2018):

I think I may have figured it out. Once I ran docker's prune command I was able to see a change in the tag (I had changed ocr to ocr-test like the poster did in Issue #175 ). Rerunning it now to see if the new tags show up.

<!-- gh-comment-id:448347652 --> @s1rk1t commented on GitHub (Dec 18, 2018): I think I may have figured it out. Once I ran docker's prune command I was able to see a change in the tag (I had changed ocr to ocr-test like the poster did in Issue #175 ). Rerunning it now to see if the new tags show up.
Author
Owner

@s1rk1t commented on GitHub (Dec 18, 2018):

Yep, that did it. It's working as expected now.

Thanks so much for your help!

<!-- gh-comment-id:448351366 --> @s1rk1t commented on GitHub (Dec 18, 2018): Yep, that did it. It's working as expected now. Thanks so much for your help!
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ambar#201
No description provided.