[GH-ISSUE #175] Adding custom tagging rules #173

Closed
opened 2026-02-27 15:55:28 +03:00 by kerem · 7 comments
Owner

Originally created by @musical10441 on GitHub (Jul 28, 2018).
Original GitHub issue: https://github.com/RD17/ambar/issues/175

Good afternoon -

I wanted to add a custom rule to detect US Social Security Numbers and tag the documents as "SSN". Here are the steps I took:

  1. docker-compose down on my existing install
  2. locate and update autotagging.py.
  3. I added the following:
def AutoTagAmbarFile(self, AmbarFile):
    self.SetOCRTag(AmbarFile)
    self.SetSourceIdTag(AmbarFile)
    self.SetArchiveTag(AmbarFile)
    self.SetImageTag(AmbarFile)
    **self.SetSSNTag(AmbarFile)**

as well as:

def SetSSNTag(self, AmbarFile):
if PIIParser.MatchCC('234-23-1662'): #for testing only
self.AddTagToAmbarFile(AmbarFile['file_id'], AmbarFile['meta']['full_name'], self.AUTO_TAG_TYPE, 'cc')

  1. I re-ran docker-compose up -d
  2. I uploaded a file but the CC tag was not applied.

My next step was to remove the instance again and modify the AutoTagging.py file again:

def SetArchiveTag(self, AmbarFile):
    if ContentTypeAnalyzer.IsArchive(AmbarFile['meta']['full_name']):
        self.AddTagToAmbarFile(AmbarFile['file_id'], AmbarFile['meta']['full_name'], self.AUTO_TAG_TYPE, 'archive-test')

Notice that in this case all I did was change the name of the archive tag to ensure that I wasn't using any evaluation logic and that I was just trying to get the tag name to take affect.

I then uploaded a zip file containing an image document. The tag "archive" (not "archive-test") was added.

I would think that modifying an existing tag with a new name without changing any logic would work but it didn't. Is there another place I should be looking?

Thanks for your help and what a great tool you've created!

Originally created by @musical10441 on GitHub (Jul 28, 2018). Original GitHub issue: https://github.com/RD17/ambar/issues/175 Good afternoon - I wanted to add a custom rule to detect US Social Security Numbers and tag the documents as "SSN". Here are the steps I took: 1. docker-compose down on my existing install 2. locate and update autotagging.py. 3. I added the following: > def AutoTagAmbarFile(self, AmbarFile): > self.SetOCRTag(AmbarFile) > self.SetSourceIdTag(AmbarFile) > self.SetArchiveTag(AmbarFile) > self.SetImageTag(AmbarFile) > **self.SetSSNTag(AmbarFile)** as well as: > def SetSSNTag(self, AmbarFile): > if PIIParser.MatchCC('234-23-1662'): #for testing only > self.AddTagToAmbarFile(AmbarFile['file_id'], AmbarFile['meta']['full_name'], self.AUTO_TAG_TYPE, 'cc') 4. I re-ran docker-compose up -d 5. I uploaded a file but the CC tag was not applied. My next step was to remove the instance again and modify the AutoTagging.py file again: > def SetArchiveTag(self, AmbarFile): > if ContentTypeAnalyzer.IsArchive(AmbarFile['meta']['full_name']): > self.AddTagToAmbarFile(AmbarFile['file_id'], AmbarFile['meta']['full_name'], self.AUTO_TAG_TYPE, 'archive-test') Notice that in this case all I did was change the name of the archive tag to ensure that I wasn't using any evaluation logic and that I was just trying to get the tag name to take affect. I then uploaded a zip file containing an image document. The tag "archive" (not "archive-test") was added. I would think that modifying an existing tag with a new name without changing any logic would work but it didn't. Is there another place I should be looking? Thanks for your help and what a great tool you've created!
kerem 2026-02-27 15:55:28 +03:00
Author
Owner

@sochix commented on GitHub (Jul 31, 2018):

Hi, did you create a new docker image with your changes? Or where did you edit autotagging.py?

<!-- gh-comment-id:409238366 --> @sochix commented on GitHub (Jul 31, 2018): Hi, did you create a new docker image with your changes? Or where did you edit autotagging.py?
Author
Owner

@musical10441 commented on GitHub (Jul 31, 2018):

I edited my changes at the ambar-master folder and then ran docker-compose
up -d to create a new docker application.

On Tue, Jul 31, 2018 at 10:22 AM Ilya Pirozhenko notifications@github.com
wrote:

Hi, did you create a new docker image with your changes? Or where did you
edit autotagging.py?


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
https://github.com/RD17/ambar/issues/175#issuecomment-409238366, or mute
the thread
https://github.com/notifications/unsubscribe-auth/Alu6YfCpH27qvSssxv9ZZFUXTCNDrvGfks5uMGghgaJpZM4VlLxO
.

--

Oran Sears
703-928-0923

<!-- gh-comment-id:409239808 --> @musical10441 commented on GitHub (Jul 31, 2018): I edited my changes at the ambar-master folder and then ran docker-compose up -d to create a new docker application. On Tue, Jul 31, 2018 at 10:22 AM Ilya Pirozhenko <notifications@github.com> wrote: > Hi, did you create a new docker image with your changes? Or where did you > edit autotagging.py? > > — > You are receiving this because you authored the thread. > Reply to this email directly, view it on GitHub > <https://github.com/RD17/ambar/issues/175#issuecomment-409238366>, or mute > the thread > <https://github.com/notifications/unsubscribe-auth/Alu6YfCpH27qvSssxv9ZZFUXTCNDrvGfks5uMGghgaJpZM4VlLxO> > . > -- *Oran Sears* 703-928-0923
Author
Owner

@sochix commented on GitHub (Jul 31, 2018):

@musical10441 you need to build a new pipeline image with your changes and then edit your docker-compose file to referenece new pipeline image

<!-- gh-comment-id:409241311 --> @sochix commented on GitHub (Jul 31, 2018): @musical10441 you need to build a new pipeline image with your changes and then edit your docker-compose file to referenece new pipeline image
Author
Owner

@musical10441 commented on GitHub (Aug 1, 2018):

Thank you. That got me much further. I'm now able to see the changes I made to the names of the default ocr and archive tags.

Now that I know the changes are taking affect, I am using the following code in the autotagger.py to call the Regular expression.

def SetCreditCardTag(self, AmbarFile):
if PIIParser.MatchCC(AmbarFile['content']['text']):
#if PIIParser.MatchCC('4485003891627515'):
self.AddTagToAmbarFile(AmbarFile['file_id'], AmbarFile['meta']['full_name'], self.AUTO_TAG_TYPE, 'cc')

My goal is to pass the contents of the document to the PIIParser and return true if there is a match in the document.

When I pass the hard coded value using if PIIParser.MatchCC('4485003891627515'): it works as expected, but when I try passing the document content using if PIIParser.MatchCC(AmbarFile['content']['text']): it does not return true. The document is a text file with only the credit card number (it's fake btw).

Am I correct in trying to pass (AmbarFile['content']['text']) or should I be passing something else?

Thanks in advance!

<!-- gh-comment-id:409415001 --> @musical10441 commented on GitHub (Aug 1, 2018): Thank you. That got me much further. I'm now able to see the changes I made to the names of the default ocr and archive tags. Now that I know the changes are taking affect, I am using the following code in the autotagger.py to call the Regular expression. > def SetCreditCardTag(self, AmbarFile): > if PIIParser.MatchCC(AmbarFile['content']['text']): > #if PIIParser.MatchCC('4485003891627515'): > self.AddTagToAmbarFile(AmbarFile['file_id'], AmbarFile['meta']['full_name'], self.AUTO_TAG_TYPE, 'cc') My goal is to pass the contents of the document to the PIIParser and return true if there is a match in the document. When I pass the hard coded value using if PIIParser.MatchCC('4485003891627515'): it works as expected, but when I try passing the document content using if PIIParser.MatchCC(AmbarFile['content']['text']): it does not return true. The document is a text file with only the credit card number (it's fake btw). Am I correct in trying to pass (AmbarFile['content']['text']) or should I be passing something else? Thanks in advance!
Author
Owner

@sochix commented on GitHub (Aug 1, 2018):

Yes, everything is correct, I don't see any error. Can you please log the AmbarFile['content']['text'], and check what it contains?

<!-- gh-comment-id:409472793 --> @sochix commented on GitHub (Aug 1, 2018): Yes, everything is correct, I don't see any error. Can you please log the AmbarFile['content']['text'], and check what it contains?
Author
Owner

@musical10441 commented on GitHub (Aug 1, 2018):

Logging the content text returns the content, so it is passing it properly. I will keep working on it.

By the way, building a new pipeline image gives errors in the log:
/envs/plarin-3.7.0a4/lib/python3.7/site-packages/pika/adapters/libev_connection.py", line 106
self.async = None

Changing the Requirements.txt to pika==0.12.0 resolves the issue.

<!-- gh-comment-id:409585681 --> @musical10441 commented on GitHub (Aug 1, 2018): Logging the content text returns the content, so it is passing it properly. I will keep working on it. By the way, building a new pipeline image gives errors in the log: /envs/plarin-3.7.0a4/lib/python3.7/site-packages/pika/adapters/libev_connection.py", line 106 self.async = None Changing the Requirements.txt to pika==0.12.0 resolves the issue.
Author
Owner

@musical10441 commented on GitHub (Aug 1, 2018):

It seems to be working now. I made sure to change the requirements.txt per my prior post before building the new pipeline image. I removed the python3 image and the pipeline image and then ran docker-compose again.

Thanks for your help!

<!-- gh-comment-id:409626405 --> @musical10441 commented on GitHub (Aug 1, 2018): It seems to be working now. I made sure to change the requirements.txt per my prior post before building the new pipeline image. I removed the python3 image and the pipeline image and then ran docker-compose again. Thanks for your help!
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ambar#173
No description provided.