[GH-ISSUE #22] regular expressions for keyword filtering #17

Closed
opened 2026-02-28 01:22:46 +03:00 by kerem · 22 comments
Owner

Originally created by @swoertz on GitHub (May 29, 2012).
Original GitHub issue: https://github.com/ushahidi/SMSSync/issues/22

it would be nice if you could filter sms messages by regular expressions

Originally created by @swoertz on GitHub (May 29, 2012). Original GitHub issue: https://github.com/ushahidi/SMSSync/issues/22 it would be nice if you could filter sms messages by regular expressions
kerem closed this issue 2026-02-28 01:22:46 +03:00
Author
Owner

@eyedol commented on GitHub (Aug 26, 2012):

Don't have the bandwidth to implement this feature. I'm leaving it open. Any community member who is willing to implement this feature should let me know. I'll humbly merge any pull request that comes my way on that front.

<!-- gh-comment-id:8036703 --> @eyedol commented on GitHub (Aug 26, 2012): Don't have the bandwidth to implement this feature. I'm leaving it open. Any community member who is willing to implement this feature should let me know. I'll humbly merge any pull request that comes my way on that front.
Author
Owner

@olliebennett commented on GitHub (Dec 22, 2012):

I am playing with implementing this at the moment. My thoughts are:

  • Most users don't know (or care) what a Regular Expression is.
  • Adding necessary option(s) to differentiate between regular expressions and keywords would be clunky and confusing.

I propose an alternative, whereby entering regex(YOUR_REGEX_HERE) in the keywords field would cause the filtering to be processed via regular expressions rather than comma-separated keywords.

What are your thoughts on this? I will submit a pull request once I have this implemented and tested, so you can decide whether it's acceptable for inclusion! :)

Notes:

  • This alternative or "secret" method would of course need to be documented, but that would only require an extra line, and a link to a good Java RegEx reference.
  • We should test the user input for RegEx validity. I can't decide whether this is worth doing, or whether it'd bloat the code.
<!-- gh-comment-id:11632285 --> @olliebennett commented on GitHub (Dec 22, 2012): I am playing with implementing this at the moment. My thoughts are: - Most users don't know (or care) what a Regular Expression is. - Adding necessary option(s) to differentiate between regular expressions and keywords would be clunky and confusing. I propose an alternative, whereby entering `regex(YOUR_REGEX_HERE)` in the keywords field would cause the filtering to be processed via regular expressions rather than comma-separated keywords. What are your thoughts on this? I will submit a pull request once I have this implemented and tested, so you can decide whether it's acceptable for inclusion! :) Notes: - This alternative or "secret" method would of course need to be documented, but that would only require an extra line, and a link to a good Java RegEx reference. - We should test the user input for RegEx validity. I can't decide whether this is worth doing, or whether it'd bloat the code.
Author
Owner

@olliebennett commented on GitHub (Dec 22, 2012):

What I've got so far. All changes are within ProcessSMS.java, and there's nothing complicated.

Introduced dependency on java.utils.regex:

import java.util.regex.Matcher;
import java.util.regex.Pattern;

Changed name + function call:

Replaced...

String keywords[] = syncUrl.getKeywords().split(",");
if (filterByKeywords(messagesBody, keywords)) {

... with ...

String filterText = syncUrl.getKeywords();
if (matchesFilterText(messagesBody, filterText)) {

Then, changed the function itself by replacing...

public boolean filterByKeywords(String message, String[] keywords) {

... with ...

public boolean matchesFilterText(String message, String filterText) {

    // Handle Regular Expression, if specified by "regex(XYZ)".
    if (filterText.toLowerCase().startsWith("regex(") && filterText.toLowerCase().endsWith(")")) {
        String regEx = filterText.substring(6, filterText.length()-1);
        Pattern p = Pattern.compile(regEx); // TODO - test RegEx validity here?
        Matcher m = p.matcher(message);
        if (m.find()) {
           return true;
        }
        return false;
    }

    // Handle the filter text as comma-separated keywords
    String[] keywords = filterText.split(",");

Feel free to take this, as I don't have more time at the moment - it should work but it's not yet properly tested.

<!-- gh-comment-id:11633240 --> @olliebennett commented on GitHub (Dec 22, 2012): What I've got so far. All changes are within `ProcessSMS.java`, and there's nothing complicated. Introduced dependency on java.utils.regex: ``` java import java.util.regex.Matcher; import java.util.regex.Pattern; ``` Changed name + function call: Replaced... ``` java String keywords[] = syncUrl.getKeywords().split(","); if (filterByKeywords(messagesBody, keywords)) { ``` ... with ... ``` java String filterText = syncUrl.getKeywords(); if (matchesFilterText(messagesBody, filterText)) { ``` Then, changed the function itself by replacing... ``` java public boolean filterByKeywords(String message, String[] keywords) { ``` ... with ... ``` java public boolean matchesFilterText(String message, String filterText) { // Handle Regular Expression, if specified by "regex(XYZ)". if (filterText.toLowerCase().startsWith("regex(") && filterText.toLowerCase().endsWith(")")) { String regEx = filterText.substring(6, filterText.length()-1); Pattern p = Pattern.compile(regEx); // TODO - test RegEx validity here? Matcher m = p.matcher(message); if (m.find()) { return true; } return false; } // Handle the filter text as comma-separated keywords String[] keywords = filterText.split(","); ``` Feel free to take this, as I don't have more time at the moment - it should work but it's not yet properly tested.
Author
Owner

@swoertz commented on GitHub (Dec 22, 2012):

I would suggest to make a checkbox (regular expression - true, false) instead of implementing it as a hidden feature. Persons who are not familiar either not check the box or look up what it means. All the others can simple use the feature. Further I would also suggest to make a checkbox for case-sensitivity.

<!-- gh-comment-id:11637061 --> @swoertz commented on GitHub (Dec 22, 2012): I would suggest to make a checkbox (regular expression - true, false) instead of implementing it as a hidden feature. Persons who are not familiar either not check the box or look up what it means. All the others can simple use the feature. Further I would also suggest to make a checkbox for case-sensitivity.
Author
Owner

@olliebennett commented on GitHub (Dec 22, 2012):

Okay - thanks for sharing your thoughts. I agree this would be a nice addition, but my concerns remain about making the UI too cluttered, and adding complexity.

Have a look at a mock-up of these new options:

regex-ui-problem

I think you'll agree, it's getting a bit complicated, and there's no longer a nice way to give instructions on comma-separated keywords (currently, the keywords field says: "Enter keyword Eg. Fire, Crime, Help", which is clear and consise).

I'm thinking of a more useful position for the extra options. I think, at least, this feature would require a new full-page screen, that could be used when creating or editing sync URLs.

In my code above, I'm just being lazy, to be honest, and implementing the feature with the fewest changes possible! ;)

I do think your case-sensitivity option is important too. I had considered working this into my "secret" regex activation code - something like regex(YOUR_REGEX)params(CASE_INSENSITIVE) - but had ignored the idea as this just makes it very confusing. With the UI changes, it'd be a great idea. :)

Let's see what @eyedol etc. think?

<!-- gh-comment-id:11637364 --> @olliebennett commented on GitHub (Dec 22, 2012): Okay - thanks for sharing your thoughts. I agree this would be a nice addition, but my concerns remain about making the UI too cluttered, and adding complexity. Have a look at a mock-up of these new options: ![regex-ui-problem](https://f.cloud.github.com/assets/1068525/28497/07949a5a-4c3f-11e2-8a51-73f61cbd44fd.png) I think you'll agree, it's getting a bit complicated, and there's no longer a nice way to give instructions on comma-separated keywords (currently, the keywords field says: "Enter keyword Eg. Fire, Crime, Help", which is clear and consise). I'm thinking of a more useful position for the extra options. I think, at least, this feature would require a new full-page screen, that could be used when creating or editing sync URLs. In my code above, I'm just being lazy, to be honest, and implementing the feature with the fewest changes possible! ;) I do think your case-sensitivity option is important too. I had considered working this into my "secret" regex activation code - something like `regex(YOUR_REGEX)params(CASE_INSENSITIVE)` - but had ignored the idea as this just makes it very confusing. With the UI changes, it'd be a great idea. :) Let's see what @eyedol etc. think?
Author
Owner

@swoertz commented on GitHub (Dec 22, 2012):

An other possibility would be, making two input fields and a checkbox for each to (de)activate (grey out the textfields) the filter mechanisms.
Of course this makes the UI more complicated but in my oppinion the benefits of the additional functionality outweighs the increased complexitiy. It would also be possible to omit the checkboxes and check if the input fields are empty or not...

I would further process the message if one of the mechanisms matches (LOGICAL OR).

[x] key-word filtering
+----------------------+
| e.g. crime, fire...  |
+----------------------+
[x] regex filtering
+----------------------+
| [a-z].*              |
+----------------------+
<!-- gh-comment-id:11637531 --> @swoertz commented on GitHub (Dec 22, 2012): An other possibility would be, making two input fields and a checkbox for each to (de)activate (grey out the textfields) the filter mechanisms. Of course this makes the UI more complicated but in my oppinion the benefits of the additional functionality outweighs the increased complexitiy. It would also be possible to omit the checkboxes and check if the input fields are empty or not... I would further process the message if one of the mechanisms matches (LOGICAL OR). ``` [x] key-word filtering +----------------------+ | e.g. crime, fire... | +----------------------+ [x] regex filtering +----------------------+ | [a-z].* | +----------------------+ ```
Author
Owner

@eyedol commented on GitHub (Dec 22, 2012):

Perhaps we could use a label saying you can enter regex or CSV to filter for keywords instead of the checkboxes. I agree with @olliebennett the ui can be a bit messy when we add the checkboxes

<!-- gh-comment-id:11637568 --> @eyedol commented on GitHub (Dec 22, 2012): Perhaps we could use a label saying you can enter regex or CSV to filter for keywords instead of the checkboxes. I agree with @olliebennett the ui can be a bit messy when we add the checkboxes
Author
Owner

@swoertz commented on GitHub (Dec 22, 2012):

I agree with you, but we have to distinguish between keywords and regexs. The 3 ways to do that we mentioned so far are:

  • by a textual indicator "regex(...)"
  • by checkboxes
  • by different input fields
<!-- gh-comment-id:11637675 --> @swoertz commented on GitHub (Dec 22, 2012): I agree with you, but we have to distinguish between keywords and regexs. The 3 ways to do that we mentioned so far are: - by a textual indicator "regex(...)" - by checkboxes - by different input fields
Author
Owner

@olliebennett commented on GitHub (Dec 22, 2012):

Of those three options, I think @swoertz's ASCII layout above is best.

I think the lack of case sensitivity toggling is a fair UX compromise, but there are two options for treating the input:

  • Treat as case sensitive - users can then enter (k|K)(e|E)(w|W)(o|O)(r|R)(d|D) if required?!
  • Treat as case insensitive - which I believe most users would want anyway (on account of text input on mobiles often handling capitalization automatically).

Which do you prefer? My use case for RegEx, would be to case-insensitively match messages that start with a keyword, for instance ^keyword.*$.

@eyedol, is there a fool-proof way of detecting whether the list is a RegEx, or a comma-separated list of keywords?

<!-- gh-comment-id:11638458 --> @olliebennett commented on GitHub (Dec 22, 2012): Of those three options, I think @swoertz's ASCII layout above is best. I think the lack of case sensitivity toggling is a fair UX compromise, but there are two options for treating the input: - Treat as case sensitive - users can then enter `(k|K)(e|E)(w|W)(o|O)(r|R)(d|D)` if required?! - Treat as case insensitive - which I believe most users would want anyway (on account of text input on mobiles often handling capitalization automatically). Which do you prefer? My use case for RegEx, would be to case-insensitively match messages that start with a keyword, for instance `^keyword.*$`. @eyedol, is there a fool-proof way of detecting whether the list is a RegEx, or a comma-separated list of keywords?
Author
Owner

@swoertz commented on GitHub (Dec 22, 2012):

is there a fool-proof way of detecting whether the list is a RegEx, or a comma-separated list of keywords?

In general there is no way, I can think of. Because every keyword is a valid regex and vice-versa if I take the regex literally I can use it as keyword.
If we allow only [a-zA-Z]* for example for keywords we could detect the special characters like "^$*" and so on, but I think this is confusing.

<!-- gh-comment-id:11638709 --> @swoertz commented on GitHub (Dec 22, 2012): > is there a fool-proof way of detecting whether the list is a RegEx, or a comma-separated list of keywords? In general there is no way, I can think of. Because every keyword is a valid regex and vice-versa if I take the regex literally I can use it as keyword. If we allow only [a-zA-Z]\* for example for keywords we could detect the special characters like "^$*" and so on, but I think this is confusing.
Author
Owner

@eyedol commented on GitHub (Dec 23, 2012):

@olliebennett no, there is no fool-proof way of detecting whether the list is a RegEx or CSV. I think we should just make the keyword input filed process RegEx. So as a user when I enter, heat,sun,cold,yam as keywords, SMSSync should be able to match them and when I enter ^keyword.*$ it should also be able to match that as well.

<!-- gh-comment-id:11642873 --> @eyedol commented on GitHub (Dec 23, 2012): @olliebennett no, there is no fool-proof way of detecting whether the list is a RegEx or CSV. I think we should just make the keyword input filed process RegEx. So as a user when I enter, `heat,sun,cold,yam` as keywords, SMSSync should be able to match them and when I enter `^keyword.*$` it should also be able to match that as well.
Author
Owner

@olliebennett commented on GitHub (Dec 23, 2012):

Okay. I guess we could try use a RegEx of our own to detect whether the filter text was a valid CSV (and if not, treat as RegEx).

This (realistically) means enforcing some limits on the acceptable keywords used in the CSV list. I propose limiting these keywords to be alpha-numeric, with spaces permitted both surrounding the comma (test, keyword) and also within the keyword itself (test keyword,and another).

Are we happy to place these limits on the acceptable keywords? And what should the limits be? I think those I've listed are sufficient and reasonable.

See http://refiddle.com/by/olliebennett/csv-or-regex for a first attempt that seems to work

It would be nice to simply treat everything as a regex, but to use @eyedol's example of heat,sun,cold,yam, we'd need to convert this to (heat|sun|cold|yam) (or something like that) in order to make it have the same effect as the keywords would suggest.

If this is okay with you guys, the change should be trivial, and I'll update the code (and relevant documentation) and submit a pull request.

assuming RegEx in Java behaves in the same way as RegEx in JavaScript...

<!-- gh-comment-id:11646878 --> @olliebennett commented on GitHub (Dec 23, 2012): Okay. I guess we could try use a RegEx of our own to detect whether the filter text was a valid CSV (and if not, treat as RegEx). This (realistically) means enforcing some limits on the acceptable keywords used in the CSV list. I propose limiting these keywords to be alpha-numeric, with spaces permitted both surrounding the comma (`test, keyword`) and also within the keyword itself (`test keyword,and another`). Are we happy to place these limits on the acceptable keywords? And what should the limits be? I think those I've listed are sufficient and reasonable. See http://refiddle.com/by/olliebennett/csv-or-regex for a first attempt that seems to work<sup>&dagger;</sup> It would be nice to simply treat everything as a regex, but to use @eyedol's example of `heat,sun,cold,yam`, we'd need to convert this to `(heat|sun|cold|yam)` (or something like that) in order to make it have the same effect as the keywords would suggest. If this is okay with you guys, the change should be trivial, and I'll update the code (and relevant documentation) and submit a pull request. <sup>&dagger;</sup> assuming RegEx in Java behaves in the same way as RegEx in JavaScript...
Author
Owner

@swoertz commented on GitHub (Dec 23, 2012):

This is a nice solution, but would not work for e.g. hashtags as keywords (example: #foo, #bar). Of course if one wants to filter by these tags (#foo)|(#bar) would work. There are other examples beside hash-tags as well. How would you document this, so that the user can recoginze the difference between keywords and reg. expressions while using the app? Many applications which are able to work with regex and "normal" strings use a checkbox (e.g. search-function in text-editors).

<!-- gh-comment-id:11647533 --> @swoertz commented on GitHub (Dec 23, 2012): This is a nice solution, but would not work for e.g. hashtags as keywords (example: `#foo, #bar`). Of course if one wants to filter by these tags `(#foo)|(#bar)` would work. There are other examples beside hash-tags as well. How would you document this, so that the user can recoginze the difference between keywords and reg. expressions while using the app? Many applications which are able to work with regex and "normal" strings use a checkbox (e.g. search-function in text-editors).
Author
Owner

@olliebennett commented on GitHub (Dec 23, 2012):

This is a tricky one!

I agree - there's a problem of letting the user know that what they think are normal keywords are in fact going to be treated as a (probably useless or even invalid) RegEx.

Perhaps we could add a confirmation box that pops up when saving the Sync URL settings, which would only appear if a special character has been entered, saying:

You have entered special characters in the filter field. This will be will now be treated as a Regular Expression.
| OK |

The documentation would need to state that:

Keywords can only be made of alpha-numeric characters (and spaces). For any more advanced tags, you will need to use Regular Expressions.

There would also need to be a couple of common RegEx examples, such as "starts with X", or similar.

It's my opinion that in this case, creating extra options in the UI adds more confusion than it does value.

Another option - as an alternative to using regex(abc) - is to require the use of ^ and $ at the start and end, respectively (although that still has problems).

There's always the (inefficient) workaround of handling the RegEx matching on the server, and only taking any action if a match is found. If it doesn't match, simply return success: "true" and then both systems can move on. This of course removes the need for RegEx support altogether (but it's not a nice solution!).

at this stage, we could also check the validity of the RegEx.

<!-- gh-comment-id:11647934 --> @olliebennett commented on GitHub (Dec 23, 2012): This is a tricky one! I agree - there's a problem of letting the user know that what they think are normal keywords are in fact going to be treated as a (probably useless or even invalid) RegEx. Perhaps we could add a confirmation box that pops up when saving the Sync URL settings, which would only appear if a special character has been entered, saying<sup>&dagger;</sup>: > You have entered special characters in the filter field. This will be will now be treated as a Regular Expression. > | OK | The documentation would need to state that: > Keywords can only be made of alpha-numeric characters (and spaces). For any more advanced tags, you will need to use Regular Expressions. There would also need to be a couple of common RegEx examples, such as "starts with X", or similar. It's my opinion that in this case, creating extra options in the UI adds more confusion than it does value. Another option - as an alternative to using `regex(abc)` - is to require the use of `^` and `$` at the start and end, respectively (although that still has problems). There's always the (inefficient) workaround of handling the RegEx matching on the server, and only taking any action if a match is found. If it doesn't match, simply return `success: "true"` and then both systems can move on. This of course removes the need for RegEx support altogether (but it's not a nice solution!). <sup>&dagger;</sup> at this stage, we could also check the validity of the RegEx.
Author
Owner

@swoertz commented on GitHub (Dec 23, 2012):

There are so many combinations which are not alphanumeric, which would lead to the message "You have entered special characters...".

The question is what is more confusing, the checkbox or the info-message when one enters "special" characters. If one is not familiar with regex's the default value of the checkbox (FALSE) is just fine. Further I think most of the smssync users are developers anyway (or is that a wrong assumption?).

How should a user who can't handle regex's react on the info message?

<!-- gh-comment-id:11649722 --> @swoertz commented on GitHub (Dec 23, 2012): There are so many combinations which are not alphanumeric, which would lead to the message "You have entered special characters...". The question is what is more confusing, the checkbox or the info-message when one enters "special" characters. If one is not familiar with regex's the default value of the checkbox (FALSE) is just fine. Further I think most of the smssync users are developers anyway (or is that a wrong assumption?). How should a user who can't handle regex's react on the info message?
Author
Owner

@eyedol commented on GitHub (Feb 11, 2013):

@olliebennett what's the status of this? If you send me a pull request, I'm happy to merge after a code review

<!-- gh-comment-id:13399112 --> @eyedol commented on GitHub (Feb 11, 2013): @olliebennett what's the status of this? If you send me a pull request, I'm happy to merge after a code review
Author
Owner

@olliebennett commented on GitHub (Feb 11, 2013):

I'm actually using my own modified version with RegEx in place of keywords as standard. Adding the actual RegEx capability was trivial, however adding a configuration option proved more complicated (creating database info, UI elements, loading/saving of existing configuration etc.) so the version I use has no changes to the UI. :( Sorry

Seem to have lost the code, but can provide the apk if anyone needs - just email.

I'll have another go at implementing it sometime, perhaps.

<!-- gh-comment-id:13409041 --> @olliebennett commented on GitHub (Feb 11, 2013): I'm actually using my own modified version with RegEx in place of keywords as standard. Adding the actual RegEx capability was trivial, however adding a configuration option proved more complicated (creating database info, UI elements, loading/saving of existing configuration etc.) so the version I use has no changes to the UI. :( Sorry Seem to have lost the code, but can provide the apk if anyone needs - just email. I'll have another go at implementing it sometime, perhaps.
Author
Owner

@mandric commented on GitHub (Feb 12, 2013):

Maybe provide a branch if it's not too much hassle?

<!-- gh-comment-id:13415608 --> @mandric commented on GitHub (Feb 12, 2013): Maybe provide a branch if it's not too much hassle?
Author
Owner

@olliebennett commented on GitHub (Feb 24, 2013):

I've slightly changed the way this works for me, by first checking whether the filter text matches a keyword, and if not, checking if it matches when treated as a RegEx. I know there could be issues with this approach, but can't think of any that would interfere with the typical, simple use cases.

I'm not comfortable enough with this code to throw a pull request yet, but I'll do some more tests, and consider bulking it out a bit, if needed.

Check out my feature-regex branch and see what you think. If anyone's got any input, please reply here.

<!-- gh-comment-id:14002274 --> @olliebennett commented on GitHub (Feb 24, 2013): I've slightly changed the way this works for me, by first checking whether the filter text matches a keyword, and if not, checking if it matches when treated as a RegEx. I know there could be issues with this approach, but can't think of any that would interfere with the typical, simple use cases. I'm not comfortable enough with this code to throw a pull request yet, but I'll do some more tests, and consider bulking it out a bit, if needed. Check out my [feature-regex](https://github.com/olliebennett/SMSSync/tree/feature-regex) branch and see what you think. If anyone's got any input, please reply here.
Author
Owner

@eyedol commented on GitHub (Feb 25, 2013):

We could document on SMSSync website as, for simple filtering, use CSV as the keywords and for complex filtering, consider entering a valid RegEx in the filter by input field.

<!-- gh-comment-id:14028070 --> @eyedol commented on GitHub (Feb 25, 2013): We could document on SMSSync website as, for simple filtering, use CSV as the keywords and for complex filtering, consider entering a valid RegEx in the filter by input field.
Author
Owner

@olliebennett commented on GitHub (Feb 25, 2013):

Agreed. The only possible problem that I foresee is if someone sends a message containing the RegEx itself. This would match as a 'keyword'.

Do you think we should change the 'enter keywords eg fire, help' placeholder text to something mentioning regex too? Or just document it online?

<!-- gh-comment-id:14028366 --> @olliebennett commented on GitHub (Feb 25, 2013): Agreed. The only possible problem that I foresee is if someone sends a message containing the RegEx itself. This would match as a 'keyword'. Do you think we should change the 'enter keywords eg fire, help' placeholder text to something mentioning regex too? Or just document it online?
Author
Owner

@eyedol commented on GitHub (Feb 25, 2013):

Yeah, we could make the placeholder text mention RegEx too

<!-- gh-comment-id:14028433 --> @eyedol commented on GitHub (Feb 25, 2013): Yeah, we could make the placeholder text mention RegEx too
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/SMSSync#17
No description provided.