Regex for matching text that does not contain a given word

Often we would like to see if a input string is present in the given text or not. While, regular expressions are not the best way to do this, it is useful to understand how we can achieve this using regular expressions. I will present here two different regular expressions which match the text that does not contain a given input word. Both use negative lookahead but behave differently on long input text. We read the text to match from a file, so that we can test our regex against arbitrarily long text. The word that should not occur is passed as the second argument as this is generally short in length.
This is the first regex

var fs = require('fs');
var inputStr = fs.readFileSync(process.argv[2])
                 .toString().replace('\n','');
//regex is /^((?!word).)*$/
var matcher = new RegExp('^((?!' + process.argv[3] + ').)*$');
console.log(matcher.test(inputStr));

download
and this is the second regex

var fs = require('fs');
var inputStr = fs.readFileSync(process.argv[2])
                 .toString().replace('\n','');
//regex is /^(?!.*word).*$/
var matcher = new RegExp('^(?!.*' + process.argv[3] + ').*$');
console.log(matcher.test(inputStr));

download (3)
First regular expression causes stack overflow error by some regex engines (tested with JavaScript/Java) when input text is long and does not contain the given word. It takes twice the number of steps to find the match when compared to the steps taken by second regex.

Advertisements

Simple for-loop startsWith and endsWith in JavaScript

After looking at all the answers posted at http://stackoverflow.com/questions/280634/endswith-in-javascript about implementing String endsWith in JavaScript, I was wondering why there is no simple for-loop implementation in that list of answers. While we all want a concise solution, for-loop endsWith is much more readable and is really really fast. Here is the link to jsperf

function endsWith(str, suffix) {
for (var suffixLength = suffix.length - 1, stringLength = str.length - 1;
suffixLength > -1; --suffixLength, --stringLength) {
if (str.charAt(stringLength) === suffix.charAt(suffixLength)) continue;
else return false;
}
return true;
}
endsWith('abcdefghijklm', 'lam') //false
endsWith('abcdefghijklm', 'klm') //true

While we are at it, here is the for-loop for startsWith

function startsWith(str, prefix) {
for (var index = 0; index < prefix.length; ++index) {
if (str.charAt(index) === prefix.charAt(index)) continue;
else return false;
}
return true;
}
startsWith('abcdefghijklm', 'abc') //true
startsWith('abcdefghijklm', 'def') //false

Trim a string using lookahead

I thought of implementing a simple trim operation using regular expression lookahead and came up with the following version for a start.

var str = "    sDAFSdasf afsd fads  sDAFSdasf afsd      ";
str = /(?!\s).*?(?=\s*$)/g.exec(str)[0];
//returns "sDAFSdasf afsd fads  sDAFSdasf afsd"

For comparing the performance of this lookahead, I picked up trim1 and trim11 from http://blog.stevenlevithan.com/archives/faster-trim-javascript. Here is the link to jsperf : javascriptlookaheadtrim Jsperf indicates that the trim using lookahead is quite fast on small strings and it is quite slower on larger strings. This is expected as this regex scans the entire string while other trim versions do not scan through the whole string.

Explanation of regex

(?!      # Assert that it is impossible to match the regex below starting at this position
   \s    # Match a single character that is a “whitespace character”
)
.        # Match any single character that is not a line break character
   *?    # Between zero and unlimited times, as few times as possible, expanding as needed
(?=      # Assert that the regex below can be matched, starting at this position
   \s    # Match a single character that is a “whitespace character”
      *  # Between zero and unlimited times, as many times as possible, giving back as needed
   $     # Assert position at the end of a line
)

Regular expression generator for matching the anagrams of a given string

This post is related to the answer I gave to the stackoverflow question Is it possible to generate a (compact) regular expression for an anagram of an arbitrary string?
The following javascript code will generate a regex that will match all the anagrams of a given input string. The regex length will increase linearly with the length of the input. It generates a regex which uses positive lookahead to match the anagram of the input string. The lookahead part of regex makes sure all the characters are present in the test input string ignoring their order and the matching part ensures that the length of the test input string is same as the length of the input string (for which regex is constructed).

function anagramRegexGenerator(input) {
	var lookaheadPart = '', matchingPart = '^';
	var positiveLookaheadPrefix='(?=', positiveLookaheadSuffix=')';
	var inputCharacterFrequencyMap = {}
	for ( var i = 0; i < input.length; i++ )
	{
	    !inputCharacterFrequencyMap[input[i]] 
                  ? inputCharacterFrequencyMap[input[i]] = 1
	          : ++inputCharacterFrequencyMap[input[i]];
	}
	for ( var j in inputCharacterFrequencyMap) {
	    lookaheadPart += positiveLookaheadPrefix;
	    for (var k = 0; k < inputCharacterFrequencyMap[j]; k++) {
	        lookaheadPart += '.*';
	        if (j == ' ') {
	            lookaheadPart += '\\s';
	        } else {
	            lookaheadPart += j;
	        }
	        matchingPart += '.';
	    }
	    lookaheadPart += positiveLookaheadSuffix;
	}
	matchingPart += '$';
	return lookaheadPart + matchingPart;
}

Sample input and output is the following

anagramRegexGenerator('aaadaaccc')
//generates the following string.
"(?=.*a.*a.*a.*a.*a)(?=.*d)(?=.*c.*c.*c)^.........$"
anagramRegexGenerator('abcdef ghij'); 
//generates the following string.
"(?=.*a)(?=.*b)(?=.*c)(?=.*d)(?=.*e)(?=.*f)(?=.*\s)(?=.*g)
(?=.*h)(?=.*i)(?=.*j)^...........$" 
//test run returns true
/(?=.*a)(?=.*b)(?=.*c)(?=.*d)(?=.*e)(?=.*f)(?=.*\s)(?=.*g)(?=.*h)
(?=.*i)(?=.*j)^...........$/.test('acdbefghij ')
//or using the RegExp object
//this returns true
new RegExp(anagramRegexGenerator('abcdef ghij')).test('acdbefghij ') 
//this returns false
new RegExp(anagramRegexGenerator('abcdef ghij')).test('acdbefghijj') 

Regular Expressions

In programming languages one generally comes across a situation where one needs to match input text against a pattern. I will quickly summarize the basic terminology involving the use of regular expressions.
1. Input text
2. Regular expression – this is also called the search pattern, there is a whole lot of syntax on how to construct this pattern which can be found here ( Regular Expression Reference)
3. Regular expression engine – this is the program that basically matches the regular expression occurring in the input text, and then it can do several operations on the matches that are found such as replacing it with some other text. So this can be thought of as consisting of two main components : a matcher and a replacer
4. Match – the text in the input text that complies exactly to the specification of the regular expression.
I will take a regex I recently modified (Original regex) to explain the concepts involved. This regex basically matches the url inside input text. I was working on showing tweets on a page and I needed to replace the plain URL text inside the tweet with html anchor tag pointing to the the plain URL text.
Here I am using javascript regex syntax to show how one can use this regex to convert text ‘This is the URL of current webpage http://wearmyhat.blogspot.com&#8217; to ‘This is the URL of current webpage http://wearmyhat.blogspot.com
I marked the modification I did in bold..this is a slight modification to avoid matching the ellipsis that occurs after the actual URL text. For example original regex was matching the ellipsis occuring after the following URL http://wearmyhat.blogspot.com&#8230; This was not really acceptable to my application.
Modified Regex (avoids consecutive ‘.’s in the url):

/\b(
     (?:[a-z][\w-]+:(?:\/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}\/)
     (?:(?:[^\s().]+[.]?)+|\((?:[^\s()]+|(?:\([^\s()]+\)))*\))+
     (?:\((?:[^\s()]+|(?:\([^\s()]+\)))*\)|[^\s`!()\[\]{};:'".,?«»“”‘’])
   )/gi

What I am going to do is take the main constructs that are used in this regex and explain them one by one.
1. \b Matches character between word and non-word character..in other words matches starting or ending character of a word.
2. () Capturing group [whatever is inside is captured by the matcher in entirety]. The above regex has only one outer capturing group with multiple non-capturing groups placed inside. These groups are the only groups that can be backreferenced.
3. (?:) Non-capturing group [whatever is inside is not captured by the matcher in entirety]. This also means that these groups are not available for backreferencing purpose. Any number of capturing groups inside a non-capturing group are treated as capturing groups by the Regex engine (which is actually non-intuitive).
4. [\w] matches a word character.
5. [^\s] matches a non-space character, [a-z] matches any character between a to z.
6. /{1,3} forward slash 1-3 times, \d{0,3} a digit 0-3 times, [a-z]{2-4} any character between a to z can occur 2-4 times.
7. My change : ([^\s().]+[.]?)+ corresponding group in original regex [^\s()]+
Here I force ‘.’ to occur only once at a time in the input text.
8. $1 This is needed to backreference a captured group which is needed to construct the anchor tag properly.
9. + means preceding expression can once or more, ? means preceding expression can occur 0 or 1 times, * means preceding expression can occur 0 or more times. These are called the quantifiers.
10. g and i in the end indicate the javascript syntax to match globally and in a case insensitive manner.
11. The second parameter of the replace function is the string that will replace the match found by the regex engine. We replace every url with an anchor tag pointing to that url.