Fasterj

|home |articles |cartoons |site map |contact us |
Tools: | GC log analysers| Multi-tenancy tools| Books| SizeOf| Thread analysers|

Our valued sponsors who help make this site possible
New Relic: Try free w/ production profiling and get a free shirt! 

Site24x7: Java Method-Level Tracing into Transactions @ $12/Month/JVM. Sign Up! 

Optimising Regular Expression Processing: page 2 of 2

jKool for DevOps
Light up your Apps & get a cool t-shirt

JProfiler
Get rid of your performance problems and memory leaks!


Java Performance Tuning, 2nd ed
The classic and most comprehensive book on tuning Java

Java Performance Tuning Newsletter
Your source of Java performance news. Subscribe now!
Enter email:


New Relic
New Relic: Try free w/ production profiling and get a free shirt!

Site24x7
Site24x7: Java Method-Level Tracing into Transactions @ $12/Month/JVM. Sign Up!


JProfiler
Get rid of your performance problems and memory leaks!

jKool for DevOps
Light up your Apps & get a cool t-shirt


Jack Shirazi looks at optimizing the performance of java.util.regex regular expressions. For those of you looking for just the summary tips, they are: Explicitly compile; reuse the Matcher; use find; strip leading and trailing .*; use indexOf to rule out non-matching lines; watch out for very long lines. Now there is no need to read further unless you are interested in the details.
Published September 2007, Author Jack Shirazi

Page 2 of 2
last page: Basic optimisations

Filter Out Non-Matching Lines Before Applying The Regex

The next optimisation is where we start entering the domain of the less obvious. It's a trick that I've used before in other areas (e.g. sorting): doing more work can end up being faster than doing less. In this case, a simple string search is always going to be more efficient than any regular expression comparison, and if the situation is right we can use that to speed things up. It's easiest to give an example for this: If your regular expression is something like GH.JK, i.e. [match a "GH" followed by any single character followed by a "JK"] then an efficient regular expression matcher should do a string search for "GH" as a first step. I don't know how efficient the java.util.regex implementation is (and in any case it can change across Java versions), but even if it is very efficient, it will have some significant overhead compared to a simple String.indexOf() comparison. So my optimisation suggestion here is to break out the constant parts of your regular expression (assuming there are some), and do an indexOf() against the strings using those partial parts. This efficiently filters out any lines that the regular expression is guaranteed to fail against, and hopefully should be faster.

So the obvious question is, how much faster? (Well, the obvious first question is "Does it work?" but clearly it does sometimes, or I wouldn't have included a section here suggesting you do this!). The code needs to change again, we have to add support for searching against constant fragments, reducing readability again and this time definitely introducing serious maintenance overhead even if you programmatically extract those constant fragments from the regular expression:

Pattern p = Pattern.compile(toMatch);
String toMatchFragment1 = ...
String toMatchFragment2 = ...
Matcher m = p.matcher(testString);
for( ... )
  testString = ...
  m.reset(testString);
  if ( testString.indexOf(toMatchFragment2) != -1 && testString.indexOf(toMatchFragment2) != -1 && m.matches() )
  ...

(Note for full optimisation you would do better by putting the Matcher.reset() call after the indexOf() calls, i.e. if (...indexOf()...) {m.reset(...); if(m.matches()){...}}, but I haven't done that in the fragment above or in my timed test, to keep the changes slightly simpler and more obvious). So after that, let's answer that question, how much faster is this? Again, this depends very strongly on the data and regex's used. In my test I tried a worst case data set where every line was passed through to the regex, and best case data set where no lines were passed through to the regex. Obviously for the worst case scenario the time should be worse compared to just using the regex with no indexOf (since the indexOf takes time but filters nothing), and in my tests I saw a decrease in performance of 5%. But for the best case scenario, where we never have to apply the regular expression match even once, the loop was an order of magnitude faster! That is a serious improvement, and well worth testing for your applications if you have a performance problem applying regular expressions in your apps. I'd add that in my actual applications, as opposed to this benchmark, I do really see good speedups, including up to an order of magnitude faster when a significant number of the lines being matched against can be filtered out to avoid the regex match.

Eliminate Unnecessary Patterns And Use Find

This next optimisation is really the result of seeing maintained code evolve to end up inefficient. The matches() methods assume that you are matching the whole string from end to end, so if you ony care about matching a fragment of the string, it is easy to put .* at the beginning and end of your pattern to use the matches() method. This is especially the case if you start out by using String.matches() and then the code evolves to explicitly use Matcher.matches() - the regular expression can stay with those superfluous "match anything" .* at the beginning and end. But Matcher has a find() method specifically for matching patterns within strings, and obviously this should be used. No guesses for figuring out that find() against a regex is always going to be more efficient than matches() using the corresponding .*regex.* . If this is applicable to your regular expression usage, you could see a good performance gain from this. My tests showed the find() method taking in the region of 25% to 50% of the time taken by the matches() call for this optimisation - a two- to four-fold speedup. But do note the caveat given in the next section.

Be Careful Of Long Lines

A colleague of mine, Douglas Donaldson, alerted me to this last issue. It turns out that the Matcher.find() method can have really bad performance when failing to match against a very long line (several thousand characters) if the regular expression has a .* as the beginning of the expression. Of course, if you are using find(), you shouldn't have a .* at the beginning of the expression in the first place, but in the case where this inefficiency was identified, the regexs were user defined ones passed in to the application, and no developer thought about stripping leading .* from passed in regexs (why would you?), while it is quite easy to prepend a .* while constructing a regex if you are thinking in terms of the whole line ("let's see now, I'll start at the beginning of the line. First, I can match any characters until I get to a blah, so that is .*blah, then ..."). In this case the inefficiency causes regex matching times of several orders of magnitude worse than without the .* - performance sufficiently bad that it can take several seconds to parse one single long line!

This is likely to be from some backtracking inefficiency within the regex implementation, and it is perfectly likely that other pathological inefficiencies are in any regular expression library implementation - it a complex algorithm and can easily contain inefficiencies.

Conclusion

The methods in the String class are perfectly adequate for one-off uses of regular expressions, but can be inefficient for repeated application of a regular expression. As usual, you should write initially for readability and maintenance, and only start tuning when you find tuning is required. The tuning techniques detailed in this article are straightforward to apply, so you don't have to construct your application in any special way beforehand to have it ready for tuning - just profile as usual to find your bottleneck, then tune.

Percentage improvement from various optimisations:

Optimisationspeed compared to using String.matches()
String.matches()100%
Compiling once only75%
+ Reusing Matchers70%
+ find() without surrounding .* 30%
+ Filter using indexOf(), all lines filtered5%

Contact Me

Page 2 of 2
last page: Basic optimisations


Last Updated: 2017-05-01
Copyright © 2007-2017 Fasterj.com. All Rights Reserved.
All trademarks and registered trademarks appearing on Fasterj.com are the property of their respective owners.
URL: http://www.fasterj.com/articles/regex2.shtml
RSS Feed: http://www.JavaPerformanceTuning.com/newsletters.rss
Trouble with this page? Please contact us