|
|
|
Jack Shirazi looks at optimizing the performance of java.util.regex regular expressions. For those of you looking for just the summary tips, they are: Explicitly compile; reuse the Matcher; use find; strip leading and trailing .*; use indexOf to rule out non-matching lines; watch out for very long lines. Now there is no need to read further unless you are interested in the details.Published September 2007, Author Jack Shirazi
Page 1 of 2
next page: Filtering, find(), and the conclusion
Being able to use regular expressions in many of my applications is pretty important to me, without them I'd have to use complicated parsers - I used to often write custom built ones - or use ugly not easily readable code that breaks up text bit by bit using lots of indexOf() and substring() calls. (Or even break out to Perl when it's a program that I really want in Java). But it's been a long time since I've needed to resort to any of those alternatives: Java has supported regular expressions as part of the core language since Java 1.4 and even before that there were several very good third party Java regular expression libraries (which I don't consider in this article).
In the course of using the java.util.regex classes, I've come across a few inefficiencies that I've successfully worked around, and I'll cover these in this article. First off is a well known optimization, to avoid recompiling the regular expression. It's really a no brainer, but not always as obvious as you might think - if you use the String methods like String.matches() or String.split(), then every time you call the method it compiles the regular expression. That's because the String methods are convenience shortcuts for using java.util.regex Pattern and Matcher objects, and so you don't get the efficiencies of using the Pattern and Matcher objects directly. For example, this statement:
boolean result = string.matches(regex);
executes the equivalent of:
Pattern p = Pattern.compile(regex); Matcher m = p.matcher(string); boolean result = m.matches( );
If the regular expression is to be reapplied to multiple strings, a better solution would be to call the Pattern.compile( ) method only once, but that option is not available if you use the shortcut String.matches(String regex) method. Similarly calling split() in a loop as you run through a file means that every loop iteration causes a recompile of the regular expression - and that is not a cheap thing to do.
It's worth getting an idea of the overhead of recompiling, so I compared looping on
for( ... ) testString = ... if (testString.matches(toMatch)) ...
with
Pattern p = Pattern.compile(toMatch); for( ... ) testString = ... Matcher m = p.matcher(testString); if (m.matches()) ...
This test showed the String.matches() loop takes about a third longer. Of course I did the usual things to ensure that my microbenchmark was not misrespresented by HotSpot profiling, JIT compilation, and caching, but this test will be highly dependent on the data being used, both the regular expressions and the strings being matched. Nevertheless, I would expect this optimisation to always be neutral or positive, that is there should never be a performance disadvantage to coding this way (there is, of course, a readability disadvantage, as it uses extra lines of code).
Checking the garbage collection showed that it also didn't distort the microbenchmark, but was interesting, showing that the recompilations in String.matches() caused much more garbage collection - four times as much - and lead straight on to my next rather obvious optimisation, detailed in the next section.
Not that many people seem to notice that the Matcher class has a reset() method. This allows you to reuse a Matcher object for matches against multiple different strings. Reusing a Matcher object doesn't save much time, but in the case of a loop it is one less object to create in each loop iteration, and correspondingly one less object to be garbage collected per loop iteration. The previous test loop needs a slight recoding to use Matcher.reset(), and leads to this code:
Pattern p = Pattern.compile(toMatch); Matcher m = p.matcher(testString); for( ... ) testString = ... m.reset(testString); if (m.matches()) ...
Once again, this pattern of coding should always be neutral or better for performance (at the cost of being slightly less readable yet again). In my tests for my data set I saw about a 10% peformance improvement.
Page 1 of 2
next page: Filtering, find(), and the conclusion