Friday, December 2, 2011

Regular Expressions Rock!

A colleague at work recently cobbled together the following regular expression, but wanted help to better understand it.

^(?:(?!(core\.windows\.net)).)*$

This regex is intended to match anything that is not a Windows Azure Storage URL. It essentially looks for strings that do not contain "core.windows.net". Here's a breakdown of the regex from the inside out:

  1. \.: Character match. Matches '.'. The '\' character escapes the proceeding character.
  2. core\.windows\.net: Word match. Matches the text "core.windows.net".
  3. (core\.windows\.net): Text grouping. Treats the "core.windows.net" text as a group.
  4. (?!(core\.windows\.net)): Negated look-ahead assertion. From the point of evaluation, fails the match if the proceeding text is "core.windows.net".
  5. (?:(?!(core\.windows\.net)).): A non-capturing match-all. The "(?:" means that that the contents of the grouping, which would normally be made available for retrieval after evaluation, are not captured for retrieval. The '.' matches any character, unlike the escaped "\.", which matches a '.' character.
  6. (?:(?!(core\.windows\.net))*: Repeat the evaluation any number of times.
  7. ^(?:(?!(core\.windows\.net)).)*$: Do not match if any character from the start to the end of a line is followed by "core.windows.net". '^' denotes the start of a line and '$' denotes the end of a line.
The key to understanding this is to really understand how the negated look-ahead works. Here's an excellent reference for regexes. The reference is specifically for perl, but the concepts are nearly the same across all systems that support regexes. A good question is why we can't just write something that fails if a body of text contains the target text? The reason is that regular expressions do not support such a simple negated expression, so what is shown here achieves that functionality through alternative means.

This regex is interesting because it requires a strong understanding of how regexes work versus a simple regex like ^abc$ (which matches lines containing exactly the text "abc"). To really understand it, one must understand the implications of the behavior of the negated look-ahead, which does not "consume" text. Likewise, the non-capturing grouping is about memory efficiency rather than having anything to do with the correctness of the expression.

Regular expressions are a very powerful tool for matching and retrieving data from text. They are usually concise, replacing lots of custom parsing with one clever expression. They are also easy to test in the same way the custom parsing would be tested. That is, pass them various texts and check for the expected output.

Thinking about the problem above actually leads to a great interview question. "Write a function to evaluate a set of strings to determine whether they contain <some complex pattern>?" An example complex pattern is valid email addresses. Clearly, it's possible to code up some custom parsing or devise a regex. Either way, the candidate will need to do some abstract thinking, identify edge cases, test for correctness, and explain why their solution works They should also be able to do this relatively quickly. If the candidate evaluates the trade-offs of using the different methods (regexes are not always simpler), it shows practicality in their approach to problem solving. Familiarity with regexes is also good in that it shows that they are knowledgeable about technologies that are not often taught in schools.