Solving the Regex of Madness - and a few thought about snarky answers on StackOverflow

Galimatias

Jan 31, 2019 • 3 min read

This infamous answer on StackOverflow have reached meme status. The question is about finding opening tags in XHTML using a regular expression, and the answer makes it abundantly clear that this is not possible - and you will surely go mad if you try.

The answer is hilarious, and is one of the most upvoted answers ever. Unfortunately it is also wrong. It is quite possible and not even that difficult:

(  
   # match all tags in XHTML but capture only opening tags
    <!-- .*? --> # comment   
  | <!\[CDATA\[ .*? \]\]> # CData section
  | <!DOCTYPE ( "" [^""]* "" | ' [^']* ' | [^>/'""] )* > 
  | <\? .*? \?> # xml declaration or processing instruction  
  | < \w+ ( "" [^""]* "" | ' [^']* ' | [^>/'""] )* /> # self-closing tag  
  | < (?<tag> \w+ ) ( "" [^""]* "" | ' [^']* ' | [^>/'""] )* > # opening tag - captured  
  | </ \w+ \s* > # end tag  
)

Granted, this is more than a one-liner, but if you know how to read regular expressions, it is straightforward to understand.

Not only can the task be solved with a regular expression - regular expressions are basically the only practical way to solve the problem. Which is why none of the clever answers actually suggest another way to solve the problem.

So why does many of the answers claim it is impossible? (And even feel justified in berating and ridiculing the asker for even attempting such an obviously foolish task? Seriously, read some of the answers. Thy are dripping with condescension.)

One of the answers attempts this impressive sounding explanation: "the flaw here is that HTML is a Chomsky Type 2 grammar (context free grammar) and RegEx is a Chomsky Type 3 grammar (regular grammar). Since a Type 2 grammar is fundamentally more complex than a Type 3 grammar (see the Chomsky hierarchy), you can't possibly make this work"

It all comes down to a widespread misunderstanding about what is possible and not possible with regular expressions.

Regular expressions uses character patterns to match substrings in a string. These patterns are pretty flexible and powerful, but they have one important limitation: They cannot be recursive.

Lets say you have this string "( ( ) )" and you want to find matching pairs of parentheses. You cannot write a regular expression which match "( ( ) )", but not "( ( )". Well you can if you only need to match to a specific nesting depth, but you cannot write a regular expression which would match to any depth. Because the pattern between the start and end parenthesis is same as the whole pattern.

(If you have ever wondered why many languages doesn't allow nested comments, this is the explanation. Comments can then easily be stripped with a single regular expression. With nested comments it would be more complex.)

For XHTML, this means you can identify individual syntax units like comments, start tags, end tags etc. But you cannot match start tags to end tags using a regular expression alone, since elements can be nested arbitrarily deep.

Of course you can pretty easily solve the matching by combining the regex matching with a counter or stack. But then you are not using only a regular expression, you have ventured into parsing.

Parsing means to translate a string of characters into some kind of tree, where the nodes represent meaningful units in the input. In the DOM for example, the nodes represent elements, attributes, text-nodes and similar from the document.

Parsing typically uses (at least) two steps: Tokenization which uses regular expressions to splits the input string into a sequence of syntax elements (numbers, quoted strings, keywords, brackets etc.), and parsing which takes this sequence of tokens and builds a tree structure from it, using a stack or some form of recursion to match recursive structures.

A parser often discards insignificant information from the syntax. For example the difference between an empty element <x></x> and a self-closing tag <x/> is not semantically significant, so the DOM does not represent that information. The lower-level SAX-parser API which many XHTML parsers are based on does not expose this distinction either.

Therefore you cannot just use a parser to solve the problem in the question. (At least not a standard DOM parser - there may be some XHTML parsers available that do expose this, but none of the answers provide any examples.)

In general StackOverflow is a great resource, but this is one instance where the voting system means memes and cargo cult beliefs win out over legitimate and helpful answers, and this helps spread misinformation and confusion.

Sign up for more like this.