Dev Watch

Blog archive

Malformed Stack Overflow Post Chokes Regex, Crashes Site

A few lessons to be learned here: Watch your backtracking regular expressions; always be ready for the most bizarre edge cases; and, for goodness' sake, don't post questions that contain 20,000 whitespaces.

On Wednesday, the popular programming Q&A site that helps millions of developers with their coding problems was brought down for 34 minutes. According to a follow-up postmortem post explaining the outage, "The direct cause was a malformed post that caused one of our regular expressions to consume high CPU on our Web servers."

The high CPU load was caused by a regex -- designed to trim unicode spaces from each end of lines -- chewing through some 20,000 whitespaces looking for a pattern match and, eventually finding none, backtracking to the next character to be examined and repeating the process ... again and again and again (or something like that; I never did grok regexes, but the real explanation is included two paragraphs down).

Making matters worse, the post was included on the home page list, so it was called again and again and .... Even worse: Stack Overflow's load balancer checks the home page to gauge system health and removed servers from the processing mix, bringing down the whole site.

XXX
[Click on image for larger view.] The Code Containing the 'Happy Sound' Line (with whitespaces removed) (source: Stack Overflow)

Here's the meat of the technical details included in the postmortem:

The regular expression was: ^[\s\u200c]+|[\s\u200c]+$ Which is intended to trim unicode space from start and end of a line. A simplified version of the Regex that exposes the same issue would be \s+$ which to a human looks easy ("all the spaces at the end of the string"), but which means quite some work for a simple backtracking Regex engine. The malformed post contained roughly 20,000 consecutive characters of whitespace on a comment line that started with -- play happy sound for player to enjoy. For us, the sound was not happy.

If the string to be matched against contains 20,000 space characters in a row, but not at the end, then the Regex engine will start at the first space, check that it belongs to the \s character class, move to the second space, make the same check, etc. After the 20,000th space, there is a different character, but the Regex engine expected a space or the end of the string. Realizing it cannot match like this it backtracks, and tries matching \s+$ starting from the second space, checking 19,999 characters. The match fails again, and it backtracks to start at the third space, etc.

So the Regex engine has to perform a "character belongs to a certain character class" check (plus some additional things) 20,000+19,999+19,998+…+3+2+1 = 199,990,000 times, and that takes a while. This is not classic catastrophic backtracking (talk on backtracking) (performance is O(n²), not exponential, in length), but it was enough.

Stack Overflow didn't provide many further details about the post in question, but a Reddit reader identified it as "Join tiles in Corona SDK into one word for a Breakout game grid?" The 20,000 whitespaces have been removed.

What remains in question is why would a post include a line with 20,000 leading whitespaces to begin with. And wouldn't that just result in the appearance of totally white page or something, anyway?

Another question (actually posed on Stack Overflow after the incident): Why does Stack Overflow use a backtracking regex implementation?

And, of more importance to Stack Overflow questioner "nguyentrungkien" -- who literally brought the house down -- how exactly does one go about joining tiles in the Corona SDK into one word for a Breakout game grid?

Poor nguyentrungkien only received two answers to his question, totaling only four upvotes, and neither answer was marked correct. You Lua experts should feel free to weigh in.

And you shouldn't worry about crashing the site like nguyentrungkien did. "This regular expression has been replaced with a substring function," the postmortem concluded.

Have you ever brought down a site through some bizarre means? I once did it with a bad database call. How about you? Comment here or drop me a line.

Posted by David Ramel on July 22, 2016