HomeBlogJavaScript: Unicode letters (in RegExps?)

JavaScript: Unicode letters (in RegExps?)

As you probably know already, Firefox's browsers' regexp engines don't "know" Unicode.  The \w specifier seems to be equivalent to [a-zA-Z0-9_], which is far from sufficient to match word characters.  After some googling I found XRegExp, a pretty cool library that extends the basic JS RegExp object.  It adds some useful magic, and—surprise—unicode support is available as a plugin.

Nice going, but since I need this in a very time intensive operation, it was rather slow.  (As you will see below, it's quite possible that XRegExp isn't to blame for the slowness).

So I thought I'd copy the bits that I need from the source code, and use them with standard JS regexps, rather than XRegExp.  If you look at the source code, it has a huge line for a "L" property (it's lowercase in the code) in a hash which defines the Unicode ranges for characters that are letters.  Later in the code, with a for loop it constructs the actual regexp.

Firefox trouble

(Update: in case it matters, I'm using Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1.5pre) Gecko/20091009 Ubuntu/9.04 (jaunty) Shiretoko/3.5.5pre)

OK, I took that line, implemented the regexp and all was good and I went on doing other interesting things.  What I noticed about an hour later is that it worked slower and slower, to the point where my application was becoming unusable.

Profile, hack & debug, change stuff, profile again — and finally, I think I figured out who's guilty: there's something fishy with Firefox's regexp engine!  This regexp works slower each time you run it.

Here's a simple test case which you can run (needs Firebug for timing the execution).  Just load this page, and watch the Firebug console, note the time displayed, then refresh.

My results (fresh browser instance):

  • first request: 63ms
  • 10th request: 314ms
  • 20th request: 1063ms

and the time keeps going up.  No memory seems to be leaked, though it's hard to be sure about this.

Note that the slowness isn't triggered by refreshing the page; if, without a page refresh, you happen to call that regexp a lot of times, then it will still be slower and slower.

Solution

While the solution I finally implemented isn't appropriate for a regexp, it happens that it works well for me.  My key problem was "how to determine if a character is a word character or not, Unicode letters included".  Here it is, simplicity itself:

function isUnicodeLetter(c) {
  return c.toUpperCase() != c.toLowerCase();
}

Incredibly fast, does the job and doesn't get tired. ;-)  Of course, I added some more checks to determine if it's a digit too, since I want digits to be treated as "part of a word".

    Comments

    • By: Steven LevithanNov 05 (08:49) 2009RE: JavaScript: Unicode letters (in RegExps?) §

      That's a weird problem you're seeing in Firefox (and indeed, I can easily reproduce it using your test page). Have you reported it on Mozilla's bug site?

      For the record, your isUnicodeLetter function is not equivalent to your regex, as there are thousands of letters without case (Lo), modifier letters (Lm), and titlecase letters (Lt) that the function will not work for (consider all Chinese, Japanese, and Korean ideographic characters, just for starters).

      • By: mishooNov 05 (11:31) 2009RE[2]: JavaScript: Unicode letters (in RegExps?) §

        Reported now: https://bugzilla.mozilla.org/show_bug.cgi?id=526724

        Indeed, looks like my hack doesn't cover all possibilities...  In any case, it's the only one usable for now--that regexp takes seconds after it ran a few dozen times. :-(

    • By: website melbourneDec 30 (08:43) 2010RE[3]: JavaScript: Unicode letters (in RegExps?) §

      I'm doing an English/Spanish site with ASP.NET using some client side validation with Regular Expressions.

      I wanted to write a single Regular Expression for most large text fields:

      ^[\w\d\s-'.,&#@:?!()$\/]+$

      Notice that I'm using \w and \d for WORD characters and DIGITS respectively. I was assuming that JavaScript would allow "áÁéÉíÍóÓúÚñÑüÜ" and characters like it when a browser is configured for Spanish, but it seems only to care about A-Za-z.

      I wanted to avoid using A-Za-z as its so English Focused.

      What's the i18n "right thing to do" when using Regula

    • By: Southampton Internet MarketingMay 24 (14:41) 2011RE[4]: JavaScript: Unicode letters (in RegExps?) §

      it worked slower and slower, to the point where my application was becoming unusable.

    • By: india domain registrationJan 04 (07:04) 2012RE[5]: JavaScript: Unicode letters (in RegExps?) §

      Thank you very much for the post it’s very informative,helped a lot .

    Page info
    Created:
    2009/10/20 15:27
    Modified:
    2009/10/21 00:37
    Author:
    Mihai Bazon
    Comments:
    5
    Tags:
    browsers, firefox, javascript, programming
    See also