HomeBlogJavaScript: Unicode letters (in RegExps?)

JavaScript: Unicode letters (in RegExps?)

As you probably know already, Firefox's browsers' regexp engines don't "know" Unicode.  The \w specifier seems to be equivalent to [a-zA-Z0-9_], which is far from sufficient to match word characters.  After some googling I found XRegExp, a pretty cool library that extends the basic JS RegExp object.  It adds some useful magic, and—surprise—unicode support is available as a plugin.

Nice going, but since I need this in a very time intensive operation, it was rather slow.  (As you will see below, it's quite possible that XRegExp isn't to blame for the slowness).

So I thought I'd copy the bits that I need from the source code, and use them with standard JS regexps, rather than XRegExp.  If you look at the source code, it has a huge line for a "L" property (it's lowercase in the code) in a hash which defines the Unicode ranges for characters that are letters.  Later in the code, with a for loop it constructs the actual regexp.

Firefox trouble

(Update: in case it matters, I'm using Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1.5pre) Gecko/20091009 Ubuntu/9.04 (jaunty) Shiretoko/3.5.5pre)

OK, I took that line, implemented the regexp and all was good and I went on doing other interesting things.  What I noticed about an hour later is that it worked slower and slower, to the point where my application was becoming unusable.

Profile, hack & debug, change stuff, profile again — and finally, I think I figured out who's guilty: there's something fishy with Firefox's regexp engine!  This regexp works slower each time you run it.

Here's a simple test case which you can run (needs Firebug for timing the execution).  Just load this page, and watch the Firebug console, note the time displayed, then refresh.

My results (fresh browser instance):

  • first request: 63ms
  • 10th request: 314ms
  • 20th request: 1063ms

and the time keeps going up.  No memory seems to be leaked, though it's hard to be sure about this.

Note that the slowness isn't triggered by refreshing the page; if, without a page refresh, you happen to call that regexp a lot of times, then it will still be slower and slower.

Solution

While the solution I finally implemented isn't appropriate for a regexp, it happens that it works well for me.  My key problem was "how to determine if a character is a word character or not, Unicode letters included".  Here it is, simplicity itself:

function isUnicodeLetter(c) {
  return c.toUpperCase() != c.toLowerCase();
}

Incredibly fast, does the job and doesn't get tired. ;-)  Of course, I added some more checks to determine if it's a digit too, since I want digits to be treated as "part of a word".

    Comments — add your comment

    • By: Steven LevithanNov 05 (08:49) 2009RE: JavaScript: Unicode letters (in RegExps?) §

      That's a weird problem you're seeing in Firefox (and indeed, I can easily reproduce it using your test page). Have you reported it on Mozilla's bug site?

      For the record, your isUnicodeLetter function is not equivalent to your regex, as there are thousands of letters without case (Lo), modifier letters (Lm), and titlecase letters (Lt) that the function will not work for (consider all Chinese, Japanese, and Korean ideographic characters, just for starters).

    (not published)
        
    Notes
    • We don't publish your email address. It's only useful if you wish to receive a notification when someone replies to your comment.

    • Notifications work by thread. That is, you'll be notified even if someone replies to a reply to one of your comments.

    • Each notification includes a "remove me" link that removes your notification option from that comment forever.

    • If you want to reply a certain comment, be sure to click the "reply to this comment" link into it (will automatically setup threads).

    Page info
    Created:
    2009/10/20 15:27
    Modified:
    2009/10/21 00:37
    Author:
    Mihai Bazon
    Comments:
    2 (add yours)
    Tags:
    browsers, firefox, javascript, programming
    See also