Wednesday, June 14, 2006

Unicode in Ruby, Unicode in JRuby?

The great unicode debate has started up again on the ruby-talk mailing list, and for once I'm not actively avoiding it. We have been asked many times by Java folks why JRuby doesn't suport unicode, and there's really been no good answer other than "that's the way Ruby does it".

For the record, Ruby does support utf8, but not multibyte. Internally, it usually assumes strings are byte vectors, though there are libraries and tricks you can usually use to make things work. The array of libraries and tricks, however, is baffling to me.

So here, now, I open up this question to the Java, JRuby, and Ruby communities in general as I did on the ruby-talk and rails-core mailing lists:

Every time these unicode discussions come up my head spins like a top. You should see it.

We JRubyists have headaches from the unicode question too. Since JRuby is currently 1.8-compatible, we do not have what most call *native* unicode support. This is primarily because we do not wish to create an incompatible version of Ruby or build in support for unicode now that would conflict with Ruby 2.0 in the future. It is, however, embarrassing to say that although we run on top of Java, which has arguably pretty good unicode support, we don't support unicode. Perhaps you can see our conundrum.

I am no unicode expert. I know that Java uses UTF16 strings internally, converted to/from the current platform's encoding of choice by default. It also supports converting those UTF16 strings into just about every encoding out there, just by telling it to do so. Java supports the Unicode specification version 3.0. So Unicode is not a problem for Java.

We would love to be able to support unicode in JRuby, but there's always that nagging question of what it should look like and what would mesh well with the Ruby community at large. With the underlying platform already rich with unicode support, it would not take much effort to modify JRuby. So then there's a simple question:

What form would you, the Ruby [or JRuby or Java] users, want unicode to take? Is there a specific library that you feel encompasses a reasonable implementation of unicode support, e.g. icu4r? Should the support be transparent, e.g. no longer treat or assume strings are byte vectors? JRuby, because we use Java's String, is already using UTF16 strings exclusively...however there's no way to get at them through core Ruby APIs. What would be the most comfortable way to support unicode now, considering where Ruby may go in the future?


So there it is, blogosphere. Unicode support is all but certain for JRuby; its usefulness as a JVM language depends on it. What should that support look like?