International Components for Unicode 3.0 Released

Language designers should find this good news encouraging. ICU is very capable and too many languages still lack Unicode support. There is no good reason with ICU around. ICU has a loose X open source license which is good for GPL or proprietary work.

ICU comes in three flavors, C, Java, and Java Native Interface. If you care about Java, consider the independent Managing Gigabytes for Java project and related papers.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Some pessimism from WINE

This interview offers insights from a programmer who wants to switch out of ICU. He likes FriBidi.

Strings

People don't often realize that string literals are among the fundamental abstractions provided by most programming languages.

When we were young we were taught that "ehud" is a 4-byte array (of some kind). Obviously this isn't the case in Haskell, but most programmers coming from C/Pascal/Ada expect this literal to be (a) unambiguous and (b) to contain 4 byte-wide elements (possibly with a nul terminator).

These days things are much more compilcated, what with char sets, bidi etc. - but most languages don't really take this into account. Even if they support some weak form of "wide chars" string literals are still what they used to be.

Sure, educating programmers about unicode etc. is important, but isn't it about time programming language support for these thing be enhanced so that strings would agin be primitive data types, and string literal will work "correctly"?

Re: Strings

That is the reason for this alert about ICU. Language designers can obtain direct Unicode support via ICU.

Designers should adopt Unicode strings as intrinsic language primitives. Strings should not be the same primitives as arrays, though certain semantics may overlap.

I'm glad you mentioned the w_char nonsense, because it is a common fallacy that w_char equals Unicode support. At best it is partial UTF-16 support with a big leak. Real UTF-16 characters can occupy 16 or 32 bits, and also have endian issues. So w_char can't even cover UTF-16. Aside from that, Windows w_char is differently sized than Linux w_char. Some applications use w_char and just pretend at Unicode. If you ask me, the best thing that could happen to Unicode is for w_char to disappear.

For the performance minded, there is an idea on Unicode string implementation that preserves the advantages of C arrays in speed and storage. ICU gave it a lot of flak, but agreed it could work. (If you bother reading it, beware the mistaken use of "code point" for "code unit" -- unit is a machine word of some size, point is the full Unicode character defined by N units; details here.)

Languages should use UTF-16 internally as the optimal encoding, allowing programmers to define literals in any encoding, and providing conversions as needed or where specified.

Beyond Unicode, I advocate "memory string literals" allowing declarations of raw memory in binary, octal, or hex. Embedded programmers will know why. Even C makes it hard to define a large block of hex. You have to break it up into a C array, which is weird. How about b"101010101111111000000" or x"ABCD12340000FFFF11115" or equivalently, x"ABCD 1234 0000 FFFF 1111 5" (correct, the string is not word-aligned, so the language must pad as appropriate).

Language expressiveness

I am more interested in the language expressiveness angle. You can make it easy to enter unicode data, but that's not enough. Suppose I say S:=S + "אהוד". Since the string literal is in an RTL language, how should the language do the concatenation? Same goes for sorting etc.

You can provide library routines, that support multiple options. It's much more challenging to decide on the "best" way to handle issues like this and provide built in language support, with appropriate syntax etc.

Not a syntax issue

Since the string literal is in an RTL language, how should the language do the concatenation?

Of course it should concatenate on the left, as usual. The semantics of + should not depend on how you express the constant (your Hebrew string here). The semantics of the expression should be the same, for example, if you had written "\uXXXX\uXXXX" or hebrewString() or whatever.

I understand the issue in question here, but it is a type issue and not a syntactic issue. If you really want a language-specific (or rather, writing direction-specific) string concatenation operator, then you ought to have a language-specific string type, to tie it together with.

Your point is actually rather interesting for me because it is the most intuitive example I have seen of the commutativity iso: one wants strings in Right-To-Left languages to be isomorphic via commutativity, and not via identity, to ones in Left-To-Right languages, and this cannot be determined just by looking at the type structure. I think I may use this in a paper...

(BTW, Ehud, as is mentioned in one of the pages cited above, almost everyone will misread RTL as "Run-Time Library"... it took me a couple of seconds to figure it out as well. The page suggests using "BiDi" instead, although that doesn't work here.)

Of course

I understand the issue in question here, but it is a type issue and not a syntactic issue.

Of course. Never meant to say it's a syntactic issue.

Re: Not a syntax issue

Frank:
[T]he issue in question here...is a type issue and not a syntactic issue. If you really want a language-specific...string concatenation operator, then you ought to have a language-specific string type, to tie it together with.

I agree. Frank really echoes my point. My point was that since we want strings with different semantics than arrays, we need a new type. The semantics of strings should obey Unicode+BiDi rules. The semantics of the plus operator should obey Unicode+BiDi rules. In this particular case, I would inspect BiDi rules for user input and implement them in the compiler. Still, if we get a paper out of Frank over this, then it was well worth the post! (That's a compliment, Frank.)

As Ehud says, we tend to think of C idioms as if Moses brought them down the mountain: string == array, byte == char, ordering == LTR == memory footprint, etc. Strings are of the worst problems in C/C++, and that long before Unicode or BiDi arrived. There is enough difficulty getting strings right, and enough importance attached to the data type itself, to justify native string types and operators. I'm afraid that not all language designers believe this statement. Some do, but consider it too much work. That's where ICU can help.

Hm, maybe if there were a distinct syntax for RTL string literals...

One could adopt a trick like the hex data string that I showed. Pick a prefix to designate strict RTL, and perhaps another to designate true BiDi strings containing both LTR and RTL. It's not just directionality, but encoding at issue too, so languages need ways to enter UTF-8 literals and UTF-32 literals, etc.

Actually, string literals in source code are bad. Doesn't that rule follow the prohibition against GOTO? There ought to be support, but perhaps the harder i18n stuff belongs in resource files anyway.

Unicode Encodes Semantics Not Appearance

A very nice and moderately detailed article by Richard Gillam drives the point home.

ICU offers a short discussion of its low-level string storage.

A problem at all?

Is this really a problem at all? In RTL languages the first character of a word is still the first character in memory. It's just drawn from the right side.

But I'm possibly missing something, as I assume you have some experience with this.

The problem

In RTL languages the first character of a word is still the first character in memory.

How do you know how the PL chooses to represent a string? It can represent it in any order it likes.

Anyway, it can't determine the writing direction of a string just by looking at the characters used in it. I think there is an explicit direction-changing "virtual character" in Unicode, but how is it to know the direction at the start of a string? Perhaps the Unicode says the default is LTR, say, but do you really want to force every user to insert an RTL marker at the beginning of a Hebrew string literal?

Hm, maybe if there were a distinct syntax for RTL string literals...

A distinct syntax for RTL string literals

if there were a distinct syntax for RTL string literals...
Like back ticks instead of quotes? :-)
I would agree if there were back quotes in ASCII, but hey, the source of our language must support Unicode not only inside string literals, right?

Alternatively, what's wrong with prefixes or suffixes denoting text direction in the same way as prefixes in C and Java denote base for numeric literals (like 0x34)?

The problem is display, not concatenation.

Anyway, it can't determine the writing direction of a string just by looking at the characters used in it.
Unicode defines the directionality of characters, so actually it can. There are lots of details, such as numerals having weak directionality and spaces having none.

It can represent it in any order it likes.
True, but Unicode requires that a string have a first-to-last logical order, so he's right to question whether there's really a problem. In fact, Ehud's original question of how to concatenate an RTL string is not a problem -- since it is the right-hand operand it is appended after the original characters in logical order.

But there is a problem and that is how to display the string. The LTR part of the string comes before the RTL part. When the two parts are displayed together, which part is on the left and which on the right? That's what the directional formatting codes are for.

How to solve the display problem.

The display problem could be solved by inheriting the direction from the context the program is running in, which solves not only the order, but also the alignment.

If you output the string in an English environment, the English part is aligned with the left side of the display (or window), and the Hebrew part comes the the right of that. If you output the string in a Hebrew environment, "first" means "to the right". So you get the Latin part aligned with the right side of the display, and the Hebrew part to the left of that.

Is this enough

When you say print (s), s being of a string type, how should the character data be encoded for for printing (or for writing to a file). We used to assume ASCII and this was enough unless you had the "pleasure" to work with EBCDIC. These days the answer isn't so clear cut. Should language require an exstra parameter? Should strings carry information about encodings? etc. All these seem like questions language designers should answer.

Re: Is this enough

The intrinsic string type knows its encoding, BiDi issues, size, etc. The representation of such strings on screen is a text editor problem, not a string type problem. Does that answer work for you?

RTL Overload

I must be in even worse shape: I always parse "RTL" as "Register Transfer Language" first, then "Run Time Library." Only when reminded of it am I able to read it as "Right-To-Left."

Pike Strings

Pike deserves honorable mention for its interesting string support. The string data type is an opaque intrinsic type with (semi-)consistent semantics no matter the contents. What makes it interesting is the optimized memory footprint. It can mix ASCII with Unicode characters.

Whether Pike strings offer O(1) character access instead of O(N) is unknown (cf. the ICU proposal). The CVS would say. Strings are immutable and shared, though assignments still work as one expects, with a copy. From a user standpoint these strings seem very friendly.

Byte-shuffling C lovers who require data transparency should remember that it's possible to convert string data from and to opaque types. Furthermore one can do string implementations in any language, or even C, since most languages (including Pike) offer C hooks. In other words, if you really like w_char pointers, you can have them. Just don't ask the rest of the world to suffer.

A language can and should do what it wants under the hood to present consistent and clean semantics. Every language should include an intrinsic string type. There remains little excuse to pretend that a w_char pointer is realistic Unicode support. Strings belong in the language, not in libraries. The reason is evident even in Pike: There may however be some operations, for example certain methods in certain modules, that cannot handle wide strings but that work with 8-bit strings. Library authors should not need to worry about encodings. Today's languages sport hash tables, complex numbers, and other "fancy" data types as built-ins. It's time to put strings on an equal footing.