Parser Generators Supporting Astral Characters

It's important to me that the programming language I'm working on has full Unicode support, which includes supporting the 'astral' characters from planes 2 through 16.

I'm not very interested in parsing, so I would really prefer to use a parser generator instead of writing a parser myself. However, the parser generators I've looked at only support plane 0, the Basic Multilingual Plane, despite claiming to have "full" Unicode support.

As a side note, they often don't even warn you about this when specifying astral characters in the grammar. They happily accept code points over 0xFFFF and output broken code. The generated code doesn't even check for this stuff at runtime, but rather just behaves unexpectedly. For example, when allowing [0x010000..0x10FFFF] characters in identifiers, SableCC output code that parsed my entire input file as a single gigantic identifier.

So I'm wondering, what popular parser generators support astral characters? Or if there aren't any, what unpopular ones? Am I not going to be able to use a preexisting parser generator to get what I want?

PLT Scheme + any Scheme parser generator

PLT Scheme says it supports characters up to 10FFFF; a brief test indicates it works fine with the SMP. So any sane, pure-scheme parsing system (like this one?) ought to work fine.

By Jay Kominek at Thu, 2008-07-10 22:45 | login or register to post comments

I was about to post the same

I was about to post the same suggestion. Of course, if you're a sane person and use s-expressions, the whole issue of writing a parser can be resolved over a cup of coffee.

By John Nowak at Fri, 2008-07-11 03:12 | login or register to post comments

You mean any pure-MzScheme parsing system...

Unfortunately, Scheme's cross-implementation portability isn't very good, though hopefully the situation will prove to be *much* better with R⁶RS.

Many if not most R⁵RS implementations don't have unicode support. Unicode is part of R⁶RS, however, and PLT, Larceny, Ikarus, and Chez are all working on support for the new standard.

By Leon P Smith at Fri, 2008-07-11 09:20 | login or register to post comments

UTF-8

I'm not sure what is meant by "full Unicode support" here, but what about using an ordinary 8-bit clean lexer/parser to process UTF-8? At least, it should be sufficient if allowing Unicode characters in identifiers and literals is all that is needed. Nothing special is needed to handle characters above U+FFFF.

If ranges of characters have grammatical significance (as in "type names are written in cuneiform, module names in futhark runes"), this could take some lexer persuasion; otherwise, remaining validity checking could be deferred to a later phase.

Normalisation may or may not be required, but it could be done by code feeding the lexer.

By Mattias Engdeg at Tue, 2008-07-15 10:27 | login or register to post comments

If you find such a beastie,

If you find such a beastie, let me know. We tried to do it this way in the first BitC compiler in flex, and it was a pretty bad mess. We actually did manage to build a working regexp for UTF-8 encoded characters, but you lose the ability to do identifier recognition as intended. Once you decide to cut over to a stop list, you quickly conclude that the lexer generator isn't buying you much. Automatically generated lexers are horribly slow in any case.
The saving grace is that Unicode issues arise only in identifiers, and writing a utf-8 byte->char decoder isn't that hard. You need a table to give you the appropriate unicode character classes, and a stop list for keywords, and you're basically done.

There is no compelling reason for the language input processor to handle normalization. Anything other than normalization C is so rare that you can require use of an external tool. The language's I/O library may or may not want to consider a more complete handling of unicode, but the likelihood that you are going to do that better than the ICU team has already done closely approximates nil.

By shap at Tue, 2008-07-22 02:33 | login or register to post comments

We had to deal with this in BitC

Short answer: the parser doesn't care. The issue is the lexer. The best answer is to hand-write your lexer (this is the right answer whether or not you want to handle UTF-8). If you are writing in C/C++, most of the actual unicode work is done for you by libicu, which is excellent. You can look at the BitC lexer as an example. See SexprLexer.cxx. Don't despair if S-expressions aren't your thing. That lexer has been used in minor variations (keyword updates, revisions to syntactic tokens and revisions to floating point lexing) in something like 8 speciality languages now. If you are writing in something other than C/C++, you'll need to haul down the Unicode character map data and spin a character map for use by your lexer.

Don't imagine that unicode support will be limited to the lexer/parser. If you take it in as input, then sooner or later you will need to emit it in a diagnostic. If the compiler host system doesn't have unicode support, it can get rather hairy. More generally, you'll need to look carefully at how best to make most of the string handling in your compiler UTF-8 clean.

Oh. And don't forget to specify a normalization for legal input, so that sorting works right. In practice, use of anything other than Normalization C is so rare as to be non-existent. Here is how we stated it in the BitC spec:

Input units of compilation are defined to use the Unicode character set as defined in version 4.1.0 of the Unicode standard [13]. Input units must be encoded using the UTF-8 encoding and Normalization Form C. All keywords and syntactically significant punctuation fall within the 7-bit US-ASCII subset, and the language provides for 7-bit US-ASCII encodable ``escapes'' that can be used to express the full Unicode character code space in character and string literals.

Tokens are terminated by white space if not otherwise terminated. For purposes of input processing, the characters space (U+0020), tab (U+0009), carriage return (U+000D), and linefeed (U+000A) are considered to be white space.

Input lines are terminated by a linefeed character (U+000A), a carriage return (U+000D) or by the two character sequence consisting of a carriage return followed by a line feed. This is primarily significant for comment processing and diagnostic purposes, as the rest of the language treats linefeeds as white space without further significance.

By shap at Tue, 2008-07-22 02:28 | login or register to post comments

Lambda the Ultimate

User login

Navigation