BitC is back

BitC is back:

As I get ready to leave Microsoft, I'm once again thinking about BitC. I
want to get the implementation and the language definition to a point of
usability, and this seems like a good time to examine some of the things
that I think, in hindsight, were mistakes or might warrant re-examination.
Most of these issues are mundane practical things. A few of them are deeper
design choices/issues.

I didn't really look at BitC till the project went dormant. Now I have, I think it's great. Every language should have a BitC-like implementation layer.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Looks like front-page

Looks like front-page material to me...

Unfortunately, this may be slightly premature.

In the worst event, resumption of active work on BitC may be delayed until July 16. I think that this delay is unlikely, but it's hard to know for sure.

[Edit] It's now confirmed that I really am clear of Microsoft at close of business today, so my caution was unwarranted.

Unicode in BitC

Looks like discussions about BitC 0.20 are already in full swing ;)

I checked with interest the one about unicode and characters. I like specially Kevin Reid's contribution:

It is possible-but-weird to handle such things as uneven widths and
invalid substring indexes by defining the high-level interfaces such
that *numeric* indexes are never seen by most programmers; see Taylor
Campbell's Scheme work on this idea. It seems reasonable to me, but I
haven't actually done any work within the system.

The starting premise as I recall it is essentially that even if we
always work in 32-bit units, that isn't what user-programmers actually
want -- consider combining characters. Rather, the primitives should
be iterating over strings in selectable units (grapheme cluster,
scalar value, utf-N code point, whatever) and parsing.

In my opinion (that is my plan for dodo) a string may use different encodings; if you look at the string as an array, it is an array of code points; a code point may happen to represent a character (eg. in ASCII), but it can also represent part of a character (modifiers, UTF-8, UTF-16 surrogates) or several characters (ligatures).

Then, for me a single character is in fact a Unicode string. Hopefully that should allow dodo to handle most scripts in the world. There are functions specialised for each string encoding to extract characters from a string and work with them. If the programmer is not satisfied with this, the string can always be seen as an array of code points.

I think that is a good idea,

I think that is a good idea, but I think that the interface should be independent of what encoding the string has. What probably makes sense is a string is a list of characters, and a character is a list of codepoints. A codepoint can be encode any number of ways: UTF-8, UTF-16, UTF-32. This probably requires a lot of implementation trickery to be efficient, and certainly some interfaces will need rethinking. Finding the nth character is likely to be linear time for UTF-8 , so programmers will need to rely on the library to do things instead of code them themselves. Although that probably is close to what you are saying.


Your suggestion has its merits (convenience), but I would argue that array random access is not the right abstraction if "Finding the nth character is likely to be linear" depending on encoding. Imagine what would happen to a good text search algorithm when it is fed with a UTF-8 string.

With additional metadata and buffering you could get it to work, but the "array of codes" version will always beat it in performance.

I think accessing the nth character should use an iterator abstraction, with convenience functions that let you write code without resorting to loops all the time.

Small nit

In unicode terms, code points are encoded as code units. So if the string indexing operation can return part of a code point, what it returns is a code unit.

My bad

I should have stuck with "codes"