BitC is back

As I get ready to leave Microsoft, I'm once again thinking about BitC. I
want to get the implementation and the language definition to a point of
usability, and this seems like a good time to examine some of the things
that I think, in hindsight, were mistakes or might warrant re-examination.
Most of these issues are mundane practical things. A few of them are deeper
design choices/issues.

I didn't really look at BitC till the project went dormant. Now I have, I think it's great. Every language should have a BitC-like implementation layer.

Looks like front-page

Looks like front-page material to me...

By Leon P Smith at Thu, 2010-03-18 13:57 | login or register to post comments

Unfortunately, this may be slightly premature.

In the worst event, resumption of active work on BitC may be delayed until July 16. I think that this delay is unlikely, but it's hard to know for sure.

[Edit] It's now confirmed that I really am clear of Microsoft at close of business today, so my caution was unwarranted.

By shap at Fri, 2010-03-19 08:28 | login or register to post comments

Unicode in BitC

Looks like discussions about BitC 0.20 are already in full swing ;)

I checked with interest the one about unicode and characters. I like specially Kevin Reid's contribution:

It is possible-but-weird to handle such things as uneven widths and
invalid substring indexes by defining the high-level interfaces such
that *numeric* indexes are never seen by most programmers; see Taylor
Campbell's Scheme work on this idea. It seems reasonable to me, but I
haven't actually done any work within the system.

The starting premise as I recall it is essentially that even if we
always work in 32-bit units, that isn't what user-programmers actually
want -- consider combining characters. Rather, the primitives should
be iterating over strings in selectable units (grapheme cluster,
scalar value, utf-N code point, whatever) and parsing.

In my opinion (that is my plan for dodo) a string may use different encodings; if you look at the string as an array, it is an array of code points; a code point may happen to represent a character (eg. in ASCII), but it can also represent part of a character (modifiers, UTF-8, UTF-16 surrogates) or several characters (ligatures).

Then, for me a single character is in fact a Unicode string. Hopefully that should allow dodo to handle most scripts in the world. There are functions specialised for each string encoding to extract characters from a string and work with them. If the programmer is not satisfied with this, the string can always be seen as an array of code points.

By Denis Bredelet -jido at Fri, 2010-03-19 23:16 | login or register to post comments

I think that is a good idea,

I think that is a good idea, but I think that the interface should be independent of what encoding the string has. What probably makes sense is a string is a list of characters, and a character is a list of codepoints. A codepoint can be encode any number of ways: UTF-8, UTF-16, UTF-32. This probably requires a lot of implementation trickery to be efficient, and certainly some interfaces will need rethinking. Finding the nth character is likely to be linear time for UTF-8 , so programmers will need to rely on the library to do things instead of code them themselves. Although that probably is close to what you are saying.

By Watson Ladd at Sat, 2010-03-20 20:32 | login or register to post comments

Divergence

Your suggestion has its merits (convenience), but I would argue that array random access is not the right abstraction if "Finding the nth character is likely to be linear" depending on encoding. Imagine what would happen to a good text search algorithm when it is fed with a UTF-8 string.

With additional metadata and buffering you could get it to work, but the "array of codes" version will always beat it in performance.

I think accessing the nth character should use an iterator abstraction, with convenience functions that let you write code without resorting to loops all the time.

By Denis Bredelet -jido at Tue, 2010-03-23 00:25 | login or register to post comments

Small nit

In unicode terms, code points are encoded as code units. So if the string indexing operation can return part of a code point, what it returns is a code unit.

By shap at Mon, 2010-03-22 04:08 | login or register to post comments

My bad

I should have stuck with "codes"

By Denis Bredelet -jido at Tue, 2010-03-23 00:13 | login or register to post comments

Lambda the Ultimate

User login

Navigation

BitC is back

Comment viewing options

Looks like front-page

Unfortunately, this may be slightly premature.

Unicode in BitC

I think that is a good idea,

Divergence

Small nit

My bad

Browse archives

Active forum topics

New forum topics

Recent comments