The Grafwegen Hi Compiler named Hic release 0.1

Just a short notice. I started a web-site around my personal pet hobby project, the Hi language, at:

      www.hi-language.org

You can follow my latest musings and get a prerelease slow, not fully complete, bootstrap compiler of the 'Grafwegen Hi Compiler named Hic, version 0.1, Not Even Wrong' there.

The bootstrap compiler is released as compiled C sources. The sources of the compiler itself are not available at the moment since I don't know how to license it, and well, it's in an embarrassing state at the moment. Really.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Nice Syntax

Sweet, the syntax does look remarkably clean. I especially like the variable bindings and the lambdas. Although I am curious, do you fully support Erlang style variable bindings? For example:

   if X < 10 -> 
        Y = 10;
      true -> 
        Y = X
   end

No

Hi is a mostly pure ML variant with a `cleaned' simple syntax and type system inspired by Miranda/Gopher/Haskell/Clean like languages.

Actually, I could see that would be useful when Y would be used after the if block, but no, not sure its worth to implement.

[The main motivation is to arrive at a simple tool which works well with/from C. I.e., at some day bind to Lapack/Blast, unit test C programs functionally, write small games with it, stuff like that.]

Actually, decided against it, another equivalent way of stating this in Hi would be:

y = [ x ? x < 10 -> 10
    | x          -> x ] z

Which is, hmm, good enough.

Erlang syntax

Well, in Erlang you can also write things such as:

   if X < 10 -> 
        Y = 10,
        Z = apple;
      true -> 
        Y = X,
        Z = orange
   end

and use both Y and Z after the if. It's one of the nice things about Erlang syntax, in my opinion. :)

Yeah, looks convenient

But, as it stands now, it'll need to wait until my compiler hits version 1.0, at which time I'll make personally sure to the 'Requests for Enhancements Committee' it'll be the enhancement labeled #1 ;)

[ Mind you, you can write

(y,z) = [ x ? x < 10 -> (10, apple)
        | x          -> (x, orange) ] z

at the moment too, so, hmm. ]

I greatly prefer this tuple

I greatly prefer this tuple binding syntax, FWIW

No tuples in the example

Erlang is a single assignment language, and there are no tuples involved with the example. Indeed, variable binding is one of the bright spots of Erlang syntax compared to most functional languages.

I am curious why you "greatly prefer" the hi syntax though. To me it would depend somewhat on the situation: simple examples I'd probably prefer the hi syntax, whereas more complicated examples the Erlang syntax becomes much more preferable, in my opinion.

Hmm. I just saw a binding

Hmm. I just saw a binding that looked something like good old pattern LHS binding, in this case a Tuple patter, found in several other languages. Expressions such as the following:

  -- nice "destructuring bind" style on a tuple return value
  let (x,y) = (abs (get-x()), abs (get-y())) in 
     -- just some cruft to fill in the example
     make-line ((x,y),(y,x));

Did I miss the status of the "(x,y) = ..." LHS in marco's original example?

S.

Uh?

No, matching the LHS to a RHS has been supported from the start of this language.

No doubt. But is this

No doubt. But is this expression alternative syntax a tuple (e.g. constructor) pattern bind or some other kind of full bore "multiple-value-bind" (to use Lisp'ish terminology, sorry)?

The LHS may contain an

The LHS may contain an arbitrary constructor (or in general, a pattern). I have no idea what a "multiple-value-bind" is?

[ It gives some runtime unsafety in case the RHS generates another record of the type matched against, but, I thought one could live with that. Given some quite simple but sufficient optimization, not there yet, it shouldn't be more costly than Erlang's direct style of binding multiple values. Regarding the optimization, you need to be able to lift the match into the guarded expression and partially evaluate that, that's not a big deal. ]

Numeric woes

I am part of the group that's dissatisfied with the numeric part of Haskell's Prelude (see also), and I would like to see special attention to that. I didn't see a link to the actual prelude, but I did notice the following sentence in Examples:

An example from the system prelude, the 'Num' interface.

interface Num = \a => #a where (
def plus: a -> a -> a
def min: a -> a -> a
def mul: a -> a -> a
def div: a -> a -> a
def rem: a -> a -> a
)


Although this does avoid the question of why, for example, one needs a signum function for arbitrary numbers, it takes the (to me) unfortunate stance of requiring Numbers to support division and extraction of remainders. What, for example, should one make of rem Float Float? I suppose that one could declare rem to be always 0 in a field; but then what should one use for rem Matrix2x2 Matrix2x2? (Naturally, I also find div problematic. In short, I'm arguing for Num to be quite similar to what one might call CommutativeRing, with specialisations adapted to more familiar rings left to sub-interfaces.)

Also note that the function that you call the ‘faculty’ function in your example section is actually the factorial function.

Noted

Fixed factorial. With regard to you other remark.

That touches on language support for extensible interfaces which the language doesn't support at the moment and what that would mean for runtime performance. You actually ask for a lot. ;)

But, it's a good case. I'll give it some thought whether to add that to the language and support extensible interfaces to just be able to derive interfaces from numbers from their algebraic characterization.

[ Thought about it. It goes in. ]

… and a pony!

Would that all language developers were this responsive. :-) I'm not quite sure why my question requires extensible interfaces—presumably a class can implement more than one interface, so it's just a question of offering sufficiently fine granularity? (I guess verbosity quickly becomes a concern in this case.)

I also wonder whether you have a reason (perhaps just history?) for the choice of having an ordering function take values in the integers, rather than in something like Haskell's {LT, EQ, GT}. Although I understand (I think) why C and C-derived languages go the negative, zero, positive route instead, I've always enjoyed Haskell's design choice, and it seems a natural fit for a Haskell-alike such as you are proposing. (I don't know what ML does here.)

Just C/Speed

Extensible interfaces? Yeah, you guessed right, it's the only manner in which I don't see ordering value types by algebraic structure wouldn't become a mess fast. I am not sure you end up 'overspecifying' everything anyway. But, I can't really have a language and not support something which most programmers will think is 'must-have' in a language with interfaces.

The integer: Historically, C-wise, it's just the easiest to take the difference of two numbers as the result of a compare, that's one machine instruction. But, since I want to bind to C a lot, where comparison returns an int, I just don't want the extra hassle of a conversion in between. Though, the Haskell solution might actually be faster in the long run, a match is cheaper than an (integer) comparison to 0.

But, nothing is set in stone yet, and it'll take me a few years before I get everything right, I think.

[ Mind you. Doing stuff as well, or as bad, as Haskell isn't really the issue. Having a language, which is easy, and binds to/compiles to/is invokable from C easily is. But, grmbl, performance... An optimized Ackermann -no method invocation- now runs at a normalized 4.5 secs on this benchmark. ]

Danger, Will Robinson

Historically, C-wise, it's just the easiest to take the difference of two numbers as the result of a compare, that's one machine instruction

This can go badly wrong. If we have n-bit two's complement integers, and the difference between x and y is 2n-1 or greater, then x - y will have the wrong sign for your purposes.

I wonder why there are not

I wonder why there are not more languages which do the 'right thing' and raise an exception in case of integer overflow?

As far as I know, Ada is the only mainstream language which does this..
Perhaps because MIPS is the only CPU I know where this is nearly free i.e which has a TRAP on integer overflow feature?

Guess it's more living with glitches

If a minor calculation in your OpenGL game goes wrong, maybe a triangle just shoots off to the wrong corner. But since that only happens once every gazillion frames, no one cares, just a glitch.

If your billion dollar rocket falls out of the sky, or you hit a wrong target, that's an entirely other deal.

But what should be the default behaviour?

Sure, in a few case you prefer to have an incorrect answer instead of an error exception which is more difficult to handle, so the language should support this also, but should it be the *default* behaviour?

In my opinion, NO!

70fps should be the default

Unless you have good hardware support, you're going to pay for it somewhere.

[ Take it with a grain of salt. After bootstrapping my compiler I am now very performance aware. ;) ]

Syntactic encoding

I think renox is referring to syntactic encoding of this decision, and not about performance, which could/should be orthogonal to the encoding decision.

For example, by default classes in C# are public. Since "class foo" means "public class foo", all C# code by default is going to be public. That is the wrong syntactic default. I am unaware of how syntactic defaults could grossly affect performance, unless the syntax requires the compiler to make use of semantic actions and require the entire compile history to determine the meaning function of a lexeme.

You could just as easily support syntax literals for both operations and generate code that obeys the semantics for each literal. The question is would it slow down your compiler significantly, e.g. requiring extra backpatching, etc.

As for being "very performance aware", I would recommend coming up with a good benchmark for your compiler that exercises various tests and optimizations you plan to perform, even if you don't perform them. This means each stress test run for a given build of your language will have a baseline and a current performance mark. For example, look at the benchmarks for Sather 1.1 (which are pretty naive, but a start).

For example, by default

For example, by default classes in C# are public.

Actually, such classes are "internal" by default. Maybe you're thinking of Java?

I meant sealed

I always mix up which one I think C# got wrong, with Java. I could've also confused it with the default value for the x:ClassModifier XAML MarkupExtension (public).

All compilers should be sealed and internal by default. This allows for a better versioning story, better performance (no virtual method overhead), and better encapsulation (hidden state is not accessible other than through the protocol the class itself defines). This advice is, of course, directly opposed to Cwalina and Abrams .NET Framework Design Guidelines; see Limiting Extensibility by Sealing Classes. However, Jeffrey Richter agrees with me. Though, I originally formed my opinion years ago while reading papers on component-oriented programming.

The FDG rationale is mostly B.S., and the reasons I provide are much stronger than the weak "you can't customize sealed types, and any default that limits extensibility is very bad" strawman argument Abrams and Cwalina put forth. Moreover, sealed by default better encourages Scenario-Driven Design they espouse, since it requires the framework designers to think through the scenarios for why the class exists. I always feel like strangling Microsoft employees when I read ASP.NET code that violates this principle, but then I realize that their core guidelines encourage them to do such things, and so realize I have nothing to hold them accountable with.

* In Java classes are package-private by default.

(none)

I wonder why there are not

I wonder why there are not more languages which do the 'right thing' and raise an exception in case of integer overflow?

As far as I know, Ada is the only mainstream language which does this..

As far as I know, every memory safe language does this for native ints. Only unsafe languages like C/C++ don't bother.

A horse is a horse, of course

From Wikipedia:

[some call] memory-safe language (or just safe language) to describe languages that do not allow undefined operations to occur

I doubt all Pascal dialects go as far as to trigger on overflow. But, no idea too.

Having said that, um, it might be a good idea for my language to have two types of preludes, or runtimes. One that checks, and one that doesn't, on basic operations. But that's I guess for the future.

Having said that, um, it

Having said that, um, it might be a good idea for my language to have two types of preludes, or runtimes. One that checks, and one that doesn't, on basic operations.

I think it's better to export checked and unchecked operations, ie. + is checked, and +! is unchecked. This way you let the developer specify the desired semantics.

Unsigned types generally wrap by default, and signed generally throw an overflow exception.

I think it's better to

I think it's better to export checked and unchecked operations, ie. + is checked, and +! is unchecked. This way you let the developer specify the desired semantics.

I'd prefer that developers need explicitly use modulo arithmetic or otherwise explicitly use a modulo arithmetic commutative semiring. I don't see much reason for modern languages to explicitly support primitive integral types, except perhaps to assert that certain strict modulo integral types (those modulo 2^N for certain values of N) will be heavily optimized.

Default integer type and arithmetic should be arbitrary precision.

Numerical Exceptions

I guess the lesson from C is that you actually don't want to throw an exception on a numerical calculation in the sense that when, and if, an exception occurs it's already too late. You should have checked the preconditions of the calculation stricter.

That's why I would opt for two semantics. One for testing, or those who want Ada-like security, and one for production ready code.

Integer Thoughts

First thoughts are that in most programs, integers are used as various sorts indexes, positions (as in files), small values used to implement character routines (toupper, compare, etc.) and so on and so forth. I believe there are plenty of source code analyses that back this up, but my addled brain fully recalls so much less than I read.

Using integers for "computing potentially large results" is an important but secondary and mostly specialized use of integers. Like, um, prime numbers just for example.

Another, often ignored but *very* important, use of integers in "computing potentially large integer values" is fixed point arithmetic, as in financial calculations, RDBMS fixed decimal point column types, etc.

I'd favor 1) an unchecked machine integer type by default that satisfies the vast (unchecked) uses of int values with no extra overhead, and then 2) another arbitrary precision integer type (might combine some kind of tagged fixnum representation auto-promoted to a BigInt) that also supports an optional scale factor for arbitrary precision fixed point arithmetic as well.

Just these two types would cover a lot of bases, problem domain wise.

S.

(My daddy schooled me over the years on mainframe BCD arithmetic and the general importance of accuracy in financial calcs; and watching him sit down and read/debug raw hex core dumps like I would read a newspaper will always make him "officially magical" in my eyes ;-) For ia32 at least, the BCD routines are supposed to be extremely slow, and fixed decimal scaled arithmetic seems a better option for many reasons).

Premature optimization?

I'd call that premature optimization. If I recall correctly, the MLton people said that overflow checks cost just 1-2% performance if done right on contemporary machines. I take the added safety by default any time. Like in SML, you can always have a separate wrap-around type for the few cases where you really need to squeeze out those last percents.

Heck, I view bounds checking

Heck, I view bounds checking (seriously!) as strictly a -debug compile time option :-) However "safety and virtue filled" a language design, I personally do not think languages should be friendly to *any* programmer error in a non-debug production compile of a program.

On integer overflow, I read an interesting little blog post on a purely C-language (GCC) overflow detection scheme that adds "only" something like ~6 ia32 machine instructions to an integer add. But then (_ * 6) instructions for every integer add will drive some folks freakin' crazy :-)

Different language design trade offs certainly serve different constituencies and their problem domains.

Personally, I'm seeking how closely "high level" ML'ish inspired constructs can inter-operate nicely with what some prominent former Lisper (Gabriel? Moon? dunno) termed our current "C-Machines" (as opposed to Lisp machines, in his context).

My goal is in making the "abstraction barrier" between a relatively high level, strict ML'ish language (just to start) and our "C-Machines" (1) as thin as possible, and (2) as simple and comprehensible as possible to the developer using such a "high level" language. And goal #2 means no "sufficiently clever compilers."

But these are just an amateur hobbyist's goals. I may well be spitting into the wind ;-)

Scott

Masochism?

[[ I personally do not think languages should be friendly to *any* programmer error in a non-debug production compile of a program. ]]

And any justification for your (weird IMHO) view?
You think that the current situation is obviously the best possible?!

As for the rest: the 6 instructions for every add was probably because it was done in C, otherwise in assembly language for 'stupid' CPUs, I think that it's only a (normally not-taken) branch and for the MIPS, it's free (when there is no overflow obviously).

To speak plainly, because an

To speak plainly, because an array index bounds error is a failure of algorithmic logic in the application (or sub program). If language designers and/or implementers really want to waste precious time in a Quixotic quest to compensate for stupid programmers, well, then heck - whatever turns them on.

S.

p.s. I also "came up" in the shrink wrap era. A nice little stack trace dump in response to a programmer error, in that context, does ~0% to ameliorate the crazy costs of shipping out a stack of 8 floppy disks to all of ones customers with a bug fix.

As for today's server side, web centric, "duct tape scripting" kids struggling to master the fine arts of database query and HTTP based file transfer (really just glorified boring data processing shops), where perhaps a stack trace dump might do some good - well, while I made some good money in that market, it is, um, nearly totally without any technical interest for me at all.

It's not about punishing bad programmers

Heck, I view bounds checking (seriously!) as strictly a -debug compile time option :-) However "safety and virtue filled" a language design, I personally do not think languages should be friendly to *any* programmer error in a non-debug production compile of a program.

In times where security holes due to buffer overflows cost billions, isn't this a strange thing to say?

Very good point

I'm enough of a dinosaur, that these exploits don't come so readily to mind. OTOH, I think we can ask whether checks on buffer reads at the periphery, "world-touching" parts of a system/application should necessarily infect every little tight loop within the core of the application.

Might not there be some elegant way to segregate the two different "quality criteria" whether automagically or at the programmer's discretion (the latter, my personal preference).

In any case, thanks for the great counter example.

S.

p.s. FWIW, just this morning I was making some notes along these lines, segregating "safe" and polymorphic array/vector (my "vectors" are growable and mutable) from a separate set of strictly monomorphic arrays of various numeric types that a developer might use when performance is off the essence (say, a graphics edge enhancement routine). New ideas for me, and still trying to make the distinction elegant and not just "kitchen sink" design. I'll see....

Good Points

But the aim is still something where TCO of the code should be very low.

Wrapping exception support around mathematical operators can be done trivially in the gluing code. But, I'll stick with C semantics for now until the compiler hits version 0.3 or something, and costs of exception support become more apparent.

(It's more where my own personal energy budget goes than anything else.)

Uhm, Java ?

I think that Java is considered a memory safe language, but 'native int' do wrap around silently..

C# can

If you place code inside a checked block or expression in C#, integer overflow will raise an exception.

Very Good

That might save me a long term bug somewhere.

[ Mind you, taking the difference works very well for characters ;) ]

How about the good old x86

How about the good old x86 style 'cmp' instruction, which is also one machine instruction and I believe, of the cuff, avoids the sub/difference approach?

Sounds like a better explanation

Than the one I gave ;)

By the way

The release holds compiled C sources, the uncompiled runtime, and a few examples. The prelude, in a rather raw form because anything just enough to get to a bootstrap went in, is included in the examples section.

I think I'll throw in all utility libraries in the next release, or release that separately, not sure.

I might release the whole compiler at one point, but that's a trade-off in between how to get it to a publishable state and whether I can attract other developers to work on it. As it is now, I work on it myself.

The speed of the compiler is now, taken from one example: definitely faster than jscript, perl, python, on par or better than trivial scheme and lisp dialects, closing in on lua, will never reach ocaml or ghc speed due to choices of data representation (everything is boxed).

Looking for Co-Developers?

I did some quantitative analysis of the bootstrap and the resulting binaries. All is not well, I have come to the conclusion that I need to rewrite a number of the internal routines and some of the data representation in order to get it into an acceptable state.

The biggest problems are: A) performance, the compiler allocates about 1GB of space for a self-compile which is mostly character data, and B) internal routines, the compiler is programmed out purely functionally (mostly) due to a lack of imperative constructs. This means it is programmed out as a number of transformations on the AST using functional data structures, which is slow.

Now, I am a bit weary of doing all by myself. There are only so many years one wants to program on it by oneself, so I am looking for co-developers.

Having said that, the language as is I think still is great, and I think it could have a great future.

So, anyone on LtU interested? Or, anyone any idea how to get me some friends to work on this project? ;)

[ I am thinking also in using a Wiki, or something similar to co-develop code. Anyone have any experience with that? ]

ML compiler, mutation trick?

the compiler is programmed out purely functionally (mostly) due to a lack of imperative constructs. This means it is programmed out as a number of transformations on the AST using functional data structures, which is slow.

I was going to provide a link to a paper that described a simple addition of mutation to the compilation process that provides a good speedup and memory use reduction. The paper described an ML compiler, but I can't find it at the moment. Anyone recognize this description?

Adding Ref?

I know it would be trivial to add ref records to the compiler by including an "unsafeUpdateFieldInRecord" primitive. Is that what you were after?

Addendum

Actually, I don't mind if I could find people who would just be interested in glancing over (some of) the sources. I would like it to just bounce ideas around with people who can tell me where in an algorithm I did something silly.

[ I can promise one thing. You'll find a delightfully simple and orthogonal compiler. ;) ]

Open Source

I am stuck with the thing, so I am open sourcing it within a week or so. For those who are interested in compiler internals, and like to read a clean-room implementation of a language similar to ML/Haskell, you can grab a copy of a print of the source here. The static semantical analysis is a bit rough around the edges and unfinished, rest is pretty much ok.

Where to host?

I am open sourcing the Hi to C compiler code. I use git and tarballs myself to keep track of the different versions. Does anyone here know where to host git based resources??

I am looking at github at the moment, I am wondering where to reach the most people.

Gitorious

gitorious is my online host of choice for git repositories. It's less widely used than github, and less polished, but at least it's open source itself.

Hmm...

Thanks, but I think I want a top tier hosting party since, well, there are a lot of providers so a number of them will die off in the coming years. So, hmm, github, google code with git over svn, or I guess good old sourceforge?

Guess I need requirements like wiki's, email lists, and syntax highlighting goodies.

Not Sourceforge.

I hate Sourceforge, the interface has gone downhill, the ads are annoying, and it seems slow. I prefer Github myself, though I haven't used Google Code extensively. There are more than a few projects use Github for the repo and Google Code to post releases and whatnot.

Implementation Reference

I briefly hacked an implementation reference together which can be found here.

I didn't spell-check it, and it can't be read as an academic paper since it doesn't expose the various transforms or the semantical phase with code or formulae, but one might find it interesting anyway.

[ I guess that wouldn't do so I removed the most obvious spelling errors. ]

[ Gawd, if you're interested in the formal background of the evaluator, please, here. ]

Type Inferencing for the Hi Language

I gave up on that the current Hic compiler will ever become a fast compiler. At the same time, my fingers itch to rewrite the compiler in C, but -alas- that makes little sense if the static semantics of the language are not fixed yet. Moreover, programming in Hi actually is a very nice experience, giving that up for C I don't really feel like.

So, I decided to throw something unexpected into the ring. An open invitation to help me getting the static semantics right - this primarily concerns the type system.

Is there anyone who feels like working on, or getting the semantics right, of a simple language???? I am looking for people to bounce ideas against...

Note that a prototype HM style inferencer is already implemented, I wanna know how to get to a provable correct one.