Langauges and Hardware...

I guess this may be more of a compiler issue than a language design issue, but with the recent excitement over Cell processors maybe people will have some insights.

Basically, I'm interested in processor/hardware support for higher level language concepts. In particular I've been looking for what exactly made Lisp Machines so much better for Lisp than normal hardware(so far not really found much), but its not just Lisp machines I'm interested in. I've been hearing about type-safe assembly languages, and I wonder what it would take to have hardware support to help with things like multithreading(besides the obvious "just use multiple processors" :) ), and garbage collection. I guess Transmeta's processors had some interesting possibilities too, but I've not heard of them being exploited.

I know that some VM's(like Parrot and .Net) help out with these things, and I've heard that the JVM has been implemented in hardware, but has anyone had much experience with hardware support stuff like this? Are these VM's really appropriate for hardware support or is there yet-another-way which would be better? Partially I just don't know what keywords to look for.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Lisp Machine

is a tagged architecture, so there is hardware support to distinguish what kind of value a cell is holding. I'd very much like this feature in mainstream processors.

Other interesting features are hardware GC support, instruction emulation (can be found in nowadays processors), and microcode (ditto).

Re: Lisp Machine

One big advantage of hardware tag bits is that the effective/usable word size matches conventional machines and languages. One of the most annoying things about using most Lisps and Schemes and MLs is that the integer size is 31 or 30 bits. This makes porting of algorithms and interfacing to libraries much more difficult than necessary. Bignums can be used in some cases, but this comes with more baggage.

Hardware tag bits add complexity and perhaps reduce performance overall (CISC versus RISC argument). Software tagging is known to impose little performance overhead. Simply making a 36 or 72 bit RISC machine would provide many of the advantages of hardware tags.

Write barriers can be incorporated into the paging hardware... nothing special, as long as the OS makes it available to user programs.

31 bit ints

One of the most annoying things about using most Lisps and Schemes and MLs is that the integer size is 31 or 30 bits.

I'm curious as to why they only use 31 bits. All processors I'm familiar with have hardware flags which are set on overflow and carry and work with unsigned and 2's complement numbers, so I wonder what the real reason is.

High bits

The reason for the smaller-sized ints is that it was usually a simple memory-saving trick. The implementations could, say, use an "int" either as a pointer, or store a number in it. They could use the most significant bit as a flag indicating if it was a pointer or an integer. (This, coincidentally, halved the addressable memory, too, which wasn't really an issue when available memory wasn't anywhere close to the 231 bits = 2 GB barrier.)

You could use further bits to represent more data types. If the next-highest bit was set, it might indicate that the value was an integer (obviously limited now to 30 bits.) If not set, it might indicate that the value represented a character. The next bit down could indicate that it was a boolean.

If the number was bigger than 30 bits, the implementation could convert it to a BigNum-type object. Slower and bigger, of course.

As memory becomes cheaper, this trick can be considered less critical, but it was quite an effective trade-off, at the expense of limiting the range of your (small, fast) numbers and the size of addressable memory. There are more space-efficient ways of doing the same thing, especially for characters, but they're generally a bit more complex to implement.

Dynamic Languages

So is that mostly for dynamic languages like Lisp? I would have thought a statically typed language (like ML) would know at compile time where in memory it was placing pointers vs. integers vs. characters, and wouldn't have to rely on run-time tags to determine types.

Maybe it helps the GC to trac

Maybe it helps the GC to track references (mostly uneducated guess).

Yes, tags in ML are for GC

Yes, GC is the main reason for tagged integers in ML. Here's what Appel says in his book "Compiling with Continuations":

This distinction is (almost) entirely for the benefit of the garbage collector, which must know which objects are pointers that need to be traversed. [...] It is inconvenient to have these tag bits. They prevent users from calculating with full 32-bit integers, and the tag bits must be stripped off operands and added to results. Can we do without the tags? In fact, for statically typed languages such as ML it is possible to garbage collect without any tags, by giving the garbage collector a "road map" of the compile-time types [*]; but this has its own complications and inconveniences.

The [*] references Appel's paper Runtime Tags Aren't Necessary [which goes into plenty of detail].

Hardware support for GC?

Is there a better way for the hardware to provide support for GC other than adding special tag bits? Can the hardware help with the type "road-map"?

Re: Hardware support for GC

Hardware can certainly help. This was touched on above, by "e":
Hardware tag bits add complexity and perhaps reduce performance overall (CISC versus RISC argument). Software tagging is known to impose little performance overhead. Simply making a 36 or 72 bit RISC machine would provide many of the advantages of hardware tags.

E.g. if you add special purpose bits to words at the hardware level, it makes the CPU less general-purpose. BTW, have you seen Steele & Sussman's Design of a LISP-based microprocessor? (ACM link.) IIRC, that has some of these sorts of features.

Of course, you may not care, but if you have any commercial plans for such a CPU, having it be tuned for particular languages may not be very compelling in the market, especially when you need to compete with the big boys making ultra-fast general-purpose chips. It seems as though in hardware these days, special-purpose is reserved for applications where it provides a large advantage, like DSP chips (and even those can be faked with FPGAs), and everything else tends more and more towards exploiting very general purpose programmable hardware.

Thanks

for the link. I think that's explanation I've been looking for.

My plans are mostly for personal benefit. I don't expect my design will be ultra-fast, although it should be general purpose. As to general purpose stuff, I consider hardware support for GC to be in the same boat as hardware for floating point math, maybe its less general purpose but why on earth would you not want it?

I actually want hardware support for multithreading as well. In fact, the whole thing is going to be done on FPGA's, so I'm glad you mention it.

My plans are to write about the experience of trying to build a complete computer system from the ground up. I can think of worse things to blog about :). If it works out I may try and turn it into a book(more for the glory than the money...).

Erlang processor

It seems as though in hardware these days, special-purpose is reserved for applications where it provides a large advantage, like DSP chips (and even those can be faked with FPGAs), and everything else tends more and more towards exploiting very general purpose programmable hardware.

Ericsson was getting something like a 30x speedup with their FPGA-based Erlang CPU. That is, a sub-20MHz FPGA-based processor was 30x faster than the latest Erlang emulator running on a 500MHz UltraSparc (IIRC). The big win was that the FPGA only used a couple of watts, whereas the traditional CPU ran quite hot.

Per clock cycles, not per wall clock

That is, a sub-20MHz FPGA-based processor was 30x faster than the latest Erlang emulator running on a 500MHz UltraSparc (IIRC).

I believe (changed the link from file:C:... to http://..., thanks Anton!) the speed-up was measured per use of clock cycles, which means that 20MHz FPGA and 500MHz emulator ran at appoximately the same absolute speed.

Link

I'm sure I would enjoy browsing around your local hard disk if I could, but... ;)

Was that link intended to be to this presentation?

Can I Browse Your Collection of Research Papers?

I've had this same thought so many times that I finally started working on a solution. Fermat's Last Margin is a combination of ImageMagick, a local filesystem wiki and the darcs source control system.

You create a wiki page named something like ErlangProcessor and write a macro call like ResearchPaper(url=http://www.erlang.se/euc/00/processor.ppt) into the page. The macro downloads the document, and feeds it into ImageMagick, which creates per-page images that show up in their own /PageNNN subpage along with a text box next to each image.

Only the wikipages are saved into darcs, so redistribution of copyrighted material is not an issue. When someone else grabs a copy of your read-only darcs-wiki repository, they can click on the macros to download the documents and generate page images for themselves.

Darcs handles shared development across multiple read-only repositories nicely. That means you don't need a central server, and you don't to worry about spam. You just update your repository from the people you know, and they update from the people they know...

This idea started when several people, starting with Andrew Bromage, dropped by the #haskell channel and had the same objections to the Haskell CoMonad IO paper. I realized I wanted to read the notes in the margins of their printed research papers. The idea behind the name is that this is the last margin Fermat would have ever needed.

This is nearly ready for an alpha release, but I keep getting distracted...


--Shae Erisson - ScannedInAvian.com

Looks interesting

...but I have some problems with reading papers in raster graphics - is it feasible to present them in some vector format, say SVG? Also, commenting on phisical units (original paper's pages) is not always convenient, sometimes I want to comment on a figure, a formula, a section, or a chapter... Not sure, whether this partitioning should be automated, or manual, so just interesting parts are promoted to a separate subunit. Anyway, this would require very good OCR (unless you happen to get the source of the paper)...

But keep going, the idea is great!

I'm sure I would enjoy brow

I'm sure I would enjoy browsing around your local hard disk if I could, but... ;)

You would not find much of materials not accessible on Internet, except tons of Java sources :)

Was that link intended to be to this presentation?

Exactly, thanks!

... or low bits

The low bit(s) can also be used, because pointers are typically 2^(n>0) aligned. This leaves the sign bit alone , which makes for easier integer arithmetic and comparison.

Hardware JVM

The picoJava II implements almost the entire JVM instruction set in hardware. It does so with a microcode memory that says how to break up the JVM instructions into micro operations, much like a normal CISC design like the Pentium. While the JVM is a stack machine, the micro architecture of the picoJava II is not.

In CISC designs, of course, each ISA instruction takes several micro operations, so they take several clock cycles in theory. But this also opens the possibility for some optimalisation, like folding of ISA instructions and a longer pipeline.

With the upcomming ease of producing custom chips fast and at low costs, experimenting with high-level language hardware interpreters might become fashionable. One could produce direct hardware support for almost every task (like the working of your cellphone) just as easily as software support.

Although this might be a lot faster, you loose flexibility. I think I'd rather have my abstraction levels implemented in software, probably for the same reasons a lot of people feel that RISC is superiour to CISC.

Why? Why not?

I'm interested in this a few reasons. The first reason is that its been a dream of mine for many years to make my own computer(from the ground up), and so I want to do something interesting with the processor.

The second reason is that my experience with VHDL seems to suggest that there is a lot of low-hanging fruit for language designers to make a better Hardware Description Language(and what better way to test out said HDL than with with a big project... like a processor :) ).

And I guess if nothing else, my interest in obscure non-mainstream microprocessor architectures is not so different from my interest in obscure non-mainstream programming languages(which I'm sure no one around here can sympathize with...).

Natural progression

Is this the natural progression of a programmer?

  1. Write your own applications
  2. Write your own language
  3. Write your own operating system
  4. Design your own computer architecture.

I'm currently transitioning to phase 2 on my way to complete control over my environment... :-)

You forgot

5. Sublimate to become a celestial being of pure energy

and possibly

6. Profit???

Another point

4.5. Write your own numerical library.

I think it was Knuth that said something to the effect that you're not a real software engineer until you've developed your own numerical library. He said it was a sort of coming-of-age activity.

Almost all numerical libraries out there have some very serious limitations, and there's always the desire to make one that works right.

As one writes a numerical library, you find that modern processors could have much better support for common operations that you need to perform on arbitrary-precision numbers. With a handful of well-chosen instructions, processors could perform mathematics on large numbers (or arbitrary-precision floating-point numbers) an order of magnitude faster. Say, operations to perform the bit-twiddling of FFT multiplication in hardware, or the like.

There's also a huge amount of improvements that could be made in processors to allow them to change their circuitry on-the-fly.

These processors would essentially be similar to large field-programmable gate arrays which can perform more complicated operations at full hardware speed. Part of the "software" fed to it would be a new hardware description used to perform certain costly operations. This, of course, takes clever compilers, and clever programmers.

For many cases, a cleverer algorithm in pure software can far outweigh the benefit of a hardware implementation, so the need for this may still be a ways down the road. Very interesting stuff, though.

Oh, and:

4.6. Write your own learning algorithms, whether it be neural networks, genetic programming, rule-based, or whatever.

4.6.1 Develop mad hubris that your learning algorithm will learn everything, until you realize how limited it really is and how hard it is just to find effective mutation rates, or momentum terms, or the like.

An FPGA in the CPU

Part of the "software" fed to it would be a new hardware description used to perform certain costly operations. This, of course, takes clever compilers, and clever programmers.

Oooh...think multitasking. How fast can FPGAs be reprogrammed? Context switches could get really expensive, if the OS gives user-space access to this capability.

Context Switch

How fast can FPGAs be reprogrammed?

That was my question, too. The main article that I saw on this was in Scientific American about 4 or 5 years ago, and, if memory serves me correctly, the authors claimed that the reprogramming could be done very quickly with the class of electrically-controlled gates they were using. (They weren't, say, burning fusible links like you tend to do with FPGAs.) It was far less than I would have expected.

Still, I think that it would be rather costly, too, but primary processing could still go on with a normal processor. Or part of the FPGA could get allocated to different tasks by priority... it would certainly force changes to the OS and compiler.

Thou shalt not burn

They weren't, say, burning fusible links like you tend to do with FPGAs.
Disclaimer: I am not an EE.

AFAIK, configuration of FPGA (unlike of smaller CPLD) is usually stored in SRAM, from which folow two important properties:

  1. It is lost on power-off
  2. It is relatively fast to change
I am not sure if the COTS FPGA can literally reconfigure themselves, or they must go out of the chip for that, but even the later option is in the order of tens megabytes per second. Of course, this is not cheap enough to do on every context switch, but sounds as a viable option for JITting hot spots. Hm, but locating these hot spots will require some degree of reflection... Cell matrices may be better suited for that...

To make this a bit more PLT-related: are there any good calculi for representing FPGA? Can a denotational semantics of, say, Scheme be used to automatically configure a chip? If not, what is missing?

Burning depends...

Disclaimer, I'm not a working EE :)

You've got it basically right. FPGA's are made from SRAM's and so they can be programmed rather fast, but most of them are programmed by a JTAG port, which serially loads them, so speed depends on the part. CPLD's are sometimes called FPGA's(by people lumping it all together, they can do the same things, but very different design philosophies created CPLD's), but CPLD's don't always involve burning since many use EEPROM's.

Actually, not only can FPGA's be reconfigured fast, many environments(like Altera's) will let you convert an array of logic elements into a block of RAM. You could build a circuit with two FPGA's, where each FPGA is wired to the reconfiguration pins of the other and they could mutually reconfigure themselves, of course, there are lots of issues about the state of each device and timing hazards that would get involved(which is something a semantics would have to deal with, perhaps the hardest part actually).

I'm still learning about semantics and calculi, so I don't have an answer for you there, but I can relate what little I know.

A logic element on an FPGA can generate any 4 bit logic function(it uses a 16 bit RAM as a lookup table), and in addition can made into a simple state machine by a flip-flop tied to its output.

These logic elements are tied together by a regular grid, and so the routers at the connection points have to be configured as well.

There are several interesting problems to be solved in configuring an FPGA: minimizing the needed logic functions, to minimize the logic elements needed, and most efficiently using your interconnects, since its not total graph connecting the logic elements. As an added boost, there are timing and fanout issues to deal with as well(tieing a single logic elements output to many elements far away on the chip will take longer to propagate). These are similar to problems you would have if you were desiging a digital device, but the FPGA's solve a lot of them, but they still exist.

Now, typically you can do some minor tweaks to your "program" to get it to fit on the FPGA, but it is an interesting problem.

As to the languages, VHDL is what is typically used, and its based on Ada, but VHDL has a lot of rough edges, and what little of its better features there are sometimes run into problems because not all of VHDL is synthesizable(i.e. the compiler can turn it into a configuration for an FPGA). This is partially because VHDL is not meant just for FPGA's, but it is also used as the first stage of design in producing ASIC's.

I haven't investigated some of the other languages mentioned in other parts of this discussion yet, but I know that at least some HDL's are just meant for modelling and can't actually configure an FPGA. Confluence actually compiles to synthesizable VHDL, so its pretty handy, and it seems to be heavily influenced by modern language design principles. Some people have mentioned HDL's based on Haskell, I don't know if those can configure an FPGA, but I would imagine they have some good formal basis considering that they used Haskell.

Of course, part of the issue with an FPGA is that all the logic elements operate in parallel, and so you have to be aware of that. They mostly work as synchronous parts, but various levels of asynchronicity are becoming important because clocking circuitry to keep the clocks synchronized are the stumbling block to faster processors right now. Of course, figuring out how to design totally asynchronous digital logic with memory is currently an "interesting problem".

Most EE's(me included) were taught to do synchronous designs using finite state machines to deal with memory. Its a decent method for designing stuff, of course writing a finite state machine in VHDL is very long and tedious and looks nothing like the state machine its supposed to be implementing(which is why I wrote my first VHDL code generator, to take a transition list for a state machine and create the VHDL for it).

Well, I didn't answer your question and I've shown my ignorance, so my work here is done.

That was a pleasure!

Well, I didn't answer your question and I've shown my ignorance, so my work here is done.
I would say you made me curious about HDLs and FPGAs. My current spare time hobby is learning about concurrency, and in my childhood I played a bit with a soldering iron, so why not to combine these two interests :-)

Thanks for the topic!

Not really...

Only if you're arrogant. Way back before college, I actually thought computer science was easy, so I wanted something hard, and I did Electrical Engineering because I figured designing the computer would be more interesting.

Many years and beatings later, I discovered I didn't know quite what I thought I did, and that I liked designing hardware too, but I really did want to program, so I started working on a Masters in Mathematics...

Really, I'm not sure what I'm on :), but the logic behind all these decisions should not be a natural progression for anyone. As a sidenote, part of this question is because as part of one of my side projects I'm planning on doing "5. All of the above." with the kicker of the Hardware Description Language for doing #4.

So, um, stay at #2, its hard enough to get that part right :).

Me too moment

I followed pretty much the same logic(!): before college I had done quite a bit of assembly and C programming and I was feeling confident that my time would be best spent if I were to pursue electrical engineering instead of computer science.

Then I went further and got an MS in Computer Engineering. Years later I wish I had studied more maths and computer science...

By this reasoning, I predict that a few years from now, I will wish that I studied more databases, analog circuits or even medical science...

It only gets worse from here...

At this point I have problems distinguishing computer science as a distinct field, more just a kind of blurry spot between Electrical Engineering and Mathematics. Not applied mathematics mind you, but the pure, crazy formal, mathematical logic sort of stuff :).

Although, now that you mention the analog circuits, you might want to check out the free book Designing Analog Chips by Hans Camenzind, the guy who designed the 555 timer.

Hardware Design Language

Matt Estes: The second reason is that my experience with VHDL seems to suggest that there is a lot of low-hanging fruit for language designers to make a better Hardware Description Language(and what better way to test out said HDL than with with a big project... like a processor :) ).

You may wish to take a look at Confluence.

Other HDLs

You might also want to check out other hardware description languages like LavaHDL and RubyHDL.

More, More!

At Chalmers they have their own Lava which is developed in parallel with Xilinx. They also have a next generation Lava called Wired which can express really funky stuff like non-functional properties.

The more the merrier :)

I've not messed with Confluence yet, but they certainly seemed to be going with the right ideas. They also seemed to have a similar reaction to VHDL :). I didn't know about RubyHDL. I had seen Lava before, but the download link wasn't working at the time, but you have a different link than the one I was using :).

One of the only things I enjoyed about VHDL(using the Altera+Max version) was the ability to graphical build components via Schematic Entry, and then reuse those components in VHDL components, which are then reused in a schematic. Doing the top level as a block diagram schematic was always appealing since it was both the executable code, and very a revealing description.

I was also interested in playing with optimization and verification kinds of problems in HDL as well as producing synthesizable output(usually on an FPGA).

Anyway, thanks for the links.

Lava doesn't seem to be available

I was interested in looking at Lava but I don't think the download link has ever worked. http://www.archive.org doesn't have anything from the past which might be usable either.

We have however recently started experimenting with Confluence and it looks very promising.

Occam

Google for the Occam programming language (I don't have the link right now). It was designed to suppport massively concurrent programs and could run on dedicated processors.

Occam

It was the language for Transputers, wasn't it?

Yes

http://vl.fmnet.info/occam/

Introduction to Occam

The online book Introduction to the Programming Language Occam mentioned in a previous LtU thread might be a better place to start.

Hydra

There is also Hydra, a HDL in Haskell. It was quite interesting, but sadly under publicised.

Stack Computers

You may be interested in stack computers which should support languages similar to Lisp and Prolog very well, not to forget Forth and other concatenative languages. This probably includes languages like Oz and functional languages in general.

Have a look at Philip Koopman's "Stack Computers: The new wave" which is still relevant:

http://www.ece.cmu.edu/~koopman/stack_computers/index.html

Especially take a look at section 7.2: Language Choice

http://www.ece.cmu.edu/~koopman/stack_computers/sec7_2.html

It would be cool to be able to buy a small, simple, yet powerful computer that boots in less than one second in Lisp, ColorForth, Erlang-something or SqueakOS... You name it!

He!

Re: Stack Computers

They seem not to have not succeeded. I'm not sure how alive they are in the embedded market where they were claimed to have more significant advantages. The main problem for the general-purpose CPU market may well be summed up in the name of a project implementing Forth on stock CPUs: RAFTS (RISCs are faster than stack machines).

However, if the claims about low energy consumption and transistor counts still hold today, then those benefits may thoroughly outweigh speed issues when applied to Amorphous Computing.

Stacks and other alternative models.

The stack computer stuff can be very today. My perspective on it(I've skimmed the Koopman book) is that its number one advantage(if done right) is that it is a way more structured approach to handling memory than is typical. Also, if it were done right, multiple stacks could be allocated in a virtual memory style that will keep them from crashing into each other, and these are all interesting ideas to think about.

Implementation wise its nice too(even if you have a few registers), the chips are easy to design, and the compilers are easy to write too, especially if your design has multiple stacks.

In the end, even if its not useful, it can still serve as a reminder that the current "standard" architectural decisions for CPU's are by no means the only ones(although I would suspect many EE's are even _less_ exposed to a variety of languages than CSC's, and especially machine architectures, and so would be less likely to consider doing something else with a new CPU design. Heck, my senior design class forced us to do a standard RISC machine, and discussed no other options and the teacher discouraged any experimentation with architecture at all).

Do object-oriented languages need special hardware support?

There's an excellent paper by Hölzle & Ungar, available at citeseer and
at UCSB, in which the authors go into a detailed comparison of the code generated by a C compiler and that generated by their Self system. Quoting from the abstract:

Our measurements show that SELF programs are more similar to C programs than are C++ programs, even though SELF is much more radically object-oriented than C++ and thus should differ much more from C.



Furthermore, the benefit of tagged arithmetic instructions in the SPARC architecture (originally motivated by Smalltalk and Lisp implementations) appears to be small. Also, special hardware could hardly reduce message dispatch overhead since dispatch sequences are already very short. Two generic hardware features, instruction cache size and data cache write policy, have a much greater impact on performance.

SPARC Tagged Arithmetic Support

SPARC processors have support for tagged arithmetic operations. This was added to help support Lisp and Smalltalk like systems.

From:

http://compilers.iecc.com/comparch/article/91-04-082

Does anyone know what TAGGED DATA instructions are useful for and how to
> use them? Tagged data is assumed to be 30 bits wide followed by two bits
> set to zero. The SPARC allows add and subtract instructions on tagged data.
>
> [Most likely it's for immediate integers in a Lisp-like system that uses
> tagged pointers, but I hope someone who actually knows will tell us. -John]

Yes, the tagged arithmetic instructions were put in the SPARC architecture
for Lucid Common Lisp. If the low-order two bits of a Lisp object
reference are zero, it is a 30-bit immediate fixnum. If some of those
bits are non-zero, it may be a pointer to a floating point number or a
bignum (arbitrary-precision integer). Generic arithmetic is generally
optimized for the fixnum case, since the overwhelming majority of
arithmetic is performed on small integers. On many machines + is compiled
inline as

Test low order two bits of first operand.
If nonzero, use general case. (Operand could be a float or bignum.)
Test low order two bits of second operand.
If nonzero, use general case. (Operand could be a float or bignum.)
Add two operands.
If overflow, use general case. (Result is a bignum).

On the SPARC this is done as one instruction (TADDCC) followed by a
conditional branch rarely taken.