Hardware Acceleration of Matrix Multiplication on a Xilinx FPGA

via Jeff Newbern in the discussion forum comes the writeup from the winning entry in the MEMOCODE 2007 contest:

This year, the first MEMOCODE hardware/software codesign contest posed the following problem: optimize matrix-matrix multiplication in such a way that it is split between the FPGA and PowerPC on a Xilinx Virtex IIPro 30. In this paper we discuss our solution, which we implemented on a Xilinx XUP development board with 256 MB of DRAM. The design was done by the five authors over a span of approximately 3 weeks, though of the 15 possible man-weeks, about 9 were actually spent working on this problem. All hardware design was done using Bluespec SystemVerilog (BSV), with the exception of an imported Verilog multiplication unit, necessary only due to the limitations of the Xilinx FPGA toolflow optimizations.

Wow! This is much cooler than the kinds of programs I write :-)

This MIT/Bluespec team won the first contest at MEMOCODE 2007 and the second in MEMOCODE 2008. The results for 2008 are very interesting: The Bluespec team had the highest performance by an order of magnitude, the next fastest program was written by one guy in straight C without using the FPGA, and the rest were mostly in a hybrid of C and (V?)HDL.

Does this make Bluespec the programming tool of choice for discriminating hardware hackers?

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

MEMOCODE 2008 Entry

The MIT team has made their code and a paper describing their architecture for the winning entry in the 2008 contest available at opencores.com. You can download the paper here, but it requires you to be registered with OpenCores. The paper will also be published in the MEMOCODE 2008 proceedings.

The most interesting part of their architecture (IMO) is a merge sort pipeline which uses different scheduling parameters at each level of the merge tree to maximize utilization. The area savings allows the tree to be deeper, which reduces the number of passes required over the data to complete the sorting.

OpenCores cheatsheet

You don't need to register to download the code, you can do a checkout from anonymous CVS directly:

cvs -d :pserver:anonymous@cvs.opencores.org:/cvsroot/anonymous login
(just press enter)
cvs -d :pserver:anonymous@cvs.opencores.org:/cvsroot/anonymous -z9 co cryptosorter

Perhaps the authors would post a copy of the paper on the web directly? That would save a lot of LtU readers registering on OpenCores just to casually read a paper.

The authors have made the

The authors have made the paper available here.

Of course

Of course, Bluespec is the programming tool of choice for discriminating hardware hackers. Now if they had just kept the Haskell syntax. :)

Parallel vs concurrent

MEMOCODE seems to have a high-performance computing / parallel algorithms focus. I'm thinking that hardware / FPGAs are at least equally interesting for real-time concurrent systems. For example, I've been hacking on an NXP LPC2129 microprocessor which is a single chip running many hard-realtime applications: ARM CPU core, 4x network controllers, 2x serial port controllers, 2x timers, pulse-width modulation, etc. I now start to see that each of these features is simply a program expressed as a circuit and combined into a chip like so many arguments to ld. Hardware is just software!? (Well, the digital bits.)

I'd be interested to see Bluespec code for e.g. an ethernet controller btw.

This sounds interesting with respect to the Reinventing Programming project. Why not rewrite the hardware too while you've got Emacs open? :-)

In some respects hardware is

In some respects hardware is just like software. But to make good hardware you have to have a good performance model. Just writing software and expecting it to turn into good hardware is a pipe dream.