Synchronic Computation II

May I draw members attention to an updated version of an earlier attempt to bypass the iteration, data structure, and deadlock issues associated with conventional dataflow models. As far as I am aware, no pure λ-calculus simulation environment for a high level programming language such as Haskell has ever been constructed, that has viable time and space complexity characteristics. That difficulty lends support to the view that alternative models should be considered, where software is not fully abstracted from hardware. An 8 page pdf summary of the material may be accessed here.

Abstract

Space is a dataflow oriented language that exploits the massive parallelism available in a formal model of computation called the Synchronic A-Ram, and provides a framework for developing general-purpose parallel environments for FPGA and reconfigurable architectures. Space is an example of an interlanguage; a circuit oriented, textual environment that was developed to address shortcomings associated with conventional tree languages for representing dataflow, allowing complex syntax trees to be collapsed into a simplified form. Space expresses variable grained MIMD parallelism, is modular, strictly typed, and deterministic. Barring operations associated with memory allocation and compilation, modules cannot access global variables, and are referentially transparent. Modules exhibit a small, sequential state transition system, aiding verification. Space deals with communication, scheduling, and resource contention issues in parallel computing, by resolving them explicitly in an incremental manner, module by module, whilst ascending the ladder of abstraction. An environment has been developed, that incorporates a simulator and compiler that transform Space programs into Synchronic A-Ram machine code consisting of only three bit-level instructions, and a marking instruction. Space and the Synchronic A-Ram do not exhibit the iteration, data structure, and deadlock related issues associated with conventional dataflow models. The implementation of high level (parallel) computation on a simple formal model, closes a missing link in computer science, and points to architectures and environments that are less susceptible to contention and programmability issues associated with multithreading on processor networks.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Example

Where does the 759 cycles in the example come from? If this is a vector of independent adders then shouldn't it take 32 cycles or less? It would be helpful if you could explain what these cycles are and how they related to the intermediate steps in the computation.

Synchronic A-ram machine cycles

The Synchronic A-Ram model is so fine grained that even a 1 bit full adder occupies 25 registers and take 7 cycles to complete. The adder32 module that appears as a submodule in the bigaddition module in Figure 1, is an Earth implementation of a 32 bit adder, whose code appears on page 266 of a larger report.
An adder32 module is space efficient whilst being relatively time inefficient, because it only occupies 138 registers through rewriting a full adder's code during runtime, but takes a maximum of 736 cycles to complete. (A 32 bit carry adder would be larger but could be made to complete in about 80 cycles.) The bigaddition module performs 65536 32-bit additions simultaneously in 759 cycles, and takes a bit longer to complete than a single adder32, because the module's large numbers of simultaneous copies and submodule activations need a few more cycles to process.

Space

I've had a look through your report and I think that I can see what you are doing. But unless I've missed something (which is quite likely given the size of your report) I can't see why you are doing it. To clarify - I can understand the justification for distributing the computation throughout the memory so that each memory location becomes a simplified processor. But the way that you have defined it I cannot see any form of locality, which is essential to making claims about scalability and verifiability.

Your A-RAM contains four instructions: the two forms of write, the conditioned skip and the jump are clearly universal (as I can unfold any mapping over a finite number of bits into a decision tree for each output bit and thus build comparators for the jumps). But each read and write can access any location - there is no concept of locality within the list of registers that constrains which locations an instruction in a register can access. So the interconnect must be fully connected between registers, must synchronise on each cycle and must terminate with an error on write collisions. This doesn't seem to be an improvement on a standard shared-memory multi-threaded machine.

Have you seen this recent work that aims at a similar area, but restricts the communication topology between the cells?

Locality and Asynchronism.

My overall position is informed by (i) not having to restrict the communication topology between cells due to wiring considerations, because of the prospect of intra-chip wave based interconnects, and (ii) considering the synchronisation of large systems to be feasible and affordable, because of opportunities afforded by optoelectronic based clocking (see D.A.B. Miller’s work). Thanks for the interesting RALA link, I can see why you mentioned it -- they seem to be presenting an asynchronous version of an FPGA, with some added control features taken from Dataflow Models. I am a bit puzzled as to why they wish to introduce asynchrony at such a primitive level -- designing asynchronous digital circuits is more fraught because of timing problems in the feedback path -- perhaps their system is meant to deal with that but I couldn’t see it.
They state in the paper that “physical dynamics are inherently distributed, parallel, asynchronous, and scalable”. That might be true, but it does not follow that explicit asynchrony and non-determinism (agent based parallelism) are fundamental to parallel programming. These features introduce state explosion and make the avoidance of deadlock difficult. The aim in my approach is to develop a synchronous, deterministic parallel programming environment, and leave asynchronous and non-deterministic features that improve efficiency and scalability to the compiler and architecture. Programming is virtually synchronous, but the physical system may be asynchronous and hence scalable.
As you point out, the state transformation function for the A-Ram as presented defines a fully synchronised machine, that has a constant propagation delay of one cycle (or as you might say, the model is non-local). I didnt follow what you meant by the link between locality and verifiability. My intention was to abstract away propagation delay in order to provide an extremely simple parallel semantics for developing a programming interlanguage, incorporating the resource allocative element and for easing the avoidance of deadlock. Interlanguage is helped in achieving this by
-directly sharing subexpressions in data structures and dataflow on a syntactic level.
-tagging data transfers with soft machine connections expressed at the preceding level of abstraction.
-tagging operations with soft machine resources expressed at the preceding level of abstraction.

Verifiability

It took me a few minutes to work out what I mean by verifiability so it obviously wasn't very clear :-).

When you describe the verifiability of the system I'm interpreting what you've said as meaning you assume the underlying interconnect possesses certain properties. Based on those properties you can then verify the execution of the program on the processing elements above that interconnect. You've made these assumptions very explicit so it doesn't detract from your model that it is based on these assumptions.

I'm more used to seeing the description of the interconnect as part of the specification being verified - so that the models of the processor and the communications are combined to allow verification of the complete system. In this situation the definition of the properties of the interconnect can lead to difficulties in the verification.

For the interconnect the verification challenge is to show that the information is propagated correctly within the clock-cycle. As far I understand the goal of the verification would be to prove that the device can operate at some particular clock frequency. If you use a global synchronous clock then the achievable frequency would depend on the size of the size. It seems that this makes verification of correct operation differ for each size of device and thus the problem is harder. Model-checking really isn't my area though and I know there was some work done on containing the state space explosion for parameterised families of problems in the mid-90s.

I'm unfamiliar with Miller's work or optoelectronics in general but it seems that the basic problem with a global synchronous clock would apply: as you increase the number of elements the achievable clock-frequency decreases. There are two components - the speed of information propagation is bounded, and if you use a fully connected topology then you need multiplexers / de-multiplexers at each node so there will be a O(lg n) signal propagation delay in the tree.

The advantage of restricting the communication topology is that the achievable clock frequency is O(1) where-as a global synchronous topology imposes a limit on clock frequency of O(1/lg n). This is a simplification that ignores constraints from bounds on information propagation.

The current focus on asynchronous designs is that they are easier to build, and less complex to design. As you point out this pushes some of the issues into software verification. The local-topology/complete-connection distinction is a slightly different (though similar) debate. It's simpler to design software for a completely connected topology but a device that meets those constraints is harder to design and build. It's much more complex to reason directly about information propagation in a compiler, but the benefit is that the device is cheaper to build. These issues have cropped up for Tilera as their architecture exposes the path delays to the compiler, and there was a similar issue in one of Clearspeed's architectures.

Synchronism and Wave Based Communication

The current focus on asynchronous designs arises out of several factors, not least of which is that asynchronism is part of the zeitgeist. There is a quasi-political consideration, that because individuals, corporations and sovereign nation states have a reasonable desire to keep their internal workings private, and not subject to centralised (read synchronised) control, this aspiration should be reflected in the design of networks of computational devices. Freedom is an absolutely legitimate concern, but it is not obvious why it has any relevance to the purely technical issues relating to the foundations and the implementation of high performance computing. Agent based concurrency has dominated theoretical studies for decades, without resolving the parallel computing crisis -- there is a need to consider other approaches.

Large computer architectures can be clocked, if there is sufficient investment in synchronisation mechanisms. The establishment of a global clock, does not require signals to travel across the entire radius of an environment, or through wires. Optoelectronic technology [1] enables very short cycle times. In [2] Miller states, “It is likely possible to retain absolute timing accuracy in the delivery of optical clock signals of ~10-100 picoseconds over a computer room (tens of meters) without any special technology.” If thats true, it should for example be feasible to synchronise millions of 3Ghz processors. I dont think architectures for synchronic computation would need to synchronise over any thing like that kind of area to be competitive.

In a physical implementation, communication times between storage elements described in Space modules may vary, but all communications would succeed (barring hardware failure) because the programming model is set up to be Exclusive Write. Verification from that point of view would not be necessary. I would be grateful for any links describing Tileras and Clearspeed’s issues, for future comparison.

In a wave based fully connected scheme for n elements, there is a potential for O(n) area complexity. Wave based information may be transmitted by electromagnetic, spintronic, or possibly some other form of radiation. Digital information is encoded through the modulation of a wave’s amplitude or phase, transmitted through the wave medium by an emitter, and received by a detector and decoded. A multitude of links can be realised in a wave medium or free space, potentially without crosstalk or interference, in at least two ways.

Processing elements are laid out in a plane, and an element incorporates at least one pair of a fixed detector and a pointable emitter or reflector, to aim at most one other element by bouncing a transmission through free space off of a reflecting mirror, or deflecting layer. Propagation times may vary depending on the total flight distance between points. A single emitter is point to point, and does not support broadcast mode.

Exclusively employing the technique of wavelength division multiplexing, each processing element is designated it’s own specific wavelength interval, and is equipped with a single wave emitter/detector mechanism, where the wavelength configuration of at least the detector must be adjustable/tunable. The plane of processing elements is sandwiched with a wave medium. Propagation times for messages again may vary, depending on distances between points. Each element may omnidirectionally broadcast on it’s own emitting wavelength, through the medium to a plurality of receiving elements, providing the receiving element’s detector is tuned to the emitter’s wavelength.

In both cases, there is no requirement for multiplexers/demultiplexors with non linear space and time complexity, merely for an element to have a tunable or pointable receiver/emitter.

[1] Aparna Bhatnagar et al., “Receiverless detection schemes for optical clock distribution,” in Quantum Sensing and Nanophotonic Devices, vol. 5359 (presented at the Quantum Sensing and Nanophotonic Devices, San Jose, CA, USA: SPIE, 2004), 352-359.
[2] D.A.B. Miller, “Rationale and challenges for optical interconnects to electronic chips,” Proceedings of the IEEE 88, no. 6 (2000): Section C.1., pg 734.

Linking proofs with programs.

In addition, I was trying to generate interest into the question of whether it matters that the λ-calculus and the Turing Machine are not suited to executing high level processes. I argue in chapter 9 of arxiv link that neither can feasibly support random access memory or parallel composition of modules, which are needed for the practical simulation of high level computation, in terms of primitive steps. In the case of the λ-calculus, the problem partly derives from adopting a non-spatial, context free grammar with high structural variability, where any branch of the syntax tree may be indefinitely long, as the syntax for the formalism. Unfortunately graph rewriting is not a feasible solution because pattern matching is NP-complete.

The Curry-Howard isomorphism is a means of relating proofs in various, conventional tree based logics, to terms in various extensions of conventional tree based λ-calculus. Can one assume that the difficulty of expressing high level programs and consequently high level proofs in the λ-calculus, does not place a limit on the practical utility of the Curry-Howard isomorphism?

I would like to argue that type theory’s overall agenda is constrained by focusing on standard tree based logics and calculi. A future paper will describe how simulation efficient compute models in the α-Ram family can provide a platform for spatialising logic and mathematics, and thereby provide an alternative means of realising mathematical argument as a form of computation. Complex structures, procedures, and proofs, could be brought into computational life as abstract data structures, virtual algorithms, and interstring-based logic programs defined in terms of a primitive formal model. The intention is to allow the use of only enough standard set theory and informal logic, for the definition of a low level, generic, synchronic model of computation. The Synchronic B-Ram defined in 3.2.2 of the linked report fulfills this requirement. It is proposed that the following steps are then taken:

(i) The construction of a native, high level interlanguage programming environment, able to support virtual programs and abstract data types.
(ii) The definition of a prenexed and interstring-based form of predicate logic called interlogic, together with a deterministic parallel deduction algorithm that is executable on the model.
(iii) Finally, the recasting of the rest of mathematical discourse concerning structures derived from computable algebras, as interstring-based reasoning about virtual programs and abstract data types.