Compiler framework, insight?

Hello,

New here, so if I missed the point of this place, apologies. I'm a hobbyist that's interested in languages in general. I've been researching the concepts behind parsers and compilers for a few years and I'm even working on a few projects associated to such research: one related to Parser generation, very incomplete, LL(*) attempt will likely use pre-calculated look-ahead table, I think it'd be similar to PEG logic, but I'd have to actually thoroughly research that to answer it, and another project related to a generalized high-level abstract syntax tree which provides an abstract typing model with a Common Language Infrastructure (.NET) implementation and an intermediate code implementation.

The focus of this topic is the latter project, the compiler framework. The end goal of this project is to provide a near C#-level representation of a code document object model (not to be confused with Microsoft's sad CodeDOM framework) for static languages (aptly named, the Static Language Framework, SLF for short.)

*end introduction*

I've been searching for a while for a place that can give me some realistic feedback about a project I've been working on called Abstraction: a Static Language Framework, I hope to enable everything you'd find in a high-level static language like C#, classes, structs, et al, lambdas, LINQ, Extension methods, iterators, and so on. Into this, I want to add a few features to the CLI's definition of Generic type parameters: duck typing.

As of now, I've mostly constructed the high-level intermediate code representation, a unified typing model, a majority of the expression/statement representations, right now I'm laying the ground work for the linker, which will be irksome because the available namespaces in scope can change method to method, by design, to accommodate for a situation that's likely been encountered by a few C# users: extension method overlap. Below is a list of what I hope to be able to pull off with this project, mind you it's focused on a Document Object Model to describe code and then compile it, so there's no language-specific logic tying it to VB or C# exclusively (a reason for that below.)

  1. Multiple File Assemblies
  2. Duck Typing (Through type-parameters)
    1. Parametered Constructor Constraints
    2. Property Constraints
    3. Method Constraints
    4. Indexer Constraints
    5. Event Constraints
    • This functionality will work by creating a public, but non editor browsable interface on which 'bridges' can be constructed. Call sites which use the duck-typed generics will generate a bridge for the types accordingly. Prior to use of the generic the compiler will check the bridge, for a given closure of the generic-parameters in use, and inject the compiler-generated implementation of the interface, which will be a basic member-mapping.

      The information associated to the constraints will be embedded on a per-type-parameter private empty struct which will be linked to the type-parameter through an attribute that points to that private struct, I know of no other better way since constraint signatures which refer to other type-parameters can't be embedded in attributes alone: attributes can't typeof(TParam) legally.

  3. Lambda Expressions
    • From my understanding of lambda expressions, closures are created out of the given scope of the lambda, and based upon scope nesting, the locals of the method are hoised into a generated class, the lambda expression being a method within that class, the internal links to the locals within the method are redirected to a singleton instance (relative to the method call) of the locals associated to the scope.
  4. Language Integrated Query
    • LINQ is syntactical sugar which is a series of extension methods which resolve based upon the active extensions in scope. Once rewritten the identity binding is based off of the methods that match the given sequence of method calls. IQueryable<T> resolves in light of being IEnumerable<T> because of it being higher in the interface inheritance chain.
  5. Iterators and Asynchronous methods
    • Iterators, and Asynchronous Methods respectively, work in an abstract sense, by taking a given method and chopping it up into sections based off of a predefined state change signaling symbol of some kind. In iterators it's yield *, in asynchronous methods its await.

      In the case of asynchronous methods, it gets a bit trickier, primarily because await is allowed within expressions, whereas yield * is only allowed at the start of a statement. My guess is this will require a bit more in-depth analysis on the expression level, adding private locals to the state-machine that's generated to hoist the previously calculated elements of the, potentially pending, method call.

      Example: F(G(), await N()), a local would need to be created to hoist the result of the call to G to ensure that the execution order is preserved. Exceptions can be made to fields since there is no chance of state change occuring, as there is no way to capture gets from fields.

The last four main items above are the major points I wanted to list because they're the main items I might have misconceptions about, and they're the toughest of the bunch, from what I can tell.

The reason there's no language-specific semantics implied within this framework is a secondary goal: code-generation. To ensure that when you write code for a specific language you get the proper result when encoding expressions, language-specific binary expressions are available (to enforce C# operator precedence, you'd use C# expressions of such.)

I've already done some previous testing with things like constant folding, short circuit logic and the like. I'm more hopeful that the folks here can provide some, much needed, insight, perhaps language suggestions (being most people here are interested in language design or languages in general, it should keep the suggestions realistic.)

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Goal?

What kind of things would your tool be used for?

Re: Goal

The goal of the project is to enable a framework for those who are interested in writing their own language for the .NET Common Language Infrastructure. The reason for the abstraction layer on the type-system is to enable potential other targets in the future (like java, or my own infrastructure if I learn enough to do so).

The main reason I'm writing it is I see a clear lack of such a framework within the CLR. Even when the compiler as a service hits some future version of the framework, if I'm right, it won't enable others to access it completely, like the VB and C# compilers will.

One of the other main focuses is the simplicity it hopes to bring. Reflection.Emit is fairly complex for someone who's never coded using an instruction stack as a means of representing intent. Here's a simple example:

var testAssembly = IntermediateGateway.CreateAssembly("WindowsFormsTest");
testAssembly.References.Add(typeof(int).Assembly.GetAssemblyReference());
testAssembly.References.Add(typeof(Form).Assembly.GetAssemblyReference());
testAssembly.References.Add(typeof(System.Linq.Queryable).Assembly.GetAssemblyReference());
testAssembly.CompilationContext.OutputType = AssemblyOutputType.WinFormsApplication;
//*snip*

//Define the main dialog.
var mainDialog = testAssembly.DefaultNamespace.Classes.Add("MainDialog");
mainDialog.BaseType = typeof(Form).GetTypeReference();
mainDialog.AccessLevel = AccessLevelModifiers.Internal;
//Add the designer partial file to the main dialog.
var mainDialogDesigner = mainDialog.Parts.Add();

//Defines the components of the main dialog.
var mdComponents = mainDialogDesigner.Fields.Add(new TypedName("components", typeof(IContainer).GetTypeReference()));
var mdDispose = mainDialogDesigner.Methods.Add("Dispose", new TypedNameSeries() { { "disposing", CommonTypeRefs.Boolean } });
var mdDisposing = mdDispose.Parameters["disposing"];

//if (disposing && this.components != null)
var disposeCondition = mdDispose.If(mdDisposing.LogicalAnd(mdComponents.InequalTo(IntermediateGateway.NullValue)));
//    this.components.Dispose();
disposeCondition.Call(mdComponents.GetReference(), "Dispose");

Essentially it has to be easy to structure if you're using it for code-generation, I think it's also equally important to be easy to use if you're using it for a language you're writing.

It's one thing to write a parser, but something else entirely to write a compiler. Knowing this from first hand experience, once you've come so far and it works, it's a load of work, you're never done if you're constantly updating the language. The more basic language features I can provide to someone interested in trying their own hand at language design, the better.

It's basically a framework aimed at what any good framework should do: make a common set of features available to more people. People that either can't implement it themselves due to time or experience, or people who want to focus on testing new ideas, and want the standard fare of features to be readily available so they have less work to do up front.

What about the DLR?

They added some very nice expression/statement tree APIs in .NET 4.0 that went along with the DLR work. The only thing you can't do easily yet is define classes/interfaces without hacking through emit (still possible, but annoying), or screwing with assembly references. Maybe this is what you are working on, in that case it might be better to interoperate with the DLR or even incorporate their code base.

Re: What about the DLR?

The Dynamic Language Runtime is focused towards a different crowd.

The DLR is primarily focused on the concept of Dynamic Dispatch which is geared towards scripting languages in general, this project is aimed primarily at Static Languages.

Since I plan on allowing language users to utilize the concept of dynamic static typing, the DLR will be an area of research once that gets underway; however, I will likely not heavily rely on it beyond that feature. I'm a firm believer that static typing has its advantages and do not want to focus on a typeless (or single-type) dynamic dispatching system.

Edit

Further, from what I've seen of System.Linq.Expressions (the main expression/statement logic I've seen so far) is it's incredibly verbose and irksome to construct statements within. Such a synthesis method is not really acceptable for a code generator, while CodeDOM is equally verbose, its the verbosity and resulted end-user code maintenance headache that I'm trying to avoid here.

I have found

I have found System.Linq.Expressions to be more than adequate for its use in Bling, which is statically typed. I think the only way to really do better is through some kind of DSL. Its definitely much better than Emit.

Re: Bling?

Bling, is it the UI toolkit from CodePlex?

Further, from what you're referring to, can you give a use case on your end of the spectrum, such as the actual expressions you're synthesizing through System.Linq.Expressions?

Yes and yes :) I'm

Yes and yes :)

I'm synthesizing lots of expressions in Bling, starting from simple mathematical functions to statement-heavy boilerplate code that does WPF data-binding or installs and maintains a DirectX shader. I even have a generator for a physics engine that runs superfast because there is no abstraction or garbage created after code generation.

There are a couple of cases in Bling where I generate actual class implementations using Emit, but I use the System.Linq.Expressions to generate the method bodies of these classes and use a static field hack to connect the two.

Re: Bling

Well,

From what I can tell most of the work you perform is during run-time, is it not? If such is the case then you're likely using the right tool for the job. The primary focus for the project I'm writing is for users looking for a traditional compilation pipeline accessible through C# in a library. There is a fairly substantial performance hit in the system I'm constructing due to the way it intends to operate.

Since it wraps the standard type-system, any access to the meta-data associated to it is penalized through a double caching process. I only load types associated to a given assembly upon access; however, in order to do so accurately the cache for the assembly is loaded in full, but the individual wrapper instances are only created when a specific type is called for.

For instance, if you call for System.String through the linker core during resolution, the assembly mscorlib is profiled and every type from that assembly is retrieved in order to breakdown namespaces within the assembly. The wrappers are then only procured as needed. The primary function of the wrappers is to maintain a consistently high-level presentation and avoid dozens of properties on types that are largely situational specific based off of the type, be it Delegates, Classes and so on.

The alternative to this approach is reading the meta-data from the assembly myself, like Mono.Cecil does; however, such an effort is long off because I want to get a proof of concept finished first before I worry about optimizing further.

Granted, if you're constructing code with the types known up-front, the penalties would be largely mitigated; however, from a compiler perspective, that is unlikely. From a user like you, were the solution to meet your needs, it would likely work fine.

The one thing that would be required, though is an ngen of the assemblies associated to the framework would be required to be generated on the target user's machine. Due to the heavy use of generics, JIT of the library takes 2.6+ seconds on a Core i7 Extreme Edition processor, the abstract type system library (which defines the go-between for intermediate and compiled types) takes ten minutes to ngen on my system. Once I finish the project I'll look into the cause of such an issue, perhaps post about it through Microsoft Connect.

Yep, I'm doing code gen

Yep, I'm doing code gen dynamically which avoids type mirror issues (generated code is in the same type-space as the generator) and even allows the generated code to refer to values in the generator space.

I would like to do what you are doing someday, the dynamic approach is not entirely appropriate, it can be slow and sometimes we want to work with separate type spaces.

Here is a link that you might be interested. Ignore the job part, but look at the content of the post.

Re: Roslyn

Yeah, I applied for it, no word in a few months, probably not qualified.

The key point to note from that description is the 'immutable model'. The focus of the project I'm working on is creating a mutable model that's gradually lowered as the compiler goes further into its resolution/rewrites into an immutable form.

I've been working on this project since before Microsoft even introduced the concept of a Compiler as a service. I can't imagine what kind of functionality will be exposed to those interested in writing a language of their own. Based upon the fact that it mentions their two major .NET Languages, something tells me it'll be somewhat closed on certain areas of functionality; however, only time will tell.

Given that they introduced a pseudo C#->VB.NET translator, it tells me that they are definitely working on a system very similar to what I am; however, it doesn't stop me from doing this because: I do it for fun.

Send me your resume

Send me your resume (smcdirm@microsoft.com), I can forward to someone close to the project perhaps for a better look.

I definitely understand why you'd want to do this project, it sounds like a great idea.

Re: Send me your resume

Alright, just remember though, I'm an autodidact: I've never worked as a professional programmer. Because this is a research project my resume probably won't look quite as polished as the average candidate might.

Just giving you a heads up.

Very interesting

I think these are very interesting ideas but they are also promisingly calling for great efforts from you (or, from us), though they by no means should discourage you/us.

"Who dares wins" as they say ;)

Now, anyone can feel free to see this as a shameless plug, and my apologies in advance if it is indeed so, per the usual LtU forums practice observed so far, but I can't help relating your effort/thinking with my own...

and since your OP's does contain "... , insight?" here's what I believe relevant:

* In my understanding, you're much concerned/interested in doing fairly better than what we have today, to somewhat "unleash" the potential of the tooling (tools, components, libraries, ...) regarding the implementation of languages when one is willing to break it down/focused to have only one main thing in common:

the target executable/runtime platform, independently of their semantics/syntactical aspects and "intents" (from those languages' implementors and/or users perspective);

* then, if I'm not misled here's what I eventually figured (mostly out of intuition, but also just considering what even the biggest software tools vendors/makers have to propose us today... still pretty remote, in capabilities, from that type of goals you and I have):

1. it's likely (imho) we'll have, we like it or not, to use the same good old, well-known, abstraction lever that CS/PLT had to use for decades on different topics: the making of as many/as much of the orthogonal dimensions of the object of discussion (here: languages, with semantics, syntax, and tooling interop) "first class" ... that is, make these (language dimensions) become first class citizens, reified somehow/somewhere in your framework's artefacts, for at least one implementation of it (be it of reference or not); hence my first and second plug;

2. but also, "sadly", (1) may actually turn out to be a necessary but not a sufficient endeavour to follow, if we have also the other, most desirable (imho) goal to keep the scaling of what you'll have devised (thus, in (1)) under meaningful control. I mean, that's surely great if you manage to achieve a much easier composition of tools and tool chains to implement whatever language one might come up with for the target platform chosen, but then you still ignore, a priori, the problem of being able to represent, as easily, these languages' and tool executions' formal properties locally or globally, which is essentially one most important precondition for having, if one wishes, proof-carrying (or at least proof-friendly) code enabled at a large/useful scale enough...

There, then, I'd say one of our first preliminary "homework" assignments should be to have also some kind of high-level (but not "too high", either) formal framework ready to use, through which we can 2.a) explicitly relate those languages implementations produced with our original intents, as language inventors', and with contextual purposes (unpredictably-)found, as language users, whenever we want(*), and 2.b) describe the bounds of these innovations' scope, for whatever milestone point in time we decide to speak about w.r.t their development.

These are (1)'s outcome, and my assumption in this context is we would then surely be very interested to have them be, say, "manageable, rationalized", in an organized fashion, by already well-known and well-understood important results about computation, transformations in software, rewriting, etc, precisely (thus, for the same point in time).

Well, it's about bounds one could hope, anyway, to manage if only at a smaller order of magnitude (# of LOCs in input/output artefacts, or # of combinations/relations considered between the notations, etc). As the idea of abstracted, simplified models/PoCs/and other genres of the prototyping/validation stage is already in broad use, not only in CS/PLT, but also to a fair extent (well, when one's lucky enough ;) in nowadays S.E. common practices.

Hence, a third, and last, plug (to a tangent LtU thread, this time).

Would love to get a chance to rediscuss this with you, if you're interested(?)

My 2 cts here, anyway.

Cheers

(*) In my speculations, there, I have and still am trying "hard" (lack of time to dedicate to it, mostly) to investigate the idea of finding or devising, at the "languages' dimension" something (ok, plug #4) with analogous capabilities (for language implementations made within such a would-be framework of yours/mine) to those of .NET metadata for managed components to transmit their supplier/client and client/supplier intents and purposes within or between layers. You know, the ones as typically but not exclusively used in loosely-coupled architectures, subsystems, and contracts (e.g., as various as those used to make easier the use of IoC patterns in designer tools' GUIs, or for describing web services endpoints, or markers in the various serialization formats be they used over the www or not, and so forth.)

If I understand correctly,

If I understand correctly, doesn't Mono.Cecil already provide a lot of this infrastructure?

Re: If I understand correctly,

From what I can tell, no, the focus isn't the same. Cecil appears to be focused towards creating new assemblies from scratch, yes; however it does so through a low-level API that redefines the instruction-based stack opcodes as well as the low-level type infrastructure.

The focus of the project I'm working on is aimed at, in the new assembly space, generating assemblies at a higher-level of abstraction. Versus viewing the code through a stack and creating your logical flow of code yourself, it creates the view of the code through higher-order constructs like a ConditionStatement, a loop statement, the typical stuff that most language developers write themselves anyway.

I also want to bring more complex language features to the plate, such as Language Integrated Query, Lambdas, and so on, stuff that doesn't seem possible in Cecil (of course I only had a cursory look at Cecil's source to ascertain project goals.)