A simple interchange format for syntax trees of any language

A few months ago I wrote an article on CodeProject detailing my plans for my parser generator and the abstraction it will build upon, something called Loyc trees and LES, hoping to get feedback on my ideas.

I didn't get a single comment, though. I think now that I was just talking to the wrong crowd, and that perhaps the LTU will have richer things to say. Basically LES is an interchange format for syntax trees ("XML for code", I like to say, but since the syntax is concise and intuitive-ish, it's more like "YAML for code"), and Loyc is a project for converting code between programming languages and making source code easier to process by mere mortals. (Plus, I'm developing a programming language called Enhanced C# and would welcome advice about making a hygienic macro system.)

So, I hope you guys will find my idea interesting and offer your feedback. The most up-to-date summary of my ideas are on the Loyc wiki:

Loyc trees: https://github.com/qwertie/LoycCore/wiki/Loyc-trees
LES: https://github.com/qwertie/LoycCore/wiki/Loyc-Expression-Syntax

And if anybody's interested in a parser generator for C#, LLLPG is getting close to finished now.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Loyc

I forgot to add that this is all part of a broader idea called the Language of Your Choice (Loyc) project, which is all about transforming source code between different languages (with a focus on popular statically-typed languages), multi-language library programming, and manipulating or analyzing source code. I'm not sure yet how to translate between languages in general, and advice on that is also welcome. Needless to say, developers are welcome too since this is a bit of a large project for one person to do in his free time.

I think you should start by

I think you should start by writing a concise definition of what a "Loyc tree" is.

You may want to be concise on this, trying to write such a definition within two or three sentences.

While writing such a definition I'd suggest trying to get straight to the point, avoiding opinionated sentences such as "I hate XML" and the like, that just confuse the reader.

Finally I'd suggest reading about Abstract Syntax Trees and about the Abstract Syntax Tree Metamodel by the OMG, and writing a discussion between the advantages and disadvantages of "Loyc trees" over those.

How's this?

A Loyc tree is a list of attributes (which are Loyc trees), plus either a literal, an identifier or a call (which consists of a call target and a list of arguments, which are Loyc trees). Loyc trees also contain position information (location within a source file). In other words, a Loyc tree is a data structure with these fields:

1. Range: tuple of (source file name, integer position, integer length)
2. Attrs: a list of attributes
3. One of: a literal (Value), an identifier (Name), or a call (Target, Args). Since I implemented it in C#, which has no ADTs, the properties Value, Name, Target and Args exist at all times and there's a Kind property to distinguish what a particular node is.)

Better? I'll have a look at that "AST metamodel".

ASTM versus Loyc trees/LES/etc

I puzzled over the OMG AST metamodel specification and related specifications (XMI, MOF) for awhile and googled around until I finally found an actual real-life example of an XMI file (which was a trivial diagram that some enterprise software package outputs into a relatively large and unweildy XML file).

After this research (and I notice that googling for OMG terminology doesn't usually turn up tutorials) I can conclude that the main difference between Loyc (including LES and Loyc trees as well as future standards like SIL) and OMG's ASTM/XMI/MOF/etc is that Loyc standards will be accessible to all software developers, while OMG standards are completely impenetrable, i.e. unless you've studied OMG standards for a long time, there's no way you can understand them, and it is unlikely that an ordinary developer could figure out how to process (programmatically input and output) ASTM-compliant XML documents in XMI format. It seems to me that ASTM is intended to be used by people who are paid to work full time writing "enterprise" UML tools. My impression is that it is not meant for "normal" researchers and developers.

In contrast, Loyc standards are meant to be understandable and directly usable by all developers, paid or hobbyist, PhD or BSc or mere high-school grads. While ASTM embraces a complicated abstraction based on XML, LES is designed to avoid the syntactic noise of XML, so users can write valid LES code easily and directly. While OMG features companies as members, with Loyc I hope to build a community of individuals (any tips on how to do that?)

I don't know how ASTM is used in practice, but it seems like UML is at the center of everything OMG does. So I assume ASTM would typically tie in to UML diagrams somehow, while Loyc is mainly focused on processing everyday code in ordinary programming languages.

My feedback

I've had a quick look at the first link. I have to say I didn't see the problem you were trying to solve but this will probably be because I don't have the same problems as your target audience. I've always found first-order logical formulae and s-expressions to be representationally good enough for my needs.

That aside, here's my quick feedback:

What's a method? You write "a general representation of syntax trees for any language" then talk about methods. Is this only for OO languages?

The section "Loyc trees versus s-expressions" didn't really say anything other than your Loyc trees are s-expressions with constraints. What can you represent with a Loyc tree that you cannot represent with an s-expression?

What happens if the language supports "#(1,2)" as a term that can be "executed", how would you represent that?

Your literals are architecture specific - you use the term "double". What about languages with arbitrary precision floating-point numbers? You also made some architecture assumptions later on with your use of "int32". How will you generalise all this?

Since I write a lot of C#,

Since I write a lot of C#, where the word "function" is never used, I've come to use "method" as a synonym for "function". Perhaps it would be better for a general audience to use "function" instead. No, Loyc trees are not OO-specific.

Okay, I'll try to explain Loyc trees again; let me know if I managed to make this explanation much clearer.

An S-expression is either a literal, a symbol (identifier) or a list of S-expressions. meanwhile, a Loyc tree is a list of attributes (which are Loyc trees), plus either a literal, an identifier or a call (which consists of a call target and a list of arguments, which are Loyc trees). Loyc trees also contain position information (location within a source file).

The purpose of Loyc trees is to represent non-LISP languages, especially popular languages, in a way that more intuitively matches the language being represented (compared to s-expressions). For example, most popular languages explicitly separate the call target from the arguments, so that's what a Loyc tree does. And for more complicated structures like

  // C#
  [STAThread]
  public static void Main(string[] x) { ... }

I believe the Loyc tree representation and LES notation will be easier to work with than an S-expression representation would be. In this case I chose the following representation for this function definition:

  [STAThread, #public, #static]
  #def(#void, Main, (#var(#of(@`#[]`, #string), x),), { ... });

where (foo,) represents a tuple of one item and #def(ReturnType, Name, Args, Body) represents a method. The attributes are attached to the root node of this expression (the call to #def).

Loyc trees are also much superior to some XML representations of PL code that I've seen, both in terms of code size and ease of processing.

The attribute list can be used not just for actual "attributes" (a concept that several modern languages support), but other optional items on declarations, such as "public" and "static", and decorative code elements such as comments (e.g. #trivia_SLComment for a single-line comment), which are called "trivia".

If popularized, Loyc trees will provide a "standard" basis for programming language transformations and conversions. There are a number of uses I have in mind; one is a LISP-style macro system (standard preprocessor) that could be tacked on to several existing programming languages, yet the macro system would integrate in a fairly seamless way in each language (in contrast with the very clunky way that the C preprocessor operates, for example). The macro system would allow a person to easily write a single macro that works in multiple languages, and my parser generator (LLLPG) is an example of that; yeah, right now it's limited to C#, but since LLLPG manipulates Loyc trees, it wouldn't be hard to support other languages in the future.

Currently LLLPG would probably need some code to support other languages, e.g. it would have to be specifically programmed to replace "goto" statements with some alternative in a language like Java that doesn't support goto. But if I can figure out how to do PL conversions more generally, LLLPG could target multiple languages without actually doing any special work to support each one, since the conversion modules would handle any minor semantic differences.

And the point is, this multi-language targetting wouldn't be limited to LLLPG or my own tools; anyone would be able to use it, easily.

I like to compare the Loyc project to LLVM. What LLVM does for back-ends is what I want the Loyc toolchain to do for front-ends. Once all the pieces are in place, creating a new language or a new DSL in an existing language will just be a matter of writing a semantic analysis module for your language. You don't have to do a complete semantic analysis, because many of the pieces will already have been written and you can re-use them. Just as LLVM/JVM bytecode/CLR CIL frees you from writing a back-end, Loyc will free you from writing most of the front end, too. You don't even have to write a parser! Since LES is designed to be language agnostic, you can just use LES at first, and write a parser sometime later (it'll be straightforward to convert any existing LES code to the grammar you design).

The Loyc tree #(1, 2) is just a data structure (with Target being the identifier "#" and two literal arguments, 1 and 2). It's no different than XML like <Octothorpe><IntLiteral>1</IntLiteral><IntLiteral>2</IntLiteral></Octothorpe>; you can process it in whatever way you want. If you want to "execute" #(1, 2), you have to pick some semantics for what that means, then convert the tree to some representation that is closer to what you want (maybe another Loyc tree, maybe LLVM bitcode).

EDIT: although a Loyc tree is "just" a data structure, I also wish to define standards (conventions) for what they mean and how they are structured (these conventions will be the foundation of something called SIL, Standard Imperative Language). In the case of #(1, 2) the current convention is that # means "run a list of expressions/statements and get the result of the last one". Therefore #(1, 2) == 2.

That's interesting, I've never heard of arbitrary-precision floating point before. Honestly as far as I have heard, all floating-point is fixed-precision (I have heard of arbitrary-precision rational numbers though).

Certainly I'm aware of arbitrary-size integers, though, and I'm thinking probably the Loyc project would define a standard name for such a type, e.g. we could call it #int. Or perhaps we'd use #bigint or #long (which, I think, is the Python name for it) for arbitrary size and #int would represent the "preferred" integer type for a particular language. Yes, currently LES is limited to fixed-size integers, but that will change as the project gains contributors who can help plan out these details. Loyc is intended to be a community project, I just have to figure out how to attract developers.

As a person who understands

As a person who understands parsing, may I ask U for Ur opinion on context sensitive approach on parsing texts?

Also, how did U solve left recursion?

At the end, I have to admit I am onto similar mission of supporting many programming languages in one product, so expect some competition from me in the future. Yet, I plan to use just javascript and html/php interface with enhanced Earley parser (100x faster than my current propriatery shift-reduce method, also supports all CFG grammars).

I don't consider myself an

I don't consider myself an expert in parsing. My parser generator is designed for one specific approach to parsing: LL(k). By writing the parser generator I learned more of the nuances of LL(k) parsing, but I didn't study the other approaches to parsing in a lot of detail. Since LLLPG allows you to insert arbitrary actions into the grammar and to use custom expressions to make branching decisions, it can allow context-sensitive behavior. It is not designed specifically for that*, but there are situations where being context-aware can make parsing easier. I'm not really familiar with the Chomskian formal definition of context-sensitive parsing, but it is sometimes useful and/or necessary in practice to going outside the boundaries of context-free grammars. I could give examples, but I'm not entirely sure what you'd like to know so I'll stop here (plus, I'm not entirely sure that I actually know what I'm talking about). Context-free grammars are better-studied and have a nice theoretical simplicity to them, so I certainly wouldn't "go out of my way" to do things in a context-sensitive way.

* actually, just looking briefly at Wikipedia's definition of context-sensitive grammars, it looks to me like LLLPG's syntactic predicate (zero-width assertion) feature lets LLLPG parse some subset of the context-sensitive languages.

I didn't solve left recursion. LL(k) doesn't allow it, and I never really felt a need for it.

"Earley parser"? You learn something new every day. I know of PEGs, LR(k), LL(*), DFAs, and Pratt Parsers, but not Earley parsers. I'll have to look at those sometime.

What are you planning? If you are trying to do something like what I'm doing, it would make more sense to join forces than compete.

Actually I plan to make

Actually I plan to make Internet cloud OS (html/php/mySQL underlaying code) with scripting support where all programmings would be done by scripting. Users would be able to extend OS with new languages like Python, Perl and probably Haskell.

It would be like little Windows, or Linux that would run inside browser, nested in a cloud.

This is some promotial material.

Synth

Hmmmm, I am finding your 'promotional' document difficult to understand. You call "synth" a "universal language", but you don't say in what sense it is meant to be universal. It looks like it is mostly just a data format like XML, but it seems to have some semantics mixed into it.

I can see that English is not your first language, and most people are not naturally very good at teaching. But an indie developer, working alone, who wants to build a community, must do a very good job at communicating. There are a wide variety of ways to accomplish things today, so you need to do a good job communicating what your goals are, and be able to argue why your ideas are better than the other ideas that everyone else is using.

So, what problem is "synth" intended to solve? How is it better than XML, YAML, JSON, Python, C#, or whatever it is intended to replace? You must explain and demonstrate the advantages of your language over the popular alternatives (and since my own project is not popular, I would say there are other ingredients for success which I myself have not yet discovered). You must also use correct terminology that people will understand. For example you use the word "classes", but they don't sound like OO classes; and there is a sentence "Synth, if extended with representation of dynamics can describe general programs as programming language." dynamics? describe programs as programming language? This has no meaning to me.

You suggest a programming community should exist that shares earnings and gives some earnings to the poor. That sounds great! But you didn't say who would pay the community, or why they would pay the community. New business models are difficult to develop (that is why I work for free).

You've also talked about a "cloud OS", but I don't know what that means. Do you really think you can create an entire OS, a competitor to Linux or Windows or OpenStack? and if so, why? Or if it is not an OS, what is it?

David, thanks for feedback

David, thanks for Ur feedback :)

Yeah, please excuse my pidgin English, that is really not my mother language.

So, what problem is "Synth" intended to solve?

Well, it is not better than any other programming language, it is just different. In fact, it uses a different programming paradigm which is neither imperative nor functional, but (what I call) "event driven".

Let's say we have any structure for storing data, i.e. XML. Now, we may track what data is being changed from input devices (built-in structure that represents states from input devices). When data is changed, an user defined event is fired upon which other data is automatically changed. Finally, changed data canonically reflects on output devices (another built-in structure).

Just replace XML structure with a little bit prettier syntax (like this one), stitch input and output devices to hardware and U get "Synth"

To concure Windows or Linux? It is a little bit different market I am aiming to. Right now there exists different cloud sites (I hope U know what cloud means, if not try to google it). Those sites usually have fixed set of applications and they are mine real competitors. What I'm trying to offer is a cloud system with scripting support. Similarities with regular OS-es are in a way that to use Synth, U have to have existing web hosting upon which U would install Synth runtime. After that, U boot into Synth from browser and do similar things U would do with Win or Linux.

As first applications that run on Synth I plan to offer cloud CASE tool with cloud office (real database cutie I have in mind, concures to MS Office database). Also I think that a few 3rd party scripting languages like Python implementation, or Perl would attract programmers.

I'm counting on donations from users that would be satisfied with the offer.

Again, tx for interest and critics on promotial material, feedback is precious to me.

AST or CST?

Please excuse my confusion -- I'm not an expert -- but isn't the Date parse tree example in your link more of a concrete than an abstract syntax tree?

My reason for asking is the presence of the "/"s in the parse tree.

I don't know, I think they

I don't know, I think they call it that way, god knows why. I got that from Wikipedia.

How will code be translated from one language to another?

That seems to be the major problem that this is intended to solve.

If I have some code that uses language-specific features, how faithful to the original semantics will the translation be?

For example, let's say I futz with Python meta-classes. Or I write some Java class hierarchy where there's both overloading and overriding of the same method name.

How does/will the translation work? What are some of the difficulties you've experienced implementing it?

Yes, here's the plan.

Yes, here's the plan.

Work on language conversion will proceed in stages (and very slowly, I might add, as long as no one is helping me).

The first stage is to create "surface syntax" converters which will do a poor job of preserving semantics, and in general will produce syntactically correct, but semantically wrong code. This part of the project is mainly about creating parsers that make Loyc trees and printers that print them, with the actual conversion process being kind of an afterthought. Surface syntax converters are already useful, as people often need to port code between languages. A mechanism for pattern-matching "macros" could help people remove some of the manual labor involved.

The second stage will be to focus on a parameterized "Standard Imperative Language" in which the "parameters" represent semantic and feature variations between common languages. Then the essence of "language conversion" becomes "altering the code to compensate for the removal (or addition) of a parameter".

Certain features of certain languages are difficult (and sometimes impossible) to faithfully translate to other languages; therefore we will focus at first on the most common features that are easiest to translate. One of the products of stage 2 should be a specification for some standard sublanguage that can be faithfully translated to several languages, but is powerful enough for many purposes.

Concurrently, I hope the community (whose materialization I still await) will help develop a multi-language standard library. This library will mainly contain common data structures: strings, fixed-size arrays, dynamic arrays, linked lists, hashtables, etc. Using this library, you can write "rich" code based on these data structures, that supports translation to several languages. The data types for a language X will be designed to interoperate reasonably well with the "standard" data types for language X.

Stage three is ????... but stage 4 is profit!

I certainly don't know how to accomplish all of this yet. "We" (hopefully it won't be "I" forever) will figure it out as we go along.

I would contrast my project with Ć. My ultimate goal is to allow rich programming, such that it is not much more difficult to write multi-language code as single-language code. I certainly intend to offer some form of automatic memory management, as well as functional and higher-order coding styles like list.filter(x => x > 0).map(x => x * 2), and standard facilities like file access. So while Ć is currently designed as a lowest-common-denominator language, I want to provide a language that is, in some ways, more powerful than the conversion targets (e.g. because LISP-style macros are a compile-time feature, they can operate in the initial language, without any support in the target language(s)).