Ideas for a PL-oriented honors thesis

I'm going to be a college senior next year and I want to do an honors thesis related to programming languages. I've been scratching my head for topics but though I have an interest in PL, I don't know enough about the field to be able to tell what the outstanding problems are that are the right size for an undergrad.

One of the ideas I've had is to design and implement a programming language, but I'm not sure if this is a good use of my time. It would probably be only a cousin of SmallTalk/JavaScript and I don't think I'd have the time (1 year, part time) to build substantial libraries or features that would set it apart from the crop.

I also think that VMs are an increasingly important field (especially with the rise of multicore) but I know little about this field. My programming languages was theoretical and interpreter based and my own readings have been along the same lines. I have some experience with program analysis (I worked on blended analysis for Java last summer) again, I don't know enough about the field to know what would be a bite-sized problem to tackle.

I do intend to ask my professors for suggestions and I'm thinking about some non-PL topics as well, but any input you guys have would be appreciated as well. I realize that this isn't strictly a PL-oriented question, so feel free to point me to somewhere else to ask my questions.

Thanks for reading through.

Basu

Don't

The time taken in industry -Phillips- to implement a DSL is one year. And that by experienced programmers. Take a language and extend it somewhat is the best you can do in your timeframe given.

By marco at Wed, 2010-04-14 20:35 | login or register to post comments

The time taken to build

The time taken to build something in a big company is sometimes more than the time it takes a smart undergrad. You can absolutely implement something like Javascript in less than a year.

By Jules Jacobs at Wed, 2010-04-14 22:14 | login or register to post comments

Well...

Depending on how badly they need the tool, and how hard they conform to specs, and how hard they want guarantees that _it_ _will_ _work_, it will pretty much always take more time in industry.

Let's be honest, 98% of all languages build will never make it past industrial tests, and 1.5% that is somewhat accepted during it's lifetime, is fixed after that.

But granted, a javascript interpreter in something like C, yeah, about what, 3-6 months for an experienced programmer?

[ Reminds me a bit of Helium, yet another Haskell. Pretty ok-ishly implemented, but has been running now for 5-years-or-some by what should be considered top notch scientists. Doesn't perform, unsure if it even self-compiles? ]

By marco at Wed, 2010-04-14 22:18 | login or register to post comments

Helium

What do you mean when you say that Helium doesn't perform? Helium is designed for teaching Haskell, so if you are referring to run-time speed then I don't see that as a valid complaint. Similarly, why should Helium need to compile itself when it is intended to teach the basics of Haskell?

By neilbrown at Thu, 2010-04-15 15:27 | login or register to post comments

I know that..

It's just an example of how difficult it is to write a compiler. And honestly, I don't think that the designers of it expected that it would be that slow. Heck, I compile to C, I never expected the amount of bits I would arrive at, i.e. the binary is about 20x the size of what I wanted.

I just get the feeling sometimes that people think that writing a compiler is easy. Yes, patching something together in flex/bison is easy. No, writing anything good which scales and works satisfactory is not.

[And as a side note. Helium is used in teaching, but also as a research vehicle. Part of it is about attribute grammars, part of it is about delivering good error messages. What is the lesson here: You might want both, but it doesn't scale? Of course you would like it to self-compile.]

[There's always the possibility that they fixed the performance issues by now, btw. ]

By marco at Fri, 2010-04-16 03:38 | login or register to post comments

But granted, a javascript

But granted, a javascript interpreter in something like C, yeah, about what, 3-6 months for an experienced programmer?

No way, especially given that we now have redexes for the language. Creating a JIT for it is another matter.. but, even then, it'd make more sense to modify one of the passes of an existing system, so it still shouldn't be bad. Either way, still just a matter of months.

Doing much of this on ones own without a background would probably be a magnitude or two harder, but that's the point of having an adviser.

By lmeyerov at Fri, 2010-04-16 05:12 | login or register to post comments

I don't see your point?

What I think I see is a Javascript interpreter in Scheme. So, if you would take that as a starting point and copy that verbatim to C, maybe.

But you're not going to copy it verbatim, since C has other idioms than Scheme. So there will be some rethinking and restructuring done. Which means: reading, two weeks. Patching together lexical+syntactical analysis, a week. Defining the AST, a week. Coming up with an environment and evaluator which performs and scales somewhat and implementing that in C, at least a month. Testing it, removing bugs, polishing the result, a few more weeks. At least three months, if all goes well.

Yes, you can do a lisp/scheme interpreter in a few weeks, heck for a simplified core a day even. But no, you cannot write a full-fledged fast scheme compiler in under a year (unless you copy it). That would make PLT pretty senseless otherwise. And since a Javascript interpreter is somewhere in between, I hold my ground. 3-6 months. ;)

Advisors can point you in a direction, but unless you're in Glasgow, Berkeley, Moscow, New Jersey, or even, what-do-I-know Nijmegen, or any institute, or rather, have a gifted been-through-the-mill doesn't-tell-you-everything advisor, which has actual experience with implementing compilers, you're mostly in the dark, and so is your advisor.

[ When I say Scheme I mean the basic Lisp/Scheme lambda 'core' which defines it, not all extensions and research decorations. ]

By marco at Fri, 2010-04-16 05:59 | login or register to post comments

What I think I see is a

What I think I see is a Javascript interpreter in Scheme. So, if you would take that as a starting point and copy that verbatim to C, maybe.

What you see is redexes, which happen to be embedded in Scheme. This isn't some ad-hoc informal language or a made up semantics formalism: this is a traditional operational semantics, that, as demonstrated, can be mechanically converted into an executable system (among other nifty things). I might think differently if such a spec wasn't available: without it, there's too much of a risk of hitting an edge case mid-implementation that requires a big rewrite.

I agree about a fast system taking more time, but that's a different issue: given this, a core JavaScript C-based interpreter shouldn't take more than a month, if even. If you're interested in speed, most of the infrastructure should be irrelevant as the project is likely possible as a modification of existing open source systems.

Advisors can point you in a direction, but unless you're in Glasgow, Berkeley, Moscow, New Jersey, or even, what-do-I-know Nijmegen, or any institute, or rather, have a gifted been-through-the-mill doesn't-tell-you-everything advisor, which has actual experience with implementing compilers, you're mostly in the dark, and so is your advisor.

I'm not sure what you're getting at here. Basic PL implementation like this is 1st or 2nd year material at good schools and the linked document dots the i's for a real language. Almost any Ph.D. in PL who is now a professor should be able to advise a simple project like this. Heck, many systems professors should be able to. To make it a research project is a different issue.

By lmeyerov at Fri, 2010-04-16 08:52 | login or register to post comments

Those who teach, can't

Having adapted and written a few compilers, my comment is on the fact that writing a compiler isn't something you can learn out of a book. Of course, you can copy a design, and that will help you patch something together fast. Make a Pascal, change the syntax, is something which comes to mind. But the design space is big, and the number of mistakes you can make is huge.

Unless you've been through the experience of building and adapting a compiler implementing some new ideas, there's no way that most people can even possibly imagine how to do it well, or how much time it takes to build something which works when doing something which is just a bit out of the ordinary.

I've been at a university which is known for writing compilers and compiler tools, not sure how many are out there. I learned half by just listening at their comments. GCC is fast, for example, because they (mostly) don't allow algorithms which have quadratic complexity. Don't think I've read that anywhere in a book.

It looks easy when you read a book, it isn't. And if you copy a design from a book, you're clueless about the amount of lessons which have gotten into that book.

And actually, that is pretty predictable. Since, by and large, when looking at compiler internals, there doesn't seem to be one silver bullet of getting it right. Just look at nano-transformations, thats a pretty old idea. I forgot which one, but there have been compilers which compile in 500 passes (if I remember correctly, that was because of reasons of having limited memory, and an abundance of tape). It ain't strange that such a compiler cannot perform.

[ Btw, I've tried to implement a compiler using redexes in Ocaml. End result: couldn't self compile since Ocaml runs out of stack space easily. Where's the last time you read something like that in a book? ]

By marco at Fri, 2010-04-16 16:51 | login or register to post comments

Having adapted and written a

Having adapted and written a few compilers, my comment is on the fact that writing a compiler isn't something you can learn out of a book.

I never stated this. My comment was about how long it would take to write a C-based JavaScript interpreter. Not about writing something as fast as GCC (or, more precisely, on par with a modern JavaScript JIT) from scratch -- for that, I advocated modifying a system where most of the infrastructure is already in place (and an alternative is to target a much smaller language, though I would still suggest reusing as many tools as possible, e.g., LLVM).

The purpose of this is a vehicle for undergraduate research. It need not be some commercial doodad with stringent quality requirements: the requirement is 1) being able to develop something quickly to enable subsequent experimentation and 2) for the student to learn about the process (research, compilers, etc.). Before Arjun did his work, I would not have advocated a large language like JavaScript (and I still don't see full merit to it, as opposed to modifying Tracemonkey or doing a Scheme DSL), but now that he has, targeting JavaScript shouldn't be as big of a deal.

I think we're talking past each other, in part because it seems that you keep shifting the goal (of course a month project will have drawbacks) while I have a somewhat fixed interpretation of the requirements of such an exercise (that the 1 month project just bootstraps the next steps so the first month must be planned in terms of the necessary bootstrapping).

By lmeyerov at Fri, 2010-04-16 22:30 | login or register to post comments

No Shifting

We just have different opinions on what should be done, and how much time that takes.

By marco at Sat, 2010-04-17 11:19 | login or register to post comments

Totally doable, go for it!

I wrote the core of Flapjax (runtime+bindings+basic opts) in maybe a month or two and someone else did the original compiler into in a few weeks. I'm working with someone on a system for generating parallel browser code atm and most of our time is on algorithm research. The actual code side of things (frontend, code emission, etc.) should be rapid on a PL research project or, assuming a modicum of coding ability, you're not actually spending the time on research.

As for a topic.. I suggest working with someone who knows about these things and picking a domain that you either know about or want to know about. There's a lot out there that can use basic linguistic support!

To throw one idea, getting a serverside+tab+window+offline+security data sharing abstraction (and basic optimization, e.g., caching) for JavaScript might be a fun one. It's an obvious problem with a lot of hackish non-principled solutions, begging for PL support now that the problem domain is more clear. However.. again, I'd go by problem domain you're interested in (as it'll motivate good research) rather than the idea of a stranger.

By lmeyerov at Fri, 2010-04-16 02:25 | login or register to post comments

At least use powerful tools

See: Tock- A Compiler fOr Concurrent Languages, written in Haskell. (Don't expect using this tool with help you learn Haskell, rather, implementing a language is the best way to know the implemented language, not the implementing language -- this "discovery" shocked me when I designed my first language.)

If you want a good undergrad text on programming language design, Turbak and Gifford's Design Concepts in Programming Languages [1] [2]. You really can't go wrong with this book holding your hand.

As for the amount of time, it varies by experience and the quality of tools as well as knowledge of the tools. For example, ONE of the authors of VMkit was able to create their own VM in ONE month.

If you want a cool topic, go with applying PLT to render farms. The stuff they seem to be doing in distributed networkable graphics applications these days seems hackish - there are all kinds of problems I see described that can be solved simply by avoiding them. If I were to go back to school, this is where I would try to make an impact, anyway.

By Z-Bo at Wed, 2010-04-14 20:57 | login or register to post comments

I agree, but...

But still, those are two research associates writing 20kLoC in Haskell and associated tools. So, with an average 2kLoC per month per person, and the fact they can bump ideas at each other, ... 6-12 months?

[Oops, they were undergraduates. Guess some of them did well ;) ]

[Just saw his talk. Yeah, looks smart, didn't like his solution. I'll give him 4-6 months ;) ]

By marco at Wed, 2010-04-14 22:10 | login or register to post comments

On the matter of Tock...

(I'm the author of that linked blog post about Tock)

We were both PhD students at the time of writing Tock (and have both since become RAs, as it happens). I probably worked full-time on Tock for effectively a year, and Adam for perhaps similar, so two years to produce a compiler for an existing fully-featured language (and a partial implementation of a new language) with two backends, lots of safety checks (and a couple of new generic programming techniques). About a third of that 20k is unit tests, BTW. 2kloc per month per person sounds an over-estimate in Haskell -- Haskell is a very dense language!

Having done all this, here's my advice for the original poster: designing and implementing a new language in a year will be interesting (and implementing it definitely helps with the design!), but is unlikely to produce much novel, since you'll be spending most of your time on tools. Writing a compiler is great fun, and if that's what you want to do then find an existing small language and write a compiler for it. If you're interested in niche language design, an EDSL might be interesting (Haskell is well-suited for this kind of thing).

I actually set out on my thesis designing a new programming language, but once I found Haskell I kept wanting to add more Haskell features into my own language, until I ended up just adding the features I wanted into Haskell instead (as a library/EDSL).

By neilbrown at Thu, 2010-04-15 10:02 | login or register to post comments

Cool, thanks for that answer.

Yeah, the 2kLoC is just an estimate of what you can do. I know I write about 2kLoC a month, if I know what I am doing. But also, I started writing a compiler totally from scratch rethinking every step in it, as a hobby project/learning experience, and ended up at writing 11kLoC in five years. (With half year pauses in between.)

Since you're generating code, and a lot of code is supposedly very similar, and the language is untyped, known, maybe easily implemented, I wasn't too sure about the numbers. I had a good bachelor student write a (half-baked) C dialect compiler in half a year (in a group of four other students), so, you never can tell I guess.

As a side note, I wish more people would share these statistics.

[ The totally from scratch was after studying several language implementations, and wondering what the/my right way of implementing would look like. ]

By marco at Fri, 2010-04-16 03:49 | login or register to post comments

On Turbak and Gifford

Turbak and Gifford is probably a bit dense for most undergrads. Something like Scott's Programming Language Pragmatics makes for a higher level, if less rigorous, coverage of the same concepts.

By Joel Redman at Thu, 2010-04-15 21:51 | login or register to post comments

+1 for Turbak and Gifford

Speaking as an entirely self-taught programmer, I have to disagree that Design Concepts in Programming Languages is too dense for undergrads - at least in the conceptual sense, since physically it's a beast ;)

I think as long as the reader has had some minimal exposure to a lisp or ML family language, the appendix is all the preparation necessary. It's got to be one of the clearest and most well-written PL books I've read. Take that opinion with a grain of salt, since I'm only ~1/3 of the way through it cover-to-cover, but have skimmed other chapters.

By cody koeninger at Fri, 2010-04-16 22:23 | login or register to post comments

More interesting things to do than implement JavaScript

Personally, I recommend implementing some new'ish language with maybe just ONE or TWO interesting new features on top of some existing Scheme or Common Lisp - preferably, one that compiles, I suppose, such as SBCL or Larceny or CCL or whatever.

First, for an undergraduate project, wrestling with syntax options and associated abstract syntax is largely a waste of time - it's mostly boring grunt work. A CL or Scheme with their macro facilities will provide happy syntax constraints and an abstract syntax suitable for most of the commonplace aspects of the research language. You can mostly focus on your unique contributions to the "new" language.

Second, for an undergraduate project, reinventing the run-time system - calling conventions, GC, primops, likely an FFI needed to implement the few unique language features - is a grand waste of time. A good CL or Scheme will give you this run-time-system ready made, and provide metalinguistic abstraction facilities to implement your language and it's unique features/semantics/etc. on top of the underlying CL or Scheme language and run time.

This is in the spirit in which Erlang was originally implemented on top of a Prolog implementation - certainly a significant contribution to recent PL design and likely PLT as well in most folks' opinions.

Just implementing a run-of-the-mill language is mostly boring and mostly just educates the implementer about relatively boring run time system design issues and trade offs - UNLESS, of course, some aspect of run time system design is the FOCUS of your purported innovative undergraduate research.

In any case, have fun!

Just my 2 cents - S.

By scottmcl at Sat, 2010-04-17 14:10 | login or register to post comments

Lambda the Ultimate

User login

Navigation