## Lexical structure of scripting languages

hi all,
I'm a newbie here but I assure you I'm not a troll. I was reading about Pluvo and then it struck me...

so you are bored and want to write a new language. what do you do at first? plan a look and feel. well, different people come from different backgrounds and they don't like the same look and feel. but then, it basically means you write a lexer, a parser, and then finally the interpreter. too much work. and it's not reusable by any other bored programmer. it's fun doing it again, of course. :)

I was thinking, what about having a rough lexical standard for interpreted languages? the point is that with clear specification everyone can write a lexer/parser and then extend it. hopefully, the base will already be there.

of course, different languages are different...
but I think some guidelines are absolutely necessary. not for the expert programmers, they can cope with anything, but the objective is to reduce the learning time for students.
so here are some imaginary rules. each idea is independent of the others. if you like them, that would probably convince you that it is indeed useful to do something like this.

• source codes are XML files. there are four different types of text: documentation, annotations, code and comments. an IDE is a must so that full advantage of this formulation can be taken. the basic principle is that source code itself is valuable data. something like:
<source language = "imaginary" target = "2.1">
<documentation>
The classical "Hello World" program
</documentation>
<code>
<annotation visibility = "public" />
<comment>the exact syntax is not relevant here</comment>
say "Hello World"
</code>
</source>
</documenation>


too verbose for text editors, but not so for an IDE. parsing XML is easy. displaying different things in different colors is.. well.. depends on you. but text editors like gedit, emacs, jEdit, Notepad++, know what to do.

• the encoding is always Unicode. this basically means you can use mathematical symbols as part of your language. advantage? well, it's been like 40 years or so of ASCII based programming, right?
• there should be an all-pervasive naming convention. either camelCase or if the language permits, hyphen-seperated-words. but not both.
• long names are preferred. so no more MVar or mapM_ or __init__ and scoping rules apply. Monad.map looks pretty (over mapM). scaring off uni students is not going to do any good.
• the languages should have a clear-cut way of communicating with the outside world. the names exported should not contain any eccentricities (should be pure ([alpha][alpha|num]*)).
• for variable names (or function names), same name with different annotations should not have different meaning (I don't hate Perl, but I think @list and $list having completely different meanings is not healthy). • there will be no way to write two different statements on the same line (like ; separated ones in C). block structure dictated by indentation. this suggestion brings up the controvertial idea of 'good taste' again. I think once someone implements a very good library for dealing with one particular standard like these, that library, hopefully very well documented, people will accept the standard just the way it is. may be they will demand minor tweakings here and there, but not more than that. psychology is a really really mysterious thing. things like these... the problem with interpreted language is that they are very concise, in my humble opinion, too concise. of course, diving any deeper will cause so much difference in opinions that I should probably stop here. ## Comment viewing options ### Even I disagree with that c'mon... there's no point whatsoever designing a new language if it does not have new semantics. but should the syntax be necessarily new every time? how much of the semantics do you think is reflected in the syntactic eccentricities like$ for scalar, @ for list, % for dictionary? my observation so far is that the more the syntax is closer to mathematics the better.

the XML suggestion is only to bring an end to the raw text programming scheme. not requiring an IDE is good. but in my humble opinion requiring an IDE is even better. better productivity for sure, but I think better code manipulation definitely helps. compare gedit with Eclipse for example. or Notepad with emacs.

there might be features of the programming language itself that the standard could specify. like support for (anonymous) tuple type. don't you think having a specification for tuple representation would be nice? Python has a tuple syntax that non-Python programmers should find very confusing. specially C programmers.

### I disagree

ALL that matters is syntax. C is just syntactic sugar for assembly which is just syntactic sugar for binary which is just syntactic sugar for electrical impulses.

### This isn't true, or there

This isn't true, or there wouldn't be more than one valid way to translate at each stage.

You forgot about semantics. C++ may translate to assembly, for example, but assembly has no concept of exceptions, so obviously there's a semantics difference between the two.

### Syntactic sugar?

C doesn't include assembly as its subset, so it's not a syntactic sugar for it.

Evem if it did, the translation would have to be sufficiently simple and local to be called syntactic sugar.

### but the objective is to

but the objective is to reduce the learning time for students.

I can't see why this should be an objective.
(Taken to the extreme, every student could reduce it's learning time to 0 by deciding not to study in the first place.)

On the contrary, I believe that a good professor will not let his students use tools like lex(1) or libraries for regular expressions, for example, forcing them to wirte a lexical analyzer themselves. Of course, this will depend on what the subject of the lessons is.

### Scripting languages are not for writing compilers

by students I didn't really mean university students *blushes*

the learners of the language... scripting languages come and go. they are designed for specific purposes. why should they have so many gotchas?

the lesser the surprize, the better.

my point is, I don't see syntactical differences to be that important. therefore, there should be no harm partially eliminating that difference.

### my point is, I don't see

my point is, I don't see syntactical differences to be that important. therefore, there should be no harm partially eliminating that difference.

I disagree. How do you know that?
Let me give an example: in Haskell you can write the following:

swap :: (a,b) -> (b,a)
swap (a,b) = (b,a)

Note how the syntactic construct (id,id) is used 2 times as syntax for types, one time as syntax for a pattern matching a tuple of the aforementioned type and one time as syntax of an expression, that constructs a tuple.

I for my part find this very elegant. And now, lets think of a translation into something you proposed:

{annotation name="swap"}
{type}{tuple n="2"}{type}{tyvar name="a"/}{/type}{type}{tyvar name="a"/}{/type}{/tuple}{/type}
{/annotation}

(I spare us the rest. I think my intention is clear enough.)

Besides, XML or the like will not save you from new syntax whenever new concepts are introduced. For example, suppose in 1960 there was Fortran, COBOL and ALGOL and someone like you invented XML then in order to unify the different syntaxes with success. However, as soon as there was a new concept like ML or Haskell patterns, we would have to adapt the XML syntax, too. Just because XML does have it's own surface syntax does not mean we have solved all syntactic problems once and for all.

### Near-perfect fit?

S-expression based languages like Scheme and Lisp fit many of your criteria pretty well, and do indeed make good scripting languages. Have you looked at them? XML is not a very practical source syntax for programs, but Lisp/Scheme S-expressions are roughly equivalent in terms of their tree structure (allowing them to directly represent program syntax trees), and much more lightweight than XML. Unicode support, a consistent naming convention (hyphen-separated), tendency to long names, and communicating with the outside world are all well-addressed.

### I refuse to conform to prog. lang. design guidelines I dislike

XNL is a horrible choice for a programming language syntax: way too verbose and redundant. All programming languages I've seen which claimed to use XML actually used an inconsistent mixture of XML and traditional syntaxes, and thus combined disadvantages of both. They can't usefully process the abstract syntax tree by generic XML tools, and the XML parts are awful to type by hand.

I personally dislike S-expressions too. They are too verbose, only a little better than XML, and reading code without interactive highlighting of matching parentheses is a pain.

I hate when a language forces to put certain things in separate lines (or separate files). It should be the responsibility of programmers to break up code into lines in sensible places. Long sequences of short elements look better when several elements are put on a single line.

The amount of abbreviations in typical names in Haskell or Python is fine for me. Java and Common Lisp are too verbose. K is too compact.

### Look before you leap.

so you are bored and want to write a new language. what do you do at first? plan a look and feel.

No.

### there have been posts here

there have been posts here in the past about using xml as an intermediate representation. quite recently, one mentioning a compiler that was implemented as a series of transformations on xml documents (iirc). try searching back for xml, ast and the like if you're interested.

### Debatable

Everything on your wishlist is debatable. I get the feeling that most language designers are perfectionists that wouldn't be willing to compromise the grammar of their language just because there's a generic parser available.