LtU Forum

Basic building blocks of a programming language

I've tried to understand category theory, but while I can't say I've got the mathematical skills to understand it, I think I kind of grasp what it intends to do. It is basically a way of building bigger things out of smaller things as flexibly as possible.

Designing my language I get inspired by this thought, and started to think along those lines. This means trying to realize what are the basic building blocks from which to build larger blocks, i.e. from where do I start?

In an attempt to figure this out, I thought about what languages typically tries to do. What I found so far are:

1. Calculate data, i.e. from data produce some other data, i.e. functions
2. Organize data into hierarchy (lists, structs, maps, let's call these types just to have some name)
3. Find patterns in data

The idea here is to make these elements as versatile as possible, so they can be combined without restriction as long as it is reasonable. Allow functions to generate other functions, hierarchies and/or patterns. Allow hierarchies to contain functions, other hierarchies and/or patterns. Allow patterns to describe functions, hierarchies and/or other patterns.

First of all, do you agree that these three could be used as basic building blocks, or is there something missing or wrong?

Secondly, the pattern. I can see patterns used in two that seems like distinctly different ways. One is that you write a template like code, which you could see as a patter
n. You insert a few values and out comes a type, a function, or something. Another way of using it would be to say this is an expected pattern with some variables in it, th
en apply data to it, and if it matches the pattern you get the value of the variables the pattern contained.

Perhaps those two cases of patterns should be named differently? Any thoughts on this?

My article on state machines and DSL evolution

I've been unhappy with with state machines, activity diagrams, and BPMN as tools for modelling processes for long time. In the article linked below I've used the same principles as were used for moving from flat languages (Asm, Fortran 66, BASIC, etc) to structured programming to create a behavior model that is some kind of behavior equivalent of Martin Fowler state machine model.

https://dzone.com/articles/evolving-domain-specific-languages

(Teaser) the end result for the sample is the following:

LOOP {
    ESCAPE doorOpened {
        DO lockPanel, unlockDoor
        WAIT doorClosed
        ALL {
            WAIT lightOn
        } AND {
            WAIT drawOpened
        }
        DO lockDoor, unlockPanel
        WAIT panelClosed
    }
}

Do we need exactly two binding constructs?

I'm recently thinking about the relation between objects and lexical environment. Objects, especially under a prototype-based object system, looks suspiciously similar to a lexical environment:

slot-ref <-> environment-ref
slot-set! <-> environment-set!
add-slot! <-> environment-define
parent(s)-of/delegate(s)-of <-> environment-parent(s)

However, one problem remains in the way of unifying those 2 binding constructs completely: the slot-scoping problem. If I simply use symbols (like the case for environment) to designate slots, there's nothing to prevent two authors to come up with the same name (say 'x) and they can clash, especially in the presence of multiple delegation/inheritance. Therefore I figure I must use slot objects rather than symbols to designate slots:

(within-environment user-1
   (define x (make-slot)))
(within-environment user-2
   (define x (make-slot))
   (make-object x 1 user-1:x 2))

and... now environments bind symbols to slot objects, and objects bind slot objects to values.

This all looks fine, except that it makes me itch that I need to have two almost identical constructs, and I can't unify them into one! Are those two binding constructs exactly what we need, no more, no less?

Cicada language -- a new dependently typed language

Cicada language is a
dependently typed programming language and a
interactive theorem prover.

The aim of cicada project is to help people understand that developing software and developing mathematics are increasingly the same kind of activity, and people who practices these developments, can learn from each other, and help each other in very good ways.

Homepage at: cicada-lang.org

CUE: An open-source data validation language

There are two core aspects of CUE that make it different from the usual programming or configuration languages:

- Types are values
- Values (and thus types) are ordered into a lattice

These properties are relevant almost to everything that makes CUE what it is. They simplify the language, as many concepts that are distinct in other languages fold together. The resulting order independence simplifies reasoning about values for both humans and machines.

It also forces formal rigor on the language, such as defining exactly what it means to be optional, a default value, or null. Making sure all values fit in a value lattice leaves no wiggle room.

The application of set-theoretic types is getting more and more mainstream as they turn out to be very effective in (partially) typing dynamic languages (i.e. Typescript). Other popular examples are Scala(3) and Kotlin, together with other (less mainstream) examples such as Ceylon, Typed Racket and the Avail programming language.

I've dabbled in Avail a few years ago, and was very impressed by Avail's type driven compiler, but not so much by its (extreme) DSL support. In contrast to Avail, I believe CUE is much more pragmatic while not allowing any 'tactical' compromises.

CUE has its origins at Google: the ability to compose CUE code/specifications is really useful at the scale that Google operates. The creator of CUE - Marcel van Lohuizen - has recently quit his job at Google to become a full-time developer of CUE. I think Marcel has created something really interesting, so I understand his decision!

Shen Standard Library

The documentation can be viewed on https://shenlanguage.org/StLib/stlib.html. Download on https://shenlanguage.org/download.html.

Mark

Trojan Source: Unicode Bidi Algorithm abuses in source code

A recent chunk of research caught my eye.

The skinny is that the unicode bidi algorithm can be used to do all kinds of mischief against the correspondence between the visual display of code and the semantic effect of it. To the extent that the manipulation of code semantics is almost but not quite arbitrary. It's exceedingly easy to use this technique to hide trojan-horse type vulnerabiity in almost any language that accepts unicode source.

My own analysis of this is that this comes about because the 'tree' or 'stack' of bidi contexts is not congruent to the 'tree' or 'stack' of semantic source code contexts.

In short no bidi context can extend beyond the bounds of the most specific syntactic context in which it was created. And that includes 'trivial' subtrees like comments and string constants. As the compiler exits a clause, it must report an error if the bidi state that existed when it entered that clause has not been restored.

And even within the subclauses, keywords and structure must begin and end with the same bidi context. For example if the language has an 'if' subclause that's

'if ('+predicate +') then' + consequent + 'else' + alternate 

Then the predicate, consequent, and alternate clauses can push/pop bidi states within themselves, but keywords and parens must all begin and end in the exact same bidi state. And that bidi state - the one that existed before the parser started reading the subclause - must be restored when the parser leaves the subclause.

Right now most unicode-accepting programming systems are treating bidi state as irrelevant and bidi override control characters as whitespace. This means that code that looks exactly the same in an editor can be read in different sequence and have vastly different semantics.

Forcing bidi level to remain congruent to syntax level through the program means that a program that displays in a particular way either has a single valid semantics according to its sequence order, or the parts you see are in some non-obvious sequence order and it is therefore a syntax error.

The Book of Shen 4th edition is now available online.

The Book of Shen 4th edition (TBoS) is now available online. This is part of an open science
initiative which will place over 1000 pages of computer science online. I'd like to thank
our sponsors whose support has made this possible.

TBoS is based around the S series of kernels, of which the latest S31 went open source.

Relevant Links

Home page
S Series Kernel
Open Science
TBoS Online
TBoS Amazon

Mark

Egel 0.1 (beta)

Egel is an untyped eager combinator scripting language. I recently decided to go beta, the thing is now ready for shipping; the language seems stable but libraries are either undeveloped or subject to change. I am now actively looking for users and feedback.

It should compile on most unixes and macos.

Cheers, Marco

We've got big balls ... of mud ...

The most popular and enduring architecture for all large projects is the one called "Big Ball Of Mud."

It's what happens when requirements grow in unexpected directions that cut directly across what were originally conceived as architectural boundaries, and met with adhoc code that connects parts of the program that the original architecture would never have allowed to touch each other. Then the whole thing is maintained for ten years, in the face of steadily expanding and frequently changing use cases and problems and by a host of people many of whom don't really know what's going on in the rest of the program, then gets bought by a competitor who starts reimplementing the whole thing in a different language and with a "proper architecture" but runs out of budget for that so they only manage to reimplement about a third of it, then link the stuff in the new language together with the stuff in the original language (and libraries in a third, not that anyone's looking at libraries at this point) and hack together the logical mismatches between "old" and "new" code by running perl scripts on the database. .....

And it just goes on like that - year after year after year.

Now imagine it's 2037 or whatever and you have a system where this process has taken the architecture to its logical conclusion. It's not just a big ball of mud any more, it's a planet of mud. There is no solid ground anywhere in any direction. Nothing like any kind of rational architecture (except Big Ball Of Mud architecture) has emerged.

What can a programmer who is responsible for some small part of this giant mudball - say, one file of code - do? In the 'thinking globally and acting locally' sense, because obviously one programmer on a whole planet of mud is in no position to impose new architecture on the whole and doesn't even have the information needed to decide a new architecture for the whole.

What modest steps can one programmer take to smooth out, and perhaps dry a little, their particular little bit of mud? Ideally, they should be low-effort steps that don't require all that much reaching into all the other bits of mud. And ideally, if all the individual programmers all over the mud planet did them in the course of maintaining their little bits of mud, it should eventually result in the simplification of the whole insofar as possible, and the emergence of some kind of rational architecture.

It might be an architecture that no one of the programmers would have designed. It might be an architectural paradigm that humans haven't even discovered yet. It might not be particularly elegant. But after a lot of iterations by people working all over the program, what small, feasible, local actions in a big ball of mud produce a program with more tractable organization?

Are there any metrics that a monitoring system can auto-generate that could give the programmer a "score" for whether they're improving or degrading the process of creating some kind of code organization?

XML feed