LtU Forum

Trojan Source: Unicode Bidi Algorithm abuses in source code

A recent chunk of research caught my eye.

The skinny is that the unicode bidi algorithm can be used to do all kinds of mischief against the correspondence between the visual display of code and the semantic effect of it. To the extent that the manipulation of code semantics is almost but not quite arbitrary. It's exceedingly easy to use this technique to hide trojan-horse type vulnerabiity in almost any language that accepts unicode source.

My own analysis of this is that this comes about because the 'tree' or 'stack' of bidi contexts is not congruent to the 'tree' or 'stack' of semantic source code contexts.

In short no bidi context can extend beyond the bounds of the most specific syntactic context in which it was created. And that includes 'trivial' subtrees like comments and string constants. As the compiler exits a clause, it must report an error if the bidi state that existed when it entered that clause has not been restored.

And even within the subclauses, keywords and structure must begin and end with the same bidi context. For example if the language has an 'if' subclause that's

'if ('+predicate +') then' + consequent + 'else' + alternate 

Then the predicate, consequent, and alternate clauses can push/pop bidi states within themselves, but keywords and parens must all begin and end in the exact same bidi state. And that bidi state - the one that existed before the parser started reading the subclause - must be restored when the parser leaves the subclause.

Right now most unicode-accepting programming systems are treating bidi state as irrelevant and bidi override control characters as whitespace. This means that code that looks exactly the same in an editor can be read in different sequence and have vastly different semantics.

Forcing bidi level to remain congruent to syntax level through the program means that a program that displays in a particular way either has a single valid semantics according to its sequence order, or the parts you see are in some non-obvious sequence order and it is therefore a syntax error.

The Book of Shen 4th edition is now available online.

The Book of Shen 4th edition (TBoS) is now available online. This is part of an open science
initiative which will place over 1000 pages of computer science online. I'd like to thank
our sponsors whose support has made this possible.

TBoS is based around the S series of kernels, of which the latest S31 went open source.

Relevant Links

Home page
S Series Kernel
Open Science
TBoS Online
TBoS Amazon


Egel 0.1 (beta)

Egel is an untyped eager combinator scripting language. I recently decided to go beta, the thing is now ready for shipping; the language seems stable but libraries are either undeveloped or subject to change. I am now actively looking for users and feedback.

It should compile on most unixes and macos.

Cheers, Marco

We've got big balls ... of mud ...

The most popular and enduring architecture for all large projects is the one called "Big Ball Of Mud."

It's what happens when requirements grow in unexpected directions that cut directly across what were originally conceived as architectural boundaries, and met with adhoc code that connects parts of the program that the original architecture would never have allowed to touch each other. Then the whole thing is maintained for ten years, in the face of steadily expanding and frequently changing use cases and problems and by a host of people many of whom don't really know what's going on in the rest of the program, then gets bought by a competitor who starts reimplementing the whole thing in a different language and with a "proper architecture" but runs out of budget for that so they only manage to reimplement about a third of it, then link the stuff in the new language together with the stuff in the original language (and libraries in a third, not that anyone's looking at libraries at this point) and hack together the logical mismatches between "old" and "new" code by running perl scripts on the database. .....

And it just goes on like that - year after year after year.

Now imagine it's 2037 or whatever and you have a system where this process has taken the architecture to its logical conclusion. It's not just a big ball of mud any more, it's a planet of mud. There is no solid ground anywhere in any direction. Nothing like any kind of rational architecture (except Big Ball Of Mud architecture) has emerged.

What can a programmer who is responsible for some small part of this giant mudball - say, one file of code - do? In the 'thinking globally and acting locally' sense, because obviously one programmer on a whole planet of mud is in no position to impose new architecture on the whole and doesn't even have the information needed to decide a new architecture for the whole.

What modest steps can one programmer take to smooth out, and perhaps dry a little, their particular little bit of mud? Ideally, they should be low-effort steps that don't require all that much reaching into all the other bits of mud. And ideally, if all the individual programmers all over the mud planet did them in the course of maintaining their little bits of mud, it should eventually result in the simplification of the whole insofar as possible, and the emergence of some kind of rational architecture.

It might be an architecture that no one of the programmers would have designed. It might be an architectural paradigm that humans haven't even discovered yet. It might not be particularly elegant. But after a lot of iterations by people working all over the program, what small, feasible, local actions in a big ball of mud produce a program with more tractable organization?

Are there any metrics that a monitoring system can auto-generate that could give the programmer a "score" for whether they're improving or degrading the process of creating some kind of code organization?

Bespoke: live graphical language for granular synthesis

This post is a slight provocation in that the inventor of Bespoke ("A modular DAW...") does not describe or promote his work as a programming language.

In Bespoke, programmer/composer/performers construct live dataflow graphs. Edges are each one of just a few types: MIDI signals (an interchange standard for real-time control of electronic instruments), digitized audio signals, and control signals.

In contrast to edge types, nodes in the graph vary without bound, each transforming inputs to outputs in specialized ways. Nodes have internal state and can expand the amount of state.

Composites of nodes can be regarded as themselves nodes (conceptually and practically).

Because programmers are meant to be able develop and modify programs while they are running, the types of data flowing on edges are necessarily manifest types. Type checking (trivially matching the types of data sources and syncs) happens on the fly as code is composed. One can't draw a "wrongly typed" program in this system -- midi outputs go to midi inputs and so forth.

The separation of types such as audible, control, and MIDI signals arises solely due to the kinds of physical equipment used to receive or output such signals. E.g., if you try to play midi wire signals over a speaker, there's little if anything to hear. One can easily imagine a slight generalization of bespoke in which, at core, all packets are untyped.

Packet transmission is necessarily noisy in the sense that the language provides no guarantees about packet delivery. For example, an audio signal might arrive at a different sampling rate and width than it was transmitted with. Packets may be dropped or spurious ones inserted. As a practical matter, stronger guarantees about packet transmission can be promised by implementations.

In the bespoke we see, there is a global clock and packets are timestamped though they may be transmitted and received before their timestemp time. This aspect of packets is specific to the music-making domain though it presumably has wider applicability. One can easily imagine a bespoke core that lacks that kind of synchronization other than that transmissions temporally precede receptions.

Do names and symbols really imply semantics? If so what to do about it?

Some languages, like APL, are written in Martian script. Their proponents insist that a set of characters humans have never seen before in their lives have intrinsic, unambiguous semantic meanings which can be inferred directly from their shapes.

I don't think that is true. Not to humans, anyway. Martians probably see differently.

Some languages have strong requirements about the semantics of operators, based on the familiar semantics and relationships of those operators. For example there was a beta release of Pascal 7 (from Borland, around 1995 I think) that allowed people to overload operators, but warned that the compiler would simplify expressions according to the precedence rules the language set for them and the distributive, associative, etc. properties implied by those operators before any definition overloads were looked up. If an operation was commutative, the compiler was allowed to reorder its arguments arbitrarily. If distributive expressions like (a*b)+(a*c) would be reduced to a*(b+c) before looking up the operator overloads. If you defined any two relational operators, the compiler would automatically define the rest using the identity axioms. You were not allowed to define three or more relational operators. Etc.

This strong assumption that the semantics of the overloads must follow, in most respects, the semantics of the operators whose names they were using, really upset a lot of people who wanted to use '+' to concatenate strings or wanted to use '<=' and '=>' (which is how Pascal spelled 'equal-or-greater' as some kind of redirection operators. A lot of it (but not all of it) got changed between beta and release.

I was never really convinced that changing it was the right thing. It seemed perfectly reasonable to me that these familiar symbols should be constrained to have the familiar semantic properties that we infer when we're looking at an expression built of them. I thought people who wanted string-concatenation operator or a redirection operators should be using different names for their operators, like '$+' or '|>'. Sadly this was impossible as there was no way to define new operator symbols. This deficiency remains true in almost all languages that allow operator overloading.

In Scheme there is a strong assumption/tradition that any variable whose name ends in the character '?' will be bound to a procedure that returns a boolean value. The same idea in common lisp is associated with the trailing character 'p'.

The tradition in scheme goes on that any variable whose name ends in '!' is bound to a procedure that has a side effect. Many schemes will produce a warning if any such variable is ever bound to anything else. And some consider it an error if a side effect is produced by any procedure not so named, because the name is considered to be an important warning to the programmer that a side effect is possible.

Scheme used to have 'indeterminate digits' in numbers, written '#', so for example 123# denoted 'one thousand two hundred and thirty-something,' an inexact integer. This got scrapped once it became clear that implementors had no interest in keeping track of how many significant figures of decimal accuracy a calculation represented. They were only using 'inexact' to mean 'ieee754 floating-point hardware representation' and were openly hostile to the notion that anyone might hope for it to mean anything more. Many regarded it as a contradiction in terms that something could be both 'integer?' and 'inexact?' And I think in the current scheme standard it may in fact be a contradiction in terms. But getting rid of it meant abandoning the possibility of tracking significant figures of accuracy. So that character used to have a semantic meaning and purpose, but nobody valued that purpose.

Early basics had 'type sigils' so that the programmer (and the interpreter) could know the type of a variable just by looking at the name. $foo always referred to a string, .foo always referred to a floating-point number, #foo always referred to a binary value, etc.

People made an awful lot of fun of those type sigils in old basics. Until the same people turned around and started using Hungarian notation to keep track of the types of their variables in C++. Because keeping track of the type was HARD, and they wanted to be able to see what type something was by looking at it. So they defined their pszName and hwndInteractionWindow and their dwIDnumber and so on and didn't think about basic's type sigils at all because this, after all, was something different.

And after all these examples of naming semantics. How much semantics is it reasonable to expect to be able to infer from syntax? How much is it reasonable for the compiler to enforce, based on the name alone? And what relationship does programmers liking it have to helping them produce good code?

Monads vs Applicative Functors.

This article ( got me thinking about the structural problems with monads, namely that `bind: m a -> (a -> m b) -> m b` is opaque in its second argument. One thing that I have become increasingly concerned with over the past few years is the separation of data and code, and this article made it clear to me that monads have a problem in this respect, and at least in the case of parsers result in mixing implementation details with the application of the parser. By combining an approach focused on the initial algebra for parsers, and implementing combinators as an applicative functor (and not a monad), the "application" of the parser is reduced to a static data structure, and the `map` functions only generate output.

Using continuation parsing to encode the existential types from the GADT (see and function overloading, I was able to port the applicative parser to TypeScript, which I have made available here and as a Deno module here I also added a couple of useful helper functions based on Parsimmon's `seq` and and `seqMap` which use a little conditional-type magic to make working without operator overloading easier. The result is we can write a simple parser in TypeScript (for floats) like this:

const digit = OneOf('0123456789');
const float1 = seqMap((a, b) => [...a, b], many1(digit), OneOf('.'));
const float2 = seqMap((a, b) => [...a, ...b], float1, many1(digit));
const float3 = seqMap((a, b, c) => [...a, b, ...c], choice(float2, many1(digit)), OneOf('e'), many1(digit));
export const float: Parser<number> = FMap(a => parseFloat(a.join('')), choice(float3, float2, float1));

Monads are of course useful, and things aren't really so black and white, but it seems interesting to consider what can be gained from the separation of data and code possible if we limit ourselves to applicative functors. The benefits include being able to have a complete implementation for the parser initial algebra in 50ish lines of code, that you can build your parser application from the basic 9 combinators and applications can remain the same whilst you can swap different implementations (with backtracking, error reporting, greater efficiency etc).

It also seems interesting to consider the initial algebra for parsers in the same way we consider Codd's five primitive operators for relational algebra. The ones used in the article are Fail/Empty/OneOf/Return/Forget/Map/Product/Either/Fix, is this minimal? There are clearly alternative formulations, for example `OneOf` isn't the only possibility, but it has the advantage that it allows the set of accepted symbols to be generated for a parser application.

In the context of compiler architecture, it seems interesting that each pass of the compiler can be implemented as a recursive descent of the abstract syntax tree with case matching on node type. With this approach the parser also becomes a recursive descent with case matching over the syntax definition of the language, which seems an elegant solution.

Thinking about programming languages themselves, as we can thread state through an applicative functor, and we can assign to scoped variables (map), it seems interesting to think about what a functional language that uses applicative functors to manage state and IO would look like? Obviously existing programming languages can implement applicative functors, in the same way you can write object-oriented code in procedural languages, but they don't focus on it. Likewise there is nothing stopping someone writing monadic code, but you would not want to encourage it. What would a language designed to encourage the programmer to think in an applicative, initial-algebraic way look like?

CFP: PLOS '21: 11th Workshop on Programming Languages and Operating Systems

Those of you who apply advanced PL ideas to operating systems might be interested in the upcoming PLOS 2021 workshop. See the abbreviated CFP below, and please consider submitting a short paper!

Thanks! ---



11th Workshop on Programming Languages and Operating Systems (PLOS 2021)

October 25, 2021

An Online Workshop
Sponsored by ACM SIGOPS
In conjunction with SOSP 2021

Paper submission deadline: August 6, 2021

Historically, operating system development and programming language development went hand-in-hand. Today, although the systems community at large retains an iron grip on C, many people continue to explore novel approaches to OS construction based on new programming language ideas. This workshop will bring together researchers and developers from the programming language and operating system domains to discuss recent work at the intersection of these fields. It will be a platform for discussing new visions, challenges, experiences, problems, and solutions arising from the application of advanced programming and software engineering concepts to operating systems construction, and vice versa.

Please visit the Web site for more info:

What is a type?

There's nominal typing.

There's structural typing.

But what if an object's type is simply the aggregate of its behavior?

I'm curious if there are other type systems that work in this same manner. Comments and criticism welcome.

Counterexamples in Type Systems

The "counterexamples" here are programs that go wrong in ways that should be impossible: corrupt memory in Rust, produce a ClassCastException in cast-free Java, segfault in Haskell, and so on. This book is a collection of such counterexamples, each with some explanation of what went wrong and references to the languages or systems in which the problem occurred.

XML feed