Trojan Source: Unicode Bidi Algorithm abuses in source code

A recent chunk of research caught my eye.

The skinny is that the unicode bidi algorithm can be used to do all kinds of mischief against the correspondence between the visual display of code and the semantic effect of it. To the extent that the manipulation of code semantics is almost but not quite arbitrary. It's exceedingly easy to use this technique to hide trojan-horse type vulnerabiity in almost any language that accepts unicode source.

My own analysis of this is that this comes about because the 'tree' or 'stack' of bidi contexts is not congruent to the 'tree' or 'stack' of semantic source code contexts.

In short no bidi context can extend beyond the bounds of the most specific syntactic context in which it was created. And that includes 'trivial' subtrees like comments and string constants. As the compiler exits a clause, it must report an error if the bidi state that existed when it entered that clause has not been restored.

And even within the subclauses, keywords and structure must begin and end with the same bidi context. For example if the language has an 'if' subclause that's

'if ('+predicate +') then' + consequent + 'else' + alternate 

Then the predicate, consequent, and alternate clauses can push/pop bidi states within themselves, but keywords and parens must all begin and end in the exact same bidi state. And that bidi state - the one that existed before the parser started reading the subclause - must be restored when the parser leaves the subclause.

Right now most unicode-accepting programming systems are treating bidi state as irrelevant and bidi override control characters as whitespace. This means that code that looks exactly the same in an editor can be read in different sequence and have vastly different semantics.

Forcing bidi level to remain congruent to syntax level through the program means that a program that displays in a particular way either has a single valid semantics according to its sequence order, or the parts you see are in some non-obvious sequence order and it is therefore a syntax error.