LtU Forum

programming languages with full-unicode syntax and identifiers are surprisingly hard to do well.

I am working on a programming language. I want to fully integrate unicode - preferably the NFC/NFD repertoire, with "canonical" decompositions only.

At the same time I don't want the bidi algorithm to be used to display code in a deceptive order on the page. But, in order to achieve this, I have to require LTR control characters in the program text after every RTL character where the following character is bidi "neutral" or "weak." Is that a mode I can set in any programming editor in wide use, or do I have to implement my own editor? Adding LTR controls in a separate step (like with a sed script or something) means there's an extra step I have to do before I see while editing, the same version of the code the compiler will be seeing.

At the same time I don't want "lookalike" characters used to display deceptive identifiers. Nobody can tell by looking whether 'a' is Latin or Cyrillic, or whether 'A' is Latin or Greek, and I don't want programmers tearing their hair out trying to understand why the variable they thought they just initialized is holding some value set by malevolent code somewhere out of sight, or why a perfectly innocent "Jane_Doe" keeps getting blamed for the fraudulent transactions of someone else whose name appears to be spelled exactly the same. The most straightforward precaution here is to ban identifiers that contain alphabetic characters from more than one script, but it seems a lot like using a sledgehammer to kill cockroaches. A less restrictive rule would allow mixing scripts but not if you use any letters which are confusable between those scripts - for example you could mix Latin and Cyrillic if you do it without using any character that looks like "a" (or other characters that could be either) or you could mix Latin and Greek if you do it without using any character that looks like "A" (or "B", or "X", or other characters that could be either). But this makes the identifier syntax rules complicated to check and hard to easily express.

Just two of the *MANY* issues that need to be addressed in order to allow a fully unicode-enabled programming language that's not a security or usability disaster.

I used to hate Unicode a lot more than I still do. These days I recognize it as a massive hairball, but I'm not really angry about it any more; it's just one more piece of legacy design that clearly was NOT intended for the kind of use I want to make of it. So it's massively difficult to use, leaky, flabby, illogical, promotes insecurity worse than a language without array bounds checking, etc, but I guess I've just come to accept it and I'm finally getting around to trying to overcome the problems and try do something worthwhile with it anyway.

Egel v0.1.8 (beta) released - do syntax

Another Egel release. New in this patch: do syntax for applications, some utf bugs squashed.

Egel is a small functional scripting language. What sets it apart is that at any time during execution the program, or any program fragment, forms a directed acyclic graph. That makes concurrency, serialization, and shipping of values more trivial than in most other languages and I want to exploit that once to make mobile code a breeze to implement. Shoot any combinator anywhere!

Meanwhile, the egel language supports only the bare minimum as a front-end to a graph rewriting semantics. At the same time, sometimes I implement little syntactic sugar to make programming more pleasant.

The latest addition is do-syntax. Like in many functional languages, you can set up application chains. For example, 0 |> ((+) 1) |> ((*) 2) will reduce to 2.

I found a slight need to facilitate this kind of programming with do, a do expression abbreviates a chain, for example, def f = do ((+) 1) |> ((*) 2) is sugar for def f = [ X -> 2 * (X + 1) ]. I have found this very convenient while programming solutions to last year's Advent of Code.

Of course, this isn't Haskell's monadic do syntax, but as we all know, applicatives are as good as monads. Or are they?

Egel v0.1.8 (beta) released - do syntax

Another Egel release. New in this patch: do syntax for applications, some utf bugs squashed.

Egel is a small functional scripting language. What sets it apart is that at any time during execution the program, or any program fragment, forms a directed acyclic graph. That makes concurrency, serialization, and shipping of values more trivial than in most other languages and I want to exploit that once to make mobile code a breeze to implement. Shoot any combinator anywhere!

Meanwhile, the egel language supports only the bare minimum as a front-end to a graph rewriting semantics. At the same time, sometimes I implement little syntactic sugar to make programming more pleasant.

The latest addition is do-syntax. Like in many functional languages, you can set up application chains. For example, 0 |> ((+) 1) |> ((*) 2) will reduce to 2.

I found a slight need to facilitate this kind of programming with do, a do expression abbreviates a chain, for example, def f = do ((+) 1) |> ((*) 2) is sugar for def f = [ X -> 2 * (X + 1) ]. I have found this very convenient while programming solutions to last year's Advent of Code.

Of course, this isn't Haskell's monadic do syntax, but as we all know, applicatives are as good as monads. Or are they?

Egel v0.1.8 (beta) released - do syntax

Another Egel release. New in this patch: do syntax for applications, some utf bugs squashed.

Egel is a small functional scripting language. What sets it apart is that at any time during execution the program, or any program fragment, forms a directed acyclic graph. That makes concurrency, serialization, and shipping of values more trivial than in most other languages and I want to exploit that once to make mobile code a breeze to implement. Shoot any combinator anywhere!

Meanwhile, the egel language supports only the bare minimum as a front-end to a graph rewriting semantics. At the same time, sometimes I implement little syntactic sugar to make programming more pleasant.

The latest addition is do-syntax. Like in many functional languages, you can set up application chains. For example, 0 |> ((+) 1) |> ((*) 2) will reduce to 2.

I found a slight need to facilitate this kind of programming with do, a do expression abbreviates a chain, for example, def f = do ((+) 1) |> ((*) 2) is sugar for def f = [ X -> 2 * (X + 1) ]. I have found this very convenient while programming solutions to last year's Advent of Code.

Of course, this isn't Haskell's monadic do syntax, but as we all know, applicatives are as good as monads. Or are they?

Ann: The Logic Lab

The Logic Lab is a high-level DSL for specifying logics. There is a series of Youtube mini-lectures on this technology detailed in the Logic Lab home page. https://shenlanguage.org/Logiclab.html. You can also access these talks at the following addresses.

History of the Logic Lab https://www.youtube.com/watch?v=VqnVWLFiEII
Specifying Logics https://www.youtube.com/watch?v=AU8WCkS9n1U
Conducting Proofs https://www.youtube.com/watch?v=1WXMMk2xUSM&t=1s

The Logic Lab page also includes a download link for the program. This application needs the Shen standard library. There is a link on the Logic Lab home page to a 'batteries included' binary version of Shen under Windows (SBCL platform) with the standard library compiled inside it.

As my first talk makes clear, this is a reissue of an old technology, published by me in the 90s, which has resided in the CMU repository for Artificial Intelligence for nearly 30 years and has been reissued under Shen.

Mark

LinearLisp: a proof of concept in full linearization for AST-walking interpreters

Or, how to walk an AST with (almost) no references on the heap and how it pays off:

https://github.com/ysharplanguage/LinearLisp

Implementation: C# / .NET 6.

Comes with basic microbenchmarks for some contenders: Microsoft's PowerShell, Microsoft's JScript, and Python 3.10+

Rewrite.js – a minimalist s-expr based term rewriting system

Rewrite.js is estimated to be a Turing complete, s-expression based term rewriting system. It may be used as a curiosity computing platform, formula application system, proof checker, problem solver, and pretty much anywhere where any kind of computation is required, as long as slower performance on intensive computations doesn't go beyond limits of user patience.

Rewrite.js is designed as a creation with only one kind of rules: substitution rules. Being such a minimalist creation, complete rewrite.js implementation fits in a bit more than 400 Javascript lines of code.

To get a feeling about rewrite.js appearance, this is a math expression rewriting code in rewrite.js:

    (
        (
            REWRITE
            (
                MATCH
                (VAR <a>)
                (RULE (READ  <a> + <a>) (WRITE 2 * <a>))
            )
            (
                MATCH
                (VAR <a>)
                (RULE (READ  <a> * <a>) (WRITE <a> ^ 2))
            )
        )

        (x + x) * (x + x)
    )

The above example evaluates to:

    ((2 * x) ^ 2)

Rewrite.js is hosted on https://github.com/contrast-zone/rewrite.js with convenient online playground.

Aside from criticism, I'm particularly interested in possible rewrite.js usage ideas.

JIT: it's complimicated

What Exactly do we Mean by JIT Warmup? 2016 but still interesting, I think.

Weird Results. Many benchmarks don’t warmup under the classic model. New goal: Try to understand why we see “weird” results.

Pico Scheme (A purely functional subset of scheme)

The Pico Scheme specification just came out. It is a purely functional subset of Scheme (R7RS-small) that I edited.

1.0 Pico Scheme release

https://github.com/jrincayc/r7rs-pico-spec

Deallocation patterns and linear types (e.g. Rust)

I've been [much] belatedly poking at Rust - particularly the move/borrow model, and I was then thinking about implications for allocation and deallocation performance. Though I haven't looked seriously, a couple of things occurred to me that I haven't seen people talking about. Nothing in any way profound here, but perhaps worth considering for those of you designing or implementing new languages, and it appears to me that this applies to many langugaes using linear type. Rust's reference counted pointers fall outside of what I'm going to describe.

I was pondering lifespans in Rust, thinking that there are probably cases where deallocations in Rust occur later (for naive lifespan reasons) than we might like - at least according to the simple description of lifespans in the specification, which is conservative. Also that the rules for object deallocation have the consequence that the cost of function return becomes hard to understand because it is difficult for a programmer to have good intuitions about deallocation overheads associated with the return. This got me to pondering deallocation and heap fragmentation to try to understand whether that's an actual problem, and then to wonder how many of Rust's allocations can be stack allocated and deallocated in the style of alloca(). So far as I can tell, the answer is most of them. I may have this wrong, but here's the intuition:

  1. In a given function, we know can usually determine statically where an allocated item (more precisely: the associated region) becomes unreachable. This, rather than reverse lexical order, is the point where the region created by the allocation actually dies. Where we can determine this point (and move semantics means we mostly can), we can sort allocations in "reverse region exit" order (last-out-first). This may not be the same as the allocation order, which is the point of interest here.
  2. For objects have statically-known size, this means that we can craft an arrangement of space on the local stack at compile time, such that the statically sized portion of the next-to-die region is always the last (most recent) chunk of storage on the stack.
  3. For regions whose exit order we cannot statically determine (do these exist in Rust?), we place them conservatively as if they will exit last. [This can be improved, but it is correct.] These objects can still have their destructors run in the expected order according to the Rust specification; what's being deferred (so far) is the release of the space.
  4. Some care is needed when a function returns a result in the same region as one (or more) of its arguments. We can (and must) assume that region is still live, but we cannot necessarily tell whether the content of the region has grown - an effect system can mitigate this, and there are other reasons to want an "allocation may happen when you call this" effect.
  5. Where all else fails, we resort to heap allocation, but the reference can still be placed according to order of region death.

If I have this right, region deallocation for statically sized objects is reduced to running destructors (in Rust: drop functions), handling any heap deallocations required to retire references, and then adjusting the stack pointer.

There are practical concerns - many operating systems make naive stack overflow tests that would need to be worked around - but that's neither new nor hard.

The handling of "uncertain death" regions can be improved. Even if we don't know the exact order of region death, we can generally determine some place in the region exit ordering where we know conservatively that a region has become unreachable.

It seems to me that we end up with three kinds of allocation areas:

  1. The stack, where deallocations are simply a stack pointer adjustment.
  2. The "linear object heap", where reachability is still governed in linear type fashion, but the dynamic nature of the allocations may cause objects in different regions to be interleaved. That is: the organization of the linear object heap does not obey a stack discipline.
  3. The "general object heap", which can be reference counted or garbage collected (or both - to eliminate dangling cycles.

The absence of reference counting (or analogous general object heap allocation mechanisms) in a program tells us whether we need to bind the collector into the binary.

For anybody whose thought carefully about implementing linear type, I suspect all of this is pretty obvious. I hadn't gotten quite that far in my thinking about heap management in BitC, though I had certainly been thinking about compiler-driven stack allocation.

XML feed