Why do we need modules at all?

Post by Joe Armstrong of Erlang fame. Leader:

Why do we need modules at all? This is a brain-dump-stream-of-consciousness-thing. I've been thinking about this for a while. I'm proposing a slightly different way of programming here. The basic idea is:

  • do away with modules
  • all functions have unique distinct names
  • all functions have (lots of) meta data
  • all functions go into a global (searchable) Key-value database
  • we need letrec
  • contribution to open source can be as simple as contributing a single function
  • there are no "open source projects" - only "the open source Key-Value database of all functions"
  • Content is peer reviewed

Why does Erlang have modules? There's a good an bad side to modules. Good: Provides a unit of compilation, a unit of code distribution. unit of code replacement. Bad: It's very difficult to decide which module to put an individual function in. Break encapsulation (see later).

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

conflation

There are two conflated ideas there (judging by the summary on LtU).

One idea is use individual functions as the highest unit of modularity, rather than have a distinct "module" (aggregate of functions) concept.

The other idea is that the highest unit of modularity should have globally unique names and meta-data suitable to allow the construction and use of a global, decentralized, repository of "all code".

Each is an interesting idea in its own right.

I believe this is because

I believe this is because modules are often conflated with namespaces :). But ya, I'm all for global namespaces.

Functions are used in groups.

Most functions are methods on some data structure and don't make sense without the other methods on that structure.

For example, the minimum necessary to implement a stack are the two functions "pop" and "push". So it sort of doesn't make much sense for those functions to get separated; they are joined by the idea of the stack.

I've envisioned something like a code assistant, but when I thought of it it was in terms of autocompleting sets: The code assistant would look at what you had just typed in the IDE and say something like

That matches the enqueue method from the foozel library; would you like to import the matching dequeue, queue-constructor and queue-destructor methods?

Although obviously as you get into the "long tail" of data structures you'd be looking at method sets on objects a lot more esoteric or specialized than simple stacks and queues.

"functions" vs "functionality"

You have conflated functions and functionality. A single function could cleanly provide both "push" and "pop". This is supposedly one of the big strengths of functional abstraction.

'Strewth, one function

'Strewth, one function certainly could provide both push and pop. But I don't think you're going to get to the one function that provides both kinds of functionality (or one function that provides all the methods on some object class) if what you're doing is indexing code that people have already written.

Because the normal coder whose work is what you're talking about collecting and indexing, also conflates functions and functionality; it's the standard model with which people are working, and they won't understand or correctly the tool if it uses a different model.

Ein Reich, ein Volk, ...

The reason we don't want a single database of all functions (not even a distributed one) is the issue of control. Projects exist because project leaders need to have, and have, authority about what goes in and what does not. It's the same reason we have a federated Web rather than a single highly replicated Xanadu database.

The database of functions

The database of functions could be federated in a content-addressed way. Identify a function by a hash of its name + creator, plus a hash of the contents to identify which version we want. A module/project specification is a set of such function ids. Individual servers can choose to cache whatever subset of the function space that they want. Whenever you install a project you locally cache all the required functions.

This has some nice features. Functions can belong to more than one module/project. Storage is federated but there is a single global namespace. Including the creator in the function hash makes collisions less likely. Given a project name, you can download all the code from any cache and know that it hasn't been maliciously modified.

zooko

To agree with and add to what you say, jamii:

I took the summary's reference to "lots of meta-data" to imply that the database of functions would be addressed primarily by meta-data. Similarly to looking up a journal article by a bibcite record.

So there is long real-life precedent of cooperative societies being able to manage a decentralized, federated namespace. Currators of content get to develop their own ways (which could include checksums, as you say), to select trustworthy entries to their local views of the database.

Manageable, useful, decentralized, distributed global namespaces with human friendly names are far from impossible (contra zooko) if we relax the implicit constraint that there be a unique authority that can perfectly say which names are valid.

Cooperative societies don't need and are an alternative to "unique authorities" in the first place so... zooko is much ado about not much (from a non totalitarian fascist perspective, at least).

make it so

seems like some kind soul somewhere should just hurry up and put up a web service enabling that to get off the ground, then :-)

We likely will in the next

We likely will in the next year or two.

Previously on LtU:Open

Previously on LtU:

Also, this seems related to Edwards' "names are too valuable to waste on talking to compilers." Also "let google resolve your symbols" (I just made that up).

Wiki-based programming

I'm pursuing this route of conflating functions with modules in my Awelon Object language. Rather than modules, I have a dictionary of words, each modeling functions. For the last month, I've been implementing a wiki-based IDE and software platform, which will eventually serve as an "open source key-value database of all functions".

(I've been thinking about wiki-based IDEs for years, but never had all the pieces until recently.)

But I'm aiming to support federation - lots of wikis all pulling and pull-requesting, merging entire histories and etymologies for every cherry-picked word. But this isn't like a bunch of projects each tucked in its own git repository. Rather, each wiki has all the words, tweaked and extended for the current project, with some useful parts eventually pushed back to popular public wikis. In the very long run, we may accrue many gigabytes of common code. (This isn't a problem. Gigabytes are cheap, and will only get cheaper.)

A highly accessible quantity is a very useful quality.

Since the wikis also serve as software platforms, the different installations may have very different active services and state.

Type Classes and Algorithms

I realise this is looking at it slightly differently, but there is also the question of parametric modules and functors. Personally I prefer Type-Classes for this anyway, as with modules you get all sorts of interdependencies, for example a container with an ordering function requires the container and ordering function to have the same 'value' type parameter, so it requires specifying the type parameters both in the interface definition and the call-site. With type-classes you can infer the required constraints from the implementation and search for matching instances.

The other aspect is data-hiding. I Prefer modules to classes because it avoids the need for things like C++'s 'friend' functions and classes. However JavaScript uses functions/closures for this and it seems to work quite well. The only thing I would miss are associated types, which you could add to ordinary records. You could use type-families instead, which separates the concerns better.

So maybe type-classes, type-families and functions mean you don't need modules?

One repository for all functions sounds a bit like the Wolfram language.

I would rather see a site for curating generic algorithms, where the best implementations can be collected and improved on over time, and I could pull implementations into my code. This would only really need to be something like GitHub with a peer review panel selecting which patches/improvements get adopted. What stops it is there is no decent generic language in widespread use (C++ with concepts is probably the closest) the language would have allow both functional and imperative implementations to represent the complete spectrum of algorithms.

Don't type classes require a

Don't type classes require a global namespace anyways? I mean, type class member names must be unique.

local type classes

There are ways to do local type classes. Oleg Kiselyov has emulated them under the name implicit configurations, and IIRC something along those lines has been proposed for Haskell a few times. At the moment, in Haskell, the limited scope of typeclasses runs into an orphans problem where a type might have different typeclass behavior in different module scopes.

For AO, I do have some ideas for inferring code using type information, but I'd like to shift that entirely to the development environment and edit-time behavior.

Different behaviour = different type

For me this is quite simple, if you want something to behave differently it has to have a different type. This is quite easy to work with if you have type level functions (type-families) and use phantom-types.

If you move from generics to inferring code at edit time, there would seem to be no way to have a repository of "best" algorithm implementations that people could share and work on with defined type-requirements on the parameters. For example a GCD algorithm that will accept any type that is a EuclideanSemimodule:

gcd :: EuclideanSemimodule t s => t -> t -> t

where t and s are types conforming to the interface definition of a EuclideanSemimodule, including functions, type-functions, and axioms.

generic programming

Typeclasses aren't essential for generic programming. Neither parametric polymorphism nor existential types require typeclasses. Your 'EuclideanSemimodule' could just as easily be a first-class record containing a few functions, rather than inferred from type.

As I understand it, using types to help infer programs at edit time is almost completely orthogonal to generic programming. There's no reason we couldn't infer generically useful programs.

if you want something to behave differently it has to have a different type

When I want something to behave differently, I give it a different value. E.g. `3` has a different behavior than `4`. `succ` has a different behavior than `square`. Of course, in many cases, it might be convenient to have types with the same precision as values. Especially when the values are statically computable.

My understanding is that types are typically more coarse grained than behavior, e.g. by no means does every function of type `Int→Int` have the same behavior. I might think of a type, instead, as a constraint on behavior. OTOH, my way of thinking doesn't seem very compatible with the notion of 'principal type'.

First Class Records.

Yes, of course it could be a first class record, but a record as a value has to be passed to a function. This is the bit about modules I don't like. Consider trying to implement some of Stepanov's "Elements of Programming" in Ada, just showing the signatures:

generic
    with package Integer_Package is new Signature_Integer (<>);
    with package Container_Package is new Signature_Container (
        Element_Type => Integer_Package.Integer_Type,
        Index_Type => Integer_Package.Integer_Type,
        others => <>
    );
    use Integer_Package, Container_Package;
procedure Container_Test;

To call this you have to

package Integer_Vector is new Vector (
    Element_Type => Integer,
    Integer_Package => Ada_Integer.Signature
);

procedure Integer_Vector_Test is new Container_Test (
   Integer_Package => Ada_Integer.Signature,
   Container_Package => Integer_Vector.Signature
);

As far as I can tell you would get the same proliferation of (redundant) module constraints using records.

What I think makes type classes better is that by being parametric on type, you don't need to pass additional arguments to the function. You only need the initial type parameter and the type-class constraints can all be inferred.

Yes...

...because each type can have at most one instance for each class (in order to make the meaning of a program unambiguous), all type classes and type constructor declarations have to be globally visible.

This really sucks, but AFAICT there's no way around it.

It's not ambiguity

The global coherence condition imposed on Haskell is due to usage patterns that assume it, such as storing balanced trees that rely on the global Ord instance. Breaking this global unicity would break some idioms (and create new ones) but I don't think it would make the meaning of programs ambiguous. The dynamic semantics of a program certainly depends on the elaboration strategy, but you could guarantee that in each scope of implicit declarations there is a canonical elaboration chosen.

For example, I think that a language feature (replace Ord Int with ... in ...) that would *locally* replace an existing type-class instance by another would not make the semantics ambiguous. It preserves (non-)solvability of type-class constraints and coherence of resolution. I find this coarse-grained feature rather inelegant (to be compared to the use of "swap" to manipulate arrays in linear languages) and I'm not actually suggesting it, but it would be one example of non-ambiguous extension.

The problem...

...with giving up global coherence is that typically you lose substitution for the external language. Since the external language is what programmers actually see, I think this is a rather risky tradeoff. For example, with your suggestion we can break even ML's relatively weak substitution principle (values for variables) pretty badly. In this example, I'll use let implicit int = e in ... to mean installing a dictionary for the int type, and ?int to denote the type-directed lookup.

   let implicit int = 5 in 
   let f : unit -> int = fun () -> ?int in 
   replace implicit int = 3 in 
   f

A straightforward elaboration of this term should result in a program which evaluates to fun () -> 5, but if you substitute for f before elaborating, you get fun () -> 3.

Global coherence is a bad idea

ML already has a substitution problem analogous to the one you mention:

let x = 5 in 
let f : unit -> int = fun () -> x in
let x = 3 in
f (* textual substitution of f doesn't work *)

This doesn't seem to confuse many people in principle (though it can be a source of errors). The situation can be similar with implicit parameters.

There's a technical difference

It's possible to argue that there's a technical difference between the two cases. With shadowing, you can prove that substitution preserves values (with capture-avoiding substitution) for the external language; with implicit parameters and without coherence, you can only prove that for the internal language.

In other words, by inspecting this snippet, you can see it depends on x (so you know you're dealing with an open term), but in many cases you couldn't see uses of implicit parameters without looking at the types of the used functions (which might even not be written in the source).

In yet other words, to deal with shadowing, you "just" need to understand correctly capture-avoiding substitution; to deal with incoherence, you need to (mentally?) readd the code which you avoided writing in the first place.

In practice, even shadowing is tricky enough at least some programming communities (at least among Java programmers) actively discourage shadowing if it can be avoided.

In fact, the only thing we can discuss objectively are the technical properties of the different possibilities. If we wanted to figure out scientifically which properties matter for PL usability, we'd need to actually measure the latter (which might or might not be possible nowadays). Right now, this is akin to Aristoteles arguing on what should happen when you drop an apple (that is, philosophy), not to somebody actually dropping an apple and looking at what happens (that is, physics).
Philosophy is worthwhile as long as a question is outside the realm of science, but the distinction is often worth emphasizing. (I should probably turn this point into a blog post to do it justice).

See "Modular type classes"

but you could guarantee that in each scope of implicit declarations there is a canonical elaboration chosen.

IIRC, "Modular type classes", from Dreyer, Harper, Chakravarty and Keller gives this guarantee, but they explicitly exclude a "replacing" facility. Neel, did your remarks already consider this paper?
(I know Derek was your supervisor, but that doesn't answer the question — not all supervisors teach their students about all of their previous research :-)).

In fact, the paper discusses this issue (with a similar but less artificial example) in Sec. 3.1; they propose specifying signatures explicitly (to decide in which scope instances are resolved), but reject this approach because of the overhead. In the end, they simply forbid nested using declarations.

Here's a few other comments on this solution.

Back to coherence of Ord and Set/Map (I've never seen a second example), how does this paper deal with it?
Since it encodes type classes on top of ML modules, I imagine a dictionary based on one Ord instance would be incompatible with a dictionary based on another instance (that's what you get anyway if you give up instance inference and encode the thing with ML modules). I haven't fully checked whether this is the case in their work.

The design of their "overload" facility reminded me of Agda: in both cases, you can take a function and turn a parameter into an implicit one (in Agda's case, it's called "instance argument", since implicit arguments are a different thing, inferred by unification).
With dependent types, encoding coherency in types where needed is straightforward, so many use cases for coherency can be dealt with (I can't claim "all").

MTC has essentially the same restriction as Haskell...

Here's a quote from section 3.1 of the paper (page 5 in the long version):

Instead we propose that the using declaration be confined to an outer (or top-level) layer that consists only of module declarations, whose signatures are typically specified in any case. All core-level terms appear in the inner layer, where type inference proceeds without restriction, but no using clauses are allowed. Thus, the set of permissible instances is fixed in any inner context, but may vary across outer contexts.

This is basically the same global/top-level namespace approach that Haskell uses.

I must be missing something

I must be missing something. I'm referring to that exact proposal, which allows different instances to be visible in different parts of one program, unlike Haskell as described both here and in that paper. The introduction of that paper sets out to lift the Haskell restriction that "there can be at most one instance of a type class at any particular type". Are you implying that in your opinion the paper fails to solve that problem?

(Your comment happens to be true for GHC Haskell with inner contexts being files, which is not what's discussed in the paper. GHC does *NOT* require global uniqueness of instances, but this thread discussion seems to focus on Haskell with global (program-wide) uniqueness. The whole Haskell-vs-GHC-Haskell distinction is a mess, so that most people refrain from using orphan instances to get GHC to enforce global uniqueness anyway, and there's a GHC bug about this debate.

An intelligent analysis of this mess, for those who didn't see it already, is in Edward Yang's blog).

Closures for modularity

Keean:

JavaScript uses functions/closures for this and it seems to work quite well.

Not well enough, which is why EcmaScript 2015 (a.k.a. ES6) is adding modules (and classes) to JavaScript.

For Whom?

Not well enough for whom? It works well enough for me. There seems to be a habit of adding unneeded features to languages that spoil the aesthetic because some people do not understand how to program in that idiom in the first place. Adding classes to JS effectively turns it into Python. I don't think we need two Pythons, but the JS approach to functions and prototypical inheritance were enough of a difference to make it interesting. If JS did not have the "bad parts" it was actually quite a nice dynamic functional language.

Stable identities

JS uses both "Closures as Modules" and "Objects as Namespaces". Both ideas, while simple on the surface, are actually woefully insufficient for the same reason: They do not introduce stable identities.

There's no way to reload a module in a JavaScript application because the typical "import" idiom aliases mutable, anonymous objects and discards names willy-nilly: `var f = require('m');` Attempting to reload module 'm' would not update 'f'. This same problem applies if you're using CommonJS or the traditional `(function(){...})()` pattern. This means that constantly reloading the browser (or restarting your service) is your only available development strategy.

With all this said, I don't think the ES6 proposals/designs even begin to address this problem. A true tragedy for a dynamic language.

Is this even a desirable feature?

I don't see anything addressing the problems you are describing. You seem to be describing dynamic code reloading? Not really something offered in many languages, so not really a must-have feature, so "woefully insufficient" seems to misstating things. Not suitable for enabling the desired future feature of dynamic code reloading might be more accurate.

I don't really see this as a problem. If I want to update my code I release a new HTML5 app with a new manifest. People only get to run version 1.0, or version 1.1. Any hybrid with some other combination of functions would be unsupported. Its just going to play havoc with the release cycle and support.

Widely used in SmallTalk & most Lisps

The impact is most apparent in dynamically scoped contexts: such as top-levels in Lisps (not Schemes) or pervasively in Emacs. Further, most OOP languages have a flavor of this too: Edit & Continue in C#/Java toolchains depend on the fact that classes have stable identities. Escaping closures escape the reach of this powerful development and debugging tool. I wasn't making a statement about versioning releases.

Continuous integration and continuous deployment.

I don't really understand. In development, new code goes into unit-test before integration testing. Using continuous deployment, after each code check-in the server is completely torn down, and rebuilt from scratch, including fresh OS image, dependency installation and app install, which then runs all the unit-tests. Continuous integration and continuous deployment are a huge improvement in development methodologies, and ensure deployment is not an after-thought as it often is. Continuous integration ensures unit-tests are written as the code is developed.

I just don't find restarting services to be a problem, so this seems to be solving a problem I don't have. I don't find refreshing the page a problem with JavaScript.

huh?

CI and CD have nothing to do with what I'm talking about. At all. I'm talking about changing a program, while it's running, while you are working on it, so as to minimize iteration time to diagnose and correct an issue. It saves an inordinate amount of time recreating, and re-re-creating debugging state.

If you've never done that, then I strongly suggest trying Edit & Continue in C#/Java or evaluating a def form in Clojure. It's a totally different way of working... one that genuinely requires language support.

I've never been able to get

I've never been able to get edit and continue to work very well in C#.

Unit tests are a better solution.

Why would I need it. I understand the code I write, so I don't need to run it to see what it does. I would normally write unit tests before writing the code itself (or at the same time). So what I want to do is edit code then run the unit-tests.

It seems to be solving the same problem as unit-tests (actually regression testing as part of unit tests) reproducing the error states, only not as good as you don't get the ability to repeatedly recreate the error state over multiple machines and debugging sessions.

In fact it sounds like a bad idea because it will discourage people from writing unit tests.

What about edit code and run

What about edit code and run unit tests at the same time? Could be fun.

Interesting

Okay, so I could buy the idea of having a live unit test dashboard that shows whether all the unit tests pass (We use Jenkins/Hudson).

The problem with live editing is that I have to change from one sensible state to another passing through garbage in between. For example I want to copy and paste a line. After the copy I do not expect the code to work, so any errors are just distracting noise (slowing the editor down and taking up valuable screen space - I like code get as much space as possible).

So this takes me back to edit, and press a button to run the unit tests when I am ready.

Live syntax highlighting is definitely useful, but I find autocomplete gets in my way. I would rather have a wide screen monitor with two side-by-side panels, one with a web-browser for language and library docs, and one for the code.

Dealing with transient

Dealing with transient errors is one of the keys to making feedback useful. For one thing, you know not to re-type if you have a syntax error, and not to re-execute a node with a type or syntax error in it. There are many ways to keep the noise down in presentation (though it takes a lot of UX design). The bigger opportunity, however, is debugging in the moment that you are editing, seeing runtime data data as you need it in a format that can be consumed easily while editing.

Most statically typed kids under 30 can't live without auto-completion just like they can't live without GPS in their cars (which personally drives me bananas). There are still people doing it the old-fashioned way (e.g. reading/memorizing docs), but this is probably not the future. The dynamically typed kids are clamoring for languages like Dart and Typescript just so they can have auto completion.

auto complete no substitute for docs.

The docs contain all sorts of useful information about how to use a function. I don't understand how someone thinks they can remember all the details of a functions behaviour when they can't even remember the function name. Are iterators invalidated when you delete an item from a collection of type X? Typing is not the limiting factor in software development, debugging is, therefore the autocompleters are optimising the wrong part of the process.

Autocompletion has nothing

Autocompletion has nothing to do with typing (well, we thought it did in 1997, but quickly found out otherwise), it just helps in discovery and recall. You'd be surprised how many people have replaced docs with auto completion and stack overflow.

Emplace or push?

Auto complete will not help decide whether to use emplace_back or push_back in C++, it will not help write deadlock free threading code, it won't help make sure loop ranges are not off by one.

The problem with autocomplete is that it is based on spelling, whereas it should be based on semantics. I should be indicating I want to insert an element into a container, and it should offer me all the methods to do that, not just the ones starting with 'ins'.

Standard libraries also provide many algorithms, I think people should spend more time studying the methods provided in the standard libraries, then they would write better code using the generic functions provided rather than writing a lot of code they don't need to. They just won't discover this sort of thing with autocomplete.

Language is a wonderful

Language is a wonderful thing: you can attach names to arbitrary concepts without regard to truth. Those who think we should be talking purelyin terms of semantics and truth forget how much language has given us in being able to talk about things we don't know completely.

Most coding that goes on never approaches the complexity where a directed search through a code completion menu isn't effective. For everything else, there is stack overflow. Also, if you can't write deadlock free threading code, the docs ain't gonna help you.

Given the way programmers currently operate, there is necessarily a bias away from complex generic functions that are difficult to pick out in a list. You can see it between languages that support good code completion (C#) and those that don't (Haskell). It is also why tooling should be considered in language design...,designing for the older emacs/vim crowd is quite different than designing for younger IDE/tool addicts.

Code Completion for Generic Programming

I think this is interesting. I have posted some thoughts about how a generic-oriented language might be designed to be compatible with code completion in a new topic.

tooling in C# vs Haskell

I think it is unfair to draw comparison about language design from the respective tooling of C# and Haskell, given the orders-of-magnitude difference in effort invested in tooling by both communities.

Furthermore, it might be fundamentally harder to provide good tooling to languages that favor abstraction, but this could be a problem to solve once and forall, and it's hard predicting what the quality of the result would be. I understand that your remark is rooted in *current* practices, but that should not be a fixation for long-term design.

My personal bet is that language design that favor expressing specifications, including type systems, are key to ultimately-good tooling, but of course that remains to be exemplified (and ligther specification approaches such as Racket's dynamic contracts may result in equally or more pleasant results than rich static typing).

The investment issue is not

The investment issue is not a big deal: code completion is not hard to implement, and beyond the basics, returns diminish quite quickly (e.g. code fix suggestions).

It is a wide open question whether abstraction heavy languages adversely affect usability and toolability. But maybe it's not even the right question, as I see the key being in more holistic design of language and tooling. Designing the type system hoping some good tooling will come isn't googd enough, you have to take your target tool requirements into account when designing your type system (design your language and tools at the same time). This happens de facto in C# given the experiences since visual studio 97 (the first one with intellisense).

If you know what it does, why run the tests?

I don't need to run it to see what it does

What a weird thought. First, if you know what the code does and don't need to run it, why have tests at all? Surely you've got to run the code at some point to validate it. Second, I find it really hard to believe that you've never written code just to see what happens. There are many domains in which that sort of coding is the default: Like games or UI, where you want to tweak some code and *feel* the results.

That said, I'm one of the weirdos who both prefers dynamically typed languages and greatly discounts the value of unit tests. When I do want/have tests, I much prefer to run them manually than have them run automatically. So maybe our styles and experiences are just so different that we're past the point of hoping to understand each other.

Tweek code, feel the results.

Maybe for educational tools, but there can never be any rigour with that kind of approach.

Test-driven-development is actually popular with the dynamic language crowd, especially the Pythonistas.

As for knowing what it does, I guess I mean I know what its supposed to do, but subtle errors like off-by-one still catch me out.

Unit-tests are how you validate code. Just running the program is unlikely to sufficiently test the corner cases of functions and/or modules. I agree with manually runing tests - I write a code block and test with unit tests (which are actually the main program at this point). You never write a large program in one go, so all unit tests are doing is capturing all the ad-hoc testing you do as you build a program in a repeatable way. They actually save you time.

Some unit tests are drop-in.

Sometimes you know a particular desired property of a system you're building, and you can just apply a "standard" unit test specified across fuzz inputs to enforce a particular property.

F'rexample if I ever overload the addition operator with any function that is not commutative and associative in cases where it doesn't signal an error, that's a mistake.

So it's nice, IMO, if I can just attach unit tests to the addition operator and have them autorun immediately whenever I overload that symbol with anything. If I have to think about doing it, I might forget.

Commutative unit tests

Whilst I agree with + always being commutative, and in this case there can be an axiom in the type system which requires it to be commutative, however I don't magically dump the definition in one paste into the editor. So I don't want an error flagged as soon as I type:

x + y =

Before I have finished typing what I already have in mind. Having a type rule that forces all definitions of + to be commutative prevents it being forgotten.

Er...

With all due respect to Joe Armstrong, but that's not exactly the most well-informed post I've ever read. Especially the encapsulation argument, which is, like, as backwards as the 'let' syntax he proposes.

The second sentence of the

The second sentence of the post admits this is a "brain dump stream of conscious" thing.

Local instances: challenges and very old solutions

Local instances and local type classes are a very old idea -- in fact, they have been proposed and even implemented in the very first paper that introduced `parameteric overloading' (type classes, in modern lingo), by Stefan Kaes. The following message posted on Haskell Cafe and its follow-ups describe the challenges and the old solutions, by Kaes, and Duggan and Ophel.

https://www.haskell.org/pipermail/haskell-cafe/2014-October/116291.html

Still need units of compilation

.. and also dependencies. We presently rely on lexical order of presentation within a unit and dependency-ordered aggregation across modules. We can do away with modules, but we still need a way to resolve symbols in dependency order.

Prediction: modules would soon get re-invented under a new name to address this.

There's also the consideration that some forms of module-like notions encapsulate real-world (social) boundaries. E.g. there is an association between a package and its maintainers. However awkward this may be for proper encapsulation in your view, it's a useful reflection of the units of replacement in deployed programs.

The domain name system

The domain name system was invented to deal with real world operational boundaries and it is hierarchical -- no global key,value store! I suspect there are more functions in the world than machines when you include all the dusty decks and versions.... And then there are issues of privacy, confidentiality etc. FooMatics.com may not care to tell the world it is working on wrist mountable 3d printers for example.

units of compilation

Functions make fine units of compilation, especially if you can suggest separate compilation for some functions and not others (e.g. via metadata), with the non-separately-compiled functions mostly being inlined.

It is not clear to me how this would constitute a 'reinvention' of modules.

It seems to me that social boundaries are often aligned to match module boundaries because that's what the language affords, rather than serving as an accurate reflection of developer intentions.

The social aspect is important for programming in the large

It sounds really fun for 4-5 people to play around in such a environment--sort of an MMO for programmers. However, at scale, having a "project" is helpful for human beings trying to develop and share artifacts with each other.

For example, projects are an interface boundary. You can change things inside a project any way you like, so long as the external users of the project are unaffected. If everything is just a sea of functions, then nobody is ever in a good position to change 10 of the functions in tandem as part of making a larger change.

Similarly, projects are an important granularity for making releases and for upgrading to new versions of third-party code. It would be miserable to try and upgrade your Erlang compiler by upgrading each function individually. Some other layer is needed; I would say two layers: in addition to projects, a higher level of distributions.

Projects are also an organizational boundary. Project committers are generally given wide leeway to make changes to the code inside a project. Non-committers always have to find a committer to shepherd their changes through. The behaviors around these group membership activities are wired into us quite deeply.

the social aspect of very large programs

Why should programming languages or environments encourage the development of very large programs? Can't a case be made that this is bad for society? Small associations of people are dis-empowered by very large programs. Very large programs are most often associated with anti-democratic power monopolies.

Against those considerations I don't see any reason to believe very large programs are technically necessary for just about anything useful.

Corollary to these beliefs: Even the scale of modern web browsers is a complete disaster for humanity.

Yet somehow -- and its nothing personal lexspoon because this is very widespread -- somehow programmers are immediately drawn to think about how to build huge, "universal" software systems.

C.f. John Cowan: Ein Reich, ein Volk, ... in different guise.

You can change things inside a project any way you like, so long as the external users of the project are unaffected.

Why is it important to create such monopolized power "at scale"?

Pascal lacked the time to

Pascal lacked the time to write shorter letters.
Our society didn't have time yet to write shorter programs (for most domains).

For one notable attempt for shorter programs, see http://lambda-the-ultimate.org/node/4436. I very much applaud Kay's and your goal, but he points out that the appropriate timescale for that is much longer than we imagine.

At the same time, some other sciences are starting to give up the idea that you can simplify things enough to make them tractable. I'd say lots of physics is already past that point (thinking of chaos theory) and one of the motivations used for "selling" homotopy type theory in mathematics is enabling scalable large-scale collaborative developments of proofs. Some mathematicians discuss explicitly the old expectations that proofs could be compressed by developing the right theories, and the idea that maybe that's just not always possible.
Also, as much as I love maths and its struggle for simple theories (such as category theory — I didn't say easy!), it's not obvious that the results empower society — we don't have yet adequate education. Having to be educated to enter elites seems better than most other barriers for elites (think religion- or nobility-dominated societies), but it still excludes large swaths of population.

Outside of mathematics and physics, I'm not sure the expectation for simplicity ever existed (is medicine or psychology simple? Don't think so.)

More concretely, could we conceive implementing (a simulator of) the human body with a small program? What about a human mind? Let's stick to an accurate simulator, and count the Kolmogorov complexity (including therefore data, such as DNA). Right now, this seems the realm of science-fiction.

What about simulating a human society? The complexity seems similarly high. Many programs simulate (or implement) parts of human society (say, bank accounts and associated laws), and need deal the attendant complexity. The complexity of laws is also a factor of technocracy, as you say, but even that seems an open challenge to me.

I'll agree that many programs are unnecessarily large because of accidental complexity, but there seems to be a push for programs whose essential complexity is high, and turning essential complexity into accidental one is research (arguably, it's mathematics).

Engineering ethics.

I got stuck on this point:

Our society didn't have time yet to write shorter programs

That sounds like a serious problem!

What with the industrial revolution and everything since, we can very obviously all be much wealthier than we are with less "busy-ness". Our political economy resists taking advantage of that development, for various reasons.

As programmers, we have choices to encourage or discourage more construction of computing systems that are so large they are anti-democratic.

The reflexive choice of many programmers is to enthusiastically throw themselves at big "scaling up" problems because these are supposedly hard and you can be famous and/or get lots of money for working on them. I'm not so sure that's a good reason to help the people paying for such work, though. After all, the ones that pay are the ones who are amassing massive degrees of anti-democratic power over billions of people.

We used to call people who uncritically worked on whatever the hot, probably big paying stuff was: "tools". It was not praise. There was a message there.

Fix research

I can buy your point. (But I debated whether shorter programs are possible). But if writing shorter programs is research (as I argued), we need to expand research on this. If research is a Ponzi scheme nowadays (that's a gross oversimplification, but let me cut out the appropriate nuances and point you to http://www.phdcomics.com/comics/archive.php?comicid=1144), it can't expand to handle the needed scale. Can you fix that? *EDIT*: the fix doesn't have to be research, if you can make the activity profitable or change the underlying economic forces.

To put it differently: I hope I'll manage to avoid stereotypical "enterprise software" (which is antidemocratic) without founding my own company, but I don't think that most people have that luxury.

Also, we have >= 10 k years dealing with large human organizations, but we didn't really learn to make those simple, did we? That's painfully apparent to a Sicilian who moved to Germany (like me).

gzip

shorter programs simply expand at runtime to be larger programs a la k. :-)

(i'm just foolin', i'm not saying it is inherently wrong.)

Capatalism

Its a trade off: if you spend the time to do something elegantly (as simple and concise as possible), your competitors can steamroll over you with a more ugly but more timely solution. And if you don't build that big/ugly system, someone else will.

In research its the same, and even if you are working in a niche field, you might want to fail fast and move on anyways (and believe me, I'm no tool...look at my citation counts).

Nature does pretty well with non-elegant solutions that are only pruned via natural selection. It is folly to think that human organizations aren't a part of that.

not about capitalism or elegance

The question before us, as I understand it, is more like "Should programming systems and environments strive to support so-called programming-in-the-large?" I think this question is largely orthogonal to the chaotic dynamics of capitalist competition except that if programming systems encourage sprawl and very large LOCs of crazily inter-dependent code then the agentless dynamic of capitalist competition seems likely to discover the most destructive uses for these practices.

You seem to assume that corner cutting is important to capitalist competition (OK, that's plausible) and that somehow, for some reason, non-human-scale, anti-democratic code bases are the necessary winners in capitalist competition.

What do you mean

What do you mean "anti-democratic"? Pure democracy is typically described as "mob rule" with people collectively following their own self interests where order only emerges and simplicity is quite elusive. Simplicity only comes from the authoritarian but consistent and enlightened rule of the so-called philosopher king, which almost never works in practice, leads to stagnation, and is anyway boring (see Singapore).

At the end of the day, it boils down to philosophy and your vision of an ideal world. Many think if programmers just knew more math and programmed in Haskell, that somehow human progress would advance. But it could just as easily stagnate if "worse is better" was thrown under a bus. Likewise, we love to hate our horrible large code bases, but given the time and resources available, would a better result could have been achieved using more careful practices?

"would a better result could

"would a better result could have been achieved using more careful practices?"
Probably not with existing languages?

"What do you mean "anti-democratic"? "

Little-d "democratic".

I mean that if you have a big messy sprawl program -- eye-crossingly inter-dependent code and a huge lines-of-code count -- it requires something like a persistent bureaucracy to monitor and manage.

Individuals and small associations of people are at the mercy of such bureaucracies.

Some examples: No small group could realistically "fork" Mozilla software without retaining a dependency on the upstream bureaucracy of that project. (And that example is probably at the low end of how bad sprawl can get.)

All users of Mozilla software are also at the mercy of that bureaucracy.

Also: Long before the sprawl problem became a painful feature of the so-called "open source" world it was already recognized as a source of vulnerability and pain for capitalist competitors.

Think if you will on the abstract but familiar problem wherein some division has domain over sprawly piece of software essential to the business, and actors within that division then have the leverage to restrict the company's freedom to evolve the software for business needs. That is, such groups are able to usurp a certain degree of the firm's executive powers.

It is a foolish industrialist, so to speak, who gives his chief engineer such a thorough monopoly over the technical infrastructure of production that that fellow can then effectively take over or crush the firm by walking away.

So sprawl software is software that, by virtue of being sprawl software, creates its own "management problem". Sprawl software creates, out of thin air, actual living breathing instances of new patterns of domination and subjugation (on a massive scale).

And the pattern is persistent: when bureaucracies do manage to compete over the same sprawly program, usually (empirically) it is winner-take-all before very long. Sustained competition almost never happens.

Yet there seems to be this odd, reflexive assumption that programming languages should tend to aid and abet the creation of sprawl. People will discuss at length the minutiae of language features in support of sprawl (e.g. modules) without revisiting the question of whether or not this is a sensible and ethical priority.

But it could just as easily stagnate if "worse is better" was thrown under a bus.

You're really addressing a straw-man of your own devising here. Who said anything about the question of "worse is better" besides you?

Darwinian PL design

The root cause of the problem seems to be that whenever you have a subcommunity maintaining a support system (such as a system of regulations, or a body of law, or a code base), the support system will evolve toward being impossible for outsiders to understand, by Darwinian evolution: survival of the support system is promoted by entrenchment of the maintaining subcommunity, and entrenchment of the maingaining subcommunity is promoted by obfuscation of the support system. What features of a programming language (in the broad sense, of course) do you think will tend to encourage evolution in a different direction?

succession vs accretion

You pose an interesting question, but I think that there's a dual (to abuse a term) question as well.

The darwinian model does not seem to apply to individual code bases in the way that it applies to a larger body of software, systems, and communities. Evolution assumes a succession of new organisms produced from earlier generations that die off. Meanwhile, individual code bases tend to accrete code like like barnacles or stratum.

So the dual question is: "What features of a programing language do you think will tend to encourage thinner layers to be built upon a stronger foundation?"

And I may reframe your question as "What features of a programming language *ecosystem* do you think will tend to encourage evolution in a different direction?"

Programming system evolution

Well, in the classical biological case evolution involves a bunch of discrete individual organisms each with a highly stable genome; under those circumstances evolution takes a bunch of generations. I think we'll need to rethink some of that model if we're ever to understand memetic evolution, which is, at least in some ways, a good deal fuzzier. I don't think it's safe to assume a code base can't "evolve".

I agree there's some sort of qualitative difference between development of a single code base and development amongst the population of code bases over a longer time. The two modes of development are... I'll go with "complementary". It's certainly relevant to understand what properties of a PL foster desirable features in the development of a code base in that language; but if the features we "desire" in that one case have negative survival value at the larger scale, we would only create a code base that we like but that will fail in wider-scale competition.

What about "one program does one job?"

I think that some classic PL environments might be worth thinking about here.

There was (at one time, probably not so much any more) a semi-classical aesthetic for programming in UNIX systems. The idea was that any particular program ought to do exactly one job and do it well. So, for example, if you were writing a mail transport agent, absolutely everything you did in it ought to be about mail transport. UI was a different task, for some other program that was a mail user agent.

However we may consider this now, the separation of jobs did serve to limit interdependencies. That is, the same MTA works fine on an Xwindows system where nobody ever sees a shell prompt and also works fine on a shell system where no GUI has ever been. This is pretty unthinkable in a "modern" mail system; because a modern mail program depends on a particular substrate for UI, and if it doesn't have the substrate it needs for its UI it won't run, period - so any code it contains for mail transport becomes unusable.

We keep tying together larger monolithic agglomerations of code that interact with more subsystems - but as we do we tacitly consign all of that code to death, the minute ANY of those subsystems is replaced or substantially changed. So the larger/more complex something gets, the shorter its life. This is one of the forces that creates a point of diminishing returns in scale.

Multiprocess is C's module system, we have simpler

I used to like that aesthetics (my canonical citation is http://www.catb.org/esr/writings/taoup/html/). But IMHO its useful features boil down to:

  1. modularize software with the only robust modularity mechanism of C;
  2. hide internal state (not OS state) even better than ML modules (the "no shared heap" mantra means "have pure interfaces");
  3. impose a huge cost on the interface complexity (one needs to parse the output of other tools), forcing you to simplify the interface.

We have improved a lot on 1., we are working on 2. (Haskell is one approach), so we might want to aim for simple interfaces directly.

(I'm probably missing some cool properties, but I'll extend my approach to those.)

features discouraging sprawl

What features of a programming language (in the broad sense, of course) do you think will tend to encourage evolution in a different direction?

"In a broad sense":

Programming systems can maintain a strict classification of composition techniques into those which represent a tight coupling, and those which represent a loose coupling.

For example, we could decide that function application always describes a tight coupling. Or if two components share memory, that is always a tight coupling.

On the other hand, pipelining outputs and inputs is a loose coupling.

Armed with that classification of composition techniques, we can meaningfully talk about the transitive closures of tightly coupled components. If A is tightly coupled to B which is tightly coupled to C, together they are a tightly coupled assembly of components A, B, and C.

Here's an odd thought:

Programming systems can restrict the total source code size associated with the transitive closure of tightly coupled components.

As a slightly tongue-in-cheek example: The language C' might be exactly like C except that the totality of source code that goes into a final executable is limited to no more than 10,000 lines of code; no more than 3,000 globally defined identifiers; no more than 3,000 functions; and so on. (I am ignoring the problem of operating systems that allow multiple processes to share memory by other means.)

[Aside: as a beneficial side effect, the strict limits on the size of C' programs would discourage anyone from inventing C'++.]

A programming system can guarantee, in other words, that a person can (probably) sit down and pretty quickly read the entire source for any tight coupling of components. Not only can the graph of tightly coupled source be automatically identified ... it is also quite restricted in size.

What about higher level, loosely coupled compositions? These may be considered in two aspects:

A high level composition has "internally" some source code that may be tightly coupled among itself. The lines of a shell script, for example, all share the same namespace of variables and so let's call that a tight coupling among all those lines of code.

Meanwhile a high level composition like a shell script specifies how other components (like subprocesses) are to be separately, loosely composed.

By implication there is some limit to how the loosely coupled components may be reified in the more intimate, tightly coupled part of the code defining the high level composition.

The shell script example again can help make this concrete: The programs shell scripts compose are reified indirectly as process numbers ("%2"). Sub-processes are not first class values in shell programs. Instead, they are referred to by symbolic names.

It would seem that programming systems for high-level, loosely coupled composition must assume some form of ephemeral environment that is responsible for resolving the symbolic names of loosely coupled components and actually effecting their creation, destruction, and connection.

In shell scripts, the ephemeral environment is implemented by the run-time system in /bin/sh along with the unix kernel. Certainly there are other possible ways.

That ephemeral environment, generally speaking, has to be stateful. The high-level component calls upon the ephemeral environment to create new loosely coupled components, connect them, later destroy them, and so on.

Armed with that understanding, we can now consider the set of all components which are connected by a transitive closure of shared ephemeral environments.

A programming system can again sharply restrict the total source code size of components within the transitive closure of components sharing an ephemeral environment.

Thus, high level compositions are again guaranteed to have short source code, suitable for a person to sit down and read though.

Think of the source code hierarchy here:

Systems are divided at the lowest level into tightly coupled components limited by C' to maybe 10K lines of code each.

At the next level, let's call it shell', at most 10K lines of shell script may control at most 300 lower level programs.

A total system with just those two levels can already comprise a total of 3,010,000 lines of code but at the same time it breaks down as:

10K LOC controlling 3,000,000 LOC. That 3M LOC broken up into 300 relatively isolated components, each with no more than 10,000 LOC.

If we want to add another layer to the hiearchy of high level composition we can, for any reason.

This leaves hanging questions about the ephemeral environments: What are the useful varieties of loose-coupling mechanisms? How are they organized into ephemeral environments?

I would guess there are very many distinct designs for ephemeral environments and high level composition, including very many different ways of combining these. Perhaps it is an area where many programming languages will eventually be needed -- but each only meant to be used for small programs, 10K lines or fewer. :-)

decomposition

If I understand correctly, you're essentially suggesting our languages or systems must not only enable composition, but favor a subset that also affords decomposition - ability to pick out small, isolated pieces that we can comprehend (and perhaps reuse) individually.

I believe that decomposition is very important.

Indeed, it has greatly influenced my designs. E.g. my word definitions are acyclic to simplify understanding individual words. Imperative code runs in logically isolated transactions, to prevent entangling of state with control flow. State resources are externalized into a filesystem-like abstraction, albeit one that is capability secure. Long running behaviors are modeled using reactive code, representing agents or rules that continuously observe states for some resources and influence states for others.

I imagine most "sprawl" would arise from having lots of simple rules interacting indirectly. Even if each rule is understood, the whole system's behavior may be surprising. To mitigate this, I favor use of laws, e.g. causal commutativity, spatial idempotence, conservation of duration, capability security. I also keep histories, and recommend development of stable models.

decomposition yes, but

Yes I'm saying decomposition is important but also more specific things, I hope.

For example, the way you describe your systems I think you are implying that an unbounded amount of source code can share state through a single file system abstraction.

In my view, the interconnection via the file system is one of two things. It might be thought of as a tight coupling between components sharing the file system. Or the file system might be thought of as the ephemeral environment that manages a composition of subsumed components.

Either way, the total amount of source code sharing a single file system should be strictly limited. E.g., no more than about 10K LOC sharing a single file system. (Of course, other code might access that same file system, but only through a loose coupling to those at most 10K LOC.)

I think there is also an issue (from the way I look at things) with capabilities:

One the one hand, capabilities must be unforgeable.

On the other hand, I have the sense that "loose coupling" should be defined to say: anything that can be shared via a loose coupling can be forged.

If capabilities can only be shared by tightly coupled components then the total LOC of source that use a given capability is limited to, e.g., 10K lines.

shared state

I'm not convinced of the 'strict limits' argument.

But there is value in some intermediate code, like your proposed 10k, that can protect invariants in a filesystem. Conventional files are byte strings and can't really protect themselves; any agent could save some corrupted data. If files are replaced by purely functional objects, e.g. `data O v = O { update:v → O v; query: v → v }` they'd essentially embed that 10k on a per resource basis. This makes it feasible for more agents to use the resource, since it can protect its own abstractions. (The system I'm developing does this.)

I don't believe that sharing capabilities implies tight coupling, e.g. in case of revocable capabilities. I also don't believe that all forms of shared state involve tight coupling, e.g. tuple spaces and blackboard systems.

A different direction

Everything you are talking about has less to do with PL than with system design.

Have you played with plan9? The "glue" language there is rc (much like sh but simpler). Each process has its own namespace which can be extended at run time. Each server provides a little namespace that can be tacked on to a client's namespace. "Loose coupling" is achieved either by unix style pipes or by a network protocol called 9p, a file access protocol that connect clients with servers. Using 9p you can even mount devices attached to another plan9 system.

I think a plan9 like system with Scheme as its shell language would go much further. Where you want tight coupling, you compile/optimize the heck out. Where you want flexibility you fling about s-exprs. I think such a system should be able to scale efficiently 3-6 orders of magnitude.

IMHO the biggest problem staring us in the face for the past 40 years is distributed/concurrent systems. PLs have not done very much to truly advance the state there because we have been largely distracted by the sequential programming paradigms of functional programming, object oriented programming, logic programming, array programming, database languages and so on. Concurrency is usually an afterthought. Only a handful of languages seem to have even attempted to deal with concurrency & non-determinism head on. Do we have anything newer than actors & CSP? (Go and Erlang fit in the same model). Or may be, PLs don't matter so much. May be we need to see what sort of computing structures scale up and work in practice.

Isn't Go the successor to

Isn't Go the successor to PL9?

Plan9 is an OS

The Plan9 kernel is implemented in C and so are many of the standard programs that come with it (the rest are rc scripts). Plan9 has a library called libthread that provided threads and channels much like in Go but not as easy to use (or type safe).

Go's concurrency model is much like CSP. There were earlier concurrent languages from Bell Labs -- alef, newsqueak and limbo -- that also had a similar concurrency model.

Ya, sorry. I got confused.

Ya, sorry. I got confused. Anyways, I would assume that Go represents the collective aggregate of Rob Pike's experience and ideas on how things should work.

system design vs. pl (also universalism)

It seems shocking to me to say:

Everything you are talking about has less to do with PL than with system design.

If programming language theory is not about system design then I don't know what it is.

Also, I want to clarify something about programming by constructive subsumption rather than by specializing abstractions. The abstraction approach is a perpetual quest for universal abstractions. The subsumption approach maintains an assumption that universals don't exist.

Two cases in point:

If someone says something like "Oh, you want something like Plan 9 where everything is glued together by rc and a file system interface," then I think they have missed my point.

The "tell" that they missed my point is the word "everything".

They are thinking of a computational universe of many small tools, perhaps, with an emphasis on loose coupling -- but somehow all subordinate to a couple of universal abstractions (rc and the file system). They are thinking that to solve a problem one starts with those unifying abstractions and tries to specialize a solution: perhaps a file system with special semantics and a specific list of new shell tools.

The (idealized) subsumption approach would never contemplate a question like "Should a file system interface be used as part of the the universal communication paradigm for all loosely coupled components?"

On the contrary, the subsumption approach would only ever consider a specific problem, P. P is some specific need for an actual deployed computing system. And then the subsumption approach might ask "Well, we have these file system tools. Are they handy for solving P?"

The abstractionist will think that if the file system abstraction is not quite right for P, and proplem P can't be altered to avoid the mismatch, that the next step must be to improve the file system abstraction. The subsumptionist will be more likely look at entirely different tools, gradually accumulating over time an ad hoc assembly of actually solved problems.

PLT is very central to the polarization of abstract and subsumptionist programming:

Should programming languages strive to facilitate the development of universal abstractions in a unified framework?

Or should programming language theory concentrate on producing small tools to give subsumptionist-style system-builders greater degrees of freedom?

Haskell or YACC, for example?

I mention this, Bakul, because you said:

IMHO the biggest problem staring us in the face for the past 40 years is distributed/concurrent systems.

A strict subsumptionist would have to scoff at the idea of trying to treat a vague abstraction like "distributed/concurrent systems" as a problem to be solved. PLT has no subsumptionist contribution to make to this problem because from a subsumptionist perspective the problem is an illusion -- a wish for a perfect, universal abstraction.

Vast advances have been made in distributed/concurrent systems in practice in the past 40 years. Along with this practice has come theoretical analysis of specific practices.

The biggest advances have contributed nearly nothing to the quest for universal abstractions. I am thinking most specifically of the modern use of MAP/REDUCE-style programming.

As practiced by Google, modern MAP/REDUCE emerged as a conceptually easy subsumption of easily assembled data centers running more or less off-the-shelf operating systems. Subsequent related PLT work has been aimed at constructing and deploying specific tools to make these bits even easier to glue together.

Would anything be added to the MAP/REDUCE world if it were recast in a universalist abstract way? For example, if someone polished up an actor system and argued or demonstrated how the existing MAP/REDUCE stack could be expressed elegantly, top to bottom, with reference to a single mathematical abstraction?

Sometimes, yes, thinking through abstractions like that can lead to insights but mostly, no: the real progress is already done by the subsumption work and there is very little to add by retroactively trying to find a more abstract way to re-express that work.

Very much yes to this:

Or may be, PLs don't matter so much. May be we need to see what sort of computing structures scale up and work in practice.

And for that, you need real, ad hoc practical problems driving the work at every stage.

might be over-reacting

I appreciate what you say.

Tho I don't know that, "IMHO the biggest problem staring us in the face for the past 40 years is distributed/concurrent systems." guarantees that Bakul was insisting on some abstractionist approach.

If Field of Study X requires 10x the person-years to study and conquer than Field of Study Y, then Field of Study X is the biggest problem staring us in the face. At least, until we've conquered some vast majority of it.

Distinguishing vague from specific

On the contrary, the subsumption approach would only ever consider a specific problem, P. P is some specific need for an actual deployed computing system. And then the subsumption approach might ask "Well, we have these file system tools. Are they handy for solving P?"

A strict subsumptionist would have to scoff at the idea of trying to treat a vague abstraction like "distributed/concurrent systems" as a problem to be solved.

I sympathize with your appeal toward solving specific problems, but I don't think I agree with whatever method you're using to sift out the specific problems from the vague ones.

A deployable computing system is a rather universal tool already: You can deploy it anywhere and any number of times you want, run it at any point in time, and give it to anyone to use. I think this is usually much more universal than the specific needs demand.

On the other hand, if you're going to include deployable computing systems, it seems odd to draw the line at systems that are only deployed at a single physical location and only do one thing at a time. These are specific limitations of existing programming languages, and a language user may feel a specific need to do away with them. The efforts towards "distributed/concurrent systems" languages are addressing this need.

I guess I was just saying

I guess I was just saying that (IMHO) there is no programming language silver bullet that'll fix the sprawl problem (or even help all that much with it beyond what we already have). As the plan9 example shows a system level vision and architecture can produce a simpler design using existing languages. I mention it rather than, say, any lisp based systems is because that is what I know more about. Now if you let loose 10,000 programmers on a plan9 based system on some grand project, will it suffer the same sprawl fate? I wouldn't be surprised if it did, but I think at least it stands a better chance of resisting it (if such a project gets strong tech. leadership).

And searching for abstractions is not the same thing as searching for "a universal abstraction". The issue for me is really if we can make do with fewer abstractions (or rather design principles) for a system of some given complexity. So I see what works in practice and see if there are any repeating patterns that can be packaged up in new abstractions or design principles that may help in future. Or if things don't work as well as they should, I might think about why and see if some insight emerges.

I said PLs have not helped much with the concurrency problem because we are still using mutexes and semaphores and condition variables and message passing and threads for concurrent programming. None of these are integrated in any sense in any mainstream PL. [At least Per Brinch-Hansen tried to, in his Concurrent Pascal but that was 40 years ago.] Many concurrency related algorithms are also very old by now. We still have to deal with deadlocks and live locks. We are still at sea on how exactly to debug when things go wrong in a distributed system (which is what the Internet is) -- these are kinds of real problems I have had to deal with so it is not just an abstract moan. The internet's "glue" are its client/server or other protocols. We still specify these in prose (usually English) and bash out some code to implement them and then debug them by trial and error. We have no formal tools to find (security related or other) bugs in these protocol specs.

To me a programming language's two main functions are a) to provide a notation for expressing computing or logic structures of interest and b) to be understandable or intelligible to a reasonably knowledgeable person. PLs seem to be doing mostly fine with b) but not so with a) when it comes to concurrent or distributed systems.

Your Subsumptionist vs Universalist dichotomy doesn't resonate with me.

systems, sprawls, and PL

"An operating system is a collection of things that don't fit into a language. There shouldn't be one." -- Dan Ingalls

System level visions and architectures are a common aspect of language design. Check out flow-based programming, for example. There is very little difference between designing Plan9 OS vs. designing a language around Plan9's abstraction of processes and state.

Of course, not all languages have a good systems abstraction. Forth can be used as an OS, but AFAICT was never designed to scale beyond one programmer or user. C makes a pretty bad OS, but could potentially be used as one with a Tiny C Compiler.

It seems to me that the poor relationship between language and OS today is the source of a massive amount of accidental complexity. Addressing this really could become a silver bullet. There are a lot of problems worth addressing in PL. Systems problems (security, distribution, deployment, etc.) are certainly among them.

The issue for me is really if we can make do with fewer abstractions (or rather design principles) for a system of some given complexity.

AFAICT, Thomas Lord's conception of components as bit-stream processors is already a fine example of a 'universal abstraction' into which he's 'specializing' everything. I think the important point is that this is an inherently modular abstraction, loosely coupled, easily reusable in many different environments, and easily composed.

I think we can we find a small set of simple 'universal' abstractions that share this property of loose coupling, yet adequately cover every use-case we're likely to encounter. Besides stream processing, I might borrow a bit from collections-oriented programming languages, and a little from modal and substructural logic.

PLs have not helped much with the concurrency problem because we are still using mutexes and semaphores and condition variables and message passing and threads for concurrent programming.

That's perhaps true of mainstream PLs. But there are PLs that shift everything to message passing, or have some sort of concurrent dataflow or CSP model built into them, or that use uniqueness types, or that support software transactional memory.

b) to be understandable or intelligible to a reasonably knowledgeable person. PLs seem to be doing mostly fine with b)

Really? I feel there's more often an illusion of comprehension that rapidly breaks down the moment I'm the person hunting down a bug in a large program.

Reading code doesn't scale. And we aren't very good at it. Yet, it seems to be the primary comprehension technique for many mainstream languages.

IMO, the languages that provide the best comprehension are those that help me control and isolate effects and relationships without reading much code, e.g. through the type system, capability security model, immutability by default, confluence, idempotence, or certain forms of tooling (e.g. graphing a function). And very few PLs are optimized for this goal.

Glitch

The next version of Glitch that I'm working on is completely multi-threaded without reader locks (writer locks are still needed, but these never acquire other locks, so never any deadlock). The key is that replay is always available to use when something is read too early (like transaction) and that the dependency graph is built for change propagation before the read occurs (we never have to wait to do something, we just have to be prepared to do it again). This is based on Jefferson's virtual time timewarp mechanism, which originally targeted distributed simulations.

programming-in-the-large

It seems to me the best way to "support" programming in the large is to address the many problems of scale (in terms of LOC and number of programmers): design languages that resist entanglement of code, that simplify extension and decomposition, that enable integration of heterogeneous systems, that encourage decentralization of authority.

To me, the question isn't "should we support programming-in-the-large" but rather "how do we create a system whose behavior remains efficient, predictable, and maintainable as it is organically extended by thousands of programmers and users?" My best answer to this question involves laws - of locality, induction, conservation - not so different from those we experience in the physical world.

Deduplication is key to Programming In The Large. But HOW?

As far as I can tell the biggest problem we need to solve to support programming in the large is about isolating abstractions from substrate implementation. Which it's not at all obvious how to do. Let me explain.

We've made careers and a profession of implementing abstractions in terms of lower-level abstractions, but what we get from that process aren't "clean" abstractions. They're abstractions that depend on some particular lower-level substrate and set of operations, and we cannot use those abstractions anywhere that particular lower-level substrate isn't available.

We have been trying to scale our abstractions by just importing everything that any of the building blocks we used depended on -- and everything that any of those depended on -- and everything that any of those depended on, etc. Most of this code is redundant, in that the differences are mainly in terms of calling conventions, framework fitting, surface details, and packaging rather than fundamentals. And the size of this transitive closure of inclusions is exponential in terms of the number of layers of abstraction we're using.

In all this exponential size of inclusions, there are also an exponential number of bugs. Bugs that, if and when they get fixed, will be fixed in one or a few places only out of all the places where our abstractions actually use the kind of functionality that they are bugs in, depending on our particular set of import and inclusion paths.

An abstraction ought to be a mathematically clean pattern, not something that depends on having a particular implementation or substrate. We should be able to build our abstractions however we want, but then "re-base" them on different substrates, cleanly and automatically, eliminating layers and deduplicating/unifying everything down to the level of the base API calls.

When we put together a new abstraction built in terms of abstractions from seven different projects built in five different type systems on three different virtual-machines in three different paradigms (but all relying on the same half-dozen basic API calls) the ideal solution is that we ought to get our abstractions in terms of *ONE* set of lower-level abstractions, depending on *ONE* virtual machine and type system, or compiled directly to an executable, with human-maintainable code that will allow future maintainers to treat it as though it had been written that way to start with. But, I have no idea how to do that.

clean substrates and dirty abstractions

Today we tend to build useful abstractions above insecure, unsafe, unportable, fragile, highly stateful, and highly efficient substrates. This causes a lot of problems for us, but at least we get our problems at a very high speed.

Rather than trying to isolate 'clean' abstractions from a 'dirty' substrates, I would propose flipping the two. Develop a clean, simple substrate or bedrock abstraction with a lot of nice non-functional properties for maintainability, optimizability, portability, concurrency, distribution, security, and so on. (Awelon Bytecode is my own effort at such a substrate.) Make it highly usable and adequately efficient.

Build your abstractions on this substrate. A useful subset of these abstractions will be clean. Many will be specific to a problem or use-case. There is no reason to assume abstractions should be mathematically clean. Why shouldn't art and music and game rules and so on all be available abstractions? But abstractions should at least have a clean foundation to build upon, such that they can be safely and easily integrated with other abstractions.

An abstraction ought to be a mathematically clean pattern, not something that depends on having a particular implementation or substrate. We should be able to build our abstractions however we want, but then "re-base" them on different substrates, cleanly and automatically, eliminating layers and deduplicating/unifying everything down to the level of the base API calls.

Aside: The idea of 'isolating' your substrate might sound nice, but all you can ever do towards that goal is abstract away from the substrate while making some assumptions about it. If you just take those assumptions as axioms and give those axioms 8-bit names, you've got yourself a bytecode. All that remains is to make it a good one. A good, consistent substrate is much more valuable than would be an abstract substrate that varies in assumptions from one abstraction to another.

No, 'build on something cleaner' is just more of the same.

It would in fact be nice if we had all agreed on a clean substrate or bedrock abstraction and then built everything in terms of it and never discovered that we were wrong or that a different substrate might ever be better for some purposes, and then never had to put together software that was originally built in terms of different substrates because we didn't imagine it would ever be brought together. But that was never going to actually happen, and will never actually happen.

In the first place it requires progress to stop. I mean, seriously, do you suppose we can just 'jump to the end' and all agree on something so good that there's no more progress to be made, ever, in terms of appropriate structures on which to build? I don't. Absolutely every one of those sets of slightly-incompatible abstractions we're struggling with, every element of the transitive closure of inclusions that is the current problem, was an attempt to create a superior set of abstractions on which to build; the fact that we're now bogged down under the weight of a thousand different visions of infrastructure excellence should tell you something about our ability to agree on visions of excellence in infrastructure.

In the second place you're essentially advocating 'pitch everything that's ever been written and start over' which is an approach that usually doesn't scale. In fact it would take so long that yet more visions of infrastructure excellence would be certain to arise before we are even a tenth of the way finished.

In the third place, real projects don't usually have any choice about what substrates and paradigms they have to build things on top of; we get the steaming pile handed us because this is the set of buzzword technologies we have to be compatible with, the set of standard applications we're doing a mashup for/of, the set of servers and clients that we have to fit into, the previous version of our own software, etc. No matter how good a new substrate is, we cannot possibly abandon the code that's written in terms of the substrates of all the crap we have to be compatible with.

A capability-secure bytecode for example could be beautiful and perfect, and I would still probably wind up cursing its name because supporting it it won't be because I get to stop supporting anything else. All the other crap it isn't compatible with (or is barely compatible with) is going to have to go on working, and the cap-secure parts will probably get in my way because I'll be replacing things that duplicated or forged authority and passed those authorities on to other sections of the code.

The only way we make progress here, is if we get a handle on what our abstractions mean in a way that makes it possible to replace them seamlessly - to transform something built in terms of one set into something built in terms of a different set.

clean foundation never seriously tried

How many capability secure (or effect-typed), mostly purely functional, substructurally typed bytecodes do you know?

I'm a bit more optimistic about this. I don't believe there's any need to "all agree". It's a lot more work to create nice new abstractions than it is to transcribe them into a new language, so getting a new idea into good shape is feasible even without asking everyone to stop what they're doing.

And if we discover something better? Well, we should have some people work on it, and translate as much existing code to it as feasible. I imagine this will be a lot easier if the existing code is already above a clean substrate that can be cross-compiled.

you're essentially advocating 'pitch everything that's ever been written and start over' [..] it would take so long that yet more visions of infrastructure excellence would be certain to arise before we are even a tenth of the way finished

Not at all. I'm not suggesting we bootstrap from scratch. Just avoid direct dependencies.

real projects don't usually have any choice about what substrates and paradigms they have to build things on top of

There is never a new language or platform that is a viable choice for a 'real project'. Until it is.

The only way we make progress here, is if we get a handle on what our abstractions mean in a way that makes it possible to replace them seamlessly - to transform something built in terms of one set into something built in terms of a different set.

That's what compilers do, isn't it?

Actually more like a 'decompiler.'

That's what compilers do, isn't it?

That is rather a profound observation. The system we need has broad features in common with a compiler.

But unlike most compilers, it needs to raise rather than lower the level of abstraction. A standard compiler takes a medium-high level of abstraction such as a high-level programming language, and lowers the level of abstraction until it can spit out opcodes and system calls.

To make this work we need to go the other way; we need to be able to read some low-level representation (possibly as low-level as opcodes and system calls) and express what it says cleanly and clearly in terms of a chosen paradigm or set of higher-level abstractions.

If we can do that, we can cut out diverse and redundant middle layers of abstraction, breaking the multiplicity of dependences, and replace them with a common set of abstractions.

cross-decompiler

Interesting. I have my doubts that a decompiler can reliably raise the level of abstractions (as opposed to changing how they're expressed). But I suppose even an unreliable design might help extract algorithms buried in other languages and accelerate development of new languages.

intractable

I suspect decompiling is intractable, simply because it would be the kind of optimisation that requires a breadth first search of all possible algorithm combinations to find the optimum (whether that's fastest, highest level etc.) Whilst it may be possible for toy problems, it seems unlikely to be successful for real software. In order to be solvable there would need to be some kind of Platonic order to algorithms like chemistry's periodic table. Naively it would seem understanding any structured algorithm theory is going to be harder to crack than primness of numbers.

Another way to look at it is compilers throw away information, in a similar way to multiplication. Decompiling would then seem analogous to factorisation.

Elements of Programming

This is what generic programming is trying to do. The important thing appears to be dependency injection. Stepanov explains it well, algorithms rely on properties, some algorithms require random-access iterators, some just forward iterators. These requirements are fundamental to the algorithm and form a classification of algorithms like rings and groups do in number theory. You can then choose the most efficient algorithm based on what lower level abstractions provide. There should be no dependency on the lower level implementation, just on the generic interfaces.

It seems obvious that these generic interfaces should be global as they express timeless, locationless properties. Algorithms implemented in terms of these properties should probably be global as well. Generic containers also global, but there would seem to be no reason programs (and subprograms) should be global. There are clear differences between generic algorithms and helper functions created to break up longer tasks and structure code. Not every function is ready to be generic. You cannot just write the correct generic algorithm with no domain experience. The reality is you write many similar non-generic functions before you understand the common pattern and can attempt a generic version. We don't need one global namespace cluttered with these non-generic functions. Often programming looks like this:

Specific functions -> Generic algorithms -> Generic interfaces

You need many similar generic algorithms before the right form of a generic interface becomes apparent. The idea that all functions can live in a global namespace seems to ignore the fact that we are always implementing new knowledge domains, programming will never be 'finished' and programming in the large needs to support the new frontiers as well as the old world.

Clutter?

We don't need one global namespace cluttered with these non-generic functions. [..] The idea that all functions can live in a global namespace seems to ignore the fact that we are always implementing new knowledge domains

A global namespace is enormous. There's no reason to assume 'clutter' will ever be a problem.

Even for problem specific 'inner' words, there is much value in exposing them in a global space. They are available for refactoring, global searches for common patterns to find new generic words. They are available for decomposition, to create variations on use-case specific without re-implementing all the details. They are available for extension, to create new APIs to common data structures to meet needs the original author did not anticipate.

This 'best practice' of hiding tools we create for the little problems (utilities, algorithms, business logic, application logic, etc.) has been terrible for us. It creates a lot of pointless rework from one project to another. Even from one function to another, in case of sophisticated functions defined in `let` or `where` clauses.

That said, it isn't a bad idea to emulate a local namespace, e.g. via appending a project or topic name and an IDE option to minimize suffixes in common with whatever word you're defining. This will prevent people from accidentally using or colliding with your use-case specific word, yet still allow it to be refactored into a more common word when people find themselves reinventing it.

(If 'rename' is available as a global refactoring across tons of projects, a lot of issues with choosing names 'wisely' or in anticipation of a future 'generic' version will just evaporate, being replaced by more organic and social forces.)

GUIDs

The thing about a global namespace is that it doesn't have to be based on programmer comprehensible names. Rather, GUIDs could be used that are resolved from programmer used names in the IDE; resolution now becomes a tool concern and can leverage context, type information, search, and so on.

The namespace is then as cluttered as the web...but we are ok with that because no one has to remember URLs anyways.

No clear benefit from GUIDs

Use of GUIDs for names is, of course, a viable possibility. We can do it. But should we?

I've not been convinced that this particular indirection adds any real value. Tools don't much benefit. They generally don't care whether a function word looks like `foo` vs. `guid:base16goesHere`. If you want to associate attributes with words, e.g. an icon to render with, that works just as well either way. I do believe we can benefit greatly from search, context, type information, and so on. But none of that actually benefits from favoring GUIDs. There aren't any security benefits. Humans certainly don't benefit: the system under the hood becomes more opaque, their experiences much more localized, less accessible for teaching and learning and sharing.

A common argument in favor of GUIDs is cheap rename. But how much value does this have, really? Any good system will already have a reverse lookup for purposes like reactivity and zero-button testing. From a given function, you can find every function that uses it, so most of the rename job is already done. Renames will also, naturally, become less frequent as a word becomes more popular. Cheap rename isn't a problem.

I think there may be some use for GUIDs, mostly for artificially created functions, or when importing a codebase and avoiding collisions, or as a temporary name for an anonymous function. But in this case, we can always use `foo.base16goesHere` or similar, keeping some human parts in the name, and renaming the words we use directly.

My point was to detach

My point was to detach naming completely as a compilation concern; where GUIDs are basically names that can't go wrong for the compiler. A function, procedure, or class (and so on) then has a bunch of names that the programmer can use to find it...and once found the compiler only sees a GUID, the GUID is persisted in the source representation that is processed by the compiler. But without GUIDs, you are at the mercy of the compiler: today foo resolves to make tomato soup, tomorrow it resolves to launch the nuclear missile.

There are alternatives; I mean, wikipedia simply requires unique names like "John Smith (explorer)" or "John M. C. Smith". But I think using GUIDs to talk to the compiler and ambiguous names to talk to programmers (where ambiguity can be dealt with manually) is a more elegant solution.

Cryptographic hashes vs GUIDs

It has been mentioned already, but I think that using a function's cryptographic hash (SHA256 or whatever) is much better than using its GUID.

Then, recursively, functions that refer to other functions need to refer to them via their cryptographic hashes. That way, compound functions (modules?) turn into a kind of Merkle DAG (of functions).

But deep cyclic function references are problematic using this scheme, because you cannot calculate the cryptographic hash of a cyclic structure (there are ways, but they do not scale).

Fortunately, this problem can be solved by writing functions in a continuation style, in combination with fixpoint combinators. Alternatively, a term rewriting language can be used to the same effect.

My research language Enchilada is such term rewriting language, in where every piece of data, code and continuation is referenced through cryptographic hashes (which can be leveraged on top a DHT).

Link layer

Use of crypto hashes seems useful for linking code. I use it in Awelon Bytecode. But they don't seem very useful at the level of names for human-maintained code. The identities would keep changing.

deep cyclic function references are problematic using this scheme, because you cannot calculate the cryptographic hash of a cyclic structure

Anonymous recursion in terms of fixpoint combinators will avoid this issue.

Indeed, I'm not suggesting

Indeed, I'm not suggesting to use crypto hashes directly. Enchilada's internal use of hashes is similar to how GIT works, so 'versioning' is cheap.

Anonymous recursion in terms of fixpoint combinators will avoid this issue.

Yeah, that's what I've said in my post.

I don't see a need for

I don't see a need for hashes, I mean, it is useful if the exact same function is formulated twice by accident, but this is very unlikely to happen.

Security

There is one advantage: security.

With crypto hashes, anyone can validate a function reference, by recalculating its crypto hash (recursively). That way, you can know for sure that a function has not been tampered with.
With GUIDs it is less easy to do so.

Both GIT and BitTorrent use a similar scheme to validate data.

tied to implementation

Tying the call to the implementation seems the opposite of what you want to do. How would you fix bugs? I guess there is the assumption that this is an interpreted language? In a compiled language the object code just includes a jump, and there is no point in checking a hash as the malicious coder can just overwrite the hash check with NOPs.

Hases are fine in Git though. I don't want my source repo tied to my language either.

Implementation details

For absolute security, you exactly want to tie the hash of the implementation at the call site.
Fixing a bug in function 'A' (hash X) would just generate a new function A (hash Y). Recursively, for functions that depend on A (the callees), you also need to generate new versions for them too.

This can be made to work for an interpreted language (see Enchilada). But for a compiled language, not so easy (unless there is some kind of hardware support that checks the crypto hash of code during execution).

I agree all this is currently not practical, but I want to research the implications of such a model.

Research

It certainly seems an interesting research topic. I think I prefer the idea of certified (signed) code though, it gives you security guarantees without being tied to the implementation. I can see it being used as a kind of 'tripwire' to detect malicious code changes. A runtime hash can be compared against a signed certified hash for each function, so you don't throw the whole program away but can isolate any unverified code.

Hacker News

This post on Hacker News pretty much says what I was saying, but much better.

Version Control + Metadata Index + Build Tool

I think this kind of thing would be great in a build tool. An online metadata rich database coupled with source code version control (so like a source code indexed github) with code verification would be great.

If the indexing tool could read metadata out of markup in local source files too, and metadata markup in local source files ended up in the index that would work for me too.

I dont think this has anything to do with the language itself or modules. I think this could be language agnostic by preprocessing the metadata out of the files like Knuth's cweb.

I wonder how many people would be interested in this kind of language agnostic system that could be used with existing file-based languages?

API assumptions and valid state

I think being able to have functions that can only be called internally in a module is an important part of defining an API. Calling the wrong function could break the abstraction or leave it in an invalid state

secure abstractions

While I like the idea of secure abstractions, I'm not very comfortable with abstractions that are secured based on a third-class concept like where the functions are defined. Capability security, value sealers, closures, etc. offer better options.

Closures for data hiding.

I find those concepts interesting, and I certainly like capabilities for runtime security. Closures however seem to be the same thing and I would be happy with closure based modularity. You can have a function containing local functions, and return an API via a record from the function getting both data and function hiding. How is this different though?

Securing via closures is an

Securing via closures is an instance of securing based on where a function is defined, ie. lexical scope. As long as "where a function is defined" is simply sugar over such a palatable and safe mechanism, I don't see the problem.

Security properties of

Security properties of closures are based on the interface it provides and the ability to control where it is granted - i.e. it's a form of capability security. Where a closure is defined determines only which information or capabilities it is constructed with, i.e. what it might be expected to secure. First class functions would serve at least as well for this purpose.

I don't disagree with

I don't disagree with that...we are only talking about top-level namespaces where sharing could possibly occur. Given a function whose scope is limited by encapsulation, there is no reason to propagate it to a top level namespace.

instead of programming in the large

Let me first say what I mean by "sprawl":

Large systems are built out of small components that are somehow composed into the larger thing.

Conventionally this might be multiple functions combined into a module, or multiple modules combined into a program (and so on).

Composed components can be compared by how tight or loose is their coupling.

For example, if two modules contain mutually recursive functions and share state in idiosyncratic ways this is a tighter coupling than if two programs are composed in a unix pipeline.

Also: Large compositions (many components, many connections between components) can be compared by the regularity or the complexity of the composition.

For example, if a large program is comprised of many modules, and there are ad hoc of groupings of modules are all mutually interconnected, that's a complex composition compared to a simple shell script that combines several programs using pipelines and files in a more regular way.

"Sprawl" is the problem of large programs characterized by overly-complex, overly-tight composition.

p.s. I am not arguing that we should all use short shell scripts and pipelines more. Those are just recognizable examples of loose coupling and more regular composition.

---------------

That's what I mean by sprawl. How does this stand in relation to programming language theory and design?

In programming language theory there's a multi-perspective recognition that, abstractly, "composition is composition". Unix pipelines and module composition may look very different, but they both conform (deep down) to the same mathematical models. All composition is "the same thing" in this abstract view.

So there is a temptation to wonder if that universality of composition shouldn't be reflected more or less directly in programming languages.

This is the instinct that might say "Well I have lambda and this module thing other people talk about sounds useful. Let me see how I can implement modules as syntactic sugar over lambda!" Or it might say "All named code entities should exist in a single planetary-wide name-space over which composition operators can act."

On the one hand, "scaling" a language in such ways certainly does make it easier to write sprawl programs.

For example, both a planetary-scale code naming system and conventional module systems simplify creating bureaucratic solutions to the problem of "name collisions" in sprawled code bases.

On the other hand, "scaling" a language in such ways begs the question of whether large scale, regular and loosely coupled compositions actually need solutions to the problems modules and planetary name-spaces aim to solve.

-----------------

Perhaps large programs should grow from small ones not by the addition of code so much as by the subsumption of smaller programs into larger compositions that are qualitatively different in nature.

In a program built by subsumption, all of the subsumed smaller components are of necessity only ever loosely coupled. No mechanism for their tight coupling need be offered. (And if the code describing a composition-by-subsumption is not itself sprawl code, the overall composition is more likely to be regular rather than complex.)

When programs grow by subsumption, the seemingly hard problem of name-space management may simply disappear. No component is very large and so there is no problem with name collisions.

Nothing is black and white and I'm sure people can argue that subsumption approaches to composition are perfectly possible within the disciplines of modules or planetary name-spaces. At a certain nihilistic level of mathematical abstraction all distinctions disappear into a shimmering lambda in the sky, or something.

But that kind of abstraction neglects to pay attention to the social impact of programming language design.

The hypothesis that large programs should be built by subsumption suggests that programming languages don't need to "scale" at all.

Coupling is in the act of writing programs

In a program built by subsumption, all of the subsumed smaller components are of necessity only ever loosely coupled. No mechanism for their tight coupling need be offered. (And if the code describing a composition-by-subsumption is not itself sprawl code, the overall composition is more likely to be regular rather than complex.)

When programs grow by subsumption, the seemingly hard problem of name-space management may simply disappear. No component is very large and so there is no problem with name collisions.

I don't exactly understand what you mean by "subsumption," but it sounds like you're advocating a style of program composition where the pieces are interchangeable (loose coupling). Doesn't that mean every component's surface area needs crufty layers of content negotiation? Doesn't it mean small utility functions effectively can't be shared, since the complexity of asking for that function will rival the complexity of implementing it oneself? I don't see this reducing the size of components at all. (Which isn't to say I see any reason components should be small!)

For questions like "whether or not [sprawl] is a sensible and ethical priority," my opinion is that the very concept of sharing code (much less black-box programs) is too simplistic. It's not realistic to assume zero coupling between the code user and the code author. If those two people choose to restrict their negotiations to a single document of code, they're going to have problems. The problems aren't always severe or even noticeable for the direct participants, but I feel ethically driven to look for programming technology that avoids or reimagines the author-and-document style of programming.

I might very much appreciate a subsumption architecture approach: Don't set out to write code (a symbolic representation of the world), and instead cut to the chase... somehow. Does this tie in with what you're saying about "subsumption"?

Brooks' use of subsumption

Brooks' use of subsumption is quite different than standard PL usage, with the former having to do more with reflexes and run-time negotiation, and the latter being much more of a static property. I wrote a paper that used the subsumption architecture (in the Kodu sense), and I like it a lot, but I don't know how to apply it beyond games:

Coding at the speed of touch

Subsumption

I'm also having difficulty envisioning what you mean by subsumption.

Also, I don't really see how grouping small programs into larger composites would reduce need for named abstractions, i.e. such that you can use the same smaller programs in many different larger composites, even if we keep the elements loosely coupled.

there's a multi-perspective recognition that, abstractly, "composition is composition". [..] All composition is "the same thing" in this abstract view.

Well, I certainly feel there are important qualitative differences between algebraic composition (common in FP) and more ad-hoc composition (common in OOP). I am interested in the properties of what you're calling subsumption.

So far, I only understand that 'algebraic closure' is not a property of subsumption, i.e. that the composition is 'qualitatively different' in some important way from its constituent elements.

meaning of "subsumption" & "loosely coupled"

"Subsumption" can be described by (informal) inductive construction:

* Base: There is some set of primitive "components".

Every component has a fixed, finite set of interfaces.

* Induction: New components are formed by making linkages between the interfaces of earlier constructed components.

Example:

A `grep' process and a `wc' process might be viewed as "primitive components". A new component is formed by linking the standard output of grep to the standard input of wc.

* About Loose Couplings

An interface is a "loose coupling" if it satisfies thetwo conditions:

1. The interface is serializable as a bidirectional stream of bits: input and output. (In principle, even if not practically).

2. The interface is subject to the constraint that no run-time guarantees exist about what bits may be sent as input to an interface.

Inputs to loose coupling interfaces are essentially type-less streams of bits.

* Subsumption Induction

If a component is formed by linking earlier constructed components using only loose couplings, then the new component is a "subsumption" of the earlier constructed components.

* Reflective Induction

An "ephemeral environment" is a particular kind of loosely coupled component.

An ephemeral environment is special because it can instantiate new components at run-time: either new instances of primitive components or new subsumptions of earlier constructed components.

The inputs to an ephemeral environment are interpreted as commands about what new components to create (a la /bin/sh).

Client components of an ephemeral environment are called its "control program".

New components created by an ephemeral environment are called its "child components".

An ephemeral environment can form links between the interfaces of its child components.

An ephemeral environment may be able to instantiate new ephemeral environments (as child components).

* Semantic Induction

A loosely coupled component's semantics can be described by a function that translates input bitstreams to output bitstreams.

When a new component is formed by subsumption, the semantics of the new component are a straightforward composition of the semantic functions of the old components.

The semantics of a running system decompose into bitstream functions in a way that exactly mirrors the run-time construction of components in that system.

* Source Limits

Without loss of generality a programming system can restrict the source code sizes of primitive components.

A system can also limit the number of components in an ephemeral environment's control program.

A system can also limit the number of primitive components an ephemeral environment is able to instantiate.

The value of such restrictions is the way that they coincide with the schematic nature of the semantics:

A system semantically decomposes into a hierarchy of bitstream functions. Source limits ensure that the definition of each function in the hierarchy is a human-friendly quantity of code.

* No Need for Global Names

The scope of each identifier in source code is limited to the source code of a primitive component.

If the source size limits are low enough then it is practical to use code libraries by copying source out of the library, renaming things as necessary, rather than linking to the library by name.

There is no need for a global namespace of any kind.

Rodney Brooks's subsumption

Rodney Brooks's subsumption architecture for robotics is similar but is more dynamic in its realization. Basically, a higher priority behavior subsumes a lower priority one for any resource in contention, like the movement actuator.

For PL subsumption, namespaces are not really involved, even when susbsumption is nominal via explicit is-a relations.

No namespace

The claim that there's no need for a global namespace is too strong. By adopting an "everything is a bitstream" model, you drive a lot of semantics underground, but you certainly haven't gotten rid of it. To continue with your running example of unix tools, one can certainly view them as loosely coupled components dealing with pure bitstreams, but this ignores a rich world of conventions. You may claim that this is irrelevant, but when some tools view 0x0a as a line separator, and some expect 0x0d0a, you will surely change your tune. And what is this if not a global namespace?

You may say this doesn't constitute coupling, but isn't it really that you've just hidden it?

Personally, I view this tension as an essential and fundamental tradeoff that can only be managed, not eliminated. I broadly agree with your aesthetics of software design, but I think one should be careful to temper one's claims...

bitstream types

The claim that there's no need for a global namespace is too strong. By adopting an "everything is a bitstream" model, you drive a lot of semantics underground

I'm not sure what it might mean to "drive a lot of semantics underground".

If you say there is some implicit, hidden need for global names can you please point to an element of the construction that needs them?

I think you are observing that sometimes typeless interfaces are usefully characterized as supporting one or more protocols. For example, typeless bytestreams may connect FTP clients and servers but the utility of connecting them arises if each tolerantly receives and strictly transmits FTP protocol according to some RFC. Even the choice of line endings (0x0a vs 0x0d0a) could be regarded as a very simple protocol.

It seems to me that types of this sort are extensionally defined subsets of typeless I/O.

If you want to verify a schematic by ensuring that connected components use compatible protocols it seems to me that each component can usefully be accompanied by a copy of the specification of its purported extensional type, along with a proof that that it's outputs conform to the purported protocols if the inputs do as well.

When two components are joined in a subsumption their protocols can be compared and checked for extensional compatibility.

None of that requires a global namespace, though. Just as we copied library code into our small source texts rather than link to them by name, so too we copy in protocol definitions rather than link to them by name.

Talking past one another

If you say there is some implicit, hidden need for global names can you please point to an element of the construction that needs them?

Yes, what is the name, in bytes, of "line separator"? That is an element of a global namespace, in my view.

It seems to me that there's hardly any difference between a complex interface composed of single-purpose functions (tighter coupling), and a simpler interface composed of a single function which implements a complex protocol. You can say "I've reduced coupling, because the whole interface is just this one function which accepts arbitrary bits as input", but then you immediately need to hand me the protocol spec alongside so that I can figure out what I can actually do with the function. It's in that sense that semantics is being driven underground, from "the interface" into "the protocol."

Anyway, this is a tangential point and I don't think it's worth arguing any further.

Name that line separator....

Yes, what is the name, in bytes, of "line separator"? That is an element of a global namespace, in my view.

Nothing in the inductive construction of components needs any such name.

In other areas of life people can give those line-terminator bytes any name they like. For example in protocol description documents. (See below.)

When constructing components as described, those names have no use.

It seems to me that there's hardly any difference between a complex interface composed of single-purpose functions (tighter coupling), and a simpler interface composed of a single function which implements a complex protocol.

I think that's mostly but not entirely right.

Think of it this way: A loose coupling between two components is equivalent to a possibly-noisy discrete communications channel (ala Claude Shannon).

Tightly coupled linkages can not be modeled as such channels. For example, if two components reliably share some memory, that linkage between them can not be rigorously modeled as a possibly-noisy discrete channel. The two components that share memory are tightly coupled.

You can say "I've reduced coupling, because the whole interface is just this one function which accepts arbitrary bits as input", but then you immediately need to hand me the protocol spec alongside so that I can figure out what I can actually do with the function.

Sure. It's helpful to accompany code with documentation that describes what the code does. You can annotate code with protocol information, formally or just in documentation.

The accompanying documentation or annotation doesn't alter the semantics of the code.

It's in that sense that semantics is being driven underground, from "the interface" into "the protocol."

If you prefer, you can write the components you create in a language that includes "protocol annotations" which are machine checked. That will help you write components that are "strict in what they transmit".

Your components must at the same time assume only typeless inputs. They must be "tolerant in what they receive".

When you publish source for your components it will still be size-limited into human-friendly chunks. A person can sit down and examine your "10K LOC" to see the entirety of your protocol implementations.

And nothing at run-time guarantees that your inputs and outputs will be linked to other components using the same protocols you expect. Subsumptions can ignore your protocol annotations.

The namespace wasn't literal, but it's real.

I think maybe you didn't see what he was saying about the global namespace.

Applications that communicate via text line have to agree on the format for the text. If one of them "names" the line separator anything other than 0xa (or u+0a if you prefer) then the others won't be able to understand it correctly, and/or it won't be able to understand them.

Ultimately, in order for two (or more) pieces of software to be coupled at all, there has to be some form of communication. If the communication is by making variables and procedure calls available to each other, that's a namespace. If they communicate by making raw binary data visible to each other, they still have to agree on how significant concepts such as end-of-line are represented, and that too is arguably a namespace, although the "names" as such are fairly implicit due to the nearly universal adoption of a uniform text representation.

re: subsumption

interface is serializable as a bidirectional stream of bits

Why 'bit streams' and not some other basic unit of information? Would it make a significant difference to your idea of subsumption if it were a stream of unicode characters? Or a singular, generic value-object in the sense of JSON or YAML? Or a time-varying FRP signal?

I'm going to assume the answer is 'no', that you just chose bit streams for this particular informal example.

no run-time guarantees exist about what bits may be sent as input to an interface

It isn't clear to me that adding a type system to, say, unix pipes would have any significant impact on their loosely coupled nature. What makes you believe this is an essential property for loose coupling?

I suppose if it's an issue of 'named types' (whose definitions might later change, inducing a form of coupling), that we might want to stick with structural typing.

inputs to an ephemeral environment are interpreted as commands about what new components to create

If I understand correctly, a VM would be an example of an ephemeral environment, and individual bytecodes could be interpreted as commands. Especially convenient for a streamable bytecode.

Or, at a higher level, a Forth interpreter could be considered one ephemeral environment, accepting a stream of definitions and command words or sentences. Conveniently, a subset of Forth excluding recursion is also comprehensible as a subsumption model.

Source limits ensure that the definition of each function in the hierarchy is a human-friendly quantity of code.

In Forth-like languages, typical words have definitions of fewer than fifteen words. If you enforced a source-code limit of ten words, you could still write any Forth program, but you'd force a certain amount of factoring larger programs into smaller words. OTOH, you'd probably annoy some people whose function could be written easily in eleven or twelve words. If you raised the limit to thirty words, the annoyance factor would be minimized, but thirty word sentences in Forth can be difficult to comprehend.

It seems to me that, with your 10k limit and a more verbose language like C, you're aiming for something like those thirty word sentences... a balance where programs are large enough to rarely annoy but small enough to still be comprehensible with effort.

The system you described earlier gave me the impression of being contradictory to this goal. What was it? Probably exemplified in this sentence: "If we want to add another layer to the hierarchy of high level composition we can, for any reason." In particular, you mentioned adding levels at the top, for "high level" composition, as opposed to adding levels at the middle or bottom, for "low level" decomposition and refactoring.

Naturally, if we're limited to 10k per object at a given level, and we can only add levels at the top, this creates pressure to use all 10k near the bottom, copying and pasting and replicating tons of logic from one low-level component to another.

It might be that I misinterpreted you, but you still seem to be saying "there is no need for a global namespace", and you even suggest copying source into low-level components.

scope of each identifier in source code is limited to the source code of a primitive component [..] no need for a global namespace of any kind

Oh? What about non-primitive components? Do those not identify other components, primitive and otherwise, that are defined externally to source code?

In Forth, it would be impractical (and maybe even impossible) to factor a large word into multiple smaller words, without having a namespace of words. In your own examples, it seems similarly impractical to refactor logic into into fine grained, loosely coupled components without developing a namespace of components.

Is there any reason that the space-of-component-names shouldn't be global? Seems to me we only need to prevent cycles to keep a clean hierarchy.

practical to use code libraries by copying source out of the library, renaming things as necessary

It seems to me that, in your system, the most important 'library' would be the component library. And the most important 'names' would be the component names.

re: re: subsumption

Why 'bit streams' and not some other basic unit of information? Would it make a significant difference to your idea of subsumption if it were a stream of unicode characters? ... JSON or YAML? Or a time-varying FRP signal?

My concept of "streams" between loosely coupled components is based on physical realism:

A stream between loosely coupled components is a possibly-noisy discrete communication channel that transmits elements drawn from an arbitrary finite alphabet.

A bitstream uses the alphabet {0, 1}.

A stream of unicode codepoints might use the alphabet {0 .. 2^21}.

"JSON values" is not a finite set (is it)? The constraint of physical realism means that JSON values could not be used as the alphabet for a stream.

It isn't clear to me that adding a type system to unix pipes would have any significant impact on their loosely coupled nature.

In a reply to Matt Hellige just a few minutes ago I explained how types of a sort can be added to loose couplings. I'll briefly restate it and add something:

The source for a component can be accompanied by a proof that its output streams conform to certain extensional types.

For example, a source might come with proofs that its output conforms to a various grammars.

Programming systems can assist with, check, and compare such proofs in any way that happens to be convenient.

The addition or removal of such proofs does not change the semantics of any component. The semantics strictly assume untyped streams of input and place no type restrictions on output produced.

Or, at a higher level, a Forth interpreter together with a massive database of words could be considered one ephemeral environment, accepting a stream of words or sentences.

I think you have the right idea here except for one key detail: the "massive database of words".

An ephemeral environment is a component. If that component had a "massive database of FORTH words" then the total size of source code defining the semantics of the component would itself be "massive".

The limits on source code sizes to human-friendly scales prevents this.

A FORTH interpreter, accessed exclusively over serial lines (the loose couplings), with a small database of FORTH words could be viewed as an ephemeral environment.

It seems to me that, with your 10k limit and a more verbose language like C, you're aiming for a balance where programs are large enough to rarely annoy but small enough to still be comprehensible.

That is exactly right.

The system you described earlier gave me the impression of being contradictory to this goal. What was it? Probably exemplified in this sentence: "If we want to add another layer to the hierarchy of high level composition we can, for any reason." In particular, you mentioned adding levels at the top, for "high level" composition, as opposed to adding levels at the bottom, for "low level" decomposition and refactoring.

Consider size limits as an example:

Size limits on ephemeral environments may make it too confining to try to build a complete system within a single ephemeral environment. We can always get around that composing ephemeral environments into subsumptions within "higher level" ephemeral environments.

We can also add "higher level" ephemeral environments for any reason, not just to work around size limits.

Naturally, if we're limited to 10k per object at a given level, and we can only add levels at the top, this creates pressure to use all 10k near the bottom, replicating lots of logic from one component to another.

Yes, I think that replication of small pieces of code is a good thing.

It might be that I misinterpreted you, but you still seem to be saying "there is no need for a global namespace".

Since code sharing is done by copying of small snippets, alpha-renaming as-needed is practical.

In Forth, it would be impractical (and maybe even impossible) to factor a large word into multiple smaller words, without having a namespace of words.

But those namespaces and the amount of code in their scope can be quite small. Naming collisions are not a problem and are easy to fix by hand if hey happen.

In your own examples, it seems similarly impractical to refactor logic into into fine grained, loosely coupled components without developing a namespace of components.

The scope of those namespaces is always limited to one ephemeral environment, or one primitive component.

The total size of the source code defining one of those namespaces is always human scale. These namespaces are small.

No elaborate global naming scheme (like a "module system") is needed.

Is there any reason that the space-of-component-names shouldn't be global?

That would make the semantics of ephemeral environments unconstructable.

There are benefits for making it global, in the sense that a component can identify any other component, perhaps with a limitation that cycles are prevented (to keep a clean hierarchy).

Circular linkings of loosely coupled component inputs and outputs can be useful ("feedback"). There is no reason to prevent it.

Also, when you said "a component can identify any other component" I don't know what you mean to include in the scope of "any other component".

It seems to me that, in your system, the most important 'library' would be the component library.

Components in source code form are natural "units" of source code publication.

Code publications can be accompanied by meta-data and currated into "libraries".

There is no reason why libraries need to be broken up into "modules". Libraries can even contain name collisions.

Materials re-used from libraries are re-used by copying small amounts of code, alpha-renaming as-needed.

replication of small pieces

replication of small pieces of code is a good thing

I'm rather neutral towards repetition in code (if I can't think of a good name for a common sequence of words, I'll just replicate it). But I think your estimation of "small pieces" is optimistic here.

I wouldn't be surprised to see about 8k of that 10k, on average, used for replication of protocols and data structures. I wouldn't be surprised if that 8k was obscure and minified, to make it as tight as possible to fit your artificial limits.

When you create pressure (in this case to squeeze in more at the bottom) it never applies evenly to all users.

The addition or removal of such proofs does not change the semantics of any component.

I do feel that pluggable types / linters are a very nice direction to go, keeping them separate from the program semantics.

Those namespaces and the amount of code in their scope can be quite small. Naming collisions are not a problem and are easy to fix by hand if they happen.

The latter sentence is also true of large, shared namespaces, Ward's wiki for example. Keeping namespaces small is perhaps useful for ensuring easy comprehension of the whole thing, if that's your goal. But collisions are not a scaling concern.

scope of those namespaces is always limited to one ephemeral environment, or one primitive component.

I suppose that, if I wanted to cheat your constraints, I could create a trie of ephemeral environments to represent an arbitrarily deep namespace. Each call would take a bit stream representing which word to call and the argument, and would return some intermediate results and a list of more words to call.

Naturally, neither this approach nor minified replicated code would be particularly amenable to your goals of making things more comprehensible to users.

Circular linkings of loosely coupled component inputs and outputs can be useful ("feedback"). There is no reason to prevent it.

I was speaking of circular dependencies, which would be a violation of hierarchy. Circular linking is a separate issue. (A fixpoint combinator is an example that demonstrates circular linking without circular dependencies.)

Components in source code form are natural "units" of source code publication. Code publications can be accompanied by meta-data and curated into "libraries". There is no reason why libraries need to be broken up into "modules".

Uh, by every generic definition of 'module' I know, your components ARE modules.

Materials re-used from libraries are re-used by copying small amounts of code, alpha-renaming as-needed.

Or by copying whole components, apparently, since every instance of ephemeral environment `bash` will apparently need its own copy of `grep` and so on according to your "local namespace only" rule. And probably by copying very large amounts of code, minified, for working with complex sensor and actuator protocols, or HTTP with its dozen headers, and so on.

I'm curious what makes you so confident in the "small amounts of code" claim.

small pieces

I suppose that, if I wanted to cheat your constraints, I could create a trie of ephemeral environments to represent an arbitrarily deep namespace. Each call would take a bit stream representing which word to call and the argument, and would return some intermediate results and a list of more words to call.

Sure, something like that should work. (Not a "cheat".)

Uh, by every generic definition of 'module' I know, your components ARE modules.

Abstractly, fine. Still, components don't get involved with names in ways that create any of the issues Joe Armstrong wrote about.

Also, I think programming with components is stylistically different from programming with modules.

With modules, programs proceed from abstractions to specializations. For example a module definition declares an abstract interface; each implementation specializes the abstraction.

Module client code also proceeds from abstract to specialized. Client code first picks an abstract module interface and then specializes by choosing from among the implementations.

At every step, module programmers look for a ready-made abstraction of the problem they are trying to solve, and then to pick from existing specializations of that abstraction.

Programming by subsumption goes in the opposite direction: from the specialized and concrete to the more abstract. New subsumption programs are always and only ever built by taking earlier programs and composing them in new, loosely-coupled "schematics".

At every step, subsumption programmers look for ready-made pre-built lower level components that can be quickly assembled into a custom solution to the problem at hand.

As a concrete example: Suppose that a module programmer and a component programmer are each working on a compiler and each need some kind of "symbol table" part.

A module programmer might boast that he found an abstract "symbol table" data type in one of his modules, and that with just a few quick subclasses (or template instantiations, or whatever) he can specialize it for his compiler.

A component programmer on the other hand might boast that he has a big library of basic components providing stacks, queues, simple hash tables, and so on; and that from those more basic components its always easy to make a custom symbol table with just a little bit of glue code -- no abstract "symbol table" data type needed.

I'm curious what makes you so confident in the "small amounts of code" claim.

Somewhat confident, not "so" confident. I've done some experiments programming "component style" by hand-translation of of a subsumption language into C. I'm working on implementing the language itself, now.

modules and components

My impression: your intuition for 'module' has a very ML-ish flavor. This makes sense. ML modules have influenced a lot of other languages. Stream-processing components are modular in every sense that matters, but are a different flavor of module.

I do favor what you describe as programming with components, of building from specializations upwards in a first-order style. I wonder where parametric and higher order abstractions of components would fit in your view. Parametric abstraction would certainly regain some of the ML-ish flavor, but the resulting components would still be largely independent.

OTOH, I don't see much benefit to your proposal to copy and paste code and components from a library. I think the ability to name modules would excellently augment this style of programming, making it more productive and maintainable. Naming can logically copy the named component.

something like that should work. (Not a "cheat".)

A trie of ephemeral environments that, given keys, return values, could represent any 'massive database', e.g. of Forth words. I'm curious how this wouldn't be a violation of the intention/spirit of your constraints.

social aspects

A flat namespace has excellent potential to be effective for a social programming experience, much as wikis are effective.

Users, of course, would still have "projects" that they're working on. Wikis have already proven that people can develop many independent projects in a shared namespace. The big difference is that a wiki doesn't build artificial walls that hinder collaborative efforts between projects.

If anything, today's convention of treating projects as walled gardens severely hinders the social aspects of programming.

If you find a bug in someone else's code that you use, you're forced to leap through administrative hoops to fix it. When you want to fix your own bugs or APIs, you can't simply go fix the existing uses by other people, so you must be defensive about backwards compatibility. When you make a change in code someone else is using, you can't immediately see how their unit tests break. Cross-project refactoring based on common patterns is infeasible. To bring a function into your code requires reuse-discouraging management of dependencies files, versions, and boiler-plate import. Words mean different things in different projects, so you're effectively asked to learn a new language every project, and context-switch between projects, placing a greater mental burden on developers.

nobody is ever in a good position to change 10 of the functions in tandem as part of making a larger change

Just take a hit of ACID.

projects are an important granularity for making releases and for upgrading to new versions of third-party code

AFAICT, many problems surrounding 'making releases' and 'upgrading to new versions of third-party code' are emergent from treating projects as walled gardens, not inherent to software development.

I do see utility in developing experimental variations of functions without destroying the old versions. But I don't see any difficulty with developing under a copy-on-write idiom (or IDE mode), keeping the old version available while you develop a new one in parallel.

Projects are also an organizational boundary.

People will plan, organize, and collaborate on projects with or without interference from the language designer. The question, then, is what sort of projects the language affords.

The walled gardens we use today don't afford cross-cutting collaborative projects that make lots of small fixes or improvements scattered across applications. Nor, due to the overheads of dependency management and import boiler plate, do our languages currently afford creating and distributing lots of small, convenient utility functions. Nor lots of small, specialized widgets, nor clip-art sized procedural generation, and so on.

behaviors around these group membership activities are wired into us quite deeply

All the more reason to not build it into the language design. We'll manage well enough at constructing social barriers without help.

There is a good argument for secure, DVCS-style copies of the function database. Assholes do exist and will shit/spam/troll/hate/flame everything if given an opportunity. It only takes 0.1% of people behaving like assholes to cause problems for everyone else, and most people are assholes some of the time. Nonetheless, how we secure the code can be largely orthogonal to the project model. We can still treat it as a global store of functions, even if it is curated.

People used to say the same

People used to say the same things about websites/encyclopedias, then Ward's Wiki came along. But you might be right: look at how Wikipedia has organically grown in-communities and boundaries to protect content.

Still, the initial boost that a global namespace would provide could be very substantial, and we might be surprised where organic boundaries arise.

Emacs?

Isn't he describing Emacs? Or any language with dynamic scope?

I hate to say it, but that

I hate to say it, but that is the first thing that came to my mind as well. As much as I love Emacs for its wealth of extensions, I've always found the overly crowded namespaces to be a nuisance, especially for my own extensions, which I don't want to assign long names...

Plain Paper User Interface

I think programs need to be readable, and that this is more important than ease of writing. Code gets read many more times than written. As such code should have a notation that can be read in the way humans read naturally. Code can be read top down or bottom up, but I think it is important it can be displayed statically as in a book or a research paper. I think programming is an activity that should be possible with just a plain text editor (or pencil and paper), so having functions stored in a database seems like a bad idea to me.

How about a function metadata markup and a grep-like command to search programming code. That would be more my style.

I also like having code in files, and sometimes I have different versions of code that are incompatible with each other in different files. I don't want all these in the same database.

How about something more like Ada's build system where you don't need a make file, it finds the code and dependencies for you. You could have a compiler markup at the head of the file, with metadata about the program, this markup could even be applied to functions.

In this way I could copy a function with normal copy and paste, and annotate each copy with different metadata, indicating a different program target, and have the compiler incrementally build two versions of the programming each using a different version of the function for testing and benchmarking. All from a plain text file.

Programming with Pen and Paper

Rather than "programs need to be readable", I'd say "programs need to be comprehensible". But there are plenty of things we can comprehend without reading them. Hammers, for example. I've never once read a hammer, but I understand hammers. It seems feasible to me that we can create systems that treat code more like tools and materials.

The belief that "code is read many more times than written" seems to be a myth without a strong statistical foundation. I'd love to see some empirical numbers. There's at least one person who has argued the opposite. AFAICT, reading code is not a very good way to understand code; most people aren't very good at simulating code in their head (and they shouldn't have to!), and it doesn't scale. Nor will reading code be an effective technique if we ever have a lot of mobile code, self-modifying code, genetic programming, programming by example, and so on.

programming is an activity that should be possible with just a plain text editor (or pencil and paper), so having functions stored in a database seems like a bad idea to me.

I do like the idea of programming with pen and paper, and maybe some AR glasses. That isn't necessarily in conflict with use of a database. I also like the idea of programming through Twitter. That seems to be simpler with a database.

AFAICT, the main attraction to programming with a conventional text editor is discomfort with change.

How about maths.

Do you feel the same about maths? I find reading code is the best way to understand it, just like reading a mathematical equation. In fact some algorithms are over 2000 years old and obviously pre date computers. Good programmers will always have to understand (simulate in their heads) algorithms. Self modifying code was always a bad idea, although I guess I am treating this as something distinct from JIT compilers.

Coming from an electronics background where components have long existed, you end up having to build mathematical models of component behaviour to be able to reliably handle complex systems. Having a language for design like VHDL was a huge improvement over graphical layout and wiring.

Tweet a program reminds me of the Tetris game written in one line of BBC basic (255 characters).

http://survex.com/~olly/rheolism/

I don't like losing my code into some database. I want to be able to print it out, or view a literate program on a tablet. I think metadata annotations can give the best of both worlds and could be used with a document indexing engine like elastic-search to give you your database like functionality (it just indexes the metadata leaving the function definitions in the text files). Where is the problem with this approach?

math too

Do you feel the same about maths? I find reading code is the best way to understand it, just like reading a mathematical equation.

I find graphing maths, rendering behaviors, playing with variables a more effective way to understand them. And teach them (I was a TA and tutor for a few years). I believe most people are better with such an approach, and I'm not alone in this belief [1][2].

I'm curious: how did you determine "the best way [for you] to understand it"? Have you made any sort of rigorous, measured comparisons? Or is this just something you believe without a strong argument?

Self modifying code was always a bad idea

Arbitrary reflection and pervasive mutability are bad ideas. But the idea of self-modifying code - e.g. a function that computes the 'next' function to use - isn't a bad idea.

I don't like losing my code into some database. I want to be able to print it out, or view a literate program on a tablet.

Keeping information in a database won't prevent you from printing it out or curling up with a tablet for some dry reading, if that's what you really want. Indeed, a database should offer great flexibility in the construction of documents and reports.

Literate programming seems an orthogonal concern. If you want a language where every function is also a self-describing document, perhaps even interactive (in the style of iPython notebook or spreadsheets), that's entirely compatible with keeping functions in a global database.

I'm not a strong believer in literate programming, but I do believe having some modes of development based around iPython notebook or Mathematica or spreadsheets could be very useful for learning and explaining code. I already have sketched designs for them.

The identity of a program

I really don't seem to like the idea of putting code in a database. It reminds me of the Windows Registry (as opposed to text config files in Linux).

I want to be able to edit program text in the editor of my choice; cut, copy, and paste code, grep it etc. I want to chose the tools I like, not have it dictated by the language.

For example I never got on with Epigram due to the requirement to write it in the interactive editor, but other dependently typed languages like Twelf are fine.

"I want to be able to edit

"I want to be able to edit program text in the editor of my choice; cut, copy, and paste code, grep it etc. I want to chose the tools I like, not have it dictated by the language. "

You can only do this because lots of different people write text editors* and not many write database editors. This is an argument that has more to do with tools, standards, community, etc. than languages.

* Text editors are interactive editors. What singles them out from other interactive editors is the popularity of the file format.

Simplicity and openness

Its the simplicity and openness of the file format that singles them out. I don't want my code disappearing inside a binary blob in some database.

Referring back to electronics hardware design, that has had all the databases etc for ages, but VHDL (a plain text language) was so much better than the previous methods, and has enabled electronics at a scale not possible using the 'visual' and database methods (namely system-on-a-chip level integration).

"Its the simplicity and

"Its the simplicity and openness of the file format that singles them out."
The simplicity and openness* you are talking about are illusions or features of the metaphors provided by the editors, not the storage format.

There is no reason, for example, why an interface with a text metaphor couldn't be provided on top of a database storage format, or why a db-like view couldn't be formed over a text file - in fact, modern IDEs do all kinds of funky things (that provide non-flat views over flat text) with their understanding of the program structure.

* unless you are using "open" as in an open standard, which goes back to my original statement about popularity, standards, etc.

db-like view over text file

I'm fine with this, I even suggested it myself and asked how many people would be interested or accept this as a solution (with thoughts of having a go at a language agnostic metadata indexing tool, that could be combined with version control and a new build system).

tooling

I don't want my code disappearing inside a binary blob in some database.

I would expect for a database system containing functions to make exporting your data to another format quite easy, and recovering from it too. Not that file systems are any better than databases about losing information. As far as "simple and open formats" go, HTTP GET should be comparable to anything a filesystem offers.

using the 'visual' and database methods

The issue of using a database to support a global namespace and easier tooling is almost completely orthogonal to whether the language is 'visual' vs. 'plain text'.

I want to chose the tools I like, not have it dictated by the language.

I would like to easily create development tools within my language, e.g. as web-applications extending a wiki-based development environment and software platform. Developing fine-grained tools is difficult to do with file and text editor based development environments.

For example, it takes a lot of extra work to even create a simple "rename this function" tool if you have to search arbitrary files, distinguish the same name in different namespaces, and manipulate code embedded deep within arbitrary functions in the file without disturbing the printed format.

If you want to have incremental zero-button testing, and rerun a test whenever a function it depends on is modified, you'll end up implementing most of a compiler and linker to track dependencies as well as very ad-hoc functions to watch the filesystem without polling it. (Or maybe you'll just give up and poll.)

Metadata and Decision Fatigue

My solution for function renaming us usually :%s/xyz(/abc(/. There is a new tool specifically for code which is looks good for replacing grep called 'ack'.

What I suggested above was adding machine readable metadata to files which could identify your program code, so you don't have to search all files (the file name extension also helps here, but filesystem metadata would be a better modern solution). Some kind of 'doctype' header in the file would do though.

As I pointed out above, zero button testing seems a mistake, as it will spend time (and my mental energy) on testing stuff that I already know is broken. One button testing seems optimal to me.

You can only make a finite number of decisions per day (Decision Fatigue), and the more you make the harder it gets, until you start putting thungs off until the next day (or take a cat nap). So seeing the zero button error feedback and deciding if you need to do anything about it is consuming multiple decisions, which just carrying out a planned edit does not.

decision fatigue vs. simulation fatigue

First, I disagree with your argument that zero-button testing would be a significant source of decision fatigue. If you expect something to break mid-edit, then confirmation of this certainly won't dissuade you from your edit. OTOH, if something breaks and you didn't expect it, that's something to legitimately concern you, because your mental model did not account for it.

Second, you seem to be assuming that predicting and mentally modeling which tests will break ahead of time - i.e. simulating results in your head, understanding not only the code you write but that of every other programmer whose code interacts with yours - would somehow be less fatiguing than receiving this information through your senses and making decisions in reaction to it.

(Keean Schupke as a car designer): "Let's put big blinders on the windows, and just let the driver press one button when they want to see. This will save the driver from all sorts of distracting information and fatiguing decisions. I understand the paths I drive, so I don't need to see them to know where I'm going. Just mentally model everything, and you're good to go."

Anyhow, zero button testing, and accurate, safe renaming (no 'oops' from matching words in different strings or contexts, nor from targeting a word that already exists in one of these contexts) are just a couple tools among many. The reason I'm moving from a file system to a database is that I find it too difficult to support fine-grained, reactive tools via the file system. I want to lower the barrier for tooling.

Driving Is continuous, Code is discrete

I don't really understand your model of code editing. I don't just sit down and type random keys, an edit is always planned. I want to rename variable x to y, or lift this function out of a loop, or write a recursive iteration from N to 0. As such I know before I start that the function is meaningless after I start such an edit and before I complete it. This requires no thinking, nor modelling it is simple common sense.

The problem with the car driving analogy is that driving is a continuous activity, you observe your road position and input a proportional correcting signal. Code is discrete, the error cannot be measured, and the correcting signal is discontinuous. It seems like confusing a discrete state space with a linear space.

I have nothing against tools for accurate renaming, ideally they should integrate with the editor of my choice, or be simple command line tools. I am not saying there is no use for the kind of thing you are developing, just that I don't think it would work for me, although I think similar tools could be developed that would work for me.

Chris Hancock would say

Chris Hancock would say coding today is like hitting a target with a bow and arrow, while what really want is a water hose. There is an Olympic sport for archery cause it's so hard, but nothing for water hosing.

Coding can be continuous.

Code is discrete, the error cannot be measured, and the correcting signal is discontinuous.

Code is not always discrete in this sense.

Many domains have a high natural level of continuity: rendering, music, user interface, robotics actuation, sensor processing, multi-agent systems. Many others can readily be expressed in ways that support high levels of continuity: incremental computation, collections-oriented programming, stream processing, soft constraints, probabilistic models.

Live coding, and live programming, can be much like driving a car - providing continuous feedback. If you've never experienced it or seen it done, then you're missing out.

Continuity doesn't imply the absence of a goal or a plan. You "don't just sit down and type random keys" when you're coding; similarly, you don't just sit down and twiddle random levers and wheels when you're driving.

I don't think it would work for me

I'm sure that's what many conservative assembly-language programmer said about structured programming. Doesn't make it true.

similar tools could be developed for my way of working, and may also work for you?

Possibly. But a filesystem is already a simplified database, and it seems wasteful to spend my efforts designing around its limitations and weaknesses.

I think a more interesting approach would be the converse: to provide specialized filesystem-like views of a database through FUSE. But, outside of specialized use-cases, I'm not sure leveraging existing tools would be sufficient advantage.

Put AST in database.

What about putting the AST in the database. All the clever tools can work off that, and you can implement a clever fuzzy nonlinear parser that allows for edits mid-line and always predicts valid syntax. That way as you edit you see only possible syntactically valid completions of the statement which change as you keep typing (I know i don't like this, but its not for me). Now you can have a plain text parser (and printout) that read the code and metadata from/to a file. Now my workflow could be clear the database and read all the source code (and metadata) in these files into the database, then run these analysis tools (which should have the possibility of being run from the command line for scripting not just from a GUI). I think people forget (good) programmers like programming, so tend to prefer writing a build automation script/program to clicking buttons.

AST is separate issue

Using a database to provide a global namespace for functions is not the same as using it to provide structured editing. You seem to be conflating the two.

For my current purposes, my AST is simple (a Forth-like language), and my language is simple (to compile to bytecode, just inline everything), and I mostly want simple tools. I just store the functions as binaries, albeit after validating that they parse and are fully defined. Parsing is cheap, especially after you're storing individual functions separately and thus don't need to deal with large files. If I ever want a clever tool, I won't need to store an AST in the database to do it.

For me, the value of the database comes from easy accessibility of lots of words without import/export boiler-plate. The ability to hyperlink words. The ability to easily arrange a word and the words it depends upon on-screen for debugging and editing. The ability to atomically update many words together. The ability to react continuously and in a fine-grained manner to changes.

storing text not structured data seems a mistake.

Storing text in a database, when databases work best with structured data seems a mistake.

Let's keep the text plain, and the data structured. I don't understand why you would not want structured data in the DB. You also only want valid code in the DB. Cheap parsing can be used to dynamically update the AST in the database as someone edits it. In the DB functions can have a unique identity, and names just become metadata.

Wouldn't it be better to focus on making it easy to arrange code (AST) and linking code not words. The words are best seen as a way if serialising and deserializing the code.

No mistake

The schema to create for a database depends on which queries you need to make efficient. To say "databases work best with structured data" irrespective of which data you're proposing to structure is idealism, not practical engineering.

Which queries concern me? Essentially, the following:

  • given a function name, look up its definition
  • given a function name and a timestamp, obtain a consistent snapshot of its definition at that time (useful for DVCS-like features)
  • given a function name, find names of all functions that depend on it
  • given a function name, find auxiliary resources such as computed type information or optimized bytecode
  • given part of a function name, find close matches

I haven't found any requirement that benefits from treating the AST itself as a database. What I need is more about relationships between functions or cached computations on functions, or operations at the level of function names.

Wouldn't it be better to focus on making it easy to arrange code (AST) and linking code not words. The words are best seen as a way if serialising and deserializing the code.

It's already easy to arrange and link code. If you want to pipe the output of foo as input to bar, the code for that is `foo bar`. Most functions are in the range of five to fifteen words, just one or two lines of code, so code is "easy to arrange" by virtue of being very small.

You seem to be assuming that coding in a database would be basically the same as coding in a filesystem, except in the database. But this isn't quite the case. Several structural properties of programming languages emerge from our efforts to achieve productivity in a filesystem + plain text-editor environment. Here's a sample:

  • We're pressured to use imports, exports, namespaces. This is because text editors don't offer an effective way to tweak rendering of global names, e.g. to minimize common suffixes or prefixes or create a little legend in the corner. So use of namespaces and scopes are necessary to keep names from growing too large and noisy.
  • We're pressured to put lots of functions together in the same file. This is in part because it would be very annoying to 'open file' and find only one or two lines of code to edit. In part, we don't like repeating import/export boiler plate in too many files.
  • Dependencies become extremely coarse-grained. When we import a resource, we don't get the dependencies for a specific function, but rather for ALL the functions in that file, and transitively. This makes us ask questions like: redundancy vs. dependencies, which is worse? and discourages fine-grained reuse or sharing of code.
  • Programming in a text editor doesn't afford rich 'literal' types. E.g. we can use text literals (because text editors are good at displaying and editing text), but not graphs or diagrams or tables and spreadsheets or music. This forces developers to use lots of ad-hoc external tools for things other than text, which in turn hinders procedural generation, hinders refactoring of transforms and pipelines into simple functions, and hinders debugging.

I'm sure the impact is much bigger than just the four points I listed above. But my own point is that the environment you're comfortable with has consequences that might be invisible due to exposure and familiarity.

When we shift way from text editors to a database, there is no need to carry the old ways of doing things. Our functions can be very small. Bulky, difficult-to-arrange ASTs can be avoided. We can have fine-grained dependencies and fine-grained sharing, without big concerns about dependencies. We can support literals other than numbers and text, and thus mitigate requirements for external file resources.

+1

hear hear. it has always bugged the heck out of me that we have so many thing that get in the way of us wanting to actually break things up into more files. (imports, makefiles, having to open/close/find them in our editors, etc.)

Text not the only way

I don't see how those benefits are exclusively tied to having source code in a database, and I don't think your criticisms of storing the AST in the database make sense. The AST is the language, the text representation is just one possible serialisation of that AST. For example whether you display an expression as prefix, infix, or postfix it is the same expression.

All the queries you list could work equally well on an AST database, and I agree they all look useful. I like all the features you list, apart from the 'rich' literal types. I don't think I really want 'data' like images or music in the source code. Perhaps the code could contain a URL for the data and the editor could insert it inline? The problem is the database is going to become a runtime requirement to fetch the rich literals, and it will either be a toy database (to keep it small), or SQL is going to become a dependency for installing the language.

Still, whilst I may disagree on the details, I do think its a very interesting area. You have convinced me that its something I should consider, even if I have slightly different ideas about how it could be done.

One thing that occurs to be is that as I like logic languages, and Prolog/Datalog lend themselves well to a database based implementation, do you have any thoughts about how this might apply to logic languages?

intentional software

whatever happened to that, anyway?

How to determine the intent

I don't get how you are supposed to express the programmers intent, it just seems a complete non-starter. A bit like the mythic system whereby a person talks to a computer in human language, and it writes the program for you. It all seems a bit like the stories about Djinni, where you wish for the software, but due to the inexactness of your specification you never get exactly what you thought you were asking for.

"Intentional Software" == brand

I believe he was referring to https://en.wikipedia.org/wiki/Intentional_Software

Intentional Programming

The AST is the language?

AST is the language, the text representation is just one possible serialisation of that AST. For example whether you display an expression as prefix, infix, or postfix it is the same expression.

I don't really agree with conflating language with AST. To me, the most important parts of the language are:

  • the composition model, combining two subprograms into a larger program
  • the security model, how to control which subprograms can access which data and resources
  • the failure model, how we resist, understand, and recover from partial failure
  • the ecosystem, how easy it is to evolve and integrate lots of heterogeneous data models with diverse update models
  • the invariants and inductive properties, which support equational reasoning, safe refactoring, and optimizations
  • the economy, how easy it is to create subprograms, share them with other developers and projects, reuse them in a new context, contract subprograms out piece-mail and test them individually. I like the idea of programming in a culture approaching singularity

To me, syntax is important especially insofar as it impacts entanglement or structural invariants.

But focusing inwards on the individual expressions, aiming to support both `4 5 +` and `4 + 5` as having the same meanings at the user's whim, does not seem to address any of the problems I consider important.

I don't think your criticisms of storing the AST in the database make sense [..] whether you display an expression as prefix, infix, or postfix it is the same expression [..] the text representation is just one possible serialisation of that AST

In Forth, `4 5 +`, `4 + 5`, and `+ 4 5` are three very different programs, all valid. The only sensible AST is pretty much the same as the program, just a simple list of words and numbers, albeit distinguishing numbers from words. And we could do that with a simple lexer, too.

My criticism of storing the AST was simply that I can't think of any advantage of breaking the function body up into a sophisticated database schema. All I can see is the pain of putting it back together again when queried for the program.

So, I ask: what concrete, practical, non-idealistic advantage would you obtain storing a Forth word definition (or a Forth-like language) as an AST?

In some version of the

In some version of the universe, it would be wonderful if someone could store the AST for a given expression in (say) COBOL, and others working with the database could view it in C or R or Lisp or Eiffel or ML or Pascal or Go, blissfully unaware that the original expression was written in COBOL.

Being able to 'query' the expression in any of a dozen different languages would be IMO worth the pain of figuring out how to build expressions from an AST. Unfortunately that is at the very least hard and requires much work that has not been done.

In a less-pipedream scenario, with the AST rather than the program text stored, we could identify redundant copies of the AST differing by minor things such as variable renaming and sequence of commutative operations, as part of an effort to support deduplication.

expressions from AST easy

I don't understand what is hard about building expressions from the AST, and I have done this for AST display. You even get back the same expression order and association, but with optimal bracketing (redundant brackets are removed).

I have code that reads a source file, parses into AST, runs type inference, and then outputs the source code annotated with type information generated back from the AST.

I have written tools that read one language to an AST and then transform one AST to a different AST then output the code in a different language.

Well, yes the simple useless case is easy.

Getting back the *SAME* expression is easy. Likewise the expression with some redundant information removed (although in languages with complicated or large sets of operator-precedence rules "redundant" parens or braces are sometimes valuable as documentation for human readability, so I'm not sure I like removing them automatically....)

It doesn't become useful to have stored the AST instead of the text unless you can get back usefully *DIFFERENT* expressions. And the particular example of that I had in mind was building the appropriate text to represent the "same" computation in different languages.

It would be awesome if you could select the programming language as part of your "view" on the database. If you can't, I'm not entirely sure that a database is better than a text file.

AST seems more fundamental to language semantics than text rep

I guess I don't agree with conflating the language with the text representation. The compiler works from the AST, so you can always load and store this from a representation like JSON. I have written several proto languages which have no parser and no written representation. I have also written DSLs where the syntax is determined by the host language, then ported the DSL from one language to another (for example Haskell to JavaScript), to the DSLs are the same language despite differing syntax.

re: ASTs

AST seems more fundamental to language semantics than text rep

In Forth-like and Lisp-like languages, the difference seems marginal.

you can always load and store this from a representation like JSON

I think this would make sense if parsing JSON was easier than parsing my language, i.e. such that every tool operating on the database could just include a JSON parser and some knowledge of how I lay it out as JSON.

But, as is, it just adds a dependency on JSON.

For one specific language

It seems you are confusing the properties of one specific language with the properties of languages in general.

For lisp the text rep is the AST. My understanding is that they were originally going to have a function syntax, and s-expressions were just a quick hack to read and dump the AST, but I suppose it was easier to get used to, than it was to write the planned parser. In any case I see this as supporting my point about the AST being more fundamental.

JSON was just an example, maybe s-expressions are easier.

specific language

I believe I referred to my specific, Forth-like language quite a few posts back, and several times hence. If you did not catch that, we have been talking past eachother.

I consider concrete syntax to be more fundamental. To even represent an AST requires a concrete syntax, e.g. JSON or something else. Relatedly, I also consider computation as a mechanical process, operating on real-world physical representations.

In memory

I did catch you referring to a forth like language, but your points about the database and ASTs seemed more general. If what you are saying is that for some non-forth like language storing the AST in the DB might make sense, then I might agree that storing text in the specific case you are talking about it makes sense, if only because everything in Forth is a 'word' (unless it supports sets?).

I still have problems with your view of ASTs. Obviously they don't require JSON or anything else, they exist as data in the computers memory linked by pointers. I consider computation an abstract process operating on memory and bits.

I think however I have been over generalising, as I think I get the idea with regards to Forth. I wouldn't want to do C++ like that though.

Abstract syntax

AST seems more fundamental to language semantics than text rep

In Forth-like and Lisp-like languages, the difference seems marginal.

For the price of dumbing down either the language structure, or the concrete syntax, or both.

It seems like this subthread is confusing concrete AST representation with abstract syntax as a concept. The latter certainly is more fundamental to a language than its concrete syntax (or a concrete AST representation, which -- really -- is just another form of concrete syntax).

Rich literal types

the database is going to become a runtime requirement to fetch the rich literals

Calling home to a database would miss out on many advantages of treating rich literals as part of the language, such as the ability to perform partial evaluation optimizations, or make rich values more accessible in mobile code, or copy-paste-modify and version-control resources much as we do with text.

Also, in purely functional or capability secure languages, rich literals let us treat values as values. Phoning home would damage this property, since now we'd need to deal with partial failures, logging and anonymity concerns, and other side-effects.

I'm interested in the possibility of using rich literals as an approach to DSLs, i.e. with many properties of external DSLs but some advantages of embedding. Again, I think I'd miss this if I was forced to use URLs.

SQL is going to become a dependency for installing the language

Of course. If a language is designed around a database, it is essential that we install the appropriate database technology to install the language.

Realistically, we'll want to fill that database with a bunch of existing content, e.g. importing functions from an open source repository to get started, so we don't spend all our time reinventing old wheels under new names. As far as 'rich literal types' go, this means we might get easy access to lots of example games and worlds and creatures and sound-effects and animations and icons. Why not? Seems awesome to me. Batteries included.

These days, a 4TB HDD is only $200. We should be taking advantage of cheap space to provide all the tools people don't believe they'll need until they discover they want them.

But it isn't as though every programmer should need to install his or her own database. A single server could cross-compile objects on demand. Of course, a single global server is also a bad idea. I think something in between is appropriate, e.g. representing different 'distributions' of the language with different curators, and companies creating their own copies to help control IP.

I like logic languages, and Prolog/Datalog lend themselves well to a database based implementation, do you have any thoughts about how this might apply to logic languages?

Use of a database for a global namespace seems mostly orthogonal to whether the language is functional or logic programming. But it might be useful to have easy access to tables to fill out certain propositions.

Thick vs Thin Language

Maybe we are just looking at two different application areas. I can see a use for systems like Matlab and Mathematica, for non-programmers, or people whose primary interest is another discipline and want to use a computer as a tool, and get answers as easily as possible. I don't really like the dependence on the cloud, but I can see a better Mathematica as a reasonable goal. I don't however think that this is suitable for all programming, just as I dont see people rushing to swap to Mathematica from Java and C++. Maybe in the future there will be no "programmers" and no computer scientists. However someone has to write the database that runs The Language, and that obviously cannot have a dependency on itself.

Maybe I am thinking more of the languages that will be used to build the components of The Language like databases, operating systems for each of the separate networked compute servers, firmware for the network routers.

On the other hand maybe The Language will be capable of compiling code to a small stand-alone binary, although it doesn't seem your focus right now.

Prolog and Datalog

With Prolog the language already is a clause database, and you run queries on that database. So I guess these languages already do what you want? (Let's ignore the difference between unification and assignment for now). A Prolog database entry consists of a head (the "function" name and parameters that is the DB key) and a body (a list of "functions" to call sequentially from the DB).

Edit: Is Prolog to trees what Forth is to stacks?

Eve is very closely related

Eve is very closely related to datalog. We store the code in the same database as all the runtime state. It makes live-coding easier since all the tools are written as incrementally-maintained views over the code using the language itself. It also accidentally allows for self-modifying or self-optimising code eg regenerating the query plan based on current statistics can be implemented from inside the language itself.

The current schema is:

view: view
field: field, view, ix
query: query, view, ix
constantConstraint: query, field, value
functionConstraint: constraint, query, field, code
functionConstraintInput: constraint, field, variable
viewConstraint: constraint, query, sourceView, isNegated
viewConstraintBinding: constraint, field, sourceField
aggregateConstraint: constraint, query, field, sourceView, code
aggregateConstraintBinding: constraint, field, sourceField
aggregateConstraintSolverInput: constraint, field, variable
aggregateConstraintAggregateInput: constraint, sourceField, variable
isInput: view
isCheck: view

That captures the core language. The fields 'code' and 'variable' are places where javascript is currently used for scalar computation and for specifying aggregate operations. Replacing javascript will probably inflate the schema a fair bit.

My comment earlier in the thread on content-addressing is our rough plan for dealing with versioning, packaging and distribution. We probably won't have a chance to implement that until mid 2015 though.

I think you meant live

I think you meant live programming: live feedback for programmers, vs. live modifications to a program running in real time (without time travel benefits).

Both analysis as-you-type

Both analysis as-you-type and execution as-you-type become much easier with the ability to write views over the code. It's not shown in that demo, but since the current state of the program is just a view over the inputs we also get retroactive changes to state as described in Out of the Tarpit. For example, in the TodoMVC demo if you remove the rule that says that pressing the delete button deletes the todo, all the deleted todos will reappear.

Ya, I get that.

Ya, I get that. But your goals aren't to enable modification to a program running live in the field, or to support live musical performances. I just think that separating the terms for the two different experiences is important (look at toplap.org for examples of live coding, which I think is not really what Eve is about).

Ah, I see what you're

Ah, I see what you're getting at. Although now I wonder whether we could support live music...

Actually, I like the live

Actually, I like the live coding/cyber-physical computing story, and perhaps might look at it someday. But better debugging tools are cool also, and probably more desired by the community.

Although, thinking about it

Although, thinking about it more, a good fraction of the work that is done with tools like Excel is less like programming, where the code artifact is the output, and more like live coding, where the process of manipulating the system is the output. Things like exploring and manipulating a scientific model or testing account projections for different budgets.

Yes, good point. Though in

Yes, good point. Though in that case, you still have plenty of time to think and reflect, there is no pressure as there would be in a live performance; you are still basically "debugging" your data. My view of live coding, especially after reading Sorensen's paper, is that you are interacting with the world in real time through your program...sort of like decking or rigging in the shadowrun sense. There is some pressure there to perform efficiently because the problems are being solved in the "live."

But when you are manipulating a spreadsheet, you are often offline: changes to the spreadsheet don't immediately go to the factory floor. You still get to try things out and take them back if they don't work.

Database == isolation and (usually bad) reimplentations

I get the advantages of having code in a database. They are substantial. However, I want to make sure that you also get the disadvantages, the things that has resulted in all database based languages to date becoming niche players at best, so you can attempt to address them.

Substantially, it comes down to living in isolation. A database based system means that you lose all the tools that work directly on files, and have to re-create the tools. It also means that you can't easily work with different languages that use the traditional tools.

I just counted off languages I regularly interact with, and came to ten. In the last couple of weeks, for every one of those, I've either written code in, had coworkers send me code for review in, or had to search through in our common search tool. Specifically, in the last two weeks, I've written code in seven of them, directly reviewed code before submit in one more, and searched through code in the remaining two.

All of those work through text files. They have the same interfaces for editing (usually vim for me, but it depends on which language I'm working with), the same interface for reviews, the same interface for backup, the same interface for version control, automated builds available, automated tests available, trivially available low quality search indexing (and with a plugin excellent quality search indexing), and easily available tools for doing various forms of manipulation in the form of awk, perl, sed, etc.

If you go with a database based language, you have to replace all of those. And, until it is common, anybody that want to casually use your language have to learn not just how to edit a basic program, but a whole host of related tools.

If you managed to get all languages and the basic computer interface to switch to your new database based system, this wouldn't be a problem - it would just be alternative way of accessing data, and people would learn that instead of the file system and text file based system. However, this hasn't succeeded in the past, and there is *very* large resistance against it.

All of this isn't necessarily a problem if your language provides large enough benefits and is separate enough from "normal programming" that it don't get subjected to the same constraints. Spreadsheets, for instance, has many of these problems, and still see very significant use - possibly more than all other programming put together. I suspect Awelon Blue might fit in the space of "not quite regular programming and provides large enough benefits", but I want to make sure you don't convince anybody that "databases for code is the future" without much deeper thought. They provide some compelling advantages, but also a host of disadvantages to go with those advantages.

Addressing Disadvantages

I don't believe that working within a database implies isolation. Rather, we're shifting away from the filesystem and into the web.

Also, I suspect a lot of similar predecessors you might be thinking about aren't really the same thing, i.e. tending towards "database per project" instead of "one global wiki-like database or common dictionary that is occasionally cloned for IP or security purposes". And favoring specialized development environments instead of the widely familiar browsers.

Consider:

  • Instead of e-mailing files, you'd send a hyperlink. As an added bonus, there wouldn't be a lot of setup involved if you want to try running the code rather than just reading it.
  • Version control deserves consideration. But I think it's much easier to implement above a database. CSCW, too - learning about potential conflicts in real time rather than trying to merge after-the-fact.
  • Loss of awk and sed would go unnoticed by most programmers.
  • Search? Well, Google has your back. It might take some effort to do better than that for raw text, but you could certainly focus on type-oriented search (Hoogle style).
  • Emacs or Vim could be significant losses to someone who has mastered them. Sadly, most programmers never touch more than a small fraction of their features. And, for many who did, the modes are a distant memory after years of various IDEs and text-editing in browsers.

That said, the internal reprogrammability or tailorability of emacs is certainly a feature worthy of emulation. In context of a language hosted in a database and accessed largely through web services, I might suggest the ability to easily add new web applications in the form of CGI-like functions. I.e. just define a new function (indicated by naming convention or metadata) to introduce a new web-app. I'm assuming the language has nice security properties.

Given internal reprogrammability, I suspect someone would find a way to install an emacs or vim inspired editing modes. Emacs finds a way. :)

Filesystem based tooling is full of productivity pitfalls that tend to catch programmers over and over again during their careers: install and configure environments every time we change computers, learning new build systems and package managers and profiling tools and debugging tools for every new language, managing dependencies and debugging ad-hoc version configurations, voodoo bugs that disappear with make clean or magic configurations, learning the particular file organizations and conventions for each new project instead of using words learned while developing earlier projects.

A new language oriented around a global database of functions will, as you mention, require learning new tools. But so does learning a new language in a filesystem. And, long term, I think we can avoid a few of the aforementioned pitfalls, enabling words to be learned once and used from project to project.

I agree that "*very* large resistance" exists. I think a lot of that is emotional, rather than based on rational extrapolations of likely impacts on productivity. But that doesn't make it any less real. A new approach will need to prove itself.

The problem with a database is that you have to know which one.

Knowing that something is a database doesn't tell you enough to know how to open and manipulate it. You have to know which database. Because each and every one of them has to be a special snowflake with its own storage paradigm. Further, they can't even bear to leave the storage paradigm alone from version to version; they change stuff arbitrarily rendering every third-party tool that ever existed, except their own, obsolete.

That means tools cannot accumulate over decades the way they have with text files, because EVERY goddamn tool has to be rewritten EVERY goddamn time someone changes the database format, or for EVERY different kind of database.

Databases will not be on a par with text until there is a universally accepted database format that everybody's software works with.

Conversely, there's a good case that if you think you need a database it's because there is some functionality that ought to be built directly into the filesystem and isn't.

molehills and mountains of tools

The problems you're describing are molehills, not mountains. They are problems we're well experienced with and prepared to handle. You could replace 'database' in your rant with any number of other ideas (file formats, data structures, operating systems, web services, etc.) and have it apply almost as well.

It hardly even matters whether you use a particular database. You can always encapsulate or extend it with a standard web services API.

tools cannot accumulate over decades

Tools will accumulate.

Consider web applications. We essentially download a tool and install it into our browser every time we interact with a web service. This has a lot of advantages for both security and maintenance. We don't need to leave things broken and wait for someone else to "rewrite EVERY tool EVERY time" we make major changes.

Applications can accumulate over decades, but they do so internally - as part of the services - rather than externally, via someone else's package manager or web store.

This approach can be augmented a lot:

  • Internal reprogrammability for a global key-value store could make it easy for anyone to add the tools they want and maintain them, modulo the will of the curators.
  • Adding and maintaining tools could be conflated with adding and maintaining functions in the store, which would be convenient because it allows reuse of the the testing, debugging, version control, etc. models used for developing every other function. It could also improve portability.
  • Use of content-addressing by secure hash instead of URLs could completely eliminate cache invalidation and greatly improve efficient reuse of tools and web applications across different systems
  • A language with better staging and control of side-effects compared to JavaScript could enable a much greater level of partial evaluation, optimization, and compilation; together with caching, this essentially gives us separate compilation.

By opening creation of tools to the users of the service, or implementing them as functions, we get many benefits of third-party tools while avoiding many of the disadvantages.

if you think you need a database it's because there is some functionality that ought to be built directly into the filesystem and isn't

I wouldn't mind if filesystems introduced a lot of the features that I might want from a database. However, to leverage nice new features like ACID transactions, consistent historical views, multiple views, or working with very fine-grained files would likely require the same retooling that you're objecting to in the first place.

[edited for coherence]

No. Just ... no.

If you think downloading and installing a tool every time we use a web interface has any security advantages, I suggest you subscribe to any security list and listen for a while to the reports of failures, invasions, and betrayals perpetrated through such "tools".

There is a reason we run NoScript on our browsers.

Nor am I willing to have tools accumulate "internally" to a service and therefore trust a single service (and network connectivity) for all management of code. Any single service will eventually die. If it is profit motivated it will do so with the absolute minimum possible warning to its customers. I really do like being able to pull out my laptop and work on code even when I am at hotels and so forth where there is never any trustworthy network connection to use.

security and networks

Downloading tools as needed has many security advantages. It supports a very fine-grained notion of installation, very precise authorities that are difficult to provide when installing applications externally and integrating them. It also avoids many security maintenance issues, and the need to trust third-party software distribution services.

Hand-written JavaScript isn't very adept at expressing secure abstractions/models, I grant. But JavaScript can be used as a secure compilation target. And there has been some efforts, such as caja and work by Mark Miller to make secure scripting more easily accessible.

Refusing scripts in favor of OS-layer executable tools isn't doing you any favors for security.

Nor am I willing to have tools accumulate "internally" to a service and therefore trust a single service (and network connectivity) for all management of code.

A network connection isn't essential. Joe Armstrong's proposed global function store would work just as well with LDAP, or DVCS-like copies of the database for everyone. And there are other options beyond those, such as content addressing, distributed hash tables, proof-of-work systems... but I favor DVCS-like approaches.

The important aspect is that everyone shares more or less the same repository of functions, all of them readily accessible, just like every English speaker shares more or less the same dictionary, and thus can learn ideas and words then easily reuse them from project to project.

We're aiming to avoid today's scenario where every collection of functions is specific to a project (or even a small niche within a project), where words like 'main' must be relearned in every new context, where developers must carefully install and maintain coarse-grained 'packages' of functions and make painful decisions between redundancy vs. dependencies.

I've already mentioned a few times, that a group should be able to clone the service, get a private copy of the global database for various reasons, and continue to maintain push-pull DVCS-like relationships. But I also don't doubt we'll have some common repositories that accumulate a wide variety of tools, and those tools will quickly become globally accessible.

If the problem is a distrusted network connection, rather than the absence of one, there are other projects to address that. For example, if TLS isn't enough, look into Tor. Of course, if you're just working on a public repo, even a distrusted connection should be fine.

These are not molehills.

If someone sends a file of text, it is absolutely clear what it is and how to use it. No questions about what software it's associated with need to be asked or answered.

I know of no database format that is universally known and handled by every kind of database software. Further, I am convinced that there will never, ever be one, not until the sun dies.

You know what the big thing in database technology is now? Getting them to read and make sense of text. So people can send text files back and forth to update databases. Well, and so they can have robots read your mail in order to gather information they intend to use to sell you crap you don't need, but that's just a follow-on effect, right?

Even if you get a code database working, it will never be used, until it can be used by people who only work on text files and can leave it up to people who want database functionality to feed textfiles into their tools.

If someone sends a file of

If someone sends a file of text, it is absolutely clear what it is and how to use it.

Well, that really depends, right? If I get a file of Chinese text, it really isn't absolutely clear how I should read it. The world isn't completely ascii. And god forbid if it is RTF or PDF or DOCX, which I'm more likely to get than an old fashioned TXT file.

Getting them to read and make sense of text.

Are you confusing data mining with data base?

Even if you get a code database working, it will never be used, until it can be used by people who only work on text files and can leave it up to people who want database functionality to feed textfiles into their tools.

Even if you can get a new fangled car to work, it will never be used, until it can fit a horse in it for people who only will ride horses.

Unproblems

First, I believe your claim about text is an untruth. Even when I recognize the language involved with text - e.g. C, with a bunch of #include directives and externs - it isn't always clear to me how to set up the environment to effectively use it. If I don't recognize the language, e.g. if I were looking at a cabal file or troff markup for the first time, it would be much less clear.

Second, you don't send a database. You might send a hyperlink. People know how to use hyperlinks. The 'universal database format' is an irrelevant consideration. Instead, we develop common protocols and adapters, such as HTTP, ODBC, RSYNC. We might also have a few de-facto common import/export formats.

Third, while I'm sure there are many people who feel as strongly as you do about 'working with text files', I doubt it's a majority of people. Wikis, Twitter, and Gmail, didn't become popular based on their ability to easily interact with emacs, awk, and grep.

Fourth, don't worry. FUSE has made it relatively easy to create filesystem-to-database adapters. People have mounted Amazon S3 buckets, google drive, torrents, and many other things. If a database language becomes popular, I wouldn't be surprised to see FUSE adapters developed by the subset of people who really want to work with text files and filesystem tools.

a file by another name?

You might send a hyperlink. People know how to use hyperlinks.

Are you saying a file can be had on demand by hyperlink? And that therefore a file view still exists? If so, then an abstraction of file system exists. So a file model is present, but with another protocol.

Or are you saying a hyperlink offers a mysterious and unspecified service, unwilling to cough up a file? I kind of like the idea of writing an Eliza-like interface imitating HAL that refuses to confirm or deny that a file exists. Have you considered upgrading your subscription to database services?

not a file, unless you stretch the term

Google docs, wiki pages, drupal blogs with forms... these aren't things I'd consider 'files' under a conventional understanding of the file and filesystem metaphor. A hyperlink could just as easily lead you to an interactive, living object or application as lead to an object that can be treated as a file.

Whether a database is willing to 'cough up' a file is almost a separate issue. My wordpress blog allows me to download an archive of all my posts, but I wouldn't normally interact with my blog by modifying the archive file and importing it.

was thinking of optional decomposition into files

(I have files on the brain because of something I've been sorting, involving role of namespace via VFS in agent-based systems ... and generalizations of local domain sockets, etc.)

as easily lead you to an interactive, living object or application

Ah, I tend to adopt perspective of automated tools initiating or responding to availability of data, and expect basic tools to be literal, dumb, and batch oriented. In other words, there's no way responsive user interface can afford such tools any value, even if a human being would enjoy the experience.

So a hyperlink can afford a multivalent experience for an interacting person, or else afford a narrow channel for a downloading tool. We know that's so, just from browsing the web, but the interesting question is whether both options ought to be possible. A user interface experience seems harder to deliver, while data access seems easier. So if a database approach to code did not afford a data access channel for tools, it would seem odd in a way requiring explanations like 1) your code is held hostage, or 2) the db's view of the world is far too complex for mere mortals to understand, so there's no linear text presentation possible. The second sounds like a lie or laziness though.

A third reason might be: you should stop thinking of code as having separate parts that tools can analyze. But it sound suspiciously like a mix of (1) and (2).

channels for tools

I wouldn't expect a database to hold content hostage. That's the opposite of the purpose of a database. Even if I were completely unmoved by an argument to support filesystem utilities, I would want narrow channels for replicating, cloning, synchronizing, subscribing for changes, etc..

As I've mentioned a few times, I like 'internal reprogrammability'. In context, a subset of a database of functions might (via metadata or naming convention) define CGI-like web applications that can receive HTTP requests and reflect on the database. With internal reprogrammability, we can create new ad-hoc multivalent experiences for humans and narrow channels for external tools. Trying to remove just one of these possibilities would be rather silly.

But, much like Emacs grows into its own operating system, I would expect internal reprogrammability to gradually usurp almost every role you might imagine for external tools.

Languages and Operating Systems

I think there is a reason why languages and operating systems are separate things. I don't want my language to lock me into one operating system.

This seems perfect for vendor lock-in. You can only program language X on platform Y, this increases the cost of maintaining versions of a program that run on multiple operating systems.

Even if your project is open-source, other OS vendors will adopt it, but with subtle differences, that makes interoperability hard (for example the saga of Java and Microsoft). How can you avoid repeating history?

portable languages

Under an assumption that the language and the OS are the same thing, your complaint reduces to "You can only program language X on language X", which seems a ridiculous condition to complain about. Only when languages and operating systems are separate things does your complaint seem meaningful. Maybe you only need the separate OS because you have it?

In any case, language X may itself be portable. This is a matter of language design. A language that makes fewer assumptions about its environment or isolates interaction with its environment is more portable, easier to adapt, re-implement, and cross-compile for a new environment.

Under the assumption that the language is an OS, the target of a port wouldn't be an OS per se, but rather a "machine" - virtual or abstract or otherwise. This would be similar to how OS kernels are ported to many different machines. Almost independently, virtual and abstract machines may be portable to different operating systems. For example, the JVM is portable, but Java itself was not designed for easy portability to other VMs.

By designing a portable language, we can avoid repeating history. I would recommend object capability model as a very effective basis for isolating interactions with the machine.

And, naturally, if the language is to be a good OS, then the language needs to be effective for historically OS-level concerns such as concurrency, security, and process control. Object capability model also helps with security.

Emacs and Specifications.

All sounds pretty reasonable, and I liked the idea of the JVM when it first appeared. Perhaps I have problems with this concept because I don't like emacs? How can you prevent your language being another emacs that is only used for writing emacs 'tools' and getting sidelined by external progress. For example I am sure emacs made sense on text terminals, but it seems silly on a modern GUI. No commercial applications were really developed in emacs (for example MS Word, SQL database, etc...). Whereas 'C' is still used widely and many applications are still written in 'C'.

Somehow I think a good language needs a specification and more than one implementation, how do you plan on specifying the database? For example I have no problem with the way Prolog has in implied clause database, as the semantics are clear and expressed as part of the language specification.

not emacs

Emacs is a text editor. That foundation has greatly shaped its evolution. In the sense that it is an operating system, it is very much a single-user operating system meant for a human user. It is not the sort of operating system you'd want to control a robot, or an MMORPG.

If you start with a multi-user web service as a foundation, e.g. a programmable wiki, your destination will be very different. Such a system may very well be suitable for running an MMORPG, but maybe not for controlling a robot except at a high level - e.g. planning and routing for swarms of robots.

If your language itself is focused on distributed and parallel programming, then the initial platform - be it a web service or a text editor - has a weaker impact on the destination. It becomes a staging area, a launching platform for a distributed computation that might operate on hundreds of cloud processors (perhaps via unikernels) and even have little bits of code running in many browsers and on lots of little robots. You can already see systems moving in this direction, e.g. with Opa or Heroku or Wolfram Language.

My own language is designed for the latter sort of distributed computation. And Erlang would also be pretty good at it.

Emacs simply isn't a very effective, efficient, or scalable system for a lot of the things we would like to operate. We can do a lot better if we try.

a good language needs a specification and more than one implementation, how do you plan on specifying the database?

To clarify, with regards to the OP topic and what I've been discussing, the database is not part of the language, anymore than a filesystem is part of C or Java. At most, the language is designed so it's more comfortable to develop using a database, much like C is designed to be comfortably developed using plain text editors and a filesystem.

Please don't conflate the concepts of "database in the language" vs. "language in the database". AFAICT, they're entirely orthogonal.

Database is (not) the language.

So there won't be a way of querying a collection of functions from the database and executing them? This would seem to allow some interesting possibilities, but as you say is orthogonal to the database storage.

Am I right in assuming I would be free to make a version of your language which consists of a compiler that reads pain text and outputs machine code, and this would work fine? In which case you are only discussing the tooling for a language. If you can provide an import/export function that allows me to get at my code, so it would resolve the dependencies from the database as part of the export and provide all the function definitions needed in a text file then that would solve the lock-in issue.

avoiding coupling

I wouldn't want most applications from my language tethered to the database. Exceptions do exist, such as developing an IDE or REPL or debugger or typechecker. So, I would provide capabilities to access the database of functions to just the subset of applications that truly need it to do their job.

You could write a plain text compiler for my language. Indeed, that's how it currently is tooled. I even have limited support for file-level imports in my very simple '.ao' file format. But I'm encountering a lot of hurdles from the filesystem.

I wouldn't dismiss any discussion as "only tooling". Tooling is an important aspect of language design. Which tools we envision for our language will have a deep impact on syntax and semantics.