Why do we need modules at all?

Post by Joe Armstrong of Erlang fame. Leader:

Why do we need modules at all? This is a brain-dump-stream-of-consciousness-thing. I've been thinking about this for a while. I'm proposing a slightly different way of programming here. The basic idea is:

  • do away with modules
  • all functions have unique distinct names
  • all functions have (lots of) meta data
  • all functions go into a global (searchable) Key-value database
  • we need letrec
  • contribution to open source can be as simple as contributing a single function
  • there are no "open source projects" - only "the open source Key-Value database of all functions"
  • Content is peer reviewed

Why does Erlang have modules? There's a good an bad side to modules. Good: Provides a unit of compilation, a unit of code distribution. unit of code replacement. Bad: It's very difficult to decide which module to put an individual function in. Break encapsulation (see later).

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

text and tools vs binary and optimization

(Incidentally, I think your Forth style word database approach is very viable, though components might need finer granularity in-the-small with less redundancy in markup showing how things chunk to a user.)

I thought this was too on-target to pass up. Other folks might be amused to see Rob Pike recap some of this perspective, in his Sept 2001 The Good, the Bad, and the Ugly: The Unix(tm) Legacy, where the slide on page 18 notes that files are "the centerpiece of the tool approach", adding:

Drifting back towards typed, binary files (Ugly):
    cat . doesnt work any more (readdir vs. read).
    Databases to hold code.
    HTTP/HTML interfaces to system services.

Other slides are good too, but here I enjoyed mention of code in databases.

I'll write up something about file systems in another discussion after the idea gels more, because it hasn't finished changing on me yet. A long part is about role of uncoupling via normalized flat tree indexes. But one part seems settled now: an idea that NS via FS (namespace via file system) is about advertising a palette of available bits of information, including means of contact if so desired. Where messages seem more a push model, a namespace you can browse is more a pull model, which is efficient when you intend to ignore most available things. (State of the world outside an agent, especially history and environment variables, should not be shoved down its throat; but it's still nice to have on offer.)

One fancy idea I want to imitate from Plan9 is local namespace versions, so advertising can be scoped as needed, including sandboxes when necessary. More importantly, less visibility reduces coupling when fewer parties need agree on names.

I like low-coupling generic text interface for control plane effects using few resources, reserving higher coupling for optimized data plane mechanisms.

Tooling

Sorry, that came across more dismissively than I intended. What I meant was that if the tooling was not essential, and it could be programmed without it, a lot of the objections are not relevant. You could split the design into a language specification, and a tooling design. I guess the clever bit (which I like) is to adapt the language design to the best tooling designs available, and to try and develop better tooling.

Different notion of isolation than I was trying to convey

What I was trying to convey is that you are isolating the way of doing development.

If I adopt a new text based language, that's "just" installing a compiler/interpreter and writing new text, using the same tools for everything else.

If I'm going to start doing development in something that's database based, that pulls it all away from the rest of my tools and workflow. It means I can't just try this out for one small utility in my world, and say "If this don't work out, I'll just keep the compiler/interpreter around, and it will be a small hassle to maintain one more language." I can't say "I'll use my standard generic tools and slowly learn new ones that are more appropriate for that language".

I have to go all in. I have to learn a new editor (possibly setting up some kind of editor that sort of simulates the one I'm used to). I have to go with new version control, new access control, new backup systems, new ways of searching, new ways of doing global fixes, new ways of building, new ways of sending references to people I work with, and so on and so forth. And these costs also come for every maintenance programmer in the future.

This makes the cost of testing the waters very high. It also adds a continual cost if I'm using several languages (and I am) - I need to learn and maintain knowledge of one way of working with the DB based language, and another way of working with everything else.

It also means that version control can't take snapshots of the world. With my normal text based systems, I can have "version control in state X" (at a label or date) represent everything there is to represent about the system. Configuration files, templates, build instructions - it all goes in my single text-based version control system.

With a database based system, I lose all that.

If you want to compete with a text based system, the extra value you get out of your database has to exceed all the value from the above, including adjusted risk perception - where people will attribute more risk to an unknown system than to a known system, and to a system out of their control than to a system in their control.

On the individual points:

  • Emailing files isn't something I do hardly ever; however, I occasionally email entire projects, and I have review setups etc for dealing with files. These include web interfaces for them. While these could be replaced, all of that is tens of programmer years.
  • Version control is certainly easier to implement over a database. However, that is irrelevant in terms of the major complexity for it - version control tend to use some kind of database anyway, and whether the things that get stored in it is a whole file or a function makes very little difference.
  • Loss of sed/awk would go unnoticed by most programmers, but that's not important. There are, in my experience, three groups of programmers in the world. There are bad programmers. They use a random assortment of tools, but are not opinion leaders or deciders and are fairly irrelevant in technology adoption. There are good programmers that are IDE users. They won't care about command line tools. There are good programmers that are separate tool users. These will tend to use emacs, vim or some other programmers text editor, and a bunch of other tools to manipulate source code. They will care about their tools, a lot, and generally include sed/awk and similar. Both kinds of good programmers are often opinion leaders; which fraction of IDE vs text editor users you get depend a lot on what kind of languages the programmers are used to. And the thing that's important for getting your language adopted and your tools written are these good programmers - as opinion leaders for adoption, as programmers for getting your tools written.
  • Search. While it is flattering that you think we have your back, I unfortunately think that's not true for source code search. There are several problems. One is that you want a guaranteed comprehensive search over your codebase, and websearch don't offer that. Another is that you want to be up to date - and we don't offer that, either, for a variety of reasons. And a last one is that you want to search through private code - both code that's not checked in and code that's not public.
  • Plugins. They work, but not fully, and is a fair amount of work to write. It took many years before Eclipse got a free, well-functioning binding for vi keymaps, and while emacs had several, they never felt like they worked well (it's approaching 15 years since I last started emacs, so it might have improved since.)

I don't think you can overcome all these obstacles. Instead, I think the useful thing to do is go around them. Make Awelon (or whatever language) so compelling for some usecases that it is worthwhile to use them for that, even with all these drawbacks.

Small projects

I can't just try this out for one small utility in my world, and say "If this don't work out, I'll just keep the compiler/interpreter around, and it will be a small hassle to maintain one more language."

Umm, that's simply untrue. You can create one application or service or utility, cross-compile for your system, and then decide that you'd prefer to do the rest of your work in another language. There might be a couple extra steps involved, e.g. import/export to fit your version control preferences, but probably not as much as you're imagining. This is something you can do.

adds a continual cost if I'm using several languages

You'll always have continual costs for using several languages, and for every language you add. More package managers, packages, versions, and configuration management. More redundant maintenance for file formats, protocols, and communication. More debugging modes and idioms. More languages for a maintenance programmer or a newcomer to your project to install, configure, and learn.

The "hassle to maintain one more language" is more involved than "just keep the compiler/interpreter around".

I grant the overhead will be somewhat higher for adding your first database-organized language to the mix, using new ways to track or import/export versions, editing code without emacs or vim (unless you really insist). But the question is how much higher? Is the cost the same as adding two file-based languages? Could you argue for three?

version control can't take snapshots of the world. With my normal [file] based systems, I can have "version control in state X" (at a label or date) represent everything there is to represent about the system.

From a database, you could capture a snapshot of some set of functions and all its transitive dependencies, e.g. via query or an export system. There are ways to address this concern.

If you want to compete with a [file] based system, the extra value you get out of your database has to exceed all the value from the above

I think it's fine that a language system primarily competes based on its utility for developing and maintaining applications and systems, rather than on some implicit argument of being a right tool in a polyglot's toolkit for small pieces of a larger project.

This does hinder the "get your toes wet" adoption model you've been assuming. But there are other strategies for language adoption - e.g. being very good at one niche and spreading outwards from there. Or developing some nice, extensible applications, where extensions are developed in a particular language.

Search. While it is flattering that you think we have your back, I unfortunately think that's not true for source code search.

I would implement a few specific searches that are useful for refactoring and code completions and such. But I'm not going to bother with all the little nuances of indexing for full text search. Even for a private repo, I might install one of Google's private search engine products. Is there a reason you'd recommend against that?

Plugins. They work, but not fully

I didn't mention plugins. Neal Ford presents a nice argument for how internal reprogrammability is very different from plugins. Martin Fowler approved.

I don't think you can overcome all these obstacles.

The tooling available today does offer some adoption advantages to file-based organizations. But I think this advantage is smaller than you've been assuming, and is accompanied by more than a few disadvantages. Any specific issues (e.g. wanting a snapshot of your project for external version control or import/export purposes) are often readily addressed.

Even if we adapted an existing language like Erlang, I think organizing around a database of functions should turn into an advantage in the mid to long term, i.e. as the database grows large enough and quantity becomes a quality.

For adoption of a new language (such as Awelon), of course, it's important to have compelling use-cases regardless of organization.

Different hassle tolerance

Umm, that's simply untrue. You can create one application or service or utility, cross-compile for your system, and then decide that you'd prefer to do the rest of your work in another language. There might be a couple extra steps involved, e.g. import/export to fit your version control preferences, but probably not as much as you're imagining. This is something you can do.

I do not consider that a reasonable workflow; in practice, I consider it something that I can't do inside the constrains of reasonable workflow, unless the language allows export to and import from files in such a way that it works as if the files were the native format (for reviews, builds, editing by maintainers). If it does, then yes, sure, but then the database becomes a minor implementation detail.

I grant the overhead will be somewhat higher for adding your first database-organized language to the mix, using new ways to track or import/export versions, editing code without emacs or vim (unless you really insist). But the question is how much higher? Is the cost the same as adding two file-based languages? Could you argue for three?

It totally depend on the languages. I can argue anywhere from one and a bit to at least twenty, depending on which languages in particular and what they or the database based language is used for. For many of the systems I have worked with, adding another file based language is very close to zero cost as long as the code in the language doesn't have to be changed. It's adding one package specification to the overall build system ("We need to install compiler X from the standard distribution packaging"), adding one rule to the standard build system to be able to build files using that, and off you go. Just write the files. The setup is just ten minutes. The tools are for the most part the same - you only change what language you write in.

Also, the overhead isn't just for the first database-organized language - it is for each and every database-organized language.

From a database, you could capture a snapshot of some set of functions and all its transitive dependencies, e.g. via query or an export system. There are ways to address this concern.

This would have to be done every single check in, and be readable in itself, and have readable differences. I think this is a much larger impedance mismatch than you seem to believe.

I think it's fine that a language system primarily competes based on its utility for developing and maintaining applications and systems, rather than on some implicit argument of being a right tool in a polyglot's toolkit for small pieces of a larger project.

So do I. For general purpose languages, I don't agree with "right tool for that particular piece". I just think you're going to lose a large part of adoption that happens by people first trying out the language on something small before adopting it for something large, and that you need to replace this with something else if you're going to get adoption - niches or using as app extension languages are fine examples.

I would implement a few specific searches that are useful for refactoring and code completions and such. But I'm not going to bother with all the little nuances of indexing for full text search. Even for a private repo, I might install one of Google's private search engine products. Is there a reason you'd recommend against that?

The ranking for source code is going to be abysmal, and I don't think the cost of buying a GSA is going to be worthwhile for that use case. (If there's other products we sell for this case, I'm not aware of them.)

Plugins vs reprogrammability - there's no difference as far as editor adaption goes. There's a significant difference for lots of other cases, but not this one.

The tooling available today does offer some adoption advantages to file-based organizations. But I think this advantage is smaller than you've been assuming, and is accompanied by more than a few disadvantages. Any specific issues (e.g. wanting a snapshot of your project for external version control or import/export purposes) are often readily addressed.

The thing is: It is necessary to address each of these things for an adopter. More or less individually. This means that if somebody is going to start using this, then they either need to set up stuff to address each aspect of this, or they need to forego the things. I think in practice, the cost of setting up ways of reclaiming these things is going to be enough hassle that they'll usually not be set up - so effectively, going with a database based language means foregoing that integration, and instead going with the database-language native versions of all of this.

You could prove me wrong. That would be great :-)

mode of adoption

I do not consider that a reasonable workflow

That's okay. Your choice to use seven languages is already far beyond my hassle tolerance. But it is possible. Perhaps adding a database language to your workflow is beyond your hassle tolerance. But it, too, is possible.

I was only objecting to your earlier hyperbole, where you said "I can't" rather than "I won't".

For many of the systems I have worked with, adding another file based language is very close to zero cost as long as the code in the language doesn't have to be changed. It's adding one package specification to the overall build system

You seem to be describing a best-case scenario, where (a) you don't use any libraries from the new language or suffer dependency management and version-maintenance thereof, (b) you get the code right the first time and thus never need to maintain it, (c) newcomers and maintainers for your project will either know the language already or never feel the need to look under the hood and understand what code in this language is doing.

We should compare more typical cases.

This would have to be done every single check in

Things you have to do 'every single time' are the easiest to automate. A one-line `wget` script to scrape a version before check-in should be sufficient, and could be automated in many version control systems.

you're going to lose a large part of adoption that happens by people first trying out the language on something small before adopting it for something large

I think "try it out for something small before adopting it for something large" will be common. It just won't happen the same way you've been imagining.

You've been assuming a particular mode of adoption that involves trying a language the first time by integrating code with an existing polyglot filesystem project.

But I believe most adopters will try a database-organized language the first time through a publicly accessible web service, creating small but independently useful artifacts (web apps, android apps, generative art, interactive fiction, ROS services, command line apps, etc.) and cross compiling them as needed. The web service avoids a lot of the barriers for a first trial, e.g. no need to download a compiler or set up an environment. And the ability to develop independently useful applications is both rewarding and mitigates concerns about being tethered to the database.

Later, a developer who is familiar with the database-organized language (or who has heard how great it is and is tempted to try learning) might feel concern regarding how difficult it is to bring some software components into their existing polyglot filesystem-organized project. And even here multiple routes are available. E.g. rather than your preference of tightly coupling all versions at the source layer, one might treat the code from the database as a separate package or service and manage versions at that layer.

I don't see much need to optimize for the particular mode of adoption you've assumed. Making integration feasible for a determined developer without too much hassle should be sufficient to address common concerns about whether the language can be used with an existing project.

[edited for focus]

proving stuff

You could prove me wrong. That would be great :-)

I'd love to. :)

Sadly, my Awelon project doesn't control enough variables to 'prove' much of anything with regards to organization of code.

AO's Forth-like syntax will undoubtedly be unpopular with a large crowd of curly-brace or parentheses zealots. Requiring all recursion to be anonymous through explicit fixpoints will likely be unpopular with the Forth crowd. The type system is largely experimental - language designed to make linting and inference and partial evaluation easy, but lacking any sort of type declarations - and will likely alarm fans of both dynamic and static types. AO is a language only a hipster will love - unless and until it really takes off. Which isn't impossible, since there are some benefits to offset the rejection of convention.

It will be difficult to isolate causes for success or failure with respect to adoption. So we're mostly left with theory, like the argument we're having now.

Simplicity and openness

are only as simple and open as the layer underneath the API your chosen OS is exposing. Your "simple, open format" file exists as (probably a number of) sequences of bytes on some form of storage; which sequences may or may not have static locations during the life of the file. They might be encrypted, striped across multiple devices, distributed, version controlled, whatever.

All of this behind a simple "open()"

Your files are already binary blobs in what is most probably an opaque proprietary database. You're just used to the query language.

formats and database hostages

I'll say more about this in a new discussion I'll start to comment on remarks by Ray Dillinger and Thomas Lord about scale, duplication, alpha-renaming, and topics related to ephemeral environment stuff.

Keean Schupke: It's the simplicity and openness of the file format that singles them out. I don't want my code disappearing inside a binary blob in some database.

You can copy simple text file formats to and from a database, where something more complex happens. But it needn't be much more complex, if the database affords a view of everything it knows as if encoded in text files. For example, I was going to persist to a tarball, consisting of text files from a virtual file system serialized in tar format. If you untar, it's all text files. (Useful as sanity check.) The "files" consisting of mainly annotation, metainfo, and indexing might be a little weird looking though. You can atomically transact tarballs in append-only mode by grouping related changes inside begin and end markers. (Applying them atomically requires the tar tool used honor transaction marks. Lucky thing writing tar readers and writers is easy.)

Simplicity and openness are only as simple and open as the layer underneath the API your chosen OS is exposing. Your "simple, open format" file exists as (probably a number of) sequences of bytes on some form of storage; which sequences may or may not have static locations during the life of the file.

I think it's the model of simplicity afforded by file systems that Schupke likes, despite the fact actual file system implementations in an OS get complex in detail. The content model itself can be simple, as exemplified by portable tar utilities and tarball format understood most places. Having storage trivially open to writing tools to test hypotheses can help a dev maintain an evidence based focus due to transparency.

(In the 90's it enraged some Apple users when OpenDoc embedded simple text files as blobs inside the Bento file format, which had originally been designed to index compact disk data collections, and had a very naive write model that merely appended and indexed. That OpenDoc documents could not be opened with a plain text editor, when the only content was a text document, drove a couple vocal people berserk.)

mistake

mistakenly placed reply

What about entwined implementation, optimization and apis?

From an external point of view, a module presents a public api - that's one purpose of it.

But from the inside it has the purpose of hiding the internal implementation and all of the hidden machinery of that implementation.

Also, to a dynamic language like scheme one needs modules to allow optimization. A module declares to the compiler that the functions and data definitions of that module will not be altered from outside it - so if nothing IN the module alters function or data definitions the compiler knows that it's safe to allowing inlining optimizations.

I've seen it mentioned in articles that the property of which code refers to each other (think of a graph) is how you decide what has to included in a single module. Functions, objects, apis that use each other's non-public details have to be in the same module.

Just as objects are useful for gathering related functions into mini-apis, modules are useful for gathering related apis for the convenience of users and gathering intertwined implementations for the sake of implementation hiding and optimization.

To hide design decisions that are likely to change

Parnas answered this question in 1972 in the paper that introduced information hiding, On the Criteria To Be Used in Decomposing Systems into Modules.

My summary: code is going to change, and people using it are better off relying on stable interfaces (design decisions that are unlikely to change) rather than implementations (design decisions that are likely to change, and are therefore hidden inside modules). But read the paper yourself if you haven't already, it's worth it!

The basic problem with doing away with modules and putting everything at the top level is that there is no longer any way to distinguish what is stable from what is likely to change. Since managing software evolution is recognized as one of the biggest problems in software engineering, this is a big deal.

Reading Armstrong's post, he

Reading Armstrong's post, he is primarily concerned with namespace management, not "modules." It is just that the two concepts are often mashed together when they are really independent concerns.

So ya, I agree. But I also put forth the crazy idea that namespace management can be detached from module decomposition and information hiding, and in that case a flat namespace works to our advantage in navigate a universe of lots of stuff that we could use in our program.

Searching for reusable code is good

Providing sophisticated search facilities for reusable code has a lot of benefits--that seems to be a theme both of Armstrong's post and also (I think) your paper on escaping the maze of twisty classes. I guess I don't see what that has to do with namespace management, though. Can't we build tools that find reusable code equally well whether it is a flat namespace or a hierarchical one?

To me the barriers are (A) enabling the tools to find whatever is out there (GitHub and similar sites help), (B) creative search and selection techniques, like the ideas from your paper, and (C) making it easier to integrate and reuse the code you find (package managers could maybe help here).

Whether your compiler or a

Whether your compiler or a tool is used to resolve names to symbols...they are still manipulating a namespace. The proposals set forth here really are about shaping this namespace to make finding things (and collaborating) easier.

Come to think about it, modules are more about access control. Today, if I access something I shouldn't be able to, the compiler provides a cryptic unhelpful error "name not found", whereas it should really say "you aren't allowed to use that." Since we have conflate namespaces and modules, we often get the former rather than the latter.

Access control = information hiding

"Name not found" arguably is the right error. Encapsulation a.k.a. information hiding is the sanest form of access control by far. Outside code shouldn't even be aware of the existence of private details, let alone be restricted by it (this is an issue with C++ and some of its OO friends, for example, where private stuff pollutes the public namespace).

It is a crappy error from a

It is a crappy error from a usability perspective, but definitions of correctness in PL rarely consider humans. It personally bites me more than a few times a day (access restriction being specified too strictly); at least I want to know for my own code.

Another example: code completion menus should show choices that are semantically invalid (with obvious indicators of such), because you might have made an error earlier that an invalid choice later would help you recognize and fix. Masking information is just inhumane.

One can give a good human error message

One can have the compiler give a more understandable (by humans) error message while still strongly enforcing information hiding. I agree with Sean that the tool should say "sorry, you can't access that, it's private" (as opposed to "name not found"). As long as the access is strictly prohibited, what's the harm?

I agree that one shouldn't let private things pollute the public namespace, but this can be consistent with good error messages. Inaccessible things still aren't a part of the public namespace, the compiler just checks the relevant private spaces to help with giving a good error message (and for no other purpose).

Moreover, for assertions or

Moreover, for assertions or logging, why not allow access?

It is harder to write a compiler that gives good error messages.

There are algorithms that reduce dependency tracking, syntax, namespace resolution, parsing, etc, down to a minimal set of essentials that are sufficient to express all the constraints in a way amenable to recognizing and representing a correct program.

Those essentials are not the terms in which human beings think.

To give good error messages, you have to make the compiler use a process similar to the process humans model in their minds, so that it can explain in terms of *THAT* process what is wrong when something doesn't fit.

And this usually makes the compiler "messy" or involves "inefficient" use of code - and is both more complex and structurally different from the code that we would write simply to handle the correct case.

You can't add good error message generation to code that's efficiently designed to handle the correct case. It is something you have to alter your design for - and arguably can require you to even use a slightly "worse" design in terms of efficiency and speed.

I would further argue that

I would further argue that conflating module with name scope gets us into this mess in the first place, efficient or non efficient compiler non with-standing. Compilers just shouldn't be doing name resolution.

Depends on why you're hiding it

GET /catdb/cat/calico HTTP/1.1
Host: example.com
---
200 OK
---
GET /catdb/dog/husky HTTP/1.1
Host: example.com
---
401 PROHIBITED
---
POST query Http/1.1
Host: stackoverflow.com
How do I get the dog data from example.com/catdb???

If the response is 404 NOT FOUND the principal will not so readily infer that dog data is available.

to distinguish what is stable

The basic problem with doing away with modules and putting everything at the top level is that there is no longer any way to distinguish what is stable from what is likely to change.

I think this isn't a big deal. First, the pieces that are 'likely to change' will just use naming conventions or documentation that make this clear to prospective users. Second, if everything is at the top level in a shared dictionary, we can easily find all the dependencies when time comes to make a change, and even edit them if necessary. This greatly mitigates the need for foresight.

These days, we use a lot of ad-hoc and awful mechanisms involving major and minor version numbers and so on to deal with instability in module interfaces. Programmers are continuously saddled with concerns for "backwards compatibility" and "forwards compatibility" and predictions they aren't very good at making but are punished for getting wrong. It seems to me this entire 'implementation hiding' approach to reusable code is fundamentally broken.

There is still a role for modules and modular interfaces, i.e. to resist accidental coupling between different subprograms. But, for these roles, it seems useful model modules within the language, i.e. to fully separate modularity from code-reuse, such that there are no 'inner' vs. 'outer' functions, just functions that might or might not be useful for constructing a module.

On the contrary, it's fundamental to SE

Designing for change is a fundamental principle of software engineering, which was been shown to be effective over decades of practical experience and by a huge body of research literature. It's not going to be blown away by a few surface-level arguments.

Naming conventions simply don't work at scale. If you have a platform--like Android, Eclipse, or Windows--and people can use "private" things, they will. As the platform developer, you will then have to remain bug-compatible with all those private things (this happened to various versions of Windows, and it caused *huge* headaches at Microsoft). It's unthinkable that platform developers in industry would even consider an approach based on naming conventions. They want to protect their own ability to evolve the implementation of the platform.

Finding and fixing dependencies doesn't work when there are hundreds or thousands of things that depend on you--especially when many of them are not in the public repository. Again, this doesn't scale up. At best, it works on small codebases maintained by a single small team, that no-one else depends on.

Platforms like the ones I mentioned earlier do do pretty well with compatibility. Mostly, new versions don't break old code. There are of course a few exceptions--but these are far easier to deal with than if people were depending on all sorts of private things. Information hiding isn't perfect, but it works at scale better than any alternative we have.

I don't know what you mean by fully separating modularity from code re-use. If someone is reusing parts of your code that are unstable, they are coupled to it, and their code will break if you change your code. The issues are unavoidably tied together.

Finding and fixing

Finding and fixing dependencies doesn't work when there are hundreds or thousands of things that depend on you--especially when many of them are not in the public repository. Again, this doesn't scale up. At best, it works on small codebases maintained by a single small team, that no-one else depends on.

Surely you've heard how some random Googler can go change protocolbufs and then find problems in and fix the entire Google codebase afterwards before the change is even accepted?

This isn't really a small team at all, but is accomplished with amazing tools* (a global source code repository, really neat building and auto testing tools...). Imagine that kind of workflow being brought to a wiki-scale community? Ya, it is not going to revolutionize closed source corporate-style SE, but definitely at Google scale, and it could bring about something we lacked before (really open libraries).

Refs:

https://mike-bland.com/2011/12/02/coding-and-testing-at-google-2006-vs-2011.html

https://mike-bland.com/2012/10/01/tools.html

No longer

Surely you've heard how some random Googler can go change protocolbufs and then find problems in and fix the entire Google codebase afterwards before the change is even accepted?

Which was eventually discovered to cause more problems than it solves, and is being severely policed these days. Essentially, only special teams are allowed to do global changes. Usually, to get rid of legacy dependencies. And that tends to take several months and PSAs.

Great!

Great! Microsoft has much better odds now. I mean, Google was able to move so fast before with that tech and culture. I'm glad to google also has PhBs who can put a stop to such nonsenses.

At scale

Well, at a certain scale, that approach no longer works like this, but rather like this or this.

Having great tools and a global code repository helps

Having great tools and a global code repository helps enormously in the *rare* circumstances where you have to change a public API. It's still a huge headache, as Andreas points out, so you want to make it as rare as possible.

The fundamental issue is that coordination is costly, and building large things (or small things that depend on each other) requires some coordination. We have APIs that hide information to minimize that coordination. We have tools to reduce the cost of the coordination that we can't eliminate.

My point is that the tools

My point is that the tools get better, can get even more better, and are mostly limited only by the limits of our imagination. A code wiki is a grand experiment that should happen eventually. By putting everyone in the the same namespace, with the ability to change anything and the responsibility to fix everything (+ some tools and language design to help out), will chaos result, or would some order appear?

Reading about the history of the original c2 wiki is quite interesting. Ward started out with a very simple design by today's Wikipedia standards, and it worked quite well. There were problems, but those were fixable through cultural and tooling evolution.

ordering?

Is there not a potential ordering to the kind of changes we make? Some that are currently amenable to automatic push/update. Lots that aren't, but might be in the future if smart people keep working on it? I for sure would rather be in a world where I can break my API and Do It Right (until I see later the next Right Way To Do It, of course) and give out some migration scripts to lessen how mad my clients/consumers will be.

There are different trade

There are different trade offs to starting big vs. starting small. If you had a global repository with all the code in it, you could just go ahead and fix everything you broke. If code was considered not worth fixing, then it could simply be expunged from the repository.

Of course, that is pretty radical...but from that point you could learn lessons that could be applied to more corporate dev (consider how wiki inspired sharepoint, as a bad example).

what about the already extant copies?

I assume I can bud off my own local running copy of stuff? So then when you Change Everything (And Fix It All Up, Too), how the heck am I going to catch up? Won't there be zillions of incompatible images? Like what might happen in Smalltalk/Squeak lands? Would it come with data migrations? Or is there some Erlangy TCO thing that lets people hot swap? But sooner or later there's going to be some change that's too qualitative to swallow in one go, I gotta assume one has to assume.

I am not trying to say NAY, I am trying to understand the vision :-)

Well, you don't have extant

Well, you don't have extant copies...everything could be in the wiki. We don't worry about editing offline copies of Wikipedia, afterall.

over-sell and under-deliver

oh. i thought this was going to be Smalltalk-esque, were we had running images that we wanted to upgrade. if this whole thing is just a big version of github then that's kind of a big meg to me.

It's like github where there

It's like github where there is just one big clone of everything (well, no clones, no different versions). It is the exact opposite of revision control actually (since everyone is always at head).

capice

sounds like a worthy experiment at the very least :-)

repositories as dialects and flavors

Rather than one central 'Wikipedia', I imagine these repositories would be more like the many Linux distribution. Choose your flavor or dialect of the language: RedHat, Ubuntu, Suse, Cent, Steam, etc.. mostly similar and familiar, but tweaked and curated by people with different visions for the color of the bike shed. In addition, there will be private repos hosted by companies that wish to protect their intellectual property. But there will be a lot of give and take between repositories.

Assuming a private copy of a repository, you should be able to pull updates to a public repository in a DVCS style. (Conversely, you should be able to offer pull requests or push cherry-picked proposed updates back to your origin wiki.)

Naturally, a pull isn't going to cause problems for the 99.9% of the content in your copy that you haven't touched and might not even be aware exists. But if you have modified content and not pushed it back, there may be a conflict to resolve. And if you have added content to your copy of the repository, it may break because dependencies change, and you may need to deal with name collisions. There would be several general strategies to deal with collisions: replace your version, replace theirs, rename your version, rename theirs. In some languages, attempting to merge updates may be appropriate. Basically, this is a DVCS operation.

You should have good automated tools to find where things break - linter, typechecker, unit testing, property testing, fuzz testing, etc.. Such tools would be a major part of developing a global repository. If some of your content breaks, you may need to fix that; fortunately, you should also have plenty of examples of how similar problems were fixed in the original repository.

The issue of Smalltalk style 'incompatible images' isn't relevant here, since functions are not stateful objects. Think of multiple repositories more like DVCS clones. Pulling an update from the origin is logically the same as creating a new copy of the origin and figuring out how to patch your changes onto it.

(Note: If your repo is public, a person pushing updates might try to make it easy for you in the form of a proposed patch. Or a person who pays attention to two repos might decide to push a change they like into yours. This won't always happen, of course, but if supported it could happen frequently enough to ease burden on curators.)

Maintaining your own repo would be an investment of time, energy, money. So, a lot of people would just use one of the more popular public repositories. This wouldn't be a bad thing.

version managing functions

What about version tracking functions, so that changes don't affect released production software. You could then generate a dependency graph if which products use the modified function, and ask their release manager to test and approve the new function (or batch these requests for efficiency). Meanwhile new code can use the new version. Also makes rollback easier.

atomic update and history

In the implementation I'm developing, I'm addressing some of those concerns. I'm using a logarithmic history, such that old versions of a function remain available. Developers can operate in sessions/workspaces to support atomic updates. I plan for some ability to mark functions stable or frozen, such that any changes that would modify their bytecode will need to go through a special approval process.

Changes won't impact released software in any case. When code is compiled, there is no runtime dependency on the repository. I suppose if you wanted to tag a specific snapshot of a repository or function as important-do-not-destroy (so the logarithmic history doesn't delete versions that are externally important), that could be supported without too much difficulty. This would make it easy for project managers to protect versions they consider important.

I thought about supporting branching, but I decided that wasn't a feature I really want to enable on a path of low resistance. So sessions are more like short-lived, miniature branches for a single user (or small group, if the user shares the session caps), while true branches require cloning a repository (which isn't too difficult, but should eventually involve a several gigabytes of history). OTOH, I do want good support for cherry picking content and sharing histories/etymology between repos.

question the fundamentals

To clarify, I haven't suggested that we shouldn't "design for change". Rather, I believe that a global namespace of functions would be a more effective "design for change" than is module based packaging and implementation hiding we conventionally use today.

But that aside, I think most "SE fundamentals" are very dubious. Structured programming, for example: don't need it if you have substructural types. A lot of software engineering ideas work just as well if you take the advice and do the exact opposite. Something to do with duality, I think.

From Sturgeon's law, 90% of SE fundamentals are crap. Question them.

Naming conventions simply don't work at scale.

They work fine. Murphy's law also applies, of course: people will do anything they can do. But naming conventions still work. The job of naming conventions isn't to prevent people from using private names, just to make them think twice or thrice about it.

Also, as I mentioned earlier, I think abstractions should be secured by first-class security concepts (such as capability security, value sealing) rather than third-class concepts like where a function is defined. The idea of 'private data' should be completely separate from the functions in the codebase. If you've done this, then the remaining reasons for 'private functions' are because they're either unstable or thought unsuitable for reuse.

Finding and fixing dependencies doesn't work when there are hundreds or thousands of things that depend on you

An advantage of a global repository is that you'll have ready access to data like "Oh, a thousand things depend on this. Maybe it would be better to start a parallel project rather than destructively modify this one." Meanwhile, the developer whose abstractions are used in just a few other places can feel confident to make breaking changes and just repair the clients.

Data driven decisions are much nicer than trying to predict the future or manage ad-hoc version numbers in a package dependency system.

Whether a thousand is too many to fix depends on how sophisticated the change, and how much help or automation is available. I think, in many cases, even a thousand changes is viable.

especially when many [things that depend on you] are not in the public repository

We only need consistency within each repository. We aren't installing 'packages' of functions that might exist in a combinatorial number of present and future version configurations. We don't need to worry about breaking things we cannot test or see.

If a group wants to pull updates into a private repository, they'll need to make any local changes to achieve consistency in their own repo. They'll have some advantages when doing so: they'll have examples for how you fixed everything else, and they won't need to go through a package maintainer.

Of course, if people get tired of doing their own maintenance on private repositories, they can always choose to open source and share it.

I don't know what you mean by fully separating modularity from code re-use.

In a good language, you can model objects or abstractions that are inherently loosely coupled, modular, and hide information. Examples include process automata, stream processors, pure functions, scene-graphs, mobile ambients. Modularity, then, arises from abstracting first-class modular systems.

It is true that code in the module may break if you change some unstable code that it depends upon. So we can have coupling between a module and the codebase. But, unless the module abstraction itself was broken (and a type checker or linter should find that) this is relatively weak coupling. Information hiding will still be intact, for example, and the module would remain weakly coupled to other modules in the same configuration.

There is a big difference!

Most people have likely moved on from this strand of the discussion but....

There is very little difference between designing Plan9 OS vs. designing a language around Plan9's abstraction of processes and state.

There is a big difference. An OS simply provides a programming interface (API) to create/access/manipulate a set of resources. And it manages these resources (real or virtual). They can be accessed from pretty much any language that can access the API. In fact system call interface has evolved from the model of extending the processor's machine instruction set (system 360 used SVC, Tops 10 used UUO -- unimplemented user operations etc.). Such traps then put the machine in a higher privileged state where it had access to state not accessible from the "user mode". A bit like running a microcoded instruction. And there is certainly a big difference between designing a processor's instruction set and a language around it!

It seems to me that the poor relationship between language and OS today is the source of a massive amount of accidental complexity. Addressing this really could become a silver bullet. There are a lot of problems worth addressing in PL. Systems problems (security, distribution, deployment, etc.) are certainly among them.

I see this differently. We are stuck with the Tower of Babel of programming languages. This is because there is no universal language that will work well in very different environments for totally diverse purposes. The needs of a high speed TCP stack are very different from the needs of AI system or of a computer algebra system or of a Mars rover control system or of an industrial PLC system. Perhaps C came the closest to a universal language but we are well aware of problems with it. Similarly there is no universal OS. May be things would've been somewhat better if Intel had succeeded with iAPX 432, their all singing all dancing, object oriented, capability based processor, instead of x86 but they didn't. And even if they did, there'd still be a lot of places where it wouldn't have been a good fit.

I think it is less a question of which PL or which OS that is better but of which system architecture is better. A "wider spectrum" system architecture will allow one to build systems for a wider set of end uses. Once you have that it will serve well with a variety of PLs. [I have a partial implementation of a Scheme interface to plan9 which I found easier to use than rc or C]. PLs should certainly do a better job of dealing with security etc. but it is not going to be enough. Unfortunately languages such as E are still not used much.

IMO, the languages that provide the best comprehension are those that help me control and isolate effects and relationships without reading much code, e.g. through the type system, capability security model, immutability by default, confluence, idempotence, or certain forms of tooling (e.g. graphing a function). And very few PLs are optimized for this goal.

I don't disagree with some of the goals but I don't think it is going to make much of a difference. A loosely coupled distributed system can be a very complex beast and visualizing various flows, debugging problems spanning more than one node, measuring usage & growth & anomalous traffic patterns, systemic or emergent problems etc. etc. -- I think the interesting work here can be done in a number of different programming languages and there is precious little you can do from a PL design perspective that will make much of a difference. It is all duct tape and tooling and more duct tape : )

Similarities

Your first paragraph, rephrased a bit to emphasize my point about similarity:

"A language provides programming abstractions (Types, Objects) to create/access/manipulate a set of resources. And it manages these resources (real or virtual). They can be accessed from pretty much any embedded language interpreter or DSL that can access the abstraction. In fact, the language itself can be understood as building upon an even lower-level language understood by the CPU. That CPU language even has some badly designed, modal security properties!"

There have been instruction sets designed with many nice features that are conventionally considered language level. The B5000 machine is an especially interesting example, and is indicated to have inspired the capability security concept (Robust Composition, ch24).

Forth is a fine example of a language whose low level words essentially become an instruction set. Other languages, such as Scheme, build above a small set of primitive operations designed almost independently of the underlying processor.

My Awelon project languages take inspiration from Forth and Scheme. Awelon bytecode is my small set of primitive operations. Awelon object language builds upon these in a transparently thin layer; words can easily be treated as higher-level instructions. The vast majority of my language design effort was oriented around deciding which properties I want (capability security, easy serialization and distribution, parallelism, streaming, easy support for genetic programming, etc.) and designing a bytecode with those properties.

Modern CPU instruction sets may have been designed by people who are more philosophically aligned with Bjarne Stroustrup than Guy Steele or Chuck Moore or Mark Miller. But, if you ever try to build your own instruction set, you'll quickly find that language design is certainly involved.

(I'll separate my response to your other points.)

The distinction I see

The distinction I see is that a language provides basic nuts and bolts or glue and basic materials, while an OS provides useful subassemblies. Another analogy: a programming language is like the visual language of schematics: it provides basic shapes that can be attached in a certain way and also a way to connect to previously defined circuits. An OS is in effect provides a set of such more complicated but very useful circuits. More generally any code library.

It obviously makes sense to try to best express properties we deem important (such as security, parallelism, modularization etc.) in a language but that is like providing better quality glue or nuts and bolts that don't crack under severe stress or stronger material. It may even make sense to make it really easy to put together some pretty useful prebuilt pieces (such as APL does for arrays). But by nature a system is a much more complex beast. If you try to put all that complexity in a language it will not be as easy to use or flexible.

What I am trying to say is that we need far more systems research, not just PL research. And we need both to influence each other and that doesn't seem to be happening much. In my view the "sprawl" everyone complains about is because we don't seem to be as clear eyed about system design as we are (or seem to be) about language design. The "poor relationship between OS and languages" that you talked about earlier is a symptom of this.

DeRemer and Kron's "programming-in-the-large vs programming-in-the-small" paper is almost 40 years old but it seems we haven't made much progress on the first part in their title. We have of course built very large systems but most are inefficient, buggy and incomprehensible dinosaurs of systems and every bug fix, extension, new feature increases their entropy and confusion. And I just don't think we are going to fix that with further PL research.

questionable generalizations

Your description of languages and operating systems is not unsuitable for many languages and uses thereof in mainstream. But if you spend some time looking outside of mainstream - e.g. into flow-based programming, REBOL/Red, Wolfram Language, ToonTalk, Croquet Project, Morphic, ColorForth - you can find plenty of examples that fall outside your generalizations, where a language provides a more holistic environment or libraries of useful subassemblies.

The "distinction you see" is neither universal nor fundamental.

by nature a system is a much more complex beast. If you try to put all that complexity in a language it will not be as easy to use or flexible

Which are more "usable and flexible" today: the popular operating systems, or the C/C++ languages from which they're constructed? What makes you believe that supporting assembly would hurt usability or flexibility?

universal languages and system architecture

there is no universal language that will work well in very different environments for totally diverse purposes [TCP stacks, AI, robot control systems, PLC]

I think there are many languages that would work well for all these scenarios, including but not limited to Java, Erlang, Smalltalk, Clojure. I'll clarify this point. Such languages might work well, but they won't always work optimally. Further, they'll work optimally for performance in a great many scenario where we wouldn't expect them to do so, but might take a hit syntactically or with regards to accessible abstractions. This works as follows:

  • An interesting feature I've seen most frequently in Smalltalk bootstraps, and more recently in JavaScript (e.g. with the "use asm" annotation), is to identify a subset of the language that can be compiled to very efficient machine code, and to recognize or annotate subprograms or modules that are encoded in just this subset of the language. Using this technique, it is quite feasible to leverage a high-level language to describe low-level operations, or even code that can be shifted to a GPGPU or FPGA.
  • Similarly, I've also seen high languages like Haskell and Coq directly model DSLs or abstractions and types for OpenGL pipelines, GPGPU computing, modeling doubles with all their quirks. The Khronos group has worked on WebCL and WebGL embeddings for high-performance JavaScript.

Even C has been tweaked with such techniques, e.g. annotated C for MPI parallelism or restricted C for GPGPU computing.

Based on these historical and ongoing observations, I believe the performance aspect is pretty well covered. Performance will take care of itself, eventually, and won't even need to be hacked-in if we have enough language design foresight to support annotations, easy partial evaluation and other optimizations, and an easily extensible compiler.

The reason high level languages today don't make good operating systems has little to do with performance or suitability to specific problems. Rather, it is because they fail to handle various cross-cutting concerns, common to all problem domains, that are historically the province of operating systems: persistence, concurrency, security, process control, software installation and update.

That is, most languages are badly designed when evaluated as operating systems. Similarly, most operating systems are badly designed when evaluated as languages. Much accidental complexity ensues.

less a question of which PL or which OS that is better but of which system architecture is better

I agree. Assuming that language should not be separated from the OS, then the whole question of PL design reduces to system architecture design.

Studying object capability model taught me that "separation of concerns" isn't always good, that not all concerns can be or should be separated; a tight coupling of designation and authority can greatly simplify our systems and our reasoning, avoid accidental complexity, result in more expressive systems.

Separation of language design from system architecture design is a mistake of similar magnitude. And we're paying for it. And most people won't recognize this without the benefit of hindsight. It is too easy, given normal human biases, to take "what is" as "the natural way of things", despite the fact that there is nothing natural about programming languages or operating systems.

A "wider spectrum" system architecture will allow one to build systems for a wider set of end uses. Once you have that it will serve well with a variety of PLs.

We don't benefit from a variety general purpose PLs. This isn't an intrinsically good property that should be targeted during design of a system. Indeed, I feel judging a system based on how many languages it supports is similar to judging quality of software based on how many lines of code.

there is precious little you can do from a PL design perspective that will make much of a difference. It is all duct tape and tooling and more duct tape : )

That's a very self-fulfilling position.

As you might expect, I disagree. : )

Perhaps you should look into languages designed to support duct tape and tooling. They call themselves "orchestration languages", such as Orc. My RDP paradigm has some similarities with these orchestration languages, and is able to represent services, frameworks, and overlay networks with the same first-class abstractions that OO paradigm languages manages for individual objects or FP for individual functions. Resources/tools/capabilities and reactive duct tape are primitive abstractions.