Open wiki-like code repository

Here is a crazy thought experiment:

Take your favorite programming language and create a project in your favorite IDE (say C# and Visual Studio, but Scala and Eclipse or Haskell and Emacs would also work). Now imagine that this project is shared by you and a thousand of your closest programming friends. Everyone can edit the project, and you are pushed contributions in real time without any vetting. The idea is to collaborate building a library in an organic way that precludes a centralized maintainer/gatekeeper/project manager. Rather, you and everyone else could review code changes by other people, revert when vandalism occurs, or ensure that standards are met from people you don't know very well and may just be casual contributors. Yes, like a wiki.

Now, here are some of the questions that I can think:

  • Could a useful/usable library ever result from large-scale de-centralized collaboration?
  • If someone added functionality to the library that would be useful to me, how could I be made aware of that functionality? Likewise, how could I add functionality so that others could find it (answer might be language specific).
  • Can community review enough to ensure quality and say..security; e.g., that someone doesn't insert a virus into codebase. Also, that contributed code is not copied from somewhere else with an incompatible license?
  • Would certain programming languages work better than others in a code wiki? For example, would strong static typing hinder to massive collaboration because it requires too much pre-planning, or help because it ensures some consistency between contributions?
  • If the library is continuously changing (no static releases), how would it be feasible to take a dependency on the library?

Not really expecting answers, but thoughts and other questions.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

David Barbour on c2

Although I think I had the same idea at roughly the same time as David, I did not post it to the Internet. My friend Luke Breuer has similar ideas and has posted this to the Internet, but I can't find the links to it (Luke's goals are more in the vein of Chandler-like personal information management system, but with some collaborative, knowledge-sharing twists).

David's ideas are on the c2 wiki. Use Google search with the inurl: scoping operator to find David writing up a storm of ideas.

Suffice to say, David has already highlighted here on LtU many times why you need automated code distribution in the environment for this to work. It's an essential requirement and just making things a wiki doesn't solve this fundamental problem.

David and I have danced around similar issues here on LtU in the past - for example when he asked me what I meant by "auditing machine" I clarified it concisely. To be clear, the auditing machine idea is not new. It was invented by the group who designed the Vesta Software Configuration Management system. Vesta did not call this an auditing machine. That name for this general pattern is just a term I coined to describe the general idea of Vesta's 3 main principles. BTW, the Springer-Verlag book describing Vesta is easily one of the best software architecture books you will ever read. The people behind the project are a who's who of great programmers, too. Off the top of my head, Paul McJones and John Ellis (who once won the ACM SIGPLAN award Best Doctoral Disertation for his thesis on optimizing VLIW compilers). Butler Lampson also contributed design wisdom, but was not credited as an implementor.

Interesting challenge such a scenario would be for a PL I guess

Now, sorry if I disappoint, but... regarding any of the PLs I know/can think of, and given the way it's stated, I'm just at a total loss, there, for giving an answer to those points (1) to (5), that would be useful to you anyhow :(

You indeed don't put much input constraints, while you're quite demanding on the type of outcome's "requirements"! Which, anyway, doesn't imply I find it irrelevant or uninteresting. It is very interesting, actually, IMO, but just too much for me for now...

No doubt others will be more inspired, though. I'm already curious where and how. :)

But! Hey, I can also relate quite a bit, nevertheless, especially on your (1), (2) and (4), where I am indeed taken currently with addressing very similar dimensions found under an analogous feature set you could appeal to, in the overall problem.

But in my case, it's kind of "easier", if you only dare to see it like this(*), as it involves the usage context of some modeling-to-artifact tool chain, with an emphasis on DSMLs, instead of a given PL with its interested community and available tooling, as in your contemplation.

(*) (and, si, si, I believe one can do so to some large extent, there then)

Related Discussion

You might recall our discussion regarding 'influence of cognitive models', in which I express the position that social (aggregate) models of program development are more important than cognitive models. This is based in part on the idea that you are a stranger to your own code after even a moderate period of time.

I've been pursuing the basic idea of wiki-based development, but my goals have been a bit broader - to support full development on the 'wiki', including testing and debugging and integration of code. Feel free to peruse WikiIde [c2], QedWiki [c2], and links you might follow from there.

Some of those motivations apply even to the far more limited case you describe. For example, I would greatly like to see better support for cross-project refactoring (i.e. where a library developer can freely fix most projects that use the library after an API change). I would like a flatter, more global namespaces to better support sharing of code among projects. I would like automated unit and integration testing whenever a dependency changes, and nice big alerts telling everyone what your changes broke that therefore need fixing.

For security and safe automated testing, I would strongly suggest favoring a capability language as the basis. This would also help for automatic distribution and for using the wiki as an IDE (i.e. where one can execute and debug persistent projects on a remote 'wiki cloud' to which your local code-browser is just an extension).

My thoughts on your questions:

Could a useful/usable library ever result from large-scale de-centralized collaboration? Yes, though vandalism will need to be controlled. Of course, if one is to achieve "large-scale" collaboration, the language will need to do much to support composition. Most languages reach limits of composition long before resource exhaustion due to issues with concurrency management (e.g. deadlock risk), authority management, privacy issues, policy injection, partial failure recovery, cyclic dependencies under manifest typing, performance risk, and so on.

If someone added functionality to the library that would be useful to me, how could I be made aware of that functionality? Likewise, how could I add functionality so that others could find it (answer might be language specific).

A language may be designed to automatically leverage new functionality, via goal-based composition where one develops 'strategies' to achieve 'goals' and the runtime (or compiler) chooses strategies based on their properties and degree of specialization. A lesser form of this would be to develop with multi-methods as the default. A new multi-method or strategy could immediately be leveraged if it applies and is the best match for the inputs and context.

More traditionally, an integrated indexed 'help' system that can reference documentation throughout the wiki codebase, support for twitter/RSS/e-mail subscriptions to changes and breakages and new users of code, and maintaining dedicated advice and example documentation pages (how do I XYZ?) would all help with this. Even better, the wiki could integrate IRC and collaborative editing so that people may teach others how to use a somewhat complicated API, remotely.

In general, you'll be unable to keep up with everything, and new programmers and new projects will tend to reinvent things as an important part of a learning and exploratory programming process. Reinvention is important! So embrace the fact that some concepts will be reinvented over and over, and focus on providing better support for 'refactoring' these repeated reinventions and supporting developers in enhancing documentation in order to better encourage reuse. (The ability to reference different versions of the same page would be quite useful for documenting development process.)

Can community review enough to ensure quality and say..security; e.g., that someone doesn't insert a virus into codebase. No. The language needs to support security (protection of authority and privacy). The language needs to support static analysis and automated testing. Interested parties will be able to maintain quality to some degree, but it would require fascist edit policies if the overhead overhead is to be kept reasonable. Reading code takes time, and finding the significance of a small or subtle edit can take more effort than even interested parties are willing to push forth, and you can't really trust per-edit documentation. Community review must be the last line of defense, not the first.

Also, that contributed code is not copied from somewhere else with an incompatible license? You can't protect against this technologically, so you must do it socially, e.g. via ensuring edits are trackable to vested online identities (users must be able to lose something if they violate the rules).

Would certain programming languages work better than others in a code wiki? Yes. A language with good security, safety, automated testing facilities can go a long way to improving stability. It should also be easy to link pages via names, and easy to refactor (i.e. to find all users of a given function).

A language that effectively integrates with live systems would also be useful, since that form of separate compilation meets the natural boundaries of the code visible on the wiki. Rather than separate compilation of 'libraries', one uses separate services. To make separate services really efficient, support for automatic distribution of code and related properties (persistence, self-healing) would be very useful because you could pull parts of remote services over to the local host and vice versa.

Would strong static typing hinder to massive collaboration because it requires too much pre-planning, or help because it ensures some consistency between contributions?

I would think static typing would help, at least so long as you can continue to run with incorrectly typed 'dead code' and have good defaults. But manifest typing would be a major hindrance for composition. Nominative typing is problematic if one is going the 'live services' route. If you're going to have static typing, ideally you'd want something like static duck typing.

If you're going to have hundreds of programmers dipping their fingers into the pie, you'll need as much support for consistency as you can possibly get. That includes automated analyses of all sorts.

If the library is continuously changing (no static releases), how would it be feasible to take a dependency on the library?

To hold under intermediate changes, I would suggest that the wiki track the last working version of dependencies from any given page (where 'working' means 'passes all static analyses and automated tests'). This would encourage development of tests, and would allow coders to continue programming even as dependencies are being refactored. The wiki could raise warnings when the most recent version of a page is not in use.

For long term security and quality assurance, the wiki should support PKI-signature approval of versions of pages. Users or organizations could then mark certain signators as trusted, thus creating a web of trust above the wiki. A new project would compile (or interpret) against trusted versions of code pages. This approach has the natural advantage of encouraging multiple code reviews (you'd need to go in, review, and sign any dependencies yourself if you your trusted dependencies have inconsistent versioning properties) and would create an economy for professional code reviewers. Further, this could help produce 'vested' identities that would resist copyright violations.

You might recall our

You might recall our discussion regarding 'influence of cognitive models', in which I express the position that social (aggregate) models of program development are more important than cognitive models. This is based in part on the idea that you are a stranger to your own code after even a moderate period of time.

Basically, I want a software development lifecycle harness that has the same properties as an automated truth maintenance system - this includes forcing queuing of events such as check-ins. While a massively collaborative system (such as Etherpad) can allow users to do operational transforms to a document in real time, macro operations that effect the actual SDLC of a project must be fenced with quality gates.

ve been pursuing the basic idea of wiki-based development, but my goals have been a bit broader - to support full development on the 'wiki', including testing and debugging and integration of code.

I think my goal simply stated is to obsolete the concept of a Software Development Kit. That would be truly open source at a fine-grained level, allowing modular composition of programs, frameworks, product lines, factories, etc. Build your own appliance - like an Arduino or Bug Labs. I'm obviously extremely closed-minded and visionary about this, though, and don't expect you to agree.

comment deleted by author

since the problem no longer exists.

Repaired inline.

Repaired inline. I apologize for the mess. (referring to missed </i>)

I did my Google due

I did my Google due diligence, but my thought was simpler than a new different wiki programming system with new technology and such. Basically, I'm wondering if it would be useful to just take the existing project model, and allow everyone to share the project. Then add some review on top of that, and...not much else.

Or to put it in Ward Cunningham's words: what is "the simplest online programming system that could possibly work?"

Have you seen github or

Have you seen github or gitorious?

The Git Object Model is for version control is what you want.

However, I reject 'simplest' here as a good solution. Experience tells me what I want from a programming system.

Have you seen github or

Have you seen github or gitorious?

Ah, but this seems slightly better than sourceforge or codeplex. They still have different projects, which seems kind of annoying to me. To get scale, I think one "true" project would be needed, like an encyclopedia of all the code that you could need if you were programming in Java or C#. Or maybe I'm just being crazy...

However, I reject 'simplest' here as a good solution. Experience tells me what I want from a programming system.

could also be read as Experience tells me what I want from an online database but then to consider how the simplest online database that could possibly work has been wildly successful. But then experience tends to bias us, which is why our scientific innovation careers are mostly over by the time we hit 30.

That wasn't meant to be inflammatory. I'm just trying to keep my mind as wide open as possible, since I'm not sure what my own pre-conceived biases will cause me to miss.

Distribution != Federation

What you're complaining about is that its dynamic federation mechanism is tightly coupled to its dynamic distribution mechanism. Nobody owns a 'project' that everyone commits to on these sites. Instead, everyone effectively owns what Microsoft calls a 'Virtual Build Lab'. Not having a complete software development lifecycle harness is the raison d'être for things like a VBL and the SNAP (Shiny New Automation Process) systems used at Microsoft. [Edit: Reverse integration and forward integration then map directly to branching and merging generic operations.]

To get scale, I think one "true" project would be needed

No, to get scale, modularly and composition is needed. As systems grow larger and larger, how you construct your arches will be the dominating factor. People just don't matter when it comes to scaling software, unless you let them by allowing them to substitute process with bureaucracy to disguise their incompetence. That's why you need rigorous quality gates and well-defined interfaces.

If the maintainer of the "golden tree" containing the "golden bits", such as Andrew Morton's kernel branch or Torvalds's kernel branch, doesn't like your patch, then you can still distribute your patch. The idea of a one "true" project goes very much against this. -- I can provide examples of how projects in the large tend to sacrifice goals that can be beneficial for a small community for the sake of overall simplicity: The AllocStream Facility was removed from gcc because it provided no benefit to Fortran users at the time, and it was argued that this could could best be shoved in a kernel or elsewhere.

The 'magic trick' part is code refactoring. This is a very tough problem to solve, especially efficiently, and especially for update-in-place for live code. Only one scheme I know of has ever been suggested, and it was in 1992! NReversal of Fortune: The Thermodynamics of Garbage Collection, By Henry Baker. Since then, people have cited this paper, but in the narrower context of time-travel debugging, which is more specific than the general ideas Baker proposed.

Besides, for any given project, if the object model is not designed well, then it will be essentially impossible to know if your encyclopedia is 'finished' so that you can 'start writing a paper' based on your encyclopedia as the authority.

Trivial example:

If an Object is not responsible for knowing its own validation, then the definition of validation will be scattered across the system, and you now have to use search-based software engineering tools to find this code. What happens if there are two functions that both take a data type and perform some 'DoValidation' on it? What just actually happened to your program? By placing data and methods together, you solve this problem trivially. You can't make bad code, good code!

No, to get scale, modularly

No, to get scale, modularly and composition is needed. As systems grow larger and larger, how you construct your arches will be the dominating factor. People just don't matter when it comes to scaling software, unless you let them by allowing them to substitute process with bureaucracy to disguise their incompetence. That's why you need rigorous quality gates and well-defined interfaces.

Is it that people don't scale, or process doesn't scale with more people? I think, according to the way things are right now, you are definitely right. But if people could consume and contribute code in smaller chunks, something that existing module systems don't really handle very well, what changes?

The AllocStream Facility was removed from gcc because it provided no benefit to Fortran users at the time, and it was argued that this could could best be shoved in a kernel or elsewhere.

The wiki way would be to just include everything that was semi-sensible and not harmful, and have a way of filtering out the noise.

The 'magic trick' part is code refactoring. This is a very tough problem to solve, especially efficiently, and especially for update-in-place for live code.

I would think this paper is related, but that is a nice link at any rate. I'm not sure if code refactoring is the real problem, or at least...the problem is refactoring code while other people depend on it. But in my experience, this is a hard problem, and the best way to deal with it is to break everything and let everyone clean up the mess...or...since everything is going into the wiki project anyways, just clean up all the code that you break when you make a change.

I've thought about about this a lot also, maybe the object model should be fixed by the language somehow....(sort of like COM)

The wiki way would be to

The wiki way would be to just include everything that was semi-sensible and not harmful, and have a way of filtering out the noise.

By default, this does not take into account metaprogramming models including supercompilation. I wish I had a really good paper to link here, but I don't. This is an underdeveloped area of PLT, at least in my own brain if not in the world at large.

Good question

But if people could consume and contribute code in smaller chunks, something that existing module systems don't really handle very well, what changes?

I missed this question. It was a very good question.

The answer here is you are referring to something like pluggable types and Jigsaw.

First, not all languages may have good enough module system support.

Second, many languages have to interface with systems with their own bizarre module hosting conventions. Your language runtime Host may not be as expressive as your source language.

For example, we use SQL CLR Integration Services a lot here at work in order to extend SQLServer in cute ways that save us countless man-hours. I think we were one of the first software shops to really push the use of this technology, and what we use it for is well beyond the scope of what I've seen published in trade press sites like SqlServerCentral.com. I also know from experience working with JVM integration with Oracle that nobody does the sort of stuff we do (we're not geniuses, just insane). But part of how we empower this creativity is working around SQL CLR's bizarre deployment model, where you have to conform to the TSQL procedural programming interface model. Sometimes this requires changing the source code to instrument it with compiler conditional directives for side-by-side deployment. Other times it means dynamically reshaping your assembly prior to deployment. Either way, our library assets for what we deploy to SQL Server (and transparently re-use the code in other environments!) results in less than simple source code files. Is it modular? Does it conform to best practices in module systems? I'd say so. Is the source text a mess in the sense it has multiple sets of dependencies? Sort of.

Do people matter at all in scaling up this process? No.

Can we scale down this process? We've scaled it down as much as possible, already. We've done so through analyzing modularity and composition needs.

PageRank as the mechanism for identifying a golden tree

I think there are some cool ideas being pitched right now in the ICSE realm of computer science, such as search-based software engineering.

I think information retrieval and use of statistics can be applied to tell you interesting things about a social network. If one person is maintaining a branch with a lot of reverse integrations from other and performing forward integrations, what does that tell you?

The cool part is you don't even need to directly integrate from that branch. You can cache it locally, do the operation from your local disk to the site. It would be slower, but you would first get to test it locally in your own private test environment. -- Would you really trust Google or Microsoft or just whoever to do quality gates on your code? Bottom line: these features fall out from the 'simplicity' of the model.

That wasn't meant to be inflammatory. I'm just trying to keep my mind as wide open as possible, since I'm not sure what my own pre-conceived biases will cause me to miss.

My approach to not missing anything is to simply consume everything, Human Hidden Markov model style. I just pretty much cherrypick ideas, and have very, very, very few good ideas of my own. My rule of thumb is I have one good idea every 3 months, and that's it.

WikiWiki wouldn't work if

to put it in Ward Cunningham's words: what is "the simplest online programming system that could possibly work?"

WikiWiki wouldn't work if a subtle mispeling of a word in a linked page could break a program or redirect you to malicious code that spins up your HDD and leaks secrets or loads a virus.

WikiWiki really doesn't work when it comes to handling vandals, spammers, stubborn ignorami, and people with awful taste in jokes.

A programming system can't so easily ignore these problems.

WikiWiki wouldn't work if a

WikiWiki wouldn't work if a subtle mispeling of a word in a linked page could break a program or redirect you to malicious code that spins up your HDD and leaks secrets or loads a virus.

Sandboxing would work mostly, treat code like you would a Silverlight application and we'd probably be ok. Any code that wants to run outside of the sandbox...well, no, you can still do a lot in a sandbox.

WikiWiki really doesn't work when it comes to handling vandals, spammers, stubborn ignorami, and people with awful taste in jokes.

I'm confused, it seems like WikiWiki has worked fairly well in these areas, at least better than one would expect given its simple model. Wikipedia seems to work better, a bit more editorial control but still fairly decentralized and successful (though it might be hitting its breaking point).

Counterexamples

The Los Angeles Times tried a wiki of their own, with hilarious results... L.A. Times shuts reader-editorial Web site

Nice, from the linked

Nice, from the linked material:

"Plenty of skeptics are predicting embarrassment; like an arthritic old lady who takes to the dance floor, they say, the Los Angeles Times is more likely to break a hip than to be hip. We acknowledge that possibility. Nevertheless, we proceed."

Anyways, security and vandalism are problems that definitely need solutions, but the solutions don't have to be complicated.

I think you just lack sys admin mindset (or the skillset)

What David and I are telling you are things a sys admin or network architect is paranoid about, but the average developer is not. Put simply, the average developer is far removed from the consequences of his actions, especially system exploits. By contrast, 'sys admins' literally carry in their job title responsibility for whole systems. Paranoia is healthy.

It might be best for you to read some basic posts on admin'ing, from the perspective of an admin. My friend Jeff has an excellent admin blog you could learn a lot from.

See: SaaS: Who needs release management?

Linux vs. Solaris packaging: it’s a philosophical thing

On revision control workflows

These posts help discuss things without telling you to go buy a book on ITIL, SCM or whatever. When you are talking about a project with 1,000 contributors, you're talking huge corporate sponsorships. You need good change control and testing practices. You can either design that into the system's workflow or not. Your choice. If you choose not to, somebody will roll his own on top of the solution. And then somebody from another company or project will roll his own, and so on.

Finally, vulernability researchers are busy enough. They don't need people making lives harder on them by producing code sharing models that make it easy to sneak in problems. Just look at the innocent problem in Debian recently with Valgrind complaining about SSL taking the address of unused memory location.

Ah, all of this stuff was in

Ah, all of this stuff was in vogue 10 years ago. I understand it, but I don't think it is universally applicable.

Anyways, there are two mind sets:

There is the trust model where you try to ensure code doesn't do evil stuff so we can give it privileges. Of course, "evil" is human defined, so most of the work automatic methods focuses on ensuring the underlying models are safe and sound. Perfectly safe code could still be evil by reading a file and transmitting it to the net. Replace code with person: we trust people by ensuring they aren't evil (run a criminal background check, make sure they have a reflection) then we give them privileges so they can get a job done (e.g., check code into a repository). We review people and their work to make sure they really aren't evil or don't become evil.

Most of the current world works on the above trust model.

The second model is assume code and people can be evil and live with the consequences. Pragmatically, we can only tell if something is evil by its behavior and acts, so make sure this is observable (one of Ward's core wiki principles) as well undoable. In the real world, we also have legal consequences for doing evil things, and that works pretty well. We can automatically trust people to walk around and even carry knives or guns that can kill people (in the states), and it mostly works out (though sometimes it doesn't). This is much more scalable than say DPRK's police state where you need explicit trust to do almost anything outside of your normal routine.

I believe not much practical work has been done on examining the second model beyond what we have seen go on in wiki. We can look at wiki as an fascinating experiment, and all the negative stuff that has occurred (e.g., spam bombs) are interesting events, especially looking at the solutions (bots to detect and delete spam). Now, my only question is: is it worthwhile to perform the experiment on code?

is it worthwhile?

is it worthwhile to perform the experiment on code?

Sure.

But to forget the lessons from dealing with mere, unempowered text would be foolish... perhaps gravely so.

And to abandon the lessons taught from revision control systems would also be problematic. In a concurrent editing environment, individual programmers need that checkins - especially of components interacting with their own code - be relatively consistent, atomic, and stable.

So long as you handle the relevant issues, the experiment could continue.

Alternatively, one could keep it to a small scale, rather than making it an "open" wiki. Existing systems already achieve similar collaborative editing properties, such as SmallTalk and Croquet Project. The smaller scale is suitable for a teaching environment.

Alternatively, one could

Alternatively, one could keep it to a small scale, rather than making it an "open" wiki. Existing systems already achieve similar collaborative editing properties, such as SmallTalk and Croquet Project. The smaller scale is suitable for a teaching environment.

I don't think the problems are very difficult to solve if scale is kept small (its easy to trust people if you know where they live). On the other hand, large scale is needed so that the system can feed on itself, people come for content that they can find and contribute content that someone else will use, its difficult to achieve that if the community is small.

Anyways, there are two mind

Anyways, there are two mind sets:

No, there are NOT two mind sets. -- You either have a feature that solves a problem, or you don't. Or you have a feature in search of a problem. Which is it? You have to consider your users.

Jim Salem, the architect of Quickens' QuickBase online database service, for example, designed an application packaging system so that clients did in fact have the ability to both have SaaS and release management. IMHO, SaaS without release management isn't SaaS. It's simply a hack, pushed on companies because they don't have IT people who know better.

Businesses, like open source developers, should have the right to revoke a "changeset", where the changeset is the deployment of an "upgraded" service they don't want at this moment. Vendors, in turn, should be able to define SLAs with businesses for how long an old service stays up after its been replaced by an upgrade. [1]

This has NOTHING to do with trust! This has everything to do with "Margin of Safety". David has already dealt with your trust rebuttal, so I won't really layer on top of his excellent feedback. He just missed this aspect. -- All the posts I linked have absolutely nothing to do with trust and everything to do with change control management best practices, so that you can define a margin of safety for business critical operations. (We would argue all code is untrusted initially, and we deny-by-default, then grant-and-forward authority (there is no such thing as revoking in o-caps, because we only care about authority, not permissions, so it is technically a trust problem, but only under scrutiny.)

A good SLA gives you a factor of 3 margin of safety from the time a change can reasonably be made until the change must be made.

Finally, I'm not sure what to do with statements like these:

Ah, all of this stuff was in vogue 10 years ago. I understand it, but I don't think it is universally applicable.

Ok... I'll agree. But... would you agree it is applicable to a project looking to scale to 1,000 contributors? Otherwise, what are you even spending the rest of the time arguing about? Instead of weird civil engineering analogies to DPRK (which are always strange to make in the context of software engineering, since SE and CE have virtually nothing in common), provide real user stories and then explain what your protection graph will look like to address them. Then argue that your protection graph is all the users really need. Once you make that argument, the conversation is done, because you'll know what you are building, how and why.

One last nit: Object capability is sometimes referred to as "o-caps". Rather than just read the two papers David presents, I also recommend reading more of Mark Miller's stuff. I like , Paradigm Lost, Paradigm Regainedand Paradigm Reconsidered. Fun fact: Mark worked on Project Xanadu for Ted Nelson prior to going to graduate school. Xanadu is/was supposed to be a rich hypermedia environment, not much unlike a wiki. Fun fact #2: Douglas Crockford, javscript evangelist at Yahoo, also worked on o-caps Internet-scale collaborative platform in the '80s.

[1] Alex St. John, the father of DirectX, is now working on WildTangent, a company that sells "web drivers" for video games. If you've bought a Dell PC recently, it is installed on your Dell by default. The idea behind the web driver is exactly change control management (well, that, and revisiting Project Chrome ideas he started when working at Microsoft as a "DirectX over the Internet" initiative).

there is no such thing as

there is no such thing as revoking in o-caps, because we only care about authority, not permissions

This is one of those 'Capability Myths' covered in 'Capability Myths Demolished' and as a pattern: Revocable Capabilities.

There are cases where using revocable caps is sensible. For example, you might want to grant to a cell-phone application (perhaps some mapping software) the permission to read your GPS location, but only while you are actively viewing that application... i.e. you deny (revoke) the capability to access your location when the application window is hidden or closed.

more of Mark Miller's stuff. I like Paradigm Lost, Paradigm Regained, and Paradigm Reconsidered

Mark marks Paradigm Regained as 'superseded by Robust Composition', which I had referenced earlier.

I especially like Nick Szabo's stuff on contracts and e-trade, which leans heavily on capabilities.

Robust Composition is a

Robust Composition is a longer read, and I generally read more than one paper on a subject when I read something about it. Superseded by != obsoleted by. [By the same analogy, would you tell somebody to read Gilad Bracha's Jigsaw Ph.D. thesis without reading first his newer stuff with Newspeak? Jigsaw hasn't been a story here on LtU btw...]

The key thing in my books is the Waterken protocol over ERTP.

I really like the sequence of ideas across the three Milton allusions. I only linked Regained because it had a google techtalk, and I like offering people the chance to watch the "Computer science discovery channel" (Google TechTalks, Chan9, Parleys, InfoQ, etc...). I am still surprised Robust Composition was never published as a monograph.

No, there are NOT two mind

No, there are NOT two mind sets. -- You either have a feature that solves a problem, or you don't. Or you have a feature in search of a problem. Which is it? You have to consider your users.

Sorry, I should have used the word "strategies for solving" problems, in that you can consider reactive and proactive strategies for dealing with security problems, depending on what your users need.

This has NOTHING to do with trust! This has everything to do with "Margin of Safety". David has already dealt with your trust rebuttal, so I won't really layer on top of his excellent feedback. He just missed this aspect. -- All the posts I linked have absolutely nothing to do with trust and everything to do with change control management best practices, so that you can define a margin of safety for business critical operations.

Perhaps you are arguing for the "one true solution," I'm only interested in the implications of an aggressively open solution: can it serve users in some capacity and how can it serve? So in some respects, we are probably just talking past each other.

Instead of weird civil engineering analogies to DPRK...

Again sorry, I was using an acronym that might not be so widely used in the west, DPRK = Democratic People's Republic of Korea; aka North Korea, which is an extreme control state. The opposite of that might be Somalia, which lacks a central government right now (law is instead an emergent community concern, or maybe someone's gun). This is just bad analogy to convey the difference between proactive and reactive security.

provide real user stories and then explain what your protection graph will look like to address them. Then argue that your protection graph is all the users really need. Once you make that argument, the conversation is done, because you'll know what you are building, how and why.

That is engineering, while what I'm referring to is more research. The point is not to evaluate existing real user stories, but to think about what user stories could be if X was true. The problem is "massive collaborative programming is too hard", and then to ask "what if we had an open access code repository?" Of course this is counter-intuitive to the notion of safety and security, but is not a complete wash for all possible use cases.

Good conversation though. It seems like this conversation has gotten bogged down into obvious secruity/safety problem, while there are lots of other challenges to consider, such as how do we find things.

[comment mostly

[comment mostly deleted]

Looking at the WebKit environment (Apple, Google, opensource, mobile + embedded systems, academic) and the social structures built around it is interesting (including the ad-hoc day-to-day parts and the automated but single entity ones). While ocaps etc. discussed in this thread are great and might even be necessary in the long run, I think there are higher-level structural problems. Automating and reifying the code review structure, sharing quality controls between groups of reviews, new models for escalating 0-day rollbacks and feature additions, statistics tracking, the role and sharing of automatic tests/patchers/scrubbers/alerts, etc. We're seeing an increasing amount of this stuff within one organization and/or academia, but nothing scalable across boundaries.

The above is for a few large groups which can therefore invest a lot in review. Small groups can't (the lead might even disappear!) and must therefore rely on another group.

Finally.. relating to LtU: again, something like ocaps seems premature. Yes, it's a nice hammer, but focusing on language primitives and models seems premature as there are social and structural questions for scaling. A systems hat, rather than a PL one, seems more appropriate. Once we kind of know what we want and the basic structure for how to do it, then asking PL questions seems more appropriate.

Systems and Languages

focusing on language primitives and models seems premature as there are social and structural questions for scaling. A systems hat, rather than a PL one, seems more appropriate

I do not believe that 'systems' vs. 'languages' is a beneficial distinction; the two blur considerably the moment you start considering persistence and distribution features.

I do agree that social and structural considerations must be considered, but I don't believe it to be a separable concern. The language and its runtime has a significant influence on which sorts of social and structural issues will exist and need tending.

My own conclusions were reached starting with consideration for how a high performance programming language system could reach Internet scale with decentralized programming, integration, and maintenance - critically, across administrative and trust boundaries.

Among these conclusions is that we need to diminish the relevance of code reviews. If achieved then mistakes, exploratory programming, and even malign intent can be tolerated, which opens programming to a wider audience with far less vetting.

And, to diminish the relevance of code reviews, I believe it important to change how services are distributed and shared.

Currently, services are often distributed as libraries or applications. This requires giving code a lot of power to 'implement' the service atop that power. Even for non-local services, one must give a lot of local power (such as arbitrary TCP/IP access) in order to 'integrate' a non-local service. As a result, one must review code in order to vet it, ensure it doesn't make mistakes or do malign things with that power. This doesn't scale past shallow dependencies, and also doesn't effectively handle independent upgrade to these 'services'. One might call this the 'libraries as services' approach (or fallacy).

A promising alternative: services aren't implemented as libraries; instead, services are essentially implemented in distinct code-bases, but can communicate between one another using a common system of values and names. Developers (via their IDEs) have bookmarks or a small database of external, opaque services, and the potential to browse public and private registries for access to more. Power largely flows from these external services, rather than to them. For a language with distribution features, there is no need to grant excessive 'local' power to implement the integration with a non-local service. Libraries still combine services to produce new ones, but tend to do so in a shallow manner and use fine-grained and well-defined capabilities, thus keeping security-reviews far more localized. Instead of using a generic 'main' function, 'main' is parameterized based on the services and configuration variables it needs. A few of these 'link-time' type-safety checks are performed to ensure the services will integrate as expected. Services may be upgraded or extended during runtime, and multiple versions of any given service may exist concurrently.

But 'libraries as services' has a major advantage over the above regarding performance and disruption tolerance.

In order to achieve performance and disruption tolerance in the alternative architecture, the host communication protocol is capable also of distributing objects (via replication, construction, or migration). This results in parts of these initially 'external' services being distributed to the client, and vice versa. (Ideally, a procedure might even integrate two independent services yet leave no persistent trace on the IDE's local host.) After said distribution and a minimal safety analysis, resulting object graphs may be compiled together - inlined, optimized. A service upgraded might replace certain elements, forcing a recompile in any replicated components.

The reason I favor ocaps as a "nice hammer" is to support these sorts of desirable social and structural changes to service distribution and service integration. That is, I didn't start by looking at ocaps and think "oh, that's neat!"; instead, I discovered what Mark Miller had written on ocaps while searching for a solution to scaling problem.

Actually, ocaps (along with the associated capability and rights amplification patterns and related features) only solve about half the security issues... You might say there are two basic cases: (a) untrusted services running on a machine you administer, (b) trusted services running on a machine that you do not administer. In both cases, you are concerned about leaking authority and sensitive information. Ocaps solve (a) by guaranteeing that untrusted services running locally gain no more authority or data than they would if running remotely across an ideal network. A solution for (b) in the face of aggressive automated distribution requires something more akin to dataflow-secrecy analysis to help split the code.

Anyhow, attempting to distinguish PL and Systems issues is, I think, a big mistake. The two blur together far too easily, and have a great deal of influence upon one another. Consider the language to be the system or at least to be a major component of it, as did the developers of Smalltalk and Lisp and Oberon and Limbo. I feel that is much wiser.

I do not believe that

I do not believe that 'systems' vs. 'languages' is a beneficial distinction; the two blur considerably the moment you start considering persistence and distribution features.

Systems and PL have always been intertwined, especially in research. I got my start in a systems group and have seen many papers published in systems conferences like OSDI. Smalltalk was always a good example of a language that blurred the barrier, but a lot of systems research was also done in Java (e.g., Java processes).

...

Among these conclusions is that we need to diminish the relevance of code reviews. If achieved then mistakes, exploratory programming, and even malign intent can be tolerated, which opens programming to a wider audience with far less vetting.

We are definitely in agreement here, but actually, I have no evidence because I don't know of any experiments previously performed. We need to perform experiments.

The reason I favor ocaps as a "nice hammer" is to support these sorts of desirable social and structural changes to service distribution and service integration. That is, I didn't start by looking at ocaps and think "oh, that's neat!"; instead, I discovered what Mark Miller had written on ocaps while searching for a solution to scaling problem.

But the experiment must be done, and ocaps must be compared to other hammers...actually, there are many hammers, secure information flow, malicious code detection and counters (rollback), sandboxing, limited scope...ocaps seems interesting, I still need to learn more about it.

Experiment? How?

I disagree with your emphasis on the relevance of 'experiment' (as opposed to principled or iterative 'design') in systems research, especially with regards to elements that interact with humans.

For experiments to be valuable in learning how features influence behavior, they must be performed in controlled circumstances. One must control for unknown variables, and limit the number of variables being tweaked at once. In the case of systems research, that means controlling for human factors, especially in terms of prior experience and resulting biases. And doing so is impractical.

I do agree with the idea of iterative design - prototyping a concept, fielding it, getting feedback, making updates, etc. But that is not scientific. If a design succeeds, you cannot isolate the factors that caused success. If a design fails, you cannot isolate the factors that caused failure. Even distinguishing between issues of marketing, product, documentation, and distribution process can be difficult.

Indeed, if you point out something like CapDesk as an "experiment" demonstrating value of ocaps, others will be very quick to bellow that this demonstrates little about 'ocaps' but rather about 'design' - that the same security features could be achieved with more traditional approaches, given (hand-waving) design effort. Other people might call it a failure because CapDesk hasn't proven it can implement every Windows app. Yet others would call it a failure because it isn't popular.

Confirmation bias is too easy a trap. What can experiments prove to a skeptic who is prepared to wield the "There Is Nothing Perl Cannot Do!" Turing equivalence argument? I'll admit that experiments are useful as marketing gimmicks. You can also use them as existential proofs (that you can achieve XYZ under so and so limitations).

I'll stick to logic, math, speculation, and design.

Speculation includes principles and priorities based upon existing observations. We do not need to create specific experiments to speculate and refine concepts; we need only to observe what is around us (one might call this a 'natural experiment'), and write up plenty of use cases and user stories. If you're willing to accept 'natural experiments' then ocaps already have plenty out there.

RE: many hammers

here are many hammers: secure information flow, malicious code detection and counters (rollback), sandboxing, limited scope

Secure information flow is orthogonal to ocaps - indeed, I mentioned it above as relevant to limiting any aggressive automated distribution policies, to ensure you aren't accidentally flowing sensitive data or capabilities through hosts under an untrusted administration.

Rollback of certain resources - such as code repositories - is, of course, an expected feature. But it doesn't offer security: at best, it mitigates insecurity, and even then does so only for human administrated persistent resources. It won't help anyplace you'd want 'secure information flow'.

I aim to widely support transactions, but the only 'security' properties transactions offer are that they aren't locks. Locks are really bad for security: a malicious bit of code can lock a resource and 'forget' to let go, or simply deadlock, thus providing a denial-of-service attack. Unlike locks, transactions can be safely aborted to make way for a high-priority event.

I would consider the use of sandboxes for security to be a "mostly failed" experiment. Security is defined in part by a liveness principle (you can do what you're authorized to do) and in part by a safety principle (you cannot do that which you're not authorized to do). Sandboxes are inherently an attempt to balance the two, and always get it 'wrong' (because security isn't about 'balance'). Some attempts are simply more wrong than others. Even you get a reasonable approximation to correctness (which is extremely tricky for network resources), sandboxes serve as a poor basis for scalable development and composition of services. You should look into the history of JavaScript for some lessons learned by fire...

Malicious code detection isn't realistic. Code is too easy to obfuscate, and a programmer with malicious intent is likely to make the extra effort to do so. Even if code were not obfuscated, it isn't as though 'malice' is something that can be defined in computing terms. And even if aspects of malice are defined in computing terms, Rice's Theorem suggests that their detection generalizes to the halting problem.

Instead of 'malicious code detection', focus on static reasoning about specific, useful properties of the software system. Static safety and secure data-flow analysis can prevent many sorts of potential forms of error (and, accordingly, any form of malice that might have leveraged that error). Effect typing makes it easy to eliminate problems from certain regions of code. Patterns for secure, robust composition can reduce the power of a malicious bit of software to the point a rare attempt at abuse is negligible.

Speculatively, once you start including economics, contracts, and e-trade into the common services and libraries, even resource 'abuse' can be reinterpreted as resource 'use'. I'd really like to see how people take advantage of such features. But, as noted earlier, I don't believe this is someplace one could create controlled "experiments"... any attempt to do so will run away, quickly escaping control of its creator, and will further taint most future attempts at 'experiments'.

But the experiment must be

But the experiment must be done, and ocaps must be compared to other hammers...actually, there are many hammers, secure information flow, malicious code detection and counters (rollback), sandboxing, limited scope...ocaps seems interesting, I still need to learn more about it.

There have been some experiments actually. See the DarpaBrowser audit, and Ka-Ping Yee's report on secure interaction design. There's some newer stuff too, for the Joe-E and Waterken systems, but I'd have to dig them up.

Still, ocap languages have properties which are not achievable with other systems, like sandboxing. Since you're obviously familiar with the lambda calculus, ocaps is basically the pure lambda calculus augmented with mutable references, but no global side-effecting functions (including no global mutable static state); anything bound at top-level must be transitively immutable and side-effect free.

Any side-effecting operations can thus only be obtained by parameter passing. See also, How Emily Tamed the Caml for Marc Steigler's earlier effort to define a capability secure subset of OCaml.

Thank You

I don't think anybody understands what a big deal this is, especially for practitioners.

Barbara Liskov in her Turing Award acceptance speech at OOPSLA basically called PLT a dead topic in computer science, and said harnessing the Internet as one big computer is the big topic in computing now. I half agree. I cannot envision any solution to "one big fat CPU in the sky" without PLT. PLT is the best tool we have for developing formal methods to analyze, design and implement such solutions.

Also interesting to me is the beneficial way those two topics and HCI can intersect. For example, if you do not have permission to use a remote service but have a permission to use a local service, then a user interface should not present you with two radio buttons, one for each, and especially not select the remote service by default. This should be automatically reconfigurable. Once you understand how a system works by guaranteeing invariants, and buy into the argument that any modern GUI is a dynamically distributed, dynamically federated application, you begin to see the question "How would you design a code wiki or wiki IDE?" differently.

To quote Leo Meyerovich, these ideas seem to be "tantalizingly close" to reality.

The reason I favor ocaps as

The reason I favor ocaps as a "nice hammer" is to support these sorts of desirable social and structural changes to service distribution and service integration. That is, I didn't start by looking at ocaps and think "oh, that's neat!"; instead, I discovered what Mark Miller had written on ocaps while searching for a solution to scaling problem.

Me too.

A solution for (b) in the face of aggressive automated distribution requires something more akin to dataflow-secrecy analysis to help split the code.

This requires something to the effect of Master Metadata Management. Dataflow secrecy should also be therefore used to ensure -- and explain -- partial views of data. For example, Minnesota State requires you not reveal an employee's birthdate, but you can reveal their birth day (i.e., to celebrate it with a company social event). This requires exposing a partial view on the month and day, but not the year.

Another idea

WikiWiki wouldn't work if a subtle mispeling of a word in a linked page could break a program or redirect you to malicious code that spins up your HDD and leaks secrets or loads a virus.

Here is an alternative to sandboxing. When someone goes and vandalizes a wiki, wiki community administration can quickly revert their changes, ban their IP addresses, and perhaps restrict access during the attack (no anonymous or new user contributions, captcha's and so on).

Perhaps we can treat the machine that code executes on as a wiki also. Delete some files? Ok, we'll revert them. Leak some secrets? Well, the code can only access the machine wiki, and what is public there. A "Recent changes" log exists for the machine, and we can see what permanent changes wiki code makes, revert the changes if necessary, and...then go and punish the code responsible by editing it.

When someone goes and

When someone goes and vandalizes a wiki, wiki community administration can quickly revert their changes, ban their IP addresses, [...]

Layering violations like this are generally a bad idea. I had a lesson in this recently when my dynamic IP changed and I suddenly had all sorts of OpenDNS filtering policies placed on my connection.

Even as a temporary measure during an "attack", it will only deter the non-technically inclined. Changing my IP is as simple as loading up my router's web page and clicking a button (or resetting my modem).

Sandboxes

Sandboxes don't work all that well. Used alone, either code is trapped in a sandbox unable to do anything interesting (such as interact with databases), or the code will have enough ambient authority that any malicious element can cause trouble (such as siphon sensitive information from a database into some vandal's private mirror).

Besides, sandboxes aren't especially simple in concept or implementation.

Sandboxes require that you anticipate and decide, effectively at time of language development, which ambient powers the code shall be granted. For example, if one is to build a GUI from a sandbox, the GUI must be built to properly interact with sandboxed code without risk of leaking authority.

A capability design allows one to introduce a great deal more power to the code because it also provides better control of how that power is distributed to modules. This is better for security, which is defined in part by a liveness principle (a service is not 'secure' if you cannot access it when you are authorized to access it). Capability languages are quite simple in concept and implementation, and also perform well under composition, concurrency, distribution, and automated testing... but you'd need to reject most off-the-shelf languages that are popular today, or at least impose some feature restrictions on them.

WikiWiki has worked fairly well in these areas

WikiWiki, with support from a few bots provided by wikizens, is batting about two in four when it comes to spammers, vandals, stubborn ignorami, and people with an awful taste in jokes. Only the most obvious and localized problems can be easily cleaned up by community oversight. I didn't even mention sophistry, but WikiWiki handles that quite poorly, too.

but you'd need to reject

but you'd need to reject most off-the-shelf languages that are popular today, or at least impose some feature restrictions on them.

Any links? There is an article on object-based capabilities and capability security, which seem to be completely different things. I'm all for building new languages. More specifically, can capability be separated from code trust if the language is sufficiently restricted? In this case, code is acting as an untrusted servant of the user, and the user must explicitly grant it privileges to do stuff on its behalf, after it knows what the code wants to do (no, you can't send my bank password to a 409 scam site). That could be annoying though (as in Vista).

WikiWiki, with support from a few bots provided by wikizens, is batting about two in four when it comes to spammers, vandals, stubborn ignorami, and people with an awful taste in jokes

So you are of the opinion that the wiki model is a failure and its not worth generalizing? I guess you could be right, but it just doesn't seem like that to me right now (given how much I use wikipedia).

Capability Links

Links:

www.erights.org is a good portal, but doesn't make it clear where to go next.

I would suggest reading a bit from the E wiki on secure, distributed computing. The same principles apply when dealing with local untrusted code. Reading up on 'capability patterns' may let you see how caps are managed in code.

You might read Ode to Capabilities, and Capability Myths Demolished.

If you're feeling up to it, Mark Miller's paper is also good... but it's a tad on the long side.

There is also a mailing list archive. You can peruse it, but it isn't well organized.

can capability be separated from code trust if the language is sufficiently restricted?

One can run untrusted code in a capability system, and capability is often fully separated from trust.

Usefully, one can make this promise: untrusted code running locally cannot gain any more authority than it would gain if running remotely across an ideal network. This promise is quite useful, since it means that the difference location of execution makes is performance and disruption tolerance, not authority.

no, you can't send my bank password to a 409 scam site

In practice, individual programmers decide how much authority to forward to untrusted code, whether that be all of it or none of it. This works because, in a capability system, a developer can forward only the authority they were already given, and do so with the expectation that whatever authority they grant will be used and potentially abused. Design patterns aimed at secure composition, revocation, rights amplification, and so on will tame any actual abuse.

It is possible to write a system that can query, asking for a new capability, and possible to forward this query all the way to the user.

But this is unnecessary. You might learn how CapDesk achieves functionality without this annoying behavior or vulnerability, and that is one approach among many.

you are of the opinion that the wiki model is a failure and its not worth generalizing?

Oh, dear no! As noted earlier, I am actively pursuing wiki as a basis for a development environment (see WikiIde). I am only saying we need to recognize the problems and handle them with greater attention than is required for unempowered text.

My first post in this topic discusses the use of vested online identities, building trust through having your edits signed by people already trusted, and via reviewing and signing edits and having people add your identity to their trust lists. This would largely prevent untrusted code from becoming empowered, which would greatly resist vandalism and spam. In a truly open system, like GIT, a public key would serve well as an online identity (this would not require passwords or a trusted base).

There are other ways to vest online identities, of course, such as associating it with a real name subject to legal persecution (this would require a mutually trusted third party to certify the association between an online ID and the real person; for anonymity, one could hide the real ID with the certification authority for anything short of a court order), or associating the online identity with a collateral insurance fund that will be forefeit in the case of misbehavior. All of these approaches could be used, but I favor the first because I'm cheap and favor approaches that don't hinder newcomers and open source development from third-world nations.

It is possible to write a

It is possible to write a system that can query, asking for a new capability, and possible to forward this query all the way to the user.

But this is unnecessary. You might learn how CapDesk achieves functionality without this annoying behavior or vulnerability, and that is one approach among many.

The principle of least privileged approach seems too inflexible. What if I want to grant a program access to the network to say upload some statistics but not transmit my personal information? There is a difference from reading my personal information, uploading stuff to the network, and uploading my personal information to the network (ya, I know, information flow work is appropriate).

My first post in this topic discusses the use of vested online identities, building trust through having your edits signed by people already trusted, and via reviewing and signing edits and having people add your identity to their trust lists.

You seem to prefer proactive rather than reactive solutions for security. I think anything that depends on just establishing trust, vesting, and so on is fallible: think sleeper agents or programmers that earn trust and later go crazy. A reactive strategy accepts that code can be unsafe no matter what we do before the code executes, and minimizes the effects of unsafe code: being able to track persistent changes and being able to revert those; processes to detect DOS attacks and limit memory and processor consumption; as well as explicit user confirmation things that can't be undone.

Or maybe we should consider a mix: cheap proactive trust building (e.g., karma accumulation) and verification coupled with reactive measures so that proactive trust building can remain cheap.

What if I want to grant a

What if I want to grant a program access to the network to say upload some statistics but not transmit my personal information?

Grant the untrusted program only the capabilities for reading the statistics and writing to the desired network destination. Don't bother giving it access to personal information.

One thing to understand is that capabilities are fine-grained to whatever arbitrary degree you wish to make them.

You don't give a programs a filename plus access to a capability that can turn filenames into capabilities. You don't do this because it would be equivalent to giving them access to the whole filesystem. Instead, you give capability to just the target file or directory. If you need to give programs the ability to raise a file selection dialog, you give the program a capability for raising a file selection dialog, which returns Maybe Filecap.

Similarly, you don't give programs a URI plus access to a capability to turn URIs into capabilities. That would be equivalent to giving them access to the whole Internet (including local domains). Instead, you give a capability to just a specific network service; the URI need not ever be transparent to untrusted code.

I think anything that depends on just establishing trust, vesting, and so on is fallible: think sleeper agents or programmers that earn trust and later go crazy.

Vested identities aren't about trust, but rather about raise the cost barrier and risk for 'going crazy'. Trust is simply one loss option; other possibilities include legal persecution or forfeit of a collateral. The purpose here is to weaken bot attacks to the point they are nearly harmless, and to provide disincentives for retaliatory programming and other sorts of human nastiness.

If someone goes crazy even after establishing trust, they have something to lose. Of course, they could still do it, and accept the loss. Sleeper agents are still possible, as you say.

But any security solution is multi-layered. If I was depending on trust alone, I would not be promoting capability language features.

A reactive strategy accepts that code can be unsafe no matter what we do before the code executes

I have a problem accepting that when it isn't true.

Rather, code can be untrusted, even if it comes from a trusted source (like yourself).

being able to track persistent changes and being able to revert those; processes to detect DOS attacks and limit memory and processor consumption; as well as explicit user confirmation things that can't be undone

I suspect you'll bump into problems regarding scope of 'undo', especially in the face of concurrency. Transactions may be a more natural fit.

Control of memory and processor resource policies is something I've considered in the past, and there are a lot of non-trivial details (i.e. if you receive an unusually large message, how much of the cost for processing that message should be paid by you vs. how much by the sender?)

One point to make is that if you limit the language too much - if users cannot safely do interesting things with the authorities they possess - then the project is at best a quaint toy and likely will fail. Sandboxes are a place for children to play.

Maybe explore "convergence" schemes?

Ok, maybe I'm going to say something totally "stoopid" because totally off the point, but :

Unless I misunderstood, you mentioned several times your special interest in something with more of an emphasis on very collaborative features with somewhat "discrete" check points on the code (reviewed quality, builds, etc) that would be produced. That is, versus something more structurally controlled/organized throughout the whole code base and its authoring population, that would qualify as, say, "more continuously constrained" (?)

Then, if just one huge, sort of "anonymous" and free-wheeled code base end result would do for you, why wouldn't you try to investgate in the direction of the imaginable features (or already existing somewhere today) that would support maximal user-friendliness and safety of authoring what we usually call POCs, as a matter of the main artifact's nature? And then let the "entropy" of the "code soup" converge into POCs merged to each others, with an unavoidable de facto "crave" for good code re-use qualities, unit-testability, etc; I guess that's also compatible with many pointers/suggestions of other responses in this thread(?) -- I mean, regarding good qualification criteria for the language's features, the layers of such an authoring platform, etc. Well, just a wild and raw idea.

As I see it, that's quite a significant part of our day to day experience (even when we assume the "Big Design Up Front" approach, we often proceed bottom-up, too, when it's time to implement) -- or, be it just in our "bedroom loneliness" (j/k) of surfing the web here and there, repeating those tasks (getting familiar with fine grained-POCs, algorithms, and then reusing those with more context/constraints bottom-up, etc)

Otherwise, I can't really contribute more, now, than just throwing that idea in the air, as I believe Z-Bo and Dmbarbour have already covered quite a number of satellite aspects (I didn't even have an idea about, by the way) and you might need to consider. I mean, prudently, especially if there are chances that the problem be actually tougher than you first thought, for the purpose you're thinking of.

'Hope it helps!

Hmm... this idea might come in handy...

This might work well with the following scheme:

  1. Each version of a source file is immutable once created! Newer versions are like branches/deviations of the older versions. The version number is based on a flattening of the version branching tree.
  2. When one source file references another source file, it includes the version it's referencing, which becomes a part of the unique identifier of a particular source file version!

This provides the following benefits:

  • Backwards compatibility with graceful upgrading ability
  • Synchronization for source files
  • Tracking of derivative works
  • Since version migration is manual instead of automatic, there is no chance of malicious code insertion without giving the programmer a chance to inspect the new version
  • the GUARANTEE that versions they point to will NOT be changed in the future!

Yay for immutability! Yay for uniqueness typing!

Formalized mathematics

This kind of scheme has been proposed for formalization of mathematical proofs, including proofs in constructive mathematics. Of course, the latter may be regarded as programs in Total Functional Programming or similar sub-Turing models of computation, depending on the exact flavor of constructivism. See Cam Freer's website vdash.org for an overview of these efforts and [1], [2] for the MathWiki and ProofWeb projects at RU Nijmegen.

ETA: Here is a paper on Cooperative repositories for formal proofs which also discusses versioning issues. The parent post's statically linked approach is good for storing older or 'permlinked' revisions, so that they don't become immediately obsolete. However, the best approach is to dynamically link the 'current' revision of each file to the latest revision of its dependencies. A file which is made invalid by a change in its dependences can be statically linked to the appropriate version and marked for manual update.

Well, ok...

Well, ok, that kinda made me smile. "Mouarf" :)

Vesta

2. When one source file references another source file, it includes the version it's referencing, which becomes a part of the unique identifier of a particular source file version!

Stamping files with dependencies is nothing new.

I also think it's a mistake to assume one source file is hardwired to one set of dependencies. That's rarely the case, and why many languages have preprocessor systems so that you can instrument the compiler with compiler diretives #if DEBUG ... #else ... But the more interesting case is Gilad Bracha's pluggable types.

The scheme you propose is basically Vesta, which is like 2 decades old, except Vesta is more thought through. For example, the goal in Vesta, from a build perspective, is to ensure all dependencies are resolved. If you actually look at how Microsoft installs their own software these days, like SQLServer 2008, it's all based on the SNAP model, where you can do arbitrarily checks before and after installation (most MS installers include effectively a rules engine). This is not far from Vesta, either, except it is poorly integrated and doesn't prove anything.

I also think it's a mistake

I also think it's a mistake to assume one source file is hardwired to one set of dependencies.

What do you mean by "one set of dependancies"? Do you mean one single version for each dependancy file? I'm not quite following you.

This system is essentially code-agnostic, except for the version directives it must parse.

Some files can have more

Some files can have more than one set of dependencies, due to preprocessor directives controlled by the compiler.

What controls dependencies is an 'xref' (cross-reference) between files. It can still be automatically derived...

but being code-agnostic is worthless; you want to know if a version (in your sense) can produce a valid build. To do that, you need all dependencies resolved.

Edit: By the way, I do like your general idea, I just think you are off in a couple ways, perhaps due to misunderstanding dependency boundaries in projects

Better yet, the version

Better yet, the version number of the file is the file's hash. But this is really just a DVCS model, so I think a prior comment was right that this is the right place to start.

I also once had the idea of

I also once had the idea of having a wiki-like web-environment for people to add/edit code.
I had a little working prototype: 
   A web site 
      -you could enter and modify lisp-like expressions (in plain text, like a wiki). 
      -a user could add a text entry (in plain text) and edit his own entries.
      -a user could see all the entries from other users.
      

   -A backend server that could 
      -detect well-formed "concept definitions"  (similar to records or class definitions)
      -detect well-formed functions 
       -detect well-formed functions marked as "test"

The server would save the concepts and functions and would run the tests.
Tests would show green/red on the webpage. So user A could write code (by adding entries) 
using functions from user B, but his test would fail once B changes all his code.

Some problems i encountered
  -you need a way to safely execute user code (no IO, no endless cputime). I did this
   by writing a little lisp code interpreter myself (But there are better ways, i would do
   it differently now)
 
  -you need to make sure your environment can deal with errors and inconsistenties 
          -for example:  2 users give a different concept, only pick one to add to the test-server)
          -for example: syntax errors in fucntions. A function that always crashes,... 
   Ideally you would make some servers that reasons about the code and let the users know what is wrong
   or inconsistent about their entries.

  -You need to allow inconsistencies. You need to be able to run half-working programs. A dynamic programming
   that allows easy metaprogramming is a lot easier than trying it in Haskell
   
  -Getting people to participate
     -I showed it to tens of people, but nobody was really interested. Most comments i got was
     "it doesnt really do anything". And only 10 or so things got added. 

Feel free to learn <ul> and

Feel free to learn <ul> and <ol> and fix the above for people with small screens...

A new related discussion on the Erlang mailing list

By Joe Armstrong: Why do we need modules at all?

Teaser:

Why do we need modules at all?

This is a brain-dump-stream-of-consciousness-thing. I've been
thinking about this for a while.

I'm proposing a slightly different way of programming here
The basic idea is

- do away with modules
- all functions have unique distinct names
- all functions have (lots of) meta data
- all functions go into a global (searchable) Key-value database
- we need letrec
- contribution to open source can be as simple as
contributing a single function
- there are no "open source projects" - only "the open source
Key-Value database of all functions"
- Content is peer reviewed

If Joe came up with

If Joe came up with something similar at least I know its not that crazy of an idea. BTW, their is a corresponding reddit thread.

I still think...to make this work, you have to rethink the language. Not sure if erlang would make it easier to define functions that are easily found and reused.

Code Catalog

Luke Palmer's blog announces code catalog, which focuses on small snippets of reusable code.

public code repository without names

Over the last few months I've been refining the idea of public and reviewable code repositories to work without imports and without a global namespace. I believe that highly reusable code cannot carry a tree or graph of dependencies since we need the ability to cherry-pick our code from other projects.

The aforementioned articles caused me to table many of my earlier designs towards a code-wiki for a few years, but recently I've been pursuing a design where linkers become constraint solvers, searching for code and filtering it based on its ability to meet declarative requirements. This gives me the reusability and connectivity without imports, and allows precise linking when you specify it, but by itself doesn't encourage peer review and sharing.

So I couple the constraint-based linking with sharing code via linked DVCS repositories, e.g. with parent/child and sibling push-pull relationships. Some of these DVCS repositories would be 'public', and public repositories would become grounds for social interaction, fulfilling the role of a wiki.

Sharing and social interactions between projects would occur almost entirely through the DVCS mechanisms.

For example, assume projects Q and R might inherit from a fairly public repo, P. One day, Q pushes an update to P. Later, developers from R pull the changes. Developers from R say "hey, that's neat, but..." and start tweaking the code from Q to be a bit more friendly (e.g. to expose some more implementation, support some extra arguments, better disambiguate), and push the update back to P. The social interaction so far is Q -> P -> R -> P.

Now Q decides to push an update to P again. Q's developers get an error message saying: "Your version of this module is out of date." And, at this point, they'll be faced with a decision: (a) merge before pushing to P (and potentially reap benefits of R's tweaks, but possibly requiring they modify their own code), (b) get lazy and stop pushing code, (c) attempt to crush the opposition by overwriting the updated version with their own. Naturally, at this point their decision is quite socio-political in nature - P becomes either a battleground or a point of cooperation between project developers (same as any shared wiki name).

From R's perspective: developers of R use updates to P, but are concerned that new modules added to P will affect their active configuration (due to the linker as a constraint solver). So after they load code from P, but before they commit to it, they use a few simple refactoring tools to detect any 'new' ambiguities with respect to their project. If there are new ambiguities, and they disapprove, they'll be faced with a decision: (a) tweak their own code (e.g. to add a new assertion) so that the ambiguity is eliminated, (b) get lazy and add a filter to ignore the problem modules, (c) tweak Q's code to allow more precision in the selection and push the updates back to P. Once again, this is a socio-political decision. R could try to contact Q to coordinate some changes rather than simply publishing to P, but I'd avoid depending on side-channel interactions like that.

A noteworthy point is that both Q and R 'volunteer' for this sort of interaction... participation with DVCS repository P was not technologically forced upon them, and they could favor inheriting from a less-public or better-vetted repo if they wish to do so (putting more insulation between themselves and the public code). So we aren't forcing anything on anyone, though there would be enough benefits for hooking up with the public repositories to encourage developers to accept the extra work of social interaction through them.

Another noteworthy point is that the linker's model is generating technological pressure towards peer review and refactoring. Developers for project R would, presumably, not feel much pressure to review Q's contributions to a public repository except insofar as Q's changes could somehow affect their own project (whether positive or negative). With common namespaces, R would (historically) nest itself deep in a subtree such as 'Com.Example.ProjectR', and Q would do the same, and there would be very little interaction between projects (except competition for popular names). But with linker-as-constraint-solver, developers of R will be alerted to Q's contributions whenever they seem similar to something that R has already developed. This is much more valuable than resolving pointless squabbles over popular names! Effectively, R would be alerted to Q's code when R is in a position to possibly use it, correct it, or refactor it into existing code already developed by R.

We don't need to rely on altruism. The selfish motivations of individual development houses, combined with systematic technical pressure, is sufficient to achieve global review, reuse, and refactoring of code. And that's exactly what I'm aiming for.