Self-assembling Type-directed Dataflows

A blog post from Netflix about Self-Assembling Components.

But what if you could have functions that self-assemble? What if processors could simply advertise their inputs/outputs and the wiring between them were an emergent property of that particular collection of processors?

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Topic-based

Topic-based publish/subscribe systems (like DDS or ROS) tend to be similar in nature, gluing together components based on shared types, potentially conflating channels with named types.

Overall, I think this can be a convenient way to build medium scale systems. It doesn't scale very well if you have too many components, and simply isn't very useful if you have relatively few components. But if the number of components is in a range where a human can glance at a visual representation of the the connectivity graph and eyeball it for correctness, it can be very useful.

That said, I think there are a lot of use-cases for such designs (cf. stone soup programming and modularity without a name).

Linda

Linda does that implicitly: someone publishes tuples, and whoever is interested can subscribe using pattern matching.

how does this work in practice?

being clueless: So there's more to things than just matching on types I should think. e.g. Just because I have 30 million strings lying around does not mean I want to sort all their characters. How do these systems decide what is actually properly to be combined? And saved?

Fine-grained types

You simply create types specific to each purpose. And typically a string will be part of a larger structure that provides some context.

Obviously, a type such as 'String' by itself wouldn't be especially useful. But if, for example, you had JpegString vs. TGAString vs. BitmapString, you might be able to do something useful with them (like automatically composing functions).

not just composing

hmmm.... even so there will probably be bogus paths to the wiring. i interpreted the netflix post as some magic that did it all right, but i guess they will burst the bubble when they post later that "oh actually a human has to review the workflows ha ha" :-}

eyeball scale

It works pretty well for some common scales... but it can grow hairy.

yeah, sounds like dataflow tuple-spaces

(My current mindset is oriented to appreciate it because I'm working on rule classification with rule data sets so large that space cost is a problem when coded for fastest constant time dispatch. But sheer number of matching processes doesn't look like an issue here. I'm about two decade's worth of experience too expensive to work at Netflix, so my remarks are just for fun, not looking for a job.)

This Netflix design approach looks both clever and beautiful, strongly related to dataflow tuple-spaces, so dmbarbour's stone-soup-programming page looks directly relevant. Erlang might work great on that problem description. Probably Golang too.

I assume binding is type-based instead of value-based, but you could make the auto-wiring work either way, since it looks like it's designed to use dynamic types. (Static typing looks like it needs generated code, at least using low level languages like C. Runtime rewiring would be easier with dynamic types.) I like it as a problem to force into a lightweight process model I'm nurturing, because exploring concrete examples is more fun than abstract problem solving.

If the over-all box is related to a connection that might be killed when the connection dies, aborts, or gets cancelled, then I'd use one lightweight process for the whole thing, then one fiber for each micro-scale tuple-processor function, if running async was useful. [In my model a fiber is like an Erlang process, and a lightweight process is like a process group, whose purpose is to gather all fibers killed at once if the group is killed.] It amounts to logically (not physically) parallel multiplexing of coroutines scheduled together in the same scheduler tick in OS accounting. Queuing a typed message could dispatch by waking a thread handling that type, which would run next without context switch unless you ran out of time. You can handle errors locally by a typed message, or signal the lightweight process in different strengths up to killing it, which also ends each fiber inside.

If I had a C-to-C rewriter working for this, then writing a new processor micro-box would look like writing a standalone Unix style filter that compiles to a coroutine that's ignorant about how the wiring works and merely responds to stimulus.

(This paragraph is only tenuously relevant; think of it as related to interference, and writing non-conflicting tuple-space components.) After you think about a design problem for a long time, you find new interesting loose ends to resolve that never occurred to you before, around the edges. The specific edges holding my attention a while now are in the meaning of details in a daemon's virtual file system's namespace. (It's hard to define the meaning of committing or aborting persistent snapshot changes for non-ephemeral data populations.) I spend most of my daydreaming time on paths for things like meta-data, and now every question gets translated to "what does that look like as a path?" which is probably not as useful as it seems. However, for problems like hosting plugins, the idea of mounting parts of namespace can be used to hide things from each other when not designed to cooperate, which means noise suppression so active processes don't get stray messages not intended for them.

Prolog Again

Seems a lot like Prolog (with a single functor and codata support). The inputs and outputs are the functor arguments. The database is searched for a matching functor (what they call a command). Types are just predicates on values in Prolog, so type directed is just functor selection with backtracking on a few predicate tests.

pattern matching nothing new, news at 11

Agreed that pattern matching up the inputs-to-the-outputs is nothing new or non-obvious after a beer or two.

Since that is the case, I assumed it wouldn't be worth a blog post or three, so then I assumed / really interpreted their blog post as sorta saying that the system wires it up all only rightly, and w/out any misakes or junk or overkill or crap. Which if true would be the amazing part to me.

So at the moment I have to assume I read too much into it, since it seems patently impossible :-)

every connection

I had assumed it wired it up dynamically as you connected and made requests. If you wrote the components in a fast low level language and then described all the IO in a Prolog database, and fed the communication protocol into a Prolog interpreter, that's what you would get. So when your request to stream file x in format y arrives, and the database says the file is in format z, Prolog would do a backtracking search to find a combination of available transcoders that would satisfy the requirement. In this case breadth-first or iterative-deepening search strategies would make most sense.

ok

in a sufficiently constrained domain, ok, got it.