## word2vec

So I made some claims in another topic that the future of programming might be intertwined with ML. I think word2vec provides some interesting evidence to this claim. From the project page:

simple way to investigate the learned representations is to find the closest words for a user-specified word. The distance tool serves that purpose. For example, if you enter 'france', distance will display the most similar words and their distances to 'france', which should look like:

                 Word       Cosine distance
-------------------------------------------
spain              0.678515
belgium              0.665923
netherlands              0.652428
italy              0.633130
switzerland              0.622323
luxembourg              0.610033
portugal              0.577154
russia              0.571507
germany              0.563291
catalonia              0.534176

Of course, we totally see this inferred relationship as a "type" for country (well, if you are OO inclined). Type then is related to distance in a vector space. These vectors have very interesting type-like properties that manifest as inferred analogies; consider:

It was recently shown that the word vectors capture many linguistic regularities, for example vector operations vector('Paris') - vector('France') + vector('Italy') results in a vector that is very close to vector('Rome'), and vector('king') - vector('man') + vector('woman') is close to vector('queen') [3, 1]. You can try out a simple demo by running demo-analogy.sh.

I believe this could lead to some interesting augmentation in PL, in that types can then be used to find useful abstractions in a large corpus of code. But it probably requires an adjustment in how we think about types. The approach is also biased to OO types, but I would love to hear alternative interpretations.

## Comment viewing options

### Type inference through Machine Learning

Sorry I can't comment on word2vec. But I'll add on the machine learning theme: people used ML for type inference and got the result accepted at POPL.
Predicting Program Properties from “Big Code”
http://www.srl.inf.ethz.ch/papers/jsnice15.pdf

### Nice find! I guess POPL

Nice find! I guess POPL isn't as boring as I thought it was.

The examples seem kind of simple, and at least with respect to type inference, are quite easy problems. It doesn't seem to deal with subtyping or parametricity.

Here is hackernews post on the project. There is a link to a FSE paper that describes naming convention inference in Java.

### dnns for interpretation & summarization

I tried finding the reference but could not: some researchers are trying to 'learn' interpreters/compilers.

We've been using word2vec to summarize what is essentially exceptions within distributed programs. Smarter techniques for reading and writing programs becomes interesting here! Instead of describing a picture, describe a program. Instead of generating a picture, generate a program. FWIW, our use is definitely more conservative/constrained -- less crazy, but more reliable.

### Cool, I would love to know

Cool, I would love to know more about how you are using word2vec in the context of distributed systems, my colleagues would definitely interested, but maybe its a trade secret?

The problem with programs, as I was told by a ML researcher, is that code isn't differentiable, that it can't be modeled as a value to be increased or reduced. So the image analogy breaks down quickly. I would guess that we would have to find a different representation for code that is more "continuous", and that the problem is tightly related to the one of live programming (scrubbing is just manual gradient descent). I don't have any good ideas on how to do that yet, but...we can start taking baby steps.

I think word2vec has a lot of potential in type system augmentation, especially OO type systems that already include similar relations in a much more clunky and manual way. But I really need to put some more time into it.

### Think of it more as clustering error messages

We're doing it in the context of barrages of alerts: program source has some english text (comments, var names), but alerts and exceptions are especially high signal/noise. Word2vec means that not only can we cluster but summarize (inverse map).

### Ah, I was thinking about

Ah, I was thinking about running word2vec on a code corpus to figure out the implied relationship between classes and methods. I think this is taking an existing model and using it to optimize your output, right?

### yep

Ours is more close to clarifying stack traces / exceptions, especially in distributed environments. The dynamic analysis view of things. (And, lightweight: it's exactly as I described -- feed in alerts, get back the clusters & essential human descriptions.) There's still value to retraining here, but Wikipedia is pretty good already :)

You're taking the static & lexical view. Still interesting -- imagine learning recipes, macros, extensions.

### The way it works...

The way it works is that contexts in which words occur are the basis vectors. For example
 My _ is fluffy. 

Dog, cat, bunny, pillow, etc can appear in this context, but lawyer, car, book, etc do not. So the former have a non-zero weight for this basis.

Carrying over the analogy to artificial languages like PLs, this says that e.g each function that takes an Int is a basis vector - but this already is the definition of Int, i.e INT_MIN..INT_MAX are the similar "words" that appear in these contexts. So I'm not sure this buys you anything that you don't already have in artificial languages.

Possibly you could find implicit subtypes, e.g the String's that appear in the context atoi(_) are similar in that they encode integers, but most String's would not appear in this context.

### Code completion. I mean,

Code completion. I mean, given the terms surrounding a context, what is the most likely terms to support that. Having a ranked list is essentially when you have 1000 choices!

### Check out Dave Mandelin's and Dan Grossman's work here

Dave Mandelin did some old-school statistical type-based code completion work, and looks like Dan's group has some good follow-ups. Same intuition: types restrict shape, and stats prioritize.

### That is an easy way to go

That is an easy way to go about: use stat to prioritize a code completion menu, for example.

But I think there might be a lot of fruit to be found in a vector representations of types themselves.

### Desperately trying not to be cryptic, but...

[much more garbage than intended, rephrased below]

### How would types as a distance work?

Standard treatments of types are based on set theory: types are sets of values and so the relationships that we expect between types are set-theoretic. Inclusion and exclusion tell us something about the possible values that can be instantiated from a type. Monotonic operators on sets give us program properties that we can reason about clearly: products of values and unions of values give us different forms of composition.

All of these standard treatments are instances of a single process, described succinctly by Luca Cardelli as

As soon as we start working in an untyped universe, we begin to organize it in different ways for
different purposes. Types arise informally in any domain to categorize objects according to their usage and
behavior. The classification of objects in terms of the purposes for which they are used eventually results in
a more or less well-defined type system. Types arise naturally, even starting from untyped universes.

Distance measures estimating from co-occurrence are a completely different form of organisation, and the structure that they would reveal in code is something quite different to the treatments that we have already explored. This would suggest that the set of organisational facts that we could infer from a program would be different, and so the notion of what a type is would shift. Do you have any intuition on what kinds of organisation these forms of types would reveal? It's quite a long jump from co-occurrence of nouns in natural language to value-sets within code...

### Distances as a type: Meters

Types are different than sets.

Also sets are best defined in terms of types, e.g. see Sets defined using Types.

For example,

Plus.[x:Meters[aNumber], y:Meters[anotherNumber]:Meters ≡  Meters[aNumber+anotherNumber]

Times.[aNumber:Number, x:Meters[anotherNumber]:Meters ≡ Meters[aNumber*anonterNumber]


### The idea that we could

The idea that we could establish statistical relationships between types without getting into set theoretics is kind of appealing from the direction of augmentation: clear reasoning doesn't help point me in the right direction, it only tells me when I'm done! Or even just ontological reasoning: if king - man + woman = queen and Beijing - china + Paris = France, then what are the relationships between the types to get there, and would it be a useful relationship to leverage in a programming language?

I'm not sure our types would really shift, at least they shouldn't shift, in as much that we can usually infer type information in code based on usage. The question is, can do-occurrence that is so locally defined be useful in reasoning about types that can be much more complicated (with depth!).

What I hope we could find are more richer implicit typing relationships; e.g. models on what methods are likely to be called on what types, which can then be used as feedback to programmers. Information that type systems oriented at correctness properties can't really convey. It isn't about value sets at all, at least not directly.

### Structural vs. Contextual, and the Denotational Zoo

Most code completion work separately handles structural (sound completions) from contextual (likely), and of course some heuristics in-between. Stuff like word2vec is contextual, and the party line would be that, with enough data, it emulates structural for all the programs we "care" about. An AI can transcribe an image, but there's no guarantee it makes sense.

In terms of switching from strings to typed code, the math for feature encoding is non-obvious to me. On one hand, maybe we can get lucky and natural language vs human languages encoded in text can be learned as easily as one another. On the other, despite knowing the static and dynamic form of programs, it's not obvious how to encode that w/ say http://scikit-learn.org/stable/modules/feature_extraction.html . The ML just recently started shifting from matrices to tensors, and to faithfully encode types or denotations seems hard!

For one of your questions, I'd love to see refactoring tools in the nominal and structural typing worlds around this. "Did you mean a monad? Most people call this IO!"

### re ontological reasoning

Or even just ontological reasoning: if king - man + woman = queen and Beijing - china + Paris = France, then what are the relationships between the types to get there, and would it be a useful relationship to leverage in a programming language?

A dangerous conflation lurks there between normative and deductive ontology. It's dangerous because computational tools rooted in that kind of conflation are instruments of a governmentality based on the brutal technical management of an abstract population which has been parsed as and is manipulated as an assemblage of demographic categories.

In terms of human freedom, the dangerous conflation of normative and deductive ontology is the direct, technocratic instrument that creates, out of thin air, effective categories of people such as "terrorist watchlist subjects", "ex-felons", and "suspected communists".

Schematically, you have big institutions (like the federal "intelligence community" or the on-line ad companies) who assemble massive corpuses of ad hoc measurements of the population and its behavior.

You have next mechanistic, competitive, and speculative application of machine learning algorithms to those "big data" corpuses.

Constructed atop that you have ideological imposition of ontological hypotheses (such as the hypothesis that drive where to direct speculative law enforcement attention, employment policies, and so forth.)

Those ideological hypotheses that resonate (not are proved by, mearly resonate) with the empirical population data then become building blocks of policy. For example, we can find in the big data a broad category which we happen to dub "potential terrorist" -- and regardless of how good that interpretation of the demographic really is, it is nevertheless a demographic that helps the FBI get more money.

Finally, you have institutional interventions on the population based on these measurements and ideologically imposed interpretations of the abstract population. So, for example, the FBI unjustly targets certain groups of people such as activists for surveillance and harassment. In the course of this they can randomly find or incite some criminality, as they could with just about any random group of targets, but politically this is enough to justify the whole illogical circuit of power (and get renewed funding for next year).

"Toolishness" is the unconscious and uncritical volunteering or selling of technical expertise without regard to the social consequences of that volunteering.

I am not sure what would be the virtuous aim of trying to extend type systems to impose ideological categories over fuzzy demographic data.

All I see is the aim to amplify the range of brutal population management.

### Wow, I never realized type

Wow, I never realized type systems could be so....repressive. But really, I just want to brutally manage my objects and values, not other people. The technology already exists to be misapplied, but maybe it can also be applied for productive use.

### re maybe it can also be applied for productive use.

I think you misunderstand.

Your speculation about types contains a deep non-ideological technical problem: a conflation of normative and deductive ontology. Even if you wanted to be cunningly repressive, that error would still get in your way.

I did draw out some of the consequences of that kind of technical mistake that do, however, challenge the idea that "maybe it can also be applied for [good] use."

The mistake leads to putting controls on computational entities that on the controller side have normative ontological lables, but on the controlled back-end do something unintended to actual people.

### I still don't get it,

I still don't get it, despite my tendency to anthropomorphize my code (I am an OOP enthusiast after all), I don't see how extracting relationships from code corpus, or even just designing a better OO type system that supports analogy, leads to anything that is actually bad for humans. I can see how stereotyping people is bad, but stereotyping objects is just efficient.

### re I still don't get it

Here are some examples of computations that mesmerize to greater and lesser degrees:

Eliza, Emacs' "dissociative-press", the slightly more sophisticated examples that generate fake nonsense academic papers that slip by the false editors and reviewers of scam journals.

Another phenomenon that's related goes to a way old hack Martin Gardner popularized: If you ask people to make up out of their imagination a random sequence of heads or tails, but put a markov chain predictor on the other side, the predictor will usually swiftly train to predict the human's "heads or tails" choice with better than random probability.

One thing those examples have in common is that they create the illusion that the computer is doing something with a deeply human meaning. Early dupes easily spilled secrets to Eliza. It can take more than trivial scrutiny to be sure a fake nonsense paper is really machine generated nonsense. The coin-guessing machine can be maddeningly addictive.

Brains have bugs in that area. Trivially exploited bugs. Con artists and brain washers know about those bugs. A lot of age-old social custom and habit is in part an attempt to defend against such bugs.

Computers are capable of tilting the playing field and more intensively than ever exploiting those bugs to harm people and enrich others at their expense.

But it gets worse.

Computation can be a tool for exploiting those bugs but even worse, naively arranged computation can exploit those bugs without anybody in the "cybernetic" control circuit actually consciously intending to exploit those bugs.

Q. How can cybernetic computing systems accidentally exploit those bugs at industrial scale?

A. For example, by presenting the interface for control in terms of ideological intensions -- conscious demographic stereotypes -- while using that control signal to modulate an illusory coincidence; to generate effective operations on a population that don't actually reflect the intensions expressed at the input, but that create the illusion they might. Like eliza or fake academic papers.

In the context of eliza, back when, this bug in human brains and its unfortunate interaction with computation is an abstract anomaly contained in a lab.

In the context of what Google is doing, or the "intel community" is doing: It is as if authority is insisting that a fake academic nonsense paper is in fact real and true and everyone must adapt to living according to its claims.

### POV?

Sean has not presented any ideologically slanted context for his interests. You see to be placing a very narrow and specific interpretation onto his words that I do not see any evidence of at all. Your writing style has changed quite dramatically in your comments in this thread, and the big questions thread. The associations that you are making to label research directions as "threats" to democracy / anti-globalisation / humanity are coming across (in my opinion) as paranoid and somewhat off-topic.

Over your past few responses you have equated "reasoning with distance metrics instead of set-theoretic operators" as "supporting an authoritarian regime". Is this not a somewhat extreme argument?

### re POV?

Sean has not presented any ideologically slanted context for his interests.

I am not accusing Sean of any ideological slant.

I don't think so but I am bringing in some multi-disciplinary ideas that I guess are probably not familiar.

When we apply programming langauge theory to machine learning applied to "big data", I hope you agree that there's a kind of user interface question:

Users of these systems will have a mental model of what the inputs and outputs to these programs mean. In the field, current users interpret the inputs and outputs to mean things like "likely buyers" or "suspected activists".

This direction of computing research is intimately linked to the structure and power of big institutions (like ad-sellers or the FBI).

The inputs and outputs and how they are described by computing professionals is intimately linked to the decision making processes of those big institutions of power.

The discourse that drives decision making in those big insitutions is unavoidably ideological. It entails an abstraction of society such as a breakdown into demographic groups. It entails interpretations of of society in which those abstract categories of people are subjects and objects. And the decision making process modulates actual real-world power like what happens to consumers and how law enforcement efforts are directed. The decision makers contemplate society in terms of those abstractions in order to operate upon society.

It's well known in sociological and philosophical discourse that when big organizations operate on society that way, they are often not discovering some latent, objective demographic breakdown that exists out there in the world, waiting to be discovered. Often they are imposing an arbitrary and accidental structure on society. Power creates its own reality and without much regard to human concerns.

ML work and PLT work should be critiqued on the issue of how it is described to the rest of the world, most importantly how it appears in the decision making processes that control big, powerful social organizations.

As an analogy:

If we were talking about graphical user interfaces for controlling complicated machines, we would want to consider more specific cases like "What kinds of error does this approach make more likely if it is used in airplane control systems? Nuclear weapon launch control systems? Patient record systems?"

In the discussion of word2vec and "types", I saw an unconscious drift from what is literally an eliza-level illusion of human meaning into uncritical, unqualified social-ontological interpretation. That is precisely the kind of UI error that is troublesome in the context of how this technology is used.

### Natural evolution

To be honest, I'm guessing you're being tongue-in-cheek here.

I do think there is a mixed data-and-society rethinking happening for language design, and wish it'd go faster. Starting ~6 years ago, I began wondering, "How should we design to leverage network effects?" What if we had all of the programs ever written and their execution data? Could we write languages/tools that became faster/more productive the more data we had? How do we structure this? We see weak forms of this today: package managers for deploying code, repos for collaborating on it, and gofmt for language evolution + refactoring.

To your point, we need to get the sharing right. Centralizing to a benevolent dictator or decentralizing a la bitcoin seems much more preferable than getting it owned by a corporation (ex: GitHub, NPM, Docker, Amazon.) As we figure out how to make the data more useful, getting this right becomes more important!

### This totally reminds me of

This totally reminds me of Vernor Vinge's programmer archeologist. It is crazy that reality could catch up to fiction.

### re tongue-in-cheek

To be honest, I'm guessing you're being tongue-in-cheek here.

Not to the slightest degree. Mass surveilllance, big data, and machine learning are epochal like the atomic bomb or biological warfare.

### I actually agree with you to

I actually agree with you to some extent (hey, SkyNet might just be around the corner), but I just don't know what it has to do with this conversation, which is more about how we can apply new technology for use in good rather than worrying about the evil implications. In the long run, we might very well lose the battle with the machines, we might be creating our own successor species that will render us obsolete, BUT I'm pretty sure applying word2vec to type systems is not going to bring about such doom.

I'll end this with something something Bill Joy.

### > the battle with the machines

the battle with the machines

It's not a battle with the machines. Asymmetric mass surveillance and big-data computing is a phenomenon of a two-class society. It is class warfare.

what it has to do with this conversation

Earlier, when you wrote:

Or even just ontological reasoning: if king - man + woman = queen and Beijing - china + Paris = France, then what are the relationships between the types to get there, and would it be a useful relationship to leverage in a programming language?

To come full circle:

A dangerous conflation lurks there between normative and deductive ontology. It's dangerous because computational tools rooted in that kind of conflation are instruments of a governmentality based on the brutal technical management of an abstract population which has been parsed as and is manipulated as an assemblage of demographic categories.

The data sets are about relations between words arising solely from their proximities in texts but you are suggesting looking for types that describe these relations as if they were social relations among people such as "queen" and "France".

My suggestion is that the more virtuous reficiations of machine learning as programming language concepts will help users and programmers actively resist those kinds of conflations (hence have a more literal and less politically crazy view of what the programs in question mean).

### Is this just because I used

Is this just because I used Queen in my example? Because, that was just what was given in the technology description. Would it have been more socially responsible to use Monad as an example? As far as I can tell, monads are apolitical.

At the end of the day, words (and I would say types) obtain meaning via how they are used, not the other way around. To reject that is to basically reject free will.

### re is this just because I used...

Is this just because I used Queen in my example?

No, it's because you were discussing statistical features of word proximity as if they measured social relations (and because the domain we're talking about here is big-data computations over mass surveillance data of one sort or another).

At the end of the day, words (and I would say types) obtain meaning via how they are used, not the other way around. To reject that is to basically reject free will.

Free will does not imply individual or even collective choice about the meanings of words.

### This is just all over my

This is just all over my head, sorry. I guess I could see something like a 1984-ish NewSpeak arising from a smart programming language, that shapes the way we think with machine learned biased. But I'm totally missing your very specific point.

### Enabling Technologies

After reading Thomas' longer reply on a different branch above I think I can see what he means, and his concern is probably one that we have all faced at some point. I would try to explain his point firstly by generalising it completely from this domain.

Any technology enables some set of applications. In general we don't associate degrees of evil or danger with specific technologies, but rather with the applications that they allow. In pop-venacular "Your scientists were so preoccupied with whether or not they could, they didn’t stop to think if they should". One way that a scientist would approach the issue of "should" would be to treat the danger of a particular technology would be to estimate its risk: sum up the potential damage from all of its applications scaled by the probability that it would happen.

I think in this case the application (new forms of type judgement) is less what worries Thomas than the technology (categorising nouns in texts by their semantic distance in a synthesized ontology). In this case the worrying application would be : take this big-data corpus of emails / phone-calls / messages and associate nouns and verbs in an ontology. Find X (in the vector sums presented before) where X is "terrorist", or X is "activist". What is the probability of a state-funded organisation attempting this? I would estimate it is about one. What is the potential damage - I don't know how to estimate that, and I suspect that the long-term fallout from the Snowden affair is the realisation that we have already entered the panopticon as a society.

As always with these things YMMV - personally I think the automatic building of ontologies whose elements respect an algebraic structure has many interesting applications that are beneficial or neutral. I guess that everyone forms an opinion based on their own cross-disciplinary interests. For me: the last time I saw something that provoked a similar reaction was the BAe contest in UAVs. Painting "civilians" with a laser designator to in an urban environment where they are "needing rescue" from behind obstacles to "save" their lives. I considered that far too militaristic to work on as I have no intention of actively trying to build a first-generation terminator. It surprised me at the time that my colleagues had such a wide range of moral responses. Over time that surprise has evolved into an understanding that "morality in science" is not something that we (as society or as a particular industry) have really gotten to grips with.

### Simply put, the technology

Simply put, the technology is already being developed for these purposes, we in the PL community have no say in that. To apply the technology to PL does not make the situation any worse than it already is.

My dad worked in nuclear power. Though I thought nuclear bombs were evil, power generation seemed fine if done safely.

### re the technology is already being developed

Simply put, the technology is already being developed for these purposes,

Yes.

we in the PL community have no say in that.

I strongly disagree. In fact, I think we've found out here a fertile ground for public, not-for-profit, multi-disciplinary research that could and perhaps should have room for PLT researchers.

What can historians, sociologists, and journalists turn up about what work is being done in this area, including secretive unpublished work?

What can philosophers, political scientists and computer scientists figure out abuot the consonances and dissonances between what such software is literally doing vs. how it is conceived in organizational decision making processes?

Can these groups articulate for a general audience the new realities these systems are creating?

Such research could be of value to journalists as background, judges as expert analysis, legislators, etc.

### Encourage good != discouraging bad

I think Sean agrees that we're in a position to do more good, and that worrying about causing less badness is in this case... tricky. That's part of the reason I brought up, if data becomes central to PL, language designers must be more conscious of their role as community designers. And that, in turn, is why I brought up the already alarming effective privatization of source code to silicon valley companies like GitHub and NPM.

Another recent, interesting example: sharing security alerts across companies. EFF and friends got up in arms when the gov tried mandating sharing attack data between security groups across companies akin to what's already happened w/ fraud data between banks. Facebook, instead, is collecting that same data between companies and explicitly not the government. Replace 'security alerts' with 'runtime exceptions', and that's IMO within the realm of what a source repo can usefully track!

### It is not even a matter of opinion

I think engineers and other practitioners struggle sometimes when presented with ethical implications of their chosen careers or employment. More so because it seems more like a calling (vocation) than a choice, at least to some of us. But it is not even a matter of opinion of whether these things (oppression by data and toolishness--a cromulent word if there ever was) happen. On Facebook people are forced to out themselves instead of using a persona that would limit the ability of individual aggressors to continue to ruin their lives. This non-anonymity requirement, apart from the consequences I just called up, also facilitates gathering of data about individuals which is sometimes for unsavory purposes. I think it could be argued that any ACM member ought to resign from employment at Facebook due to ACM's code. Maybe none of us works for a bomb-making company, but all of us should monitor whether the activities we facilitate are those we would encourage our children to do.

### Neural Networks, Types, and Functional Programming

I don't have a deep understanding of this subject, but this article seems like a very nice introduction to one aspect of learning. He summarizes the three main narratives thusly:

At present, three narratives are competing to be the way we understand deep learning. There’s the neuroscience narrative, drawing analogies to biology. There’s the representations narrative, centered on transformations of data and the manifold hypothesis. Finally, there’s a probabilistic narrative, which interprets neural networks as finding latent variables. These narratives aren’t mutually exclusive, but they do present very different ways of thinking about deep learning.

Sean started this thread seemingly on the probabilistic narrative, but the article argues convincingly for data transformation narrative.

### I read that a couple of days

I read that a couple of days ago. It seems to point out ML uses transformations that can be expressed with FP, or ML is about learning abstractions, and abstractions can be expressed in FP. Well, ya duh...but what can we expect to take to the bank with this?

### The intuitive picture

I like the diagrams at the beginning of the article because they suggest a simple intuitive picture of what a NN does. There is a simple 2D input, with a contiguous smooth region being deformed through a relatively smooth deformation onto its output. Everything is nicely coloured to show a partition of the plane with relatively low entropy and I get a nice warm fuzzy feeling looking at it... yeah we could understand this.

But useful NNs feature multiple layers that handle arbitrary transformations of multi-dimensional spaces with non-binary features. They really are opaque black-boxes that encode very problem specific transformations, by their nature.

When we talk about the high-level architecture of building a NN out of simple(-ish) pieces that we understand it can look like a simple programming problem. But overall the speculation that we could proceed in this way just doesn't strike me an convincing in general. It would be really nice if it worked that way in 30 years...

I would make a different prediction (and I would guess this been said many times before): in any method that provides a competitive result in classification or regression the comprehensibility of the result is inversely proportional to the accuracy. This seems to be a natural alternative expression of the over-fitting problem. If I look at SVMs, NNs or RFs (at scale) then the accuracy of the results seems to be a direct trade-off of the predictive accuracy.