Restructor: Full Program Automatic Refactoring

Wrote up a description of some personal research I did 2005-2010:

This algorithm will take an arbitrary program (including side effects) and refactor (create and inline functions) until the code contains no more under-abstraction (copy paste code) or over-abstraction. It is an implementation of code-compression: it finds a set of functions representing your code such that the total AST nodes are minimized.

I had originally intended this as a new way of programming, where you could modify your code using copy paste to your hearts content, and leave the annoying work of abstraction to the computer. I now realize that is a bit naive, and note some reasons why this may not be ideal on the page above. Still, an interesting idea that I've not seen before, would love to hear y'alls expert opinions :)

For reference, I posted about this idea on LTU in 2004 under the topic "abstractionless programming".

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Suggesting abstractions

I've experimented with similar algorithms, albeit for a trivial (Forth like) language. We can seek subprograms that are similar modulo a few holes, factor out the holes.

One challenge I ran into is consistently and efficiently recognizing commutative code. Like (a + _) vs (_ + a) can be collapsed to the same node for many types (numbers, matrices, etc.). Or that order of arguments curried to a function is irrelevant. Order of fields in a record, order of assignment to them. Not only is there a never ending supply of type-specific commutativity, but just searching for equivalent subroutines under reordering blows up into a problem of recognizing similar graphs, the algorithms for which tend to sit in NP.

I eventually gave up the idea of reducing code to a "normal form". And really, at the extreme interpretation, that's undecidable anyway as it would require solving the halting problem (so we can rewrite to infinite loop).

I doubt automatic whole codebase refactoring will see much use outside of possible size optimization and minification of code. But suggesting abstractions based on analysis of large codebases should be widely useful, and wouldn't break mental code maps.

No normal form

I thought it was provable that code has no normal form? You are essentially asking for equality on functions. That is does function A do the same thing as function B. This reduces to a proof search - given all the axioms of mathematics, find a path from A to B. This has the same properties as an algorithm as generalised proof search, which is certainly an open problem, if not impossible, due to the fact that we often have to take steps that move us further from the goal to not get stuck, a bit like navigating a maze. With our current understanding you would need something like a proof assistant, where a human gives hints to help narrow the search and find the path. There might be something like the mathematical equivalent of the 'follow the left edge' rule that could reduce the search space, but whether it would be enough to make the general proof search for function equivalence useful is doubtful.

normal forms

Yeah, it's undecidable to find a normal form in the extreme sense of "the minimal program with the same behavior". As I mentioned, that would require solving the halting problem. But we can still have a normal form in a weaker sense of having a confluent set of rewrite rules that systematically translate a program into another program with the same observable behavior.

My essential point above is that even this weaker "normal form" seems infeasible in practice (due to NP-complete searches) once we start contemplating commutative operations, reordering of data structures, and the rules needed to recognize and automatically refactor common behaviors under reordering.

I haven't proven I reach "normal form"

Note that while I speak of normal form, I haven't proven that I reach it, nor is that important for the practical use of this algorithm. The algorithm is very iterative/emergent/heuristic in nature, which keeps it simple, and it appears to find the set of abstraction that is close to optimal :)

I haven't considered commutative operations yet, but that could indeed be an interesting addition. While most of my examples are with operators, I'd expect the bulk of nodes in more normal code to be function call etc, most of which are not commutative.

For Forth this algorithm indeed becomes a lot more elegant, as it just finding the longest/most repeating sub-"string" in its most basic form. I have considerd Forth for this algorithm, and I do remember finding cases where even there abstraction with holes (a closure argument) would be beneficial, but it is less essential than in non-stack languages.

I agree a more human controlled version of this algorithm would work better for most people :0


Could it collapse a sequence of repeated function calls into a loop?
Or a sequence with small differences into a function mapped over a list of those differences.

It might have to do some sort of tradeoff between compressing the most common pattern and compressing the longest loop. Whichever got it smaller would be best I suppose.

edit: I see that that is in the first page of your brainstorming.

Loop refactoring..

It doesn't do those yet, but that would certainly be interesting. It will currently end up with multiple subsequent similar looking function calls, so that should be easy to recognize.

"Copy and Paste Redeemed"

Narasimhan and Reichenbach (2015) did work that seems relevant to your goals
("Copy-paste redeemed",—have you taken a look at that? How does your work compare?

(Narasimhan and Reichenbach, 2015) Krishna Narasimhan and Christoph Reichenbach. 2015. Copy and Paste Redeemed. In Proceedings of the 2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE) (ASE '15). IEEE Computer Society, Washington, DC, USA, 630-640. DOI=


I can't tell if the author's recommendation that programmers use this tool instead of proper abstractions is meant to be humorous or not, which makes the introduction somewhat troubling, but the results on "real programs" correctly suggest that this can be a useful quality-increasing tool.

In other related work, I wrote in my own Search for Program Structure (a paper/essay about the interest of looking for canonical representations of programs) that while canonical representations are convenient for algorithms operating on programs, it is not the best representation for programmers writing programs. The criticism, quoted below, also applied to the proposed "full refactoring":

The choice of a highly non-canonical representation of programs allows programmers rich means of expressions of the same program. Two distinct expressions of the same program may correspond to stylistic difference, or an expression of intent in the way the program being worked on will evolve over time. Non-canonicity may be essential for flexibility, clarity and modularity.