Better language tools

I have recently been going over merging code from various branches. One rather annoying thing I noticed was the diff and merge tools don't seem to understand enough about the language in order to produce good ideas on what's going on. For instance if I am merging an insert where the inserted code just happens to have the same header bit of a comment and ends with the same bit of the comment it wants to merge this new piece in-between the comment rather than viewing the comment as an entire block. The same happens elsewhere often.
So with that, my question is really if better merge tools are viewed as a necessary problem in dealing with languages or if the outcome is not considered worth the effort?

I realize this is not a question directly at PL (theory) in general so if the question is inappropriate for here please take care of it accordingly.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Concurr

(For whatever it is worth) I agree that tools are very important, and I furthermore think that the state-of-the-industry is really lame! The sort of thing you mention is a great example. If you consider the usual approach to software development, there is the phase where you are doing nothing but fixing bugs, and often the bugs are in code that has some long history of edits. Trying to see what happened to that code with diff tools that end up showing a mess, rather than showing structured alterations, is less than ideal.

If there were tools which understood the structure, then we could also get rid of the "how do you indent your code" (for you non-Haskell non-Python folks) problems by simply translating it into what each individual prefers.

To put it in a plt context...

There have been numerous attempts to create languages that are stored as ASTs, instead of plain text, for the reason you mentioned. You can do a lot more with tools, show show different representations to different users, etc. For some reason, none of them have taken off. I'm not sure if that means that plain text is inherently superior, or if other factors were involved.

Unfortunately, can't think of any of these off of the top of my head, or I'd at least throw out some links.

Intentional programming

Seems like the issue would be separable: do not pretty much all programming languages end up as an AST in the end?

I'm assuming the intentional programming style of development and tools rely on knowing all about the AST, while letting folks use regular ASCII as well. IDEs have sorta already gotten there what with all the refactoring tools avaiable in Eclipose or IntelliJ.

ASCII vs. AST

I'm assuming the intentional programming style of development and tools rely on knowing all about the AST, while letting folks use regular ASCII as well.

IIRC correctly, the original conception of intentional programming relied on graphical AST representations. This was quickly found to have horrible usability.

IDEs have sorta already gotten there what with all the refactoring tools avaiable in Eclipse or IntelliJ.

There are still some gaps, mostly in that extending those IDEs with either new DSLs or new refactoring/analysis tools is a decidedly expert undertaking. That said, it does seem like most of the value promised for intentional programming has been achieved by modern IDEs.

Re: ASCII vs. AST

IIRC correctly, the original conception of intentional programming relied on graphical AST representations. This was quickly found to have horrible usability.

Do you have any further information on these failures from the past?

I'd be interested...

... in pointers to additional info on some failures too. I'll start digging on my own of course but it would be great to shorten the process.

yes

Benjamin Pierce showed that diff doesn't satisfy properties you'd expect ("A Formal Investigation of Diff3") and has some interesting work that in one sense circles the area.

Miryung Kim has done some fun work centered around detecting code-duplication and has moved on to study further structured changes.

More generally, Dawson Englar, as part of the coverity project, did some statistical analysis to find basic library usage invariants ("in probability.."), and implicitly, invariant violations and thus extra structure to look for.

Furthermore, given much of the development cycle is spent on maintainence / extension, verification should often be of changes.

I guess what I'm getting at here is that I view it as an SE issue that can bleed into a PL one - albeit one I find interesting.

Cost-benefit

So with that, my question is really if better merge tools are viewed as a necessary problem in dealing with languages or if the outcome is not considered worth the effort?

Yup, that's it exactly. It's a problem that's not quite interesting enough to be the subject of much academic research, and not quite marketable enough to be worth commercial or open-source IDE or VCS vendors' development efforts. It's one of probably thousands of smallish vaguely-annoying problems that will end up being solved once developer resources get cheap enough, but not until then.

Eclipse goes part of the way

Eclipse goes part of the way there, in that it's diff viewer will show you a tree of which classes and methods contained changes. But that still works at the ascii text level, as far as I'm aware.

There are also a number of XML Diff utilities which work on the DOM tree of the XML document. For example, IBM's effort.

An AST-based diff would be great for avoiding silly problems like files being reported as changed when they've simply been reformatted. The lesson from XML seems to be that these sorts of tools will start to appear if obtaining an AST for a file is easy enough.

Lingua Franxmla

All compilers should have XML back-ends. Too bad that doesn't help since you'd be diffing XML rather than source code. So then we just simply need to use some XSLT to fix it up to look like source code! Voila! (I kid.)

Why not...

After all, if it's good enough for gcc... :-)

Seriously though, the XML diff reference was meant as an example of tools working on an AST (the W3C's DOM in that case) in the hope that there might be some useful ideas to take from it - not as a suggestion that XML would be better than ASCII for diffing.

Agreed

Discussion forums are hard - I wasn't and didn't mean to sound like I was dogging anything about the XML diff (and I have had need for good XML diffs quite often). And, I could sorta like the idea of having XML as some core representation with conversion to-and-from the relevant language ASTs, although not because I think XML is anything other than a standard format; no super powers or silver bullet-ness.

How much knowledge of a

How much knowledge of a language is necessary in order to get a less cumbersome diff? For dealing with languages like Java/C/C++ etc, what if you teach diff what a block is and then it can treat the contents of the blocks relatively normal (as ascii text). It's not perfect but would that bare minimum amount of knowledge be useful.

Lost information

Once you're diffing two files, you have already lost information. A simple approach might be for the editor to persistently remember every editing action (inserted and deleted characters, cut, paste, etc.) on all undo/redo paths. It's still a challenge to generate a meaningful diff from this information - you might be able to track movement of text with cut and paste, or points where you deleted/created lines.