Lambda the Ultimate

inactiveTopic Seesoft
started 8/11/2002; 2:03:16 AM - last post 8/14/2002; 6:10:09 AM
jon fernquest - Seesoft  blueArrow
8/11/2002; 2:03:16 AM (reads: 1996, responses: 8)
Seesoft
Subtitled "A Tool for Visualizing Line Oriented Software Statistics." Seesoft provides a colored software map that makes it easy to gestalt significant patterns out of large bodies of source code. Kind of like zooming out from hardcopies of all your source code files laid out next to each other on the floor: "35 files containing 50,000 lines of code can comfortably fit on a standard high resolution (1280x1024) workstation color monitor." Each line of source code has a color that represents a statistic: "The display looks like a miniature picture of the text with the color showing the spatial distribution of the statistic within the text."

The statistics associated with text may be continuous, categorical, or binary. For a line in a computer program, when it was written is a continuous statistic, who wrote it is a categorical statistic, and whether or not the line executed during a regression test is a binary statistic.

Seesoft is an instance of Exploratory Data Analysis, an approach to statistics devoted to identifying patterns in data. There is also a shorter paper (source of above quotes), a short online summary, and a description in an online chapter on gui's for information retrieval. TileBars are a simpler technique using a grey-scale to represent statistics.

Posted to Software-Eng by jon fernquest on 8/11/02; 2:14:49 AM

Ehud Lamm - Re: Seesoft  blueArrow
8/12/2002; 12:50:48 AM (reads: 1158, responses: 0)
One of the keynotes in the recent Ada-Europe conference was about something like this (Visualizing System Evolution, I think was the title).

I must say that I came out unconvinced about the benefits of this approach.

It's not that I am against visualization as such. In fact, I find that visualization can be helpful in a variety of fields. However, I think that analyzing source code is difficult, and finding the correct resolution is tricky. For example, I don't realy think that program lines (is that statements, btw?) is the meaningful level to work on for most programs. Maybe modules are more appropriate.

Another problem is that changes are related to each other, something that is close to impossible to see with these tools. Version 2 may contain 100 source changes, of them 96 may be related. It is but one change. Statistical analysis (clustering) may help, but doing it well is a hard problem.

The most important objection, however, is that seeing properties of code (esp. as regards code changes) usually tells you very little about a system. Is a change the result of debuging (and is, perhaps, an ugly workaround), an addition of a new feature, a refactoring - these are things you want to know, and I don't see how a tool that simply does program analysis can help you. Perhaps connecting these tools with a CASE system can help (Fear and loathing time )

Software archeology is essential in today's world, where you often have to work on huge software systems you hardly know, much less understand. I think industry courses should deal with this, by the way (I even consult about it at times). I have yet to see a convincing tool.

jon fernquest - Re: Seesoft  blueArrow
8/12/2002; 3:58:35 AM (reads: 1145, responses: 1)
> I don't realy think that program lines (is that statements, btw?)
> is the meaningful level to work on for most programs. Maybe
> modules are more appropriate.

The system studied was 15 years old and written in C. Programming language feature sets change so rapidly that you would definitely be shooting at a moving target. Programs always have Abstract Syntax Trees though, so perhaps generalized AST query like the JavaML author does would be useful.

I originally bumped into Seesoft used with natural language texts where density of word usage from a set can indicate topic relevance for passages within a text , basically using it as a *within-text search engine*.

This technology makes more sense for natural language because it is difficult to parse, whereas there is no issue about getting a programming language syntax tree and then using it for the basis for more informative statistics.

Maybe this article should be filed under history. I'm looking for refinements of the idea that use information in the abstract syntax trees of software systems. Anyone know of such refinements?

Ehud Lamm - Re: Seesoft  blueArrow
8/12/2002; 12:35:42 PM (reads: 1182, responses: 0)
I have a feeling NL techniques don't translate well to PL (the funny thing is many linguists say the same thing about using PL techniques studying NL ).

Thing is PL are not as complicated as natural languages (and are more precise to boot), while the artifacts we are interested in studying (i.e., huge software systems) are more complicated than natural language texts that are susceptible of mechanical analysis.

jon fernquest - Re: Seesoft  blueArrow
8/13/2002; 1:20:37 AM (reads: 1116, responses: 1)
> I have a feeling NL techniques don't translate well to PL

With Seesoft it was the other way around, namely *Seesoft was a PL technique translated to NL. Seesoft was developed by ATT research for managing large code bases, the information retrieval (NL) people saw it, and then adopted it.

Since the version control system that Seesoft used data from was line-oriented, Seesoft was line-oriented. I think the value of Seesoft, however, lies in the very idea of zooming out and getting a big picture of the source code and then zooming in to get the specifics.

The concern graphs paper had an interesting application of the Seesoft idea to mapping concerns across the code base of a large software system: Aspect Browser .

>(the funny thing is many linguists say the same thing about using PL techniques studying NL

Pereira and Schieber (Prolog) sure didn't. There used to be a copy of an encyclopedia article on Schieber's Harvard class site that they wrote that argued the relevance of PL semantics for NL.

jon fernquest - Re: Seesoft  blueArrow
8/13/2002; 1:58:46 AM (reads: 1101, responses: 0)
A good example of programming language semantics contributing to natural language (computational linguistics) is Grammatical Framework. A good overview of the project brings out its roots in industry (Xerox and Nokia).

A Finnish friend of mine even says the authors of the system studied Martin-Lof type theory at the University of Helsinki philosophy department. So we have a strange fusion indeed: computer science (Haskell), philosophy (logic), industrial applications (multi-lingual editor).

Ehud Lamm - Re: Seesoft  blueArrow
8/13/2002; 4:11:18 AM (reads: 1150, responses: 0)
Zooming out is, of course, useful. But this is textual zooming. My feeling is that this is "zooming" of the wrong map. Analogy:zooming a National Geographic photo of a lion won't tell you much about its genetic code.

many linguists say the same thing about using PL techniques studying NL

Many, cetainly not all.

jon fernquest - Re: Seesoft  blueArrow
8/14/2002; 3:57:32 AM (reads: 1119, responses: 1)
> Zooming out is, of course, useful. But this is textual zooming.
> My feeling is that this is "zooming" of the wrong map.
> Analogy:zooming a National Geographic photo of a lion won't
> tell you much about its genetic code.

Agreed that lines in either PL's or NL's have no intrinsic meaning. A sentence, word, a constituent phrase in NL. and an expression, statement, function, or module in PL does. Coloring in the map obviously has to be keyed to meaningful structure and you need an AST to get this info. (Something like JavaML would probably do the trick.)

The Aspect browser referenced above uses regular expressions and naming conventions for find cross-cutting concerns. Naming conventions usually supply info that you could get automatically from an AST (like it is an "integer variable" in a given function of a given module).

Concerns graphs *must utilize* AST's in their construction. (Implementation details were not given in the paper.)

Do version control systems utilize AST info nowadays? This would have made the statistics in the Seesoft paper more accurate (but significantly more accurate? In C a line corresponds pretty close to a statement, but it certainly doesn't in Lisp , the target of the Aspects paper above)

Programmers traditionally use grep to find cross-cutting concerns in changes that affect several files. Grep is line oriented regular expression matching. Regular expression matching on AST trees will be possible when tree regular expressions become available for XML. (Tgrep is a tree grep used on the UPenn treebank NL corpus.)

Ehud Lamm - Re: Seesoft  blueArrow
8/14/2002; 6:10:09 AM (reads: 1160, responses: 0)
Programmers traditionally use grep

Yes. And bugs are often itroduced during this type of maintenance.

Notice, however, that there's a big difference between identifying suspect files, and then manually checking them and producing statistics that are then open for misinterpretation.