Project Sikuli

Picture or screenshot driven programming from the MIT.

From the Sikuli project page:

Sikuli is a visual technology to search and automate graphical user interfaces (GUI) using images (screenshots). The first release of Sikuli contains Sikuli Script, a visual scripting API for Jython, and Sikuli IDE, an integrated development environment for writing visual scripts with screenshots easily.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Hmm...

As a method to quickly write test code for GUIs, the system seems apropos. Altough the idea isn't new, this may be the best impelementation for all I know.

However, I'd hate to see this expand to script generation for the purpose of performing user-orientated tasks. The user-interface, although shinier than ever, should really not be the target of non-user manipulation. Programs have not made any strides (save for Accessibility APIs) of becoming any more semantically orientated or self-descriptive, and this would take things in the wrong direction. Until that happens, I beleive that true dynamic compositionality, multiple/dynamically created GUIs, handling of other modalities, creating intellegent synergy between applications, etc., will not be possble, or at least will require unnecessary hand-jamming.

The last thing we need is another hypercard like phenomena but this time just by clicking around on existing apps - I can see the TODO-manager that's built by clicking around Excel for 10 mins now :)
Luckily the concept runs out of steam (as most visual aids do) when it comes to performing anything above modest complexity - after which it becomes worth learning how to program to the API.

But it is "nifty"...

As a method to quickly write

As a method to quickly write test code for GUIs, the system seems apropos.

Until that happens, I beleive that true dynamic compositionality, multiple/dynamically created GUIs, handling of other modalities, creating intellegent synergy between applications, etc., will not be possble, or at least will require unnecessary hand-jamming.

There are roughly two sorts of compositionality we might care about: 1) end-user composable, and 2) just-in-time/model production composable (for software product lines). I don't see a good way to do 1) and if we do 2) then we probably do not need to depend upon an "agent" such as the one proposed here; instead we can use a Robots API / Automation Peer API and code synthesis tools to automatically explore paths through the system. I've read various books on unit testing UIs, such as Swing Extreme Testing, and also academic papers, and they usually contain severe unnecessary assumptions that increase the combinatorial search space and also decrease the practical suitability of probabilistic approaches to testing software, such as Orthogonal Array Testing Strategy (OATS). In software product lines, the basic approach is to use relations to define "feature profiles", and thus prune your automatic test generation search space using the bounded combinatorics methodology. [Edit: Concolic testing has already been empirically shown to not live up to the hype, as well, despite being better than the alternatives, and any methodology that does not directly address eliminating paths through the system will decrease the effectiveness of whatever testing approach you take.]

Love the use of the phrase "hand-jamming", by the way :) I'll have to remember that one...

Seems too clever / too complicated

My biggest gripe with newfangled ideas from MIT is that places like MIT reward how clever you are, and not how practical you are. My belief is that this is a direct result of how brilliant everyone there truly is -- million times smarter than me.

I don't know. To me, it seems like
a) throwback to screenscraping, with an extra level of indirection; this is the antithesis of the Web movement and thus puts us back in the '70s
b) lacks any strategy for "lifting" out the underlying service hydrating the UI

I've wrote a simple, pragmatic prototype for WPF and Silverlight that does b) and doesn't depend on a), but it's dependent on a SearchMonkey-like data interchange format. I believe this is not at all clever, but simply the right approach.

No disrespect to the researchers... this is highly clever... just not that practical {{in my humble opinion, only}}.

Edit:

One other comment

A computer user hoping to learn how to use an obscure feature of a computer program could use a screen shot of a GUI — say, the button that depicts a lasso in Adobe Photoshop — to search for related content on the web. In an experiment that allowed people to use the system over the web, the researchers found that the visual approach cut in half the time it took for users to find useful content.

This is only half the battle, though. Nirvana is being able to automatically produce screenshots of your application, in a particular state, for a particular user. If you take the stance that a GUI is fundamentally a dynamically federated, dynamically distributed system... then it's a mistake to assume a given user will see the same screenshot, and may even present an information secrecy problem.

Again, I am not meaning to disrespect Rob Miller, who always has great and fascinating ideas... this one just overlooks established thought on what to do and how to use hypermedia.

Drag-Drop + skepticism

The Sikuli-Script idea for Drag Drop is actually a pretty cool hack :)


y=Find( Picture_1 )[0]
for x in find ( Picture_2 )
DragDrop (x, y)


def dragLeft(t)
dragDrop(t, [t.x - 200, t.y])
click ( Picture_3 )
click ( Picture_4 )
for t in find( Picture_5 )
dragLeft(t)

By the way, ignoring questions of scaling this approach (posed by Michael Robin and myself), Sikuli states in one of their videos that it lets you do anything you can do visually... I am curious about a practical challenge in most screenscrapers, such as Boston WorkStation scripting software used to screenscrape Meditech screens. With traditional screenscrapers, you cannot monitor what the "final" value is for something. For example, if a value is only known after a series of constraints have been solved, and those constrains continuously change, how do you monitor that value and then "finalize" it when the user clicks, say, "Done" or "Next"? For example, looking up patient information and then, context-sensitive, loading a web browser with a REST-based GET request for some additional information not provided by the closed-source, proprietary, non-JIT/MP composable Meditech screens.

Thus, I think the ability to do "anything" is overpromised. In order for Sikuli to actually do "anything" it would need to be able to interactively debug the application it is monitoring. This would allow it to trap into debug mode on the "Done" click, and then capture the "final" value. Sikuli thus needs "capture" and "seize" primitive commands to go beyond normal screenscraping.

Edit: Also, how do you support zoomable uis?

Interesting

This is pretty cool... an interesting evolution of scripting. Besides the huge security holes of course... (I'm assuming if you can display a graphic on a users screen, you can change how Sikuli scripts would execute.)

confusing sikuli

(I'm assuming if you can display a graphic on a users screen, you can change how Sikuli scripts would execute.)

Apparently so. Notice in the video that the sikuli script window hides itself when run - presumably so that the screen shots in the script won't confuse the script.