Speech-to-text friendly programming languages

I have a few friends that have severe repetitive stress injury and can effectively no longer type for long periods of type.

I'm trying to consider an environment and a language which would be suited to speech-to-text input. My first thought on the base language is Standard ML since definitions are self contained and require no "punctuation" e.g.

let x=5
let y=7

is valid and complete without double-semicolons at the end. Thus, you could say "let x be five, let y be seven" and produce the above code without too much interpolation.

That said, there would have to be a grammar that translated a precise speech into ML and to be really effective, the ML generated would have to be constantly reparsed and kept in a symbol-table state so that the speech processing program could use the inherent structure of the underlying language to disambiguate slurred or otherwise ambiguous speech. Another good reason to use Standard ML is that parsed ML contains more information than many other languages due to type-safety putting restrictions on variable/function usage -- more disambiguation possible.

Does anyone have any thoughts on other requirements for such a beast or pointers to research that has already been done that I might not find through an old-fashined ACM/Citeseer/DBLP search?

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Typing Injury FAQ

They have a lot of links, including using speech for programming.
(look for Emacs)

Interesting, thanks. Not sur

Interesting, thanks. Not sure emacs is the way I want to go, although this will definitlely help me think about it more.

Spoken Programming Language Design

I have the same problem, RSI shutdown my left wrist for six months, I had to learn the right-hand-only dvorak keyboard layout. Since I've gotten back to two-hand typing I've been thinking about a good design for a usable spoken programming language. My first thought was to encode lambda calculus in a spoken form, that's an easy design, just write code to recognize three sounds.

But that's not really general purpose, so I wonder whether a better design is to find a nice round binary number of sounds that I can easily recognize with simple code and use that for the spoken 'charset'.

Derek Elkins suggested a concatenative language for spoken programming, I think that's a great use for a language like Joy.

Does anyone here have further ideas or can point out possible pitfalls for such a language?

-- Shae Erisson - www.ScannedInAvian.com

Are we talking just entering the source?

just write code to recognize three sounds

I though the main challenge is editing/refactoring/otherwise transforming/navigating the code by voice, not just entering it. Is this completely wrong assumption?

Well if we're not...

Again, concatenative languages seem ideal for this. For example, to navigate Joy you'd pretty much only need "forward", "back", "in", "out". There are higher level navigation things you'd want, but those are less language or syntax specific, e.g. "goto definition of foo". Manipulating the source is equally simple, you simply insert, overwrite, or delete a Joy object. Further, since the syntax is so regular1 it's intuitive and unambiguous what say "delete 4" would mean.

To rile up a whole 'nother hornets nest, I wonder if (but haven't looked at all yet) there have been any studies about moded v. modeless spoken editting, especially with regards to programming. voi anyone?

1 One point I'd like to make, is that Joy's abstract syntax is regular, whereas in Lisp only the concrete syntax is regular.

Step on giants' feet?

Not to veer too far off-topic like the "Chinese natual [sic] language" thread, but you might want to start with a larger language, and then add programming facilities. One of the goals of lojban was to make automated speech recognition somewhat easier. (E.g., word boundaries are clearly defined based on emphasis patterns; words fall into a fairly small set of patterns.) And its state goals of reducing ambiguity (compared with natural langages) are certainly in the direction of PLs.

Spoken Language Support for Software Development


Sorry that i am not sure how to format this better:



Software development environments have not changed very much in the past thirty years. While developers discuss software artifacts with one another in terms of high-level conceptual notions, their environments force them to use low-level text editors and program representations designed for compiler input. This shift in level is error-prone and inefficient; in addition, the environments create frustrating barriers for the growing numbers of software developers that suffer from repetitive strain injuries and other related disabilities that make typing difficult or impossible. Our research helps to lower those barriers by enabling developers to work at a more conceptual level and by reducing their dependence on typing and text.

The specific technical issues to be addressed in this research are driven by two approaches: multi-modal (notably speech) interaction, and semantic and structural search, navigation, and transformation. The technology we are creating is not limited to programming languages; it extends to other specification, design, and command languages that are used by developers and that can be formally defined. Our research will be embedded in the Harmonia framework, also being developed at UC Berkeley. The first prototype language for which the linguistically-based methods will be created is Java.

Our research will create the first version of a form of Java that is more naturally verbalized by human developers than the standard Java language. Methods will be created to translate this form to the same annotated abstract syntax representation used by conventional text-based tools. The major technical challenge is to resolve the ambiguities that the new form allows. That ambiguity resolution requires new algorithms for interacting lexical, syntactic, semantic, and program-specific analysis. New methods of accommodating lexical, syntactic, and semantic errors and inconsistencies will be created, in order to sustain language-based services when the artifacts are incomplete and incorrectly formed.

This is excellent stuff!

Alright, this is more like what I was looking for in the first place. I see that there are releases of Harmonia out there -- how mature is the project? What's left to be done here? I see that there are only a very few languages supported and even fewer languages supported completely, but that's not so important as having a speech input mode that correctly handles the languages it does support, of course.

I do have a lot of the background knowledge needed to work on this -- e.g. machine learning, natural language processing, linguistics, etcetera. Are there any parts of this project where help is needed right now, or are they pretty well staffed? Being a full time PhD student with wife and kids and a side job, the main thing I lack is time, but if there were some way I could work on this in a meaningful way, I think I could devote some.

speech to text

Interesting idea. I'm assuming that you are using a language dependant engine like Speechworks/Scansofts Naturally Speaking or IBM Via Voice, and not a speaker independent tech like MS SALT and underpinnings...

If you do the speaker independent then the best advice I can give you is to really think through the spoken grammar and test it and test it and test it and test it. For some ideas - check out how Australians bet on horses.

If you do the speaker dependent thing - I assume you'll be spending most of your time on error correcting, or you'll give up and just let Naturally Speaking/Via Voice be the generic interface.

Do remember speech is probabilistic. There's a 5 to 10% chance that any utterance will be wrong... so editing becomes a real issue...and you won't be able to bring error rate down to less than 2%, which means you have to do some serious testing.

Stay away from any speech words that get easily confused - especially words that start with s, or z (you'll see why if you look at the energy of s on a meter). And finally, KEEP YOUR COMMANDS TWO SYLLABLES, not one - one syllable words reco horribly.

Have I made my point about testing? Start testing now, and test often - you may find out what you're worried about is the wrong thing.

Good luck! Post back if you have success!

Ashok Khosla, CTO & CoFounder, TuVox Inc.

Not speech, but simple ways to edit a program

I think that speech recognition---although a tough and interesting problem, and cool in SciFi---is the wrong way to go. Instead, we should make it `easier' to edit programs. One way of thinking about this is that we should try to minimise the bandwidth required to program a computer.

For example, if Stephen Hawking had to type each letter in each word he wanted to use, he would have real problems communicating. Instead, he uses a text-to-speech synthesis program that has an optimised interface that requires few user inputs. The result is that he can communicate almost in real-time.

Of course, I am not advocating the replacement of the keyboard by some sort of weird device. Instead, I think we should make our interactions with our programming environments more efficient. If we can make sophisticated changes to programs quickly and easily, we should be able to be more productive and make fewer errors. E.g. I tend to slightly lose track of what I am doing when compiling code in a language such as Java, because I have to wait for the compile to finish and my mind tends to wander. I think a similar argument could be made when many keys need to be pressed to make a significant alteration. We should also be able to accomodate those who suffer from RSI or are otherwise physically less able, because less bandwidth is required from the programmer. This seems to be what IDEs should have been. Instead, they tend to be simply a combination of an editor, makefile generator, debugger and perhaps a WYSIWYG user interface editor.

I remember seeing a demo on the Microsoft Research website of a system where, if I remember correctly, the program's parse tree (or similar) and metadata (e.g. comments) we stored in a database. Users could customise their view onto the program (think CSS), could make quite sweeping changes to the code and they could even automatically translate from one programming language to another (though this was obviously quite limited). The system was similar to that proposed by McConnell in Code Complete.

If anyone has a link to this, could they post it, as I spent about 10 minutes looking, but I couldn't find it.


Not speech related, but I wonder what could be accomplished coupling a Dasher like interface with an advanced, programming language aware editor. (Dasher is a "low-bandwidth" text entry system using only a mouse)

The result would be...

... programming by slalom! Perhaps computers could be rigged with reward dispensing devices for when you've cleared x number of gates: "Well done: have a Pepsi!" :-)

Seriously, though: I was very impressed with Dasher when I first saw it. That's *sort* of what I was getting at. However, I was thinking more along the lines of being able to say "replace all the instances of this programming pattern with this new one that doesn't include that buffer overflow" (for suitable definitions of `pattern', of course), where you could define a pattern simply by marking the start and end of an example of the pattern, or something similar.

I wonder what could be accomp

I wonder what could be accomplished coupling a Dasher like interface with an advanced, programming language aware editor. (Dasher is a "low-bandwidth" text entry system using only a mouse)
Hey, Dasher *could* be an advanced, programming language aware editor. Our language models are just C++ classes, and one could easily be written that groks your language of choice. Programming in Dasher doesn't work so well currently because the native language model is a simple n-gram model (looks at the last n characters, predicts the n+1th), and so doesn't understand that there's a grammar involved when you write:

   let x=5
   let y=7

But that doesn't mean it has to be that way. I'd love to see someone hook in a language model that speaks something like yacc(1), so that we can throw any language we have a grammar for at it. (My speculation on why this hasn't happened yet is that coders don't have time to play with Dasher until they already *have* RSI, at which point they're not going to want to write huge chunks of language modelling code for us. ;-)

- Chris, the Dasher guy.

Would such a system ...

... need to be multi-modal? I imagine that inputting a program fragment and editing a program would require two very different modes of operation (even if both were accomplished using the Dasher interface).

I imagine also that a Dasher-like interface could potentially allow the programmer to match their mental model of the program to the actual program (and vice versa) more effectively, especially if they could interact with the code quickly.

For example, if Stephen Hawki

For example, if Stephen Hawking had to type each letter in each word he wanted to use, he would have real problems communicating. Instead, he uses a text-to-speech synthesis program that has an optimised interface that requires few user inputs. The result is that he can communicate almost in real-time.

Sadly, this is untrue. Professor Hawking's only communicating at about 3wpm currently, using a single switch that he presses when a cursor is over the letter he's looking for. (Actually, it splits up the alphabet into four lines, and he first activates the line the correct letter is on, and then he chooses the letter itself.) There's basic language modelling going on, but writing a full sentence reply can take him up to about ten minutes, and is obviously stressful -- he tends to prefer communication by being asked questions and just responding with nodding or displeasure.

We're working on two new versions of Dasher we'd love to have him try out (along with our many other users with less than a single continuous muscle of input available that we can tap into):

  • A version using single button presses, where an arrow moves down the screen and you hit when it's over the point the node you're after is inside. (And you can back up by activating the switch when the arrow is at the extreme top or bottom.)
  • A version that uses two buttons, and overlays boxes on the screen in a "menu" -- you use one switch to select the next box, and one to choose the currently highlighted one.

It's hoped that both of these could end up at around 15wpm; the advantage of the second mode is that it's not time-critical, which is often a hurdle when you have limited motor control

- Chris, the Dasher guy, hoping that all that is interesting to someone. ;-)

My mistake

I remember seeing Prof. Hawking use his text to specch system on TV a few years ago, and I guess I mis-remembered the interface. It seemed as though there was language modeling similar to that used in mobile phones.

Stephen Hawking needs Google Suggest

The title says it all.

data point - ANS Forth

Standard Forth deliberately specified largely unambiguous american-english-like pronounceable names for all the Forth words that weren't obviously english words anyway (e.g. : is defined to be spoken as "colon" when used as a forth word, ! is defined to be spoken as "store" when used as a forth word) - see the bold entries in the glossary of the (usual-ANSI-workaround-last-draft) spec.

Unfortunately, the standardised spoken forms are a mix of descriptions of the characters and the functions they perform - as my chosen examples illustrate - :'s function is actually "define a word", whereas ! is spoken as the function it performs.

Nonetheelss, as forth also has pretty unambiguous and straightforward syntax, it thus would probably be relatively easy to make a spoken forth interpreter (you'd just have to wrap forth entries in an attention word and a terminator word - e.g "Computer! One One Plus dot ok" to which the computer could say "two ok") that maps to the existing standard language - though you'd end up shouting "colon!" at your computer an awful lot though if you used the current standard spoken forms, which might be embarrassing I guess.

Of course, forth is not a high-visibility language these days, but it's still potentially useful in some embedded systems - it's amazing how much you can do with so little compared to the bloat of a conventional OS.

Er (editing comment) - This obviously fits in with Derek Elkin's stuff about Joy too,seeing as forth is an ancestral mostly-concatenative language.