Language design: Escaping escapes

Most languages use special characters or special sequence of characters to denote the use of special directives. For instance, in C format strings:

"\n" denotes new line, and if you wanted to have backquote n, you'd have to write "\\n".

In VB, since " is used to delimit strings, you'd use "" to mean double quotes. eg. "He said ""Boo""!"

In each case, if you wanted to show source code in the same language, it gets very painful, since you'd have to escape escapes. For instance to show the example of <img> tag in HTML, I'd have to do this:

&lt;img src="example" &gt;

In C you'd do this:

printf("For example: printf(\"Hello World\\n\"); prints \Hello World\"\n");

Are there any languages or patterns which handle this kind of situation more elegantly?

Examples of patterns:

  1. Escaping HTML escapes is not really necessary because the user can "View source"
  2. Another example is python, which uses triple quotes """.
  3. ASP, JSP, PHP use a sequence of unusual characters eg: <php?>

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Escapes

Shells and Perl have "here-documents":

print <<END
this is a "here document"
END

Perl also offers "q" and "qq":

q*'non-interpolated' string to the "star"*;
qq winterpolated string to the next "double-u"w;

ruby does too

but ruby is a perl in the end ;)

so it uses 'string' , "string" and heredocuments, and even charater delimited strings:
%somecharacter blabla somecharacter

Not that I find the usage of % or q/qq reasonable.. why can't this just be named 'string' :)

In Python..

This is the relevent python doc. You have both raw strings, which avoid escapes but do not tolerate newlines, like


r'this\n is a raw string with a windows path: c:\bof\ping.png'

and triple-quoted strings, which allow anything but unescaped triple-quotes, like this:


'''this
is a\nnice
triple-quoted string'''

which becomes:

this
is a
nice
triple-quoted string

You can also combine the two, having a raw triple-quoted string. String interpolation in Python is done with the % operator, so there is no need for interpolated/non-interpolated quoting distinction.

Python is programmer friendly

Python's three ways of quoting is extremely programmer friendly, because I can usually get around quoting problems by adopting a different set of quotes.

str1 = '<a href="http://test.com">Hyper Link</a>'
str2 = "Tom's Thumb"
str3 = '''
This is how you do a 
triple quotes in python:
  """Triple quote"""
'''

i agree

Python is the only language I've ever used that made working with string literals reasonably painless. I've never understood why more people haven't followed the Python approach. Yes, there are a lot of options, but in this case, they really are all useful. And it's not like it's a hard change to make in a language... only the lexer needs to change, in many cases...

Particularly for languages aspiring to compete in the Python/perl space (some Schemes, for instance), more options for string literals is just essential.

Even his record doesn't match his record.

I've never understood why more people haven't followed the Python approach.

I can think of two reasons.

First, next to integers (and maybe lists), strings are probably the most overused data type. People use strings when they are too lazy or inept to define a more structured representation for something. For example, when you manipulate XML or HTML, you should not be using a string type; you should be using at least a tree type, and preferably a set of tree types which correspond to the Schema in question, so you can read only well-typed documents and write only well-typed documents. It is really appalling to see how many webpage-producing applications don't do this. Adding all sorts of bells and whistles to string literals encourages this sort of misuse.

Second, many people (not necessarily me, though) are of the opinion that program source should not be littered with constants. For example, if you are developing an application, it is better to put message text in an external resource file, or database, or whatever, which helps for example with localization. Here again, adding features to string literals would be sending the wrong, er, message.

Personally, I am not opposed to adding a few more bells and whistles to string literals, but IMO it is generally the poorest languages that spend the most effort on those sorts of trivialities. It's the language-design equivalent of bit-twiddling. Perl is a perfect example.

Mild disagreement on Perl...

While I don't like Perl very much, I think you are being somewhat unfair to its designers. Their focus on string manipulation really has led them in the direction of higher-level abstractions -- for example, Perl 6's string manipulation facilities are basically a nice syntax for parser combinators. This seems like fundamentally the right way to go to me, and they should be praised for adopting it.

I also think that there's some fairly interesting research questions about what the best way to print data is! All I know of in this regard is Olivier Danvy's paper on functional unparsing and Hughes's famous paper on pretty-printing, and I don't think that's the end of the story. A format string can be seen as sugar for these kinds of APIs, and as long as the language designers are thinking about this sort of issue when they are designing their literal syntax I don't think it's bit-twiddling.

Take my wife, please!

Perl 6's string manipulation facilities are basically a nice syntax for parser combinators

You are far too generous. If what you say were true and Perl's designers actually had that much insight into what they were doing, then 1) those facilities would not be limited to strings, which are after all merely sequences of characters, 2) the facilities could handle other sorts of monads, and 3) they would be described via a source-to-source translation, like derived forms in Scheme or do-syntax in Haskell.

I've looked at the Perl apocalypses and they are hardly more than a scatter-brained enumeration of trivia and faux pas; it reads like a blind man's account of constructing a puzzle. It doesn't solve problems by solving problems; it's a misadventure in stochastic symbol-pushing. Perl 6 is just another iteration in that language's interminable and drunken meandering. Occasionally it stumbles across a good idea, in the same way that a construction worker might unearth a fossil when drives his pickaxe through it. According to you, I guess "we can regard him as" a paleontologist, then.

A format string can be seen as sugar for these kinds of APIs

It can also be seen as a format string, that is, as a hack.

No reasonable designer would represent formats as strings by choice. Hell, we can represent anything as a string, right? That's not a brilliant observation: it's why we use text editors to write programs. Brilliant observations arise from revealing the structure of an object, not burying it. To say that the integers form a monoid is more insightful than to say they form a set. To say they form a group is more insightful than to say they form a monoid. To say they form a ring is more insightful than to say they form a group.

Hell, why stop at formats? Why not represent a numbers as strings too? And characters? And arrays? And records? Or functions? Oh wait, I'm starting to get a brilliant idea... why don't we shove every data type into one universal data type! Yeah, that will solve all our problems.

Gee, how many other hacks "can be seen as sugar for" rationally-designed language features? Let's see: XML is just sugar for algebraic datatypes; goto, break and continue are just sugar for first-class continuations; const and final are just sugar for referential transparency; operator overloading is just sugar for type classes; C++ templates are just sugar for bounded polymorphism and MetaML.

Yeah, I guess you're right. PLT is superfluous—it's all been done before!

Well, I'm sorry for being so caustic, but I'm really sick of this "it's not a bug, it's a feature!" attitude. Even if "worse" is "better", that doesn't mean that the people who design the "worse" actually know they are doing it. In fact, in my experience, they often think they are designing the "better".

Format strings

Frank,

I've been reading with interest. How about some concrete examples of what you mean? and what alternatives are available? At the bottom of this discussion I suggested that if programs can be represented by Non-ASCII characters, some of these problems can be avoided.

What are your thoughts?

I really should start reading the posts in full.

.

quasiquote vs. quote

These are examples of quasiquotation and regular quotation, and they don't get much more elegant than they are in Lisp and Scheme. The difference is between, for example:

`(hello world ,newline)
'(hello world ,newline)

The first expression, prefixed with the backquote operator, computes the value of newline and splices that value into its position in the list. The second expression is entirely quoted, so it comes out exactly as-is.

Quasiquotation/quotation turns out to be a prevalent but somewhat subtle concept. (A common mistake is to leave out ordinary quotation, since quasiquotation is seemingly "more powerful." But this causes problems like you describe.) It shows up whenever you're dealing with embedded languages like format strings. This also includes web development, since web pages designed with technologies like PHP, ASP, and JSP are typically creating quasiquoted HTML with unquoted meta-language code.

I recommend Alan Bawden's Quasiquotation in Lisp, which is a nice journal article summarizing the technology and history of quasiquotation in Lisp.

Thanks

The Quasiquotation article was certainly comprehensive in its treatment. I only know rudimentary Scheme, and my head started hurting at about page 7.

String formatting to language embedding

Isn't this related to the more general problem of language embedding?

It starts with a few control characters in some strings, expands with the printf string formatting minilanguage and soon you see source code written in no less than four languages: a general purpose host language, a database query language, a markup language for output and special markers and expressions in comments for automatic documentation extraction.

I find it interesting because it's the kind of problems that manifests itself as question of syntax, but will take you all the way to general programming and language design questions.

paired quotes make the need for escapes less common

In Postscript, string literals are written in parentheses and can include nested matched parentheses. For example, (this is a (string) literal) is a single string literal. You only need escaping if you want to include unmatched parens, e.g. (this string contains \) an unmatched paren).

M4 (the macro language) uses matched quotes with nesting, like `this is a `quoted' m4 string containing a matched set of quotes', but has no escape. Instead, if you want to use an unmatched quote, you must change the quote strings entirely (using the changequote command).

Changequote is most structurally correct

In the end, the problem of escaping escapes arise because ASCII is being used to represent ASCII + 1.

In templating languages, ASCII is used to represent different data structures and it fails to varying degrees.

Changing quotes will work only if the change quote command itself is immune to the structural change.

Here's an imaginary language which uses [] and a command called changequote.

The quick brown fox jumps over the [username]

changequote()

The quick brown fox jumps over the (username).

But I'm stuck because I can't show you how to
changequote in this language, unless I intentionally break the command. eg.

changequote-`' (ignore the dash between the word "changequote" and the back tick)

In the end, this kind of problems are solved by View source in HTML, or Edit in a wiki... because it allows users to actually learn what the correct representation is.

There's been some debate that source code shouldn't be represented as ASCII anymore than building plans should be represented by words. The case of the escaping escape is a damning example of how software development is still stuck in the days of the teletype terminals.

BRL

I kinda liked the BRL approach.

nice nod to R5RS

From the section you linked to:

Other languages are designed by piling feature upon feature. BRL is designed by removing the restrictions on existing features that make additional features appear necessary.

From R5RS:

Programming languages should be designed not by piling feature on top of feature, but by removing the weaknesses and restrictions that make additional features appear necessary.

:)

BRL and ASP

BRL's approach is very similar to how an ASP or JSP compiler breaks up the embedded script and template into an actual program. Where it differs is how more forgiving it is when the special brackets are enclosed twice. eg.
[ [literal string] ]
is syntactically correct in BRL. whereas in ASP, this isn't
 <%= lt;% literal string %> %> 
This reminds me of C-style comments where
/* /* this style of comment is illegal */ */

How does BRL resolve the bracket's though? If I wanted to represent a right square bracket, how would it be done?

[ ] ] 

Does BRL have some precedence rules to resolve the ambiguity above?

What ever you do, don't copy make...

I once drove some one line perl scripts from "make".

I realised that I had to contend, layered on top of each other, with the quote conventions of "make", the shell that make invoked and perl.

I drove me scatty.

editor feature

I'm not really a fan of languages which people demand you use a smart editor before the language is really fully functional, but it seems to me that this scenario is a good one for it. In particular, Color Forth comes to mind--I don't know the exact details of what it allows though. Chui Tey's comment about needing ASCII+1, and "There's been some debate that source code shouldn't be represented as ASCII anymore", seems to hit at the core issue: if we can have rich text and word processors showing more than ASCII, then why not a program editor?

However, realistically, for transportability and such, I think it makes more sense not to demand that programs use some non-ASCII document format as source code; rather, we just continue using escapes like we do now, but give the editor smarts. I'm also no fan of syntax hilighting generally, but suppose you had an editor that displayed strings with a different background color (not foreground--you want to see spaces at ends of lines, and strings wrapping across lines); it could show embedded newlines and quotation marks unambiguously but unescaped. It could allow you to type them yourself using a mode of operation like word-processor's ctrl-i and ctrl-b for italics and boldface. The editor could still read in plain text files and save the the program as plain text files with the strings escaped normally. As long as we're only doing simple things with them (like quotations and newlines), there's not much lost by hiding the innards of the string. (Tabs would be trickier.)

Intentional programming

For more on this topic, search for "intentional programming".

Also have a look at Interactive Source Code (Caution: powerpoint viewer required) by Lutz Roeder. The focus here is not on editing source code visually but browsing source code visually.

Current IDEs for JSPs and other embedded templating languages just don't recognize the existing embelishments and tags. Which of the following would you consider more readable?

<div class="header"><b>User Names </b><br/>
<% for user in users
 response.write("<i>" + user + </i>")
%>
<B>Done.</B>

or

User Names
 <% for user in users
   response.write(user)
%>
Done.

In terms of readability, the latter wins hands down, and is more likely able to point out errors. Let us not forget when VI was first invented, doubters would say that it's not as portable as "ed" and they'd be right.

How Felix handles these things

Well, people have opinions on this kind of thing it seems. It's easy to analyse with hindsight, but how about lending your brains to a new design?

What I want is: convenience, expressiveness, predictability, regularity, familiarity.

Here's what Felix does with strings that may be relevant to this discussion. The most important issue needs special mention though: there are no character constants, which is an idea stolen from Python but doesn't work well in Felix.

  • It provides Python style strings, with single and tripled delimiters using single or double quotes. The Python 'r' prefix is also supported for raw stings.

    I still run out of quotes. I'm trying not to add backquotes as a third kind of quote mark in case I need them for something else.

  • ISO10646/Unicode supported by 'u' prefix.
  • slosh u and slosh U for short and long hex encoded code points, in 8 bit strings, generates UTF-8 sequences
  • traditional C style \n \t \r
  • Octal escapes banned
  • I waver on hex escapes
  • Application of a string to a string is interpreted as concatenation. Application of a string to an integer concatenates the code point the integer represents. For example:

    "Hello" 32 "World" // "Hello World"
    "Hello " person_name // works with any expression

    removes the need for escapes sometimes.

  • Combinator style regular definitions:

    regdef ident = letter (letter | digit) *;
    and syntax supporting matching and tokenisation eliminates the need for out-of-language lex script and reduces need for string based in language regular matching:

    regmatch expr with
    | regexp1 => expr1>
    | regexp2 => expr2
    endmatch

    This construction is ultra fast and linear. There's a related reglex construction which uses iterators.

  • There is no special support for printf style formats. You can use them if you want, Felix doesn't try to stop you calling C functions, not even bad ones.

    At present, there is a convention to use overloaded 'print' and 'str' for quick printing and formatting.

    Actually I don't really have a more general design concept here: C++ iostreams is an example of such a more general system, but I'm not at all sure I like it.

  • MISSING. Here are things I know are missing that I'm thinking about.
    • Tcl style interpolation "hello $name"
    • Extension of regexps to 32 bits and perhaps generalised sequences
    • Easy run time compilation of regexps
    • Perl style quotations
    • HERE docs, and perhaps some more advanced control over formatting