IEEE Scheme expiring soon

Scheme has an IEEE standard, IEEE 1178-1990, which describes a version of the language slightly later than R4RS (#f and the empty list are definitely different) but not yet R5RS (no syntax-rules, dynamic-wind, or multiple values). That standard was reaffirmed unchanged in 2008, and will come up again for renewal in 2018. What's the Right Thing?

I see three reasonable courses of action:

1) Do the work to make R7RS-small the new edition of IEEE Scheme.

2) Do the (lesser amount of) work to reaffirm IEEE Scheme unchanged for the second time.

3) Do nothing and allow the IEEE standard to expire.

Does anyone still care about having an official, citable standard for Scheme?

(When I brought the question up on #scheme, someone asked what R7RS-small implementations exist. Currently there are Chibi, Chicken (partial), Foment, Gauche, Guile (partial), Husk, Kawa, Larceny, Mosh (partial), Picrin, Sagittarius.)

Well, I'm not going to do it this time...

I'm no longer a regular user of Scheme, I do not use the recent versions of it at all, and I'm no longer a member of IEEE. So I'm not going to spearhead the standard renewal process with IEEE again.

I also feel that recent language design choices have made the language less worthwhile to use, and wouldn't like the job of defending them to a committee.

Sadly, R4 may be the most recent step in the right direction that the language design took, and positive progress has ceased.

In particular, and starting with the worst misstep:

The handling of Unicode as specified since R6 is just plain wrong.

It makes people have to care how strings are represented and preserves a "character" data type which is representation-dependent and not a tenable primitive for expressing strings given Unicode string equality semantics. It breaks homoiconicity to have different rules for symbols and strings. And if you really need an array of codepoints, then it can only be because you are using it for something that is not a semantically sensible operation for strings. People want to treat strings as arrays of characters but in order to do that they have to have a "character" unit which no longer maps to anything human beings think of as characters and also randomly assigns different lengths to the same string when it happens to be expressed using different sequences of codepoints. Transformations that "patch up" these inconsistencies usually do so in ways that create further inconsistencies; they make it possible for example, to split strings and then concatenate the divided parts and get a different string than you started with.

It is an error to conflate arrays and strings given that string equality as understood by humans and as specified in Unicode is no longer dependent on the equality of array elements, and that string length as understood by humans and specified by Unicode no longer matches the number of array elements. Furthermore, manipulation of arrays of the codepoints which the standard is now mistaking for characters, requires Unicode standardized operations providing semantic operations on strings, to be re-implemented in scheme. Divisions on codepoint as opposed to character boundaries can create characters that were not there before and can create malformed strings that start with non-characters such as bare accents. Mistakes can easily result in other malformed sequences which are not strings, in strings which contain things that are not and never have been characters, and treatment of strings as a non-linguistic array of data encourages non-string misuse of the structure.

The only principled and consistent thing to do given the Unicode standard was for strings to be immutable values, for characters and strings to be the same type, and for characters to be distinguished only by there being no semantically consistent (ie, without creating a malformed string) way to subdivide them.

Something which is less bad, but still wrong:
Exceptions are a serious impedance mismatch for continuation-based control. And guard forms as specified by winding continuations since R5 are not compatible with exceptions at all. If you want exceptions, you need a multithreading mechanism that they do not conflict with. If you want continuation-based control you need continuation arguments to call in case of whatever event. If you want these to be dynamically scoped rather than explicit arguments, which most people who want continuations desire, then you need to nail down the semantics of a dynamically scoped environment for them to be variables within. Scheme already has dynamic scoping for certain constructs such as current-output-port and so on, and ought to have either scraped these out for consistency with lexical scoping, or nailed down semantics for its dynamic scoping, long ago.

Moving on to something not quite as bad but still wrong:
The handling of uniform numeric vectors as specified in R7 is also just plain wrong, for similar reasons to the reasons why strings are now wrong. It invites people to write code which relies for its correctness on limitations of the underlying data representation, and in which the same constructs have different semantics when applied to different objects which are nominally of the same type.

Finally, something that is a damn shame although not quite completely wrong:
Multiple returns are a good idea but the multi-argument continuations specified in R5 and subsequent standards make them hard to use, ridiculous, and confusing. They needed a proper syntax that allowed binding the return values to variable names at the call site and within a definite scope, in a way consistent with the way lambda expressions bind values to variable names at the definition site and within a definite scope.

In general, I consider many major design decisions since R4 to be mistakes. Most new constructions are not sufficiently well designed to be consistent. In many cases elements have been introduced which conflict with each other or with existing design elements. No one is attempting to preserve the semantic simplicity which derived from being able to think in terms of semantic types with uniform operations rather than in terms of machine representations with many different semantics for similar operations. That semantic complication is a rich source of program errors and lessens the utility of the language as a means of expressing, explaining, or reasoning about algorithms.

R4 was a limited language, but at least the parts it had were parts that actually fit together. What we have now, isn't worth the effort to make a standard for.

By Ray Dillinger at Sun, 2016-04-03 03:54 | login or register to post comments

I'm no longer a regular user

I'm no longer a regular user of Scheme, I do not use the recent versions of it at all, and I'm no longer a member of IEEE. So I'm not going to spearhead the standard renewal process with IEEE again.

I didn't expect it.

R4 may be the most recent step in the right direction that the language design took

You don't mention syntax-rules. Do you consider that a mistake as well?

I agree 110%. Unfortunately, it was simply not within the remit of the Working Group to make backward incompatible changes with IEEE 1178. Our charter set an unreasonably high bar against that. I'd love to see R8RS take exactly this view.

I would add one further point and one caveat. The use of non-negative integers to index into strings is a mistake: string cursors should be opaque objects. SRFI 130 (a rewrite of Olin's SRFI 13 string library) takes this point of view. The caveat is that you assume that there is a single point of view about what constitutes a character to a human being, but that turns out not to be the case: people can and do disagree about how to divide the same text into characters. That being so, the codepoint view provides a rock-bottom mechanism that other more sophisticated views can be layered on in order to serve different purposes.

Scheme already has dynamic scoping for certain constructs such as current-output-port and so on, and ought to have either scraped these out for consistency with lexical scoping, or nailed down semantics for its dynamic scoping, long ago.

Formal semantics is all Greek to me, so I can't defend this point, but I've been told that the R7RS formal semantics does nail this down.

The handling of uniform numeric vectors as specified in R7 is also just plain wrong

I don't understand your reasoning here. It's true that you can't do anything with bytevectors that cannot already be done with vectors, but you cannot do anything with vectors that cannot be done with lists. Indeed, a vector can be abstractly considered as a list of pairs whose cars are mutable and whose cdrs are immutable.

Multiple returns are a good idea but [...] needed a proper syntax that allowed binding the return values to variable names at the call site and within a definite scope, in a way consistent with the way lambda expressions bind values to variable names at the definition site and within a definite scope.

The R7RS syntax let-values, let*-values, and define-values serve this purpose, if I understand your objection correctly. The first two come from SRFI 11 and are present in R6RS as well.

By johnwcowan at Sun, 2016-04-03 05:28 | login or register to post comments

quote

"You don't mention syntax-rules. Do you consider that a mistake as well?"

I do.

I'd be much happier with a more powerful system that wasn't a dsl.

I made comments about it here and here

By Josh Scholar at Sun, 2016-04-03 08:31 | login or register to post comments

syntax-rules

I'm conflicted about syntax-rules. On the one hand, it enables some things that truly needed to be enabled - hygiene is a somewhat serious restriction when a language has no access to any dynamically scoped structures. As reflected by the fact that more than half of Scheme's native syntax forms deal with its special predefined set of dynamic variables.

On the other hand, what was needed to serve that role was a proper semantics for dynamically scoped variables that didn't allow them to be confused with or shadowed by lexically scoped variables. That would have made all sensible usage of syntax-rules unnecessary, and the particular semantics they chose is unclear, inconsistent with the rest of the language, and can be used to provoke completely unexpected behavior such as hiding a function evaluation within the evaluation of a symbol which appears to be merely a variable reference.

So I'm categorizing syntax-rules under "the surgeon really did need a cutting instrument; a scalpel probably would have been more appropriate than a chain saw."

By Ray Dillinger at Sun, 2016-04-03 17:41 | login or register to post comments

Could you help me?

Can you give an example of a system of macros with dynamically scoped variables that solves the hygiene problem? I say that because I really don't understand and I want to since I'm about to be implementing a macro system.

By Josh Scholar at Sun, 2016-04-03 18:11 | login or register to post comments

* I too would like to see an example

By johnwcowan at Mon, 2016-04-04 00:20 | login or register to post comments

I as well

would like to hear this point expanded upon. Seems to me the part about not confusing dynamic variables with static ones is rather key. How to you distinguish them? A symbol is a symbol... unless you use some convention to distinguish, like the way classic BASIC suffixed $ on a variable name to denote a string. (Of course, Kernel coped with hygiene using first-class environments, which, whatsoever its merits or demerits, is quite a radical solution and is surely one of the reasons Kernel is, as remarked elsewhere in this thread, not a Scheme.)

By John Shutt at Mon, 2016-04-04 01:29 | login or register to post comments

ISLISP

In ISLisp there are special forms defdynamic to create a globally scoped dynamic variable, dynamic-let to bind one or more of them, dynamic to fetch its value, and setf with dynamic to assign it. So essentially they are in a distinct namespace, as function names are in ISLisp and Common Lisp.

By johnwcowan at Mon, 2016-04-04 18:49 | login or register to post comments

The form I eventually went with used "with" as a keyword.

I used 'with' in pretty much exactly the same way as 'let' to create dynamically scoped variables, and had them occupy a separate name space (dynamic variable names must start and end with asterisks and have at least one character between them, and non-dynamic variable names are not allowed to do both).

By Ray Dillinger at Mon, 2016-04-04 22:17 | login or register to post comments

Why not String = [Char]?

The only principled and consistent thing to do given the Unicode standard was for strings to be immutable values, for characters and strings to be the same type, and for characters to be distinguished only by there being no semantically consistent (ie, without creating a malformed string) way to subdivide them.

I understand that a Unicode codepoint isn't a character, but what's wrong with defining "character" along the lines you just sketched and taking a "string" to be a list of those? I assume there is a uniqueness property for Unicode (i.e. if ABC can be split into AB and C and also into A and BC, then it can be split into A, B, C). That's true, isn't it?

By Matt M at Sun, 2016-04-03 18:46 | login or register to post comments

The term "character" is inherently vague

Not all scripts have as well-defined a notion of characters as Latin or Han. For example, there are three human-oriented ways to view Devanagari text: (1) the phonemic level, where letters (consonants and initial vowels) and vowel signs are separate units; (2) the default grapheme cluster level, where a letter with or without a vowel sign is the unit; (3) the akshara level, where a (possibly zero-length) sequence of graphically reduced letters followed by a fully written letter followed by an optional vowel sign or virama (vowel suppressor) is the unit. None of these is inherently superior to any other, and what people see as "a character" depends on their purpose at the time. Furthermore, none of these three levels is equivalent to the codepoint level, but all of them can be constructed on top of it.

Even the familiar Latin script is not so simple as it seems. In English æ is a mere typographical ligature, and it is all one whether you write Caesar or Cæsar, but in Norwegian æ is a separate letter, not interchangeable with ae. By the same token, sœur is the normative French spelling of the word for 'sister', but it is commonplace to write soeur nowadays because œ got squeezed out of the Latin-1 character set; however, moelle 'marrow' cannot be spelled mœlle (it is pronounced as if written moile). In German and French respectively, ä and é require their accents, but don't constitute separate letters of the alphabet, whereas in Swedish and Icelandic they are as separate as i and j are in English (but not in Italian); English ö in coöperate is a typographical nicety used by The New Yorker but hardly anyone else nowadays.

By johnwcowan at Sun, 2016-04-03 19:27 | login or register to post comments

Nice collection of arguments but no show stopper

This wouldn't stop me treating code points as 'Latin characters.' Of course, you're right there will be lots of languages where that should be called a misnomer. But for lots of applications, it should be sufficient to treat a Unicode string as an opaque piece of text only used for storing, comparing, and rendering.

By marco at Sun, 2016-04-03 20:11 | login or register to post comments

I think we agree

I'm arguing against the claim that there are uniquely defined natural objects called "characters" such that a string is a sequence of them. It doesn't sound to me like you disagree with that.

By johnwcowan at Mon, 2016-04-04 00:16 | login or register to post comments

Avoiding nonsense is key to making decisions.

I think of it mostly in terms of whether operations that are supposed to be well-defined can create malformed-ness and/or nonsense. I accept 'œ' as a character, for example, because if you subdivide it into strings containing 'o' and 'e', then concatenate those strings, you get 'oe' not 'œ'. So 'sœur' and 'soeur' are alternate spellings of the same word as far as I'm concerned, not different ways of writing the same spelling. Some languages have rules that permit certain alternate spellings to be constructed, and that's okay. But if the string changed as a result of a pair of operations that are supposed to be inverses, then the operations are not inverses, and that is not okay. The emergence of nonsense when you divide and rejoin means œ, and other ligatures, should be treated as indivisible.

Likewise I accept ä and é as characters, for similar reasons; if you subdivide them, you get a letter on one side of the division, and a bare accent on the other. Because a bare accent is not a character by itself, you have produced a malformed string. A malformed string means the division operation was an error, and therefore string division should not have operated at that boundary. And anyway ä and é exist in multiple versions in unicode; whether they're one codepoint or two is ambiguous, and the only way to sensibly resolve the ambiguity is to be counting grapheme clusters instead of codepoints.

There is another, IMO even stronger, reason why ä and é should not be subdivided; It is purely an accident of encoding that we're putting the codepoints for accents after the codepoints for the letters accented (Before the letter would actually have made more sense, if thinking in terms of designing electromechanical printers). If an arbitrary representation choice affects what you get as a result of an operation, then the operation is non-semantic, and your programming language is no longer working at the semantic level. In a machine-oriented language like C that's expected and appropriate, but C is for describing machine operations and Scheme is for describing semantic operations. The principle of least surprise cuts the other way.

By Ray Dillinger at Mon, 2016-04-04 17:04 | login or register to post comments

What are we to do?

Most string operations aren't meaningful. One of the most important string operations in practice is substitution / interpolation, but that's already difficult to get right. In English you have 'a' vs. 'an', plurals and number agreement, gender agreement, etc.

Strings as a list of characters seems like a useful abstraction to me. You may need new code to handle unusual cases, just like you need code to handle the edge cases of English in interpolation.

I would consider an o with an umlaut to be a single character.

By Matt M at Mon, 2016-04-04 17:59 | login or register to post comments

Not of themselves meaningful but with consistent axioms.

String operations by themselves are mostly not meaningful, but they should be operations that would make sense to non-programmers working with text, and must also obey axioms that programmers can exploit. Doing both harmoniously is the crux of the design effort.

For example the axioms I invoked above about string division not creating malformed strings, and about string division and string concatenation being inverse operations of each other. Those are axioms a programmer can use and exploit, and they are also in harmony with the way people working with language think of writing it. The only reason why a non-programmer wouldn't understand those axioms is because the non-programmer would never have even considered the possibility of an operation that fails to obey them.

You're absolutely right that substitution/interpolation is one of the most important string operations. To be most useful to programmers, it has axioms it must follow. At a minimum it must create no malformed strings. And I would consider it broken if any substitution of an X for a Y could not be undone directly by a substitution of a Y for an X. When you start splitting on codepoint boundaries instead of character boundaries, those axioms break on one point or the other.

I don't have a problem with programmers having to worry about the idiosyncrasies of individual languages, but for a language that's describing things at a higher abstraction than machine operations I'm going to be really obstinate about programmers being able to use the same units of representation and the same semantic operations that the people writing the language with other tools use.

If a scholar with a brush and an ink stone doesn't have to think in terms of code points, then to the extent possible, neither should the programmer have to think in terms of code points. The reason it's okay that the programmer has to be the one who worries about the rules of individual languages is because the programmer is now in the place of that scholar, and in order to function the programmer, like that scholar, has to know the rules of his language's orthography. As language implementers we're providing the brush and the ink stone.

We should be thinking in terms of the operations the scholar uses. "Put an accent over that character" or "Use a different word" are operations that make sense in terms of what the scholar does, and they're operations we must provide. But the brush and the ink aren't the part of the system that has to know the linguistic rules about how to form plurals and make noun/verb agreement work. That's what language scholars who are the ones using the tools do.

By Ray Dillinger at Mon, 2016-04-04 22:11 | login or register to post comments

Hard cases make bad law

It's easy to say that about ö. But what about Devanagari क्न्य knya? At the akshara level, it's a unitary character. At the default grapheme level it's three characters क् k + न् n + य ya. At the letter level it's क ka + ् vowel killer + न na + ् again plus य ya. Visually, the क gets reduced to just its left half and the न to a squiggle. You type at the letter level, so when you hit backspace, the rightmost letter disappears. But when you navigate through the text using the arrow keys or the mouse, you can't select in the middle of an akshara: when the character appears as क्न्य, it is just one letter; when it appears in the equivalent (but less legible) way as क्‌न्‌य, it is three.

By johnwcowan at Tue, 2016-04-05 01:40 | login or register to post comments

Defer to the people using it!

All we can do is provide primitive operations that behave in a consistent manner and don't produce malformed strings. It is certain that subdividing must not separate the vowel killer from the truncated-syllable that would otherwise have the vowel. It is also certain that 'adding a symbol to an akshara' is an operation the people using the language actually do, in the same way we do 'adding an accent to a letter,' so it isn't unnatural to them to enter it as multiple operations the same way we enter letters.

As for the silly question of whether it's one or three, I'm sure that people actually using the language are able to decide for themselves when they want to use which kind of count. If they write it, as you suggest, in visually different styles intending different numbers of characters, then it is within their paradigm to understand that a thing can be 'spelled' in more than one way because they are the ones making the distinction.

If they don't want to *say* that it's a different spelling, that's fine; it's as consistent as an English-literate person insisting that 'œ' is the same spelling as 'oe.' It is true within the rules of their language but if they see and use the distinction, they can see the difference, and they can treat it differently when appropriate.

Anything which cannot be decided at the level of tools has to be decided at the level of actual use.

By Ray Dillinger at Tue, 2016-04-05 02:44 | login or register to post comments

Unicode and Indic languages

For at least the Indic languages Unicode is the way to go. Unicode's code points canonicalizes things very well. For example, the first "akshar" of my name in Gujarati would be written as બ and while one can say is it is composed of "half-letter" બ્ and vowel-sign ા, normally you wouldn't write it as two separate symbols. Nor would you write it in it true half-letter form (by omitting the vowel sign) except when it is immediately followed by consonant and even there the written form can vary depending on actual sounds. So for example "pr" would be written as પ્ર while "pl" would be written as પ્લ.

All this is now handled in how you present text via the font machinery. Internally, unicode codepoints regularize things quite a bit. Until Unicode the situation was quite a mess (almost always the encoding stored in a file was just whatever mapping provided by the particular font being used -- forget writing a common set of simple tools to work over all sorts of documents!).

If Scheme is to be used for Indic language processing unicode is pretty much required.

I very much like the model provided by Go (and in fact I first used Go to translate documents using font specific encodings to unicode). In Go a string is a UTF8 byte string. If you index or extract a substring using an index, you get a UTF8 "character" but if you range over a string with a "for loop", you get a sequence of unicode code points (they call them runes). Something similar would work well in Scheme. E.g. string-for-each and string-map. And may be a string-next-rune function. Then you can stop worrying about whether the underlying scheme used is UTF8 or something else. You still need string-ref and string-set! for building friendlier abtractions (such as regular expressions etc.).

By Bakul Shah at Fri, 2016-04-15 17:28 | login or register to post comments

Codepoints - only better because they don't actively suck. :-)

I will agree that the codepoint representation provided by Unicode is a far better representation for pretty much all languages that have writing systems more complicated than ASCII can serve. Which is essentially all of them. Including English. Even for English-language writing in a presumed context of one country, ASCII was a harshly minimalistic judgment call about which things to leave out.

Any system which represents language well enough, was going to break that one-codepoint-per-printing-character assumption that programmers used to rely on, because that assumption is pretty much wrong. A lot of languages have open-ended construction rules - Even in the Latin alphabet that we're using here, there is no real limit (aside from practicality) on how many accents and diacritics are valid. Heck, I've seen mathematics use A-prime, A-Double-prime, A-Triple-prime, etc... There's not even a limit just in repetitions of the same diacritic, and we were going to have to get that.

Unicode is badly designed for a completely different reason. Well, a bunch of them actually, but only one that's truly unforgivable. It allows different sequences of values to be representations of the same string.

If two strings are identical I don't want to be able to detect any difference between them. I don't want it to matter what operations I did to get them. I don't want them to break apart differently. I don't want joining them to something to give different results. I don't want the same character location in identical strings to be indexed by different numbers. There is absolutely no way to deal with that except by abstracting string handling above the level of codepoints.

That implies eager normalization for strings - where absolutely every string and character is in the same normalization form, where there is no operation on strings that can yield a denormalized string, and where it doesn't even matter what normalization form that happens to be - where different implementations can use different normalization forms under the hood and software operating on the same strings will yield the same results.

Among other things that means a bunch of stuff we are accustomed to using strings for - essentially every usage where the string is used as anything other than a sequence of HUMAN LANGUAGE - needs to be expressed as operations on arrays instead.

And that includes handling sequences of codepoints, if we're handling them in any context other than as constituents in human language. A sequence of codepoints can be non-normalized, or can contain partial characters, or other things that make nonsense when considered in terms of human language, and if we allow strings to do those things we break every programming axiom in the world as regards string identity and consistency of operations.

By Ray Dillinger at Sat, 2016-04-16 00:19 | login or register to post comments

"That implies eager normalization for strings "

That implies eager normalization for strings

The kind of character and string types you advocate for are agnostic as to normalization because they abstract it away. The question doesn't arise.

At the codepoint level and below, the option not to eagerly normalize is vital for practical reasons.

By Thomas Lord at Sat, 2016-04-16 00:34 | login or register to post comments

Right.

At the codepoint level and below, the option to not eagerly normalize, yes, is vital for practical reasons.

That's exactly why people must not be forced to work with strings at the codepoint level and below. At that level they are no longer strings.

What I mean by eager normalization was that if the programmer CARES about what normalization form the implementation uses to store strings, it means string identity semantics have failed. If the goal is that all instances of a particular string value must have the same semantics, then eager normalization (forcing all instances of that value to have the same representation) is the simplest way of achieving it.

The sole exception - the ONLY place codepoints ought to rise to the level of programmer attention - is when they are unavoidable. There are essentially two places where that happens. The first is when the codepoint is being used as an integer for some non-linguistic purpose such as indexing into a table or structure. In that case you're not using a character, you're using a numeric value cast from a character.

The second is binary I/O. Binary I/O ports, by definition, aren't reading and sending strings from and to the outside world, they're reading and sending blobs. If you send a string through a binary I/O port then the port will have to cast it to blob before it can send it.

Because there are several normalization forms and endiannesses, the i/o ports need to be able to make any of several different casts to do that. So it's reasonable that the programmer should be concerned with (able to specify) which casts a port will use to create or translate its blobs.

If you are using codepoint values in some way that allows the creation of or requires the interpretation of denormalized sequences then you are not using them as characters. If you ask a binary port for a blob, or ask it to write a blob, it will happily do that with no casts to or from strings - reading and writing blobs, after all, is what it's for. If you want to manipulate that sequence as a sequence of binary in uniform widths, that's fine too; that's what array operations are for.

But when you cast any blob value to a string, you get the same string value (or the same error, if it contains a non-codepoint or invalid sequence). When you cast any string value to a blob, you get the same blob value. Unfortunately blobs have a many-to-one relation to both strings and errors, so the conversion is lossy in one direction.

By Ray Dillinger at Sat, 2016-04-16 17:18 | login or register to post comments

More to the point...

I tend to write answers that are too long.

In more direct and appropriate response to your point:

At the codepoint level and below, the option not to eagerly normalize is vital for practical reasons.

What you are talking about here is the sort of non-linguistic operations that we have been mistaking for string operations, but which are not. I do not deny the need for operations on arrays of binary values. I have no problem with the idea that binary values can be cast to and from characters and strings according to their values as codepoints. I have no problem with cast operations that can convert between arrays of integers and strings. But the array of binary values cannot be a string any longer, because the conversion from arrays of binary to strings is lossy; many arrays of binary cannot be made into strings, and many different arrays of binary can be made into the same string.

By Ray Dillinger at Sat, 2016-04-16 17:32 | login or register to post comments

"Uncode is badly designed"

"Unicode is badly designed for a completely different reason."

That may be but there is no other decent alternative at present if you want to work on some non-ascii text. If you live in an ascii only world, UTF8 encoding will work just fine for you! That is, we still need string-ref, string-set! but for "proper" unicode handling we will need more advanced functions (which in turn may use string-ref, string-set!, substring etc.). In other words decent Unicode support +UTF8 will end up being a pure superset of ASCII only world (and a "small subset" Scheme can simply ignore Unicode!).

"If two strings are identical I don't want to be able to detect any difference between them."

This is a low level boring sort of difference that can be easily handled at a different level (or in a preprocessing phase or something). Base unicode library functions can assume some normalized representation. Or you just write a (normalize ...) function if the input is too grotty. When I was looking at Gujarati spell checking, "same but different" is not where I spent my time. The interesting stuff was figuring out complex but well defined combining rules (see sandhi) and the kind of mistakes people might make. For that working at the codepoint level was just about ideal. My guess is Scheme would be a very good programming language choice for all sorts of natural language processing.

By Bakul Shah at Sat, 2016-04-16 03:36 | login or register to post comments

afaik

There are standard "normalize" procedures.

If you have one, and libraries do, you can always compare strings.

By Josh Scholar at Sat, 2016-04-16 06:18 | login or register to post comments

I see what you're saying

As an aside, your choice of title confuses me. Examining a hard case to make law ... isn't that what you're doing?

I see what you're saying that using a Character abstraction makes assumptions that don't hold in every use case for Unicode. I think that's a good argument that Character shouldn't be how Unicode is exposed. I'm not sure that it proves that the Character abstraction shouldn't exist. I'll have to think about it more. Thanks for the discussion.

By Matt M at Tue, 2016-04-05 17:04 | login or register to post comments

Living with Unicode

I don't know what scheme proposes to do with unicode, but you just have to live with the fact that human languages are messy and have ligatures and you can't always fit a grapheme in any given number of bytes. In short, there's no way around the fact that programmers have to up their game to deal with Unicode and you need libraries that deal with the cases.

I'd be happy with:
1) utf8 - that leaves ascii as ascii and it's short
or
2) 32 bit "characters"
or
3) graphemes (no one does this, but scheme has bigints, so it's within the style of the language).

In any case you need a library that converts unicode to canonical form so that strings that render the same can be compared as the same.

The standard defines how to do that, and I've worked on libraries that do that. It takes a big table and some code, but that's life in the big wide world of supporting every language.

By Josh Scholar at Sun, 2016-04-03 19:51 | login or register to post comments

Piling feature on top of feature

There's a passage in the introduction to the Scheme reports, starting with the R3RS, often quoted because it neatly captures an idea lots of people agree with (although they sometimes have aggressively incompatible notions of its practical implications):

Programming languages should be designed not by piling feature on top of feature, but by removing the weaknesses and restrictions that make additional features appear necessary.

I figure Scheme abrogated its claim on that philosophy when they added macros, dynamic-wind, and multiple-value returns. I've mused occasionally that the addition of macros might be dubbed the "second-class perversion" (alluding to Vernor Vinge's A Fire Upon the Deep). I presented alternatives to all three (mis)features in Kernel: first-class operatives instead of macros, guarded continuations instead of dynamic-wind, and generalized definiends instead of multiple-value returns. (It suddenly occurs to me I've never blogged about why I consider multiple-value returns wrong-headed, though it's discussed in a couple of places in the R-1RK, listed there in the index under "multiple-value returns". Key points are that $define! uses a generalized definiend — no discontinuous syntactic sugar for defining procedures — and that apply-continuation passes the argument tree to the continuation rather than merely passing the first argument as in Scheme.)

By John Shutt at Sun, 2016-04-03 14:38 | login or register to post comments

That IMO much-overquoted

That IMO much-overquoted remark has never applied to library features. If it did, R2RS (the first Scheme report to have a procedure library at all) would have had only car, cdr, cons, pair?, and null?, since lists and atoms (defined by exclusion) are a sufficient basis for programming anything, and there would have been no need for R[3-7]RS at all. (It actually suffices to allow just one atom.)

As you know, I admire Kernel very much, but it is not Scheme, and my purpose at present is determine how to standardize Scheme.

By johnwcowan at Sun, 2016-04-03 18:08 | login or register to post comments

Unlibrary

Doesn't apply to library features; quite right. Though none of those three issues is about library features. (Admittedly apply-continuation is library, but I merely used it as a more direct illustration of my point than the primitive continuation->applicative under which the relevant rationale discussion occurs.) Also granted, Kernel is a bit too far afield in design space to qualify as a Scheme; but I do maintain that those three features added to R5RS Scheme are philosophically inappropriate for Scheme because they introduce lack-of-generality into the language whereas there are ways to accomplish such things that are philosophically consistent as the R5RS solutions are not.

By John Shutt at Sun, 2016-04-03 21:26 | login or register to post comments

Lack of generality has been there from the beginning

Scheme has always had second-class mutable variables rather than first-class boxes (cells). It's somewhat mysterious why this is so. Steele remarks somewhere that Scheme doesn't have cells because it would have taken an extra year to make it have them, and I always wondered what he meant by it. Certainly boxes (in the form of hunks) were available on Maclisp, and it would have simplified the Rabbit compiler considerably if all variables were immutable. I finally realized that he meant it would have taken too long to figure out how to optimize away the use of explicit boxes in cases where they were not necessary.

As dynamic languages go, Scheme is actually rather static in everything but its type system. For example, the effects of redefining a name in the global scope are undefined. Procedures are not S-expressions, though their external representations are.

By johnwcowan at Mon, 2016-04-04 02:11 | login or register to post comments

huh?

I find this confusing.

When you say variables are second class instead of boxes I understand that as "you can't take a reference to a variable".

But then you say you want them to be immutable.

What's the point of immutable boxes? If they're immutable you might as well pass values.

A lot of languages can't pass by reference. It turns out not be a big deal, since you can box by hand in the very few cases that you need it.

By Josh Scholar at Mon, 2016-04-04 02:20 | login or register to post comments

Assignment conversion

Assignment conversion changes variables that are actually mutated (lexical scoping makes it possible to determine these for sure) into immutable variables holding mutable boxes, thus putting all mutation into data structures rather than the language itself. Some Scheme compilers in fact do this, on the assumption that mutating variables are rare and a few such boxes scattered through the heap is no big deal.

However, if you were to get rid of set! and simply require users to use the boxes themselves, as ML does, it becomes much harder to optimize away boxes when they turn out not to be necessary. It is precisely the advantage of making things second-class that they are often easier to optimize.

By johnwcowan at Mon, 2016-04-04 04:54 | login or register to post comments

Procedures are not

Procedures are not S-expressions, though their external representations are.

I find myself uncertain what you mean by this statement. It might be a statement that procedures are encapsulated atomic objects (which is also true of Kernel operatives, and has nothing to do with static-versus-dynamic); or a statement that the standard does not require procedure objects to be evaluable (which would imply that they aren't first-class, and would honestly surprise me); or some other statement that I'm failing to imagine as a possibility.

By John Shutt at Mon, 2016-04-04 03:18 | login or register to post comments

I meant the former

Procedures as encapsulated objects were introduced into the Lisp tradition by Scheme. Before that, and right up to CLtL1 and Elisp, lists of the appropriate form are procedures, though perhaps not the only type of procedure. You could cons up a list whose car was lambda and pass it directly to apply. That's not possible (portably) in CLtL2 or ANSI Common Lisp, and has never been possible (portably) in Scheme. All code objects are knowable in advance (hence static), and the only thing that can be freely created at runtime is a closure with dynamically determined values of the closed-over variables and a known code object.

The exception, of course, is a procedure created by eval. But eval did not exist in Scheme until R5RS, evaluates its first argument in an empty lexical scope, and there is no guarantee that the global scope in which the call to eval is executed is reified in such a way that it can be passed to eval at all. In principle, eval could invoke an entirely separate Scheme implementation that shares nothing with the calling program.

By johnwcowan at Mon, 2016-04-04 03:52 | login or register to post comments

Procedures as encapsulated

Procedures as encapsulated objects were introduced into the Lisp tradition by Scheme.

The Art of the Interpreter views the pre-Scheme strategy as lacking first-class procedures, rather supporting only first-class representations of procedures. These are of course just different ways of saying the same thing, but I think I agree with Steele and Sussman that it's more useful to think of the earlier approach as not giving first-class status to procedure values. Procedures in this sense cannot be represented in source code — they are purely runtime entities, which is also why they are exempt from Wand's "theory of fexprs is trivial" effect (that is, Wand means by "theory" the theory of source expressions, and Lisp has a trivial theory of source expressions because all Lisp source expressions are passive data; procedures, being not source expressions, do not appear at all in Wand's trivial theory). Encapsulation of procedures is important for proving things about programs, but this is true regardless of whether the proofs apply to compile-time or runtime.

R5RS provides what I'd call a fake eval, since what makes eval a meaningful operation is first-class environments.

By John Shutt at Mon, 2016-04-04 04:40 | login or register to post comments

"Fake" is a rather harsh term

See my comment on conlanging and fossil faking to your most recent blog post. "Disjoint" might be less contentious.

By johnwcowan at Mon, 2016-04-04 05:19 | login or register to post comments

It's surprisingly useless

compared with say, an eval in the current environment, if safer.

By Josh Scholar at Mon, 2016-04-04 06:01 | login or register to post comments

Current environment

Mm. Evaluating in the current environment can be dangerous too if it's easy to overlook the dependency, which is why Kernel's eval always requires an explicitly specified environment argument. ("Dangerous things should be difficult to do by accident. :-) Personally I think eval really only comes into its own when it synergizes with lexical procedures capable of optionally capturing their dynamic environments.

By John Shutt at Mon, 2016-04-04 15:59 | login or register to post comments

word choice

Yes, "fake" is a rather harsh word choice. But I think it applies: not merely "not real" (similarly to the way conlangs are not real), but not real while misleadingly claiming to be real. It doesn't require deliberate deception, although to produce such a thing without deliberate deception would seem to involve some failure to recognize the lack of efficacy in what was adopted. Possibly the puzzling outwardly visible result of some compromise with a convoluted internal story.

By John Shutt at Mon, 2016-04-04 16:11 | login or register to post comments

re fake eval ("word choice")

See the other comment some thoughts on steering scheme.

Fake eval arises from a general pattern of shunning any dynamic, reflective features that are simple and easy in an on-the-fly graph-based interpreter, but that can be used in ways that thwart optimizing compilation.

In the early history of Scheme it was widely recognized that Scheme made sense and had a potentially bright future in both dynamic and static forms, and that the two forms could be harmonized. (Aubrey Jaffer's SCM and his original CHEAPY compiler (later replaced by Hobbit) sketch and provide evidence of this understanding.

Around the time of R4RS and then increasingly after, the authorities in control of the reports became hostile to the dynamic path. Leadership was dominated by the opinions of a few developers of academic research compilers.

By Thomas Lord at Mon, 2016-04-04 16:43 | login or register to post comments

Obsolete dichotomy

Now that we have just in time, and even tracing compilers (even in some scheme implementations like Racket) the research should be into combining the two.

Racket still shows problems from that fight though.

By Josh Scholar at Mon, 2016-04-04 17:45 | login or register to post comments

re obsolete dichotomy

Now that we have just in time, and even tracing compilers (even in some scheme implementations like Racket) the research should be into combining the two.

Uh..no, that's several confusions combined.

1. JIT was well known and convincingly practiced no later than 1987, published by 1990 (Self). R4RS arrived in 1991.

2. Really, JIT in the context of graph-based lisp interpreters was understood no later than 1990 (SCM).

3. The dynamic, reflective features under consideration here are interesting in no small part because they can be usefully applied in ways that defy even JIT compilation.

The attitude that an interpreter has to be justified by an imagined potential to use any technology, even JIT, to compete with optimizing compilers of static code -- that attitude -- is what help the Scheme steerers drive away implementers like Jaffer (and me) and helped them lead the Scheme reports in the direction of being about the tastes of a very small number of compiler maintainers.

They treated Scheme standards as if they had to be protected like Cobol or Fortran standards; as if there were millions and millions of users more dependent on stability than advance.

Like Jefferson Airplane said: "Soon, you'll achieve the stability you strive for / in the only way it's granted / in a place someone puts fossils of our time."

By Thomas Lord at Mon, 2016-04-04 20:36 | login or register to post comments

The first JIT compiler that

The first JIT compiler that I know of was developedby HP in 1977 to compile APL. It worked line by line and assumed that the current array shapes of all variables referenced in the line would be the same the next time the line was reached. This compiler generated the correct number of nested loops and hard-coded the bound of each dimension: the result was "hard code".

It also generated a prologue that tested the variables to make sure their rank and shape was actually still the same, and if they weren't, it branched to another JIT compiler. This one assumed ranks were stable but pulled the dimensions out of the current variable, and generated "soft code". If the soft code prologue, which checked only ranks, failed, new soft code was created.

(For reasons of limited memory, the compilers generated compact bytecode rather than native code, but this is an implementation detail. In any case APL always spends most of its time in the primitives, which are in native code.)

By johnwcowan at Tue, 2020-03-24 01:03 | login or register to post comments

re first JIT

It's nice to have a forum where necromancy is welcome.

I remember reading some old papers that discussed runtime compilation of procedurally generated functions. Sadly, I cannot recall the title or date, but there was an interesting factor: the first-class functions were given so much 'space' to install themselves in the actual executable code for the host (e.g. forty bytes). If this wasn't enough to embed the function, it was at least enough for a call-return. If too much, can always add a jump forward.

Unsafe as hell, extremely unfriendly to reentrancy or concurrency, but potentially very fast in an environment where every bit of latency counted.

IMO, the right position for JITs is not some magical, hand-wavy background process, nor something as unsafe as the above, but something more precisely wielded as a tool by the programmer. For example, Surgical Precision JIT Compilers. It'd be very convenient if we can ensure static type safety and support heterogeneous computing (GPGPU, FPGA, cloud, web-app JS+DOM), too.

By dmbarbour at Tue, 2020-03-24 19:22 | login or register to post comments

Necromancy...

...is a term that could be applied to any activity on this site.

By Matt M at Wed, 2020-04-01 15:43 | login or register to post comments

PLT

I think we're in a pregnant moment. We understand one set of ideas, and have discussed them extensively here, and yet something is missing but we don't yet clearly know what. It's like the time before a thunderstorm when everything is still and yet you can smell the potential, and suspect at any moment the wind may whip up and the rain come down.

By John Shutt at Mon, 2020-04-06 14:20 | login or register to post comments

* The Cheapy compiler was by Steele (it came before Rabbit)

By johnwcowan at Tue, 2016-04-05 00:25 | login or register to post comments

re cheapy compiler

Then I am using the wrong name but the thing I'm talking about was Jaffer's not Steele.

Jaffer had a little hack that would compile a tiny ad hoc subset of Scheme to simple-minded C to compile together with SCM. When Hobbit came along, it was a greatly cleaned up idea of the same concept.

I had the impression Jaffer's hack was for the purpose of some specific "day jobs" he was up to.

By Thomas Lord at Tue, 2016-04-05 00:42 | login or register to post comments

"Fake" is exactly the right term

Without first-class mutable environments there is no way to use eval to accomplish anything that would justify its use in the first place. In pursuit of static optimizability, "eval" was effectively sacrificed; the only reason it remained in the language was to pay lip service to the tradition in which it had actually been useful.

It is well understood that 'eval' is hostile to static compilation, and that if you use it you will pay a heavy price. But that was not an adequate reason to blunt it into oblivion.

By Ray Dillinger at Thu, 2016-04-07 02:17 | login or register to post comments

Well Understood

It is well understood that 'eval' is hostile to static compilation

Right. That's the problem I am running into. I am trying to make, not a Lisp, but a compiler for an untyped combinator language (with quotation) where most of the compiler internals are exposed to language such that it should be feasible to implement DSLs more easily.

But yeah. The problem sucks, I am not sure where to progress, and I am not sure I would end up with something worthwhile.

With Lisp, most of the low-hanging fruit regarding dynamic compilation seems to be eaten.

By marco at Thu, 2016-04-07 04:10 | login or register to post comments

Module Loading

Isn't the problem with "eval" equivalent to module loading? You could compile the code in the eval to a shared object with unknown symbol bindings, and then let the dynamic linker resolve the environment. You would effectively build the code inside the eval as a .so or DLL and then pass the environment as a native symbol table to the dynamic library loader. That's probably as good as you can get statically, and would allow all static optimisations except inlining, which based on the fact you want to dynamically change symbol bindings is the best you could ever get statically.

By Keean Schupke at Thu, 2016-04-07 05:27 | login or register to post comments

Dunno

I am not exactly sure what you are proposing but it doesn't feel like the right direction.

I have dumbed it down to two competing requirements:

1. I want to be able to write arbitrary functions over the AST resulting in new ASTs directly in the language. (If I could get rid of quotation that would be even better, I guess.)

2. It should be efficient. (And it should work in a REPL, of course.)

So, something like it should be able to pick up an expression, compile it, run it, and then continue with the next expression?

By marco at Thu, 2016-04-07 05:46 | login or register to post comments

Static or JIT?

Isn't the aim to statically compile what is inside the eval, deferring calls mapped to the environment (and values by reference so they are mutable) until runtime? This is exactly what the dynamic link loader does when loading shared objects. It sounds like this is a valid solution to me, is it just that you don't like it, or want to do something more just-in-time?

The thing is I don't see any difficulty with JIT compiling the eval contents? You might want an intermediate byte code representation though, so you are not string parsing at runtime, but this seems like normal JIT compiling, so I thought you must want something else?

By Keean Schupke at Thu, 2016-04-07 06:11 | login or register to post comments

I thought the problem was

what's in the eval might be
(set! TopLevelFunctionThatsCalledEverywhere (lambda() (print "oh my god I changed the program")))

Either forcing the compiler to never inline functions or forcing it to pause and deoptimize running code (like Self used to).

By Josh Scholar at Thu, 2016-04-07 06:21 | login or register to post comments

Don't do that :-)

As I commented elsewhere I don't think changing bindings for functions that are currently on the call stack is a good idea. What I wanted to allow was changing the bindings for the code in the eval, so you could call code with a different environment.

If you can pull the rug from code already in the call stack you would have to JIT compile, and set watches on symbols used in the call stack to invalidate the compile cache for that function, or indirect every call through a function table, still that would be no worse than virtual functions for performance.

By Keean Schupke at Thu, 2016-04-07 07:29 | login or register to post comments

You misunderstood, I think

The problem isn't TopLevelFunctionThatsCalledEverywhere is currently running. No, not at all.

The problem is that functions might have inlined TopLevelFunctionThatsCalledEverywhere, and that inline is now an invalid optimization. Those functions all have to be recompiled based on their optimization being wrong:
1) And some of THOSE functions might be running.

2) Worse than anything, imagine a loop that calls TopLevelFunctionThatsCalledEverywhere
a) and TopLevelFunctionThatsCalledEverywhere is inlined
b) and in this case it IS running
c) there is no call stack for it
d) you have to deoptimize the running program, CHANGE the call stack to simulate the function being on the call stack, deal with the impossible problem of how that function folded with outer code (I bet Self simply disabled most optimizations in the first place). And set it up so that the NEXT iteration gets the new function.

Think of how hard that is. E

By Josh Scholar at Thu, 2016-04-07 14:26 | login or register to post comments

I don't see it's that hard.

A compiler is merely an optimisation, so "all" you need to do is preserve the behaviour you'd get if it was interpreted.

All you're changing is a binding; anything that already *has* that binding gets to keep it "as is", anything that uses the binding after it's been rebound gets the new version. Sure, dealing with inlined code is a bit tricksy, but I can think of a few ways of doing that which should only affect performance in the case where you *have* rebound stuff, and anyway, optimisation is hard.

The only hard stuff is if you actually *want* to rebind everything up the call stack. Perhaps I lack imagination, but I can't actually think of any case where you'd really want to do that.

By simon.stapleton at Fri, 2016-04-08 06:08 | login or register to post comments

That's because Keaan focused on Symbol Bindings

The problem I am looking at is more general. One of the problems I encountered, for instance, is: If you quote something with the explicit intention of doing a source-to-source translation and compile it then how often are you going to compile? Once, for every compile instruction, or every time you encounter it in the body of an expression? Moreover, do you want to quote the source code or a runtime expression?

If you go the way of code-is-data there are lots of design decisions you can make beyond adopting the manner in which Lisp does it.

By marco at Sat, 2016-04-09 23:26 | login or register to post comments

Yep.

I saw this from time to time in scheme code. I think it came up most egregiously, in fact, in most of the early "records" implementations.

Somebody would want a user-definable type such as records, and they would implement it in terms of vectors, and then they would redefine vector-ref and vector-set so that those functions did not work on their brand new records.

And then of course that code would have a big ol' fight with someone else's code who had had the same idea, or with someone else's code whose continuations didn't catch at the point they were relying on having caught, or with someone else's winding continuations that were trying to protect references to particular other vectors with guard clauses, or ....

And that was about the point at which my hobby lisp started acquiring sigils that limited certain kinds of dynamism.

By Ray Dillinger at Thu, 2016-04-07 07:48 | login or register to post comments

JIT

For one, I want to be able to have access to all ASTs. So, while compiling a quoted expression, where a symbol refers to a defined object, I want to be able to substitute its definition.

A bit uncommon but should allow one to build, for instance, computer algebra systems. (Although Mathematica would arguably be better at that.)

I am not sure about it. Weighing the pro's and cons' of various approaches.

By marco at Thu, 2016-04-07 06:21 | login or register to post comments

Code as data

That doesn't seem that odd, although different from what I was thinking. You just need a data structure that can represent code. This would not be a byte code but as you say an AST, with a binary in memory representation. It doesn't seem that different from an ordinary algebraic datatype. You would have a front end parser that converts everything to the AST format, then you could self modify the code. You would then JIT compile and invalidate the JIT cache if a functions source AST is modified.

By Keean Schupke at Thu, 2016-04-07 07:35 | login or register to post comments

Hard Problem

Well. it turns out to be a hard problem (to get right.)

By marco at Thu, 2016-04-07 20:06 | login or register to post comments

Actually, Scheme/CL style eval is useful

It allows the creation of dynamic code that executes in the same global environment as existing static code. It is a kind of JIT: your program creates safe Lisp code and evals it. Of course the usual injection concerns apply: you have to make sure that none of the code is externally supplied.

By johnwcowan at Tue, 2020-03-24 01:06 | login or register to post comments

Defining a real eval in terms of that

R5RS provides what I'd call a fake eval, since what makes eval a meaningful operation is first-class environments.

As long as the first-class environments can be iterated over, those are interdefinable. That said, I see R5RS and Kernel don't allow environments to be iterated over, so that might be a moot point.

I'll describe what I'm talking about anyway. Suppose our environments were association lists. We would like to write this:

(eval-in-env '(/ (+ b c) a)
  (list (cons '/ /)
        (cons '+ +)
        (cons 'a 1)
        (cons 'b 2)
        (cons 'c 3)))

We can achieve the same thing with this:

( (eval-in-empty-env
    '(lambda (/ + a b c)
       (/ (+ b c) a)))
  / + 1 2 3)

We can define eval-in-env as a procedure that does this.

If we don't like having the ability to iterate over environments (and I don't, necessarily), then I think we can proceed to write a library where these assoc lists are wrapped up in an encapsulated data type.

By Ross Angle at Mon, 2016-04-04 06:00 | login or register to post comments

Almost works in R7RS-small

Your first version can easily be written in R5RS or earlier, and in fact I intend to propose something related to it for R7RS-large in order to do partial evaluation, which is useful in connection with the possible introduction of second-class lexical syntax. More on that another day.

R5RS and R7RS-small do provide global environment objects. The trouble is that there is no way to create fresh objects. The only mutable environment object guaranteed to exist is the interaction environment, and there is no assurance about exactly what names it provides (informally: the names visible at the REPL). What is more, R5RS does not allow evaluating definitions in order to create new bindings, though R7RS does. I have a preliminary proposal to lift those restrictions for R7RS-large.

There is not, nor do I intend to propose, any machinery for first-class lexical environments.

By johnwcowan at Mon, 2016-04-04 16:59 | login or register to post comments

Opinions vary on some of those points.

I don't have a problem with multiple return values. In fact if you have both continuations, and functions that take multiple arguments, then they make the semantics more regular and consistent. I'd have expressed them differently but that's bikeshedding. I don't have a problem with them.

And lots of people think Scheme/Lisp macrology is natural and consistent; I think they are wrong, because these features create a staged runtime and its attendant set of complications, and the purposes they serve should probably have been served otherwise (as in Kernel).

Winding continuations on the other hand are in fact an ugly stain. They are a 'bandaid' on a very deep impedance mismatch that needed to be eliminated or resolved in a way that simplified things, rather than bandaged in a way that complicated them.

By Ray Dillinger at Mon, 2016-04-04 17:27 | login or register to post comments

In terms of convenience

and getting rid of unnecessary glue boilerplate, having multiple values the way lua does, where both calls AND returns can take different numbers IN THE SAME FUNCTION without an error and MISMATCH WITHOUT AN ERROR is much more convenient. It may be less safe, but getting rid of boilerplate is worth it.

By Josh Scholar at Mon, 2016-04-04 17:47 | login or register to post comments

Scheme allows that

The Lua / Common Lisp rules (extra values are dropped, missing values are set to null) are valid in Scheme, though not a requirement, and a round dozen of Schemes actually apply them. However, not requiring them means that low-rent but conforming Schemes like Chibi can use very simple implementations, because there is no requirement to signal an error on mismatch. Details on multiple values are available.

By johnwcowan at Mon, 2016-04-04 18:44 | login or register to post comments

Lua

I've observed the potential neatness of that facet of Lua, and considered its implications for language design in general. Seems to me that while a language design may benefit from neat conceptual devices like this, the benefit is lost if there are too many of them not fundamentally connected to each other. JavaScript has lots of pieces with insufficient overall coherence, making it a tangled mess. Indeed, getting the whole small enough for serious synergy is quite difficult. I've obviously thought deeply about what the core concepts of Lisp are, and it seems to me multiple-value returns don't fit cleanly into the picture. What could be done to expand a coherent Lua-like approach work is a separate (and engrossing) question; indeed, one could say the same for any language with some merit to it. A side project I've been mulling over for many years is to try to isolate and expand the elegant core of vintage 1970s BASIC — it has to have had such a core or would never have been popular.

By John Shutt at Mon, 2016-04-04 18:54 | login or register to post comments

In fact if you have both

In fact if you have both continuations, and functions that take multiple arguments, then [multiple return values] make the semantics more regular and consistent.

I submit this is an illusion caused by having got things subtly wrong in the first place. The irregularity is already present, and multiple return values try to generalize from it with inevitably unfortunate results; trying to build a higher and higher tower on a flawed foundation eventually comes to grief, hence the Smoothness Principle I've proposed re abstraction theory (any roughness in a language design ultimately bounds its radius of abstraction). Scheme doesn't allow you to write (apply (lambda x x) 2) even though common sense says this ought to evaluate to 2, because the language design fails to grant first-class status to the argument list passed to a procedure. Once you admit first-class argument lists, you can see that this entire structure is what ought to be passed to a continuation, so that instead of writing (c 2) one ought to write (apply c 2). The idea that there is something "more general" about passing multiple return values to a procedure is grounded in the illusion created by instantiating a continuation as a procedure that takes just one argument and discards the rest of its (second-class) argument list.

By John Shutt at Mon, 2016-04-04 18:34 | login or register to post comments

Obvious?

It is not obvious to me that something which accepts a proper list as its argument structure should also accept improper lists, nor is it obvious what the utility of function calls of the form (f a . b) might be.

By johnwcowan at Tue, 2016-04-05 00:17 | login or register to post comments

Uniform empowerment

The goal is to provide future programmers with a uniform language design that empowers them to imagine things we haven't thought of, by leveraging the uniformity we've given them, and, having thought of those things, to write them and have them work as expected. We expect to be unable to name specifically the ways such features might be used down the road. The fact that (apply (lambda x x) 2) in Scheme doesn't do what we expect it to do is a problem in itself because it deviates from our understanding of the uniform language design; the unacceptability of the deviation does not depend on our ability to name a specific situation where we want that. (I do recall using an improper operand-list in my library implementation of $cond, but the general point stands without that.)

By John Shutt at Tue, 2016-04-05 01:43 | login or register to post comments

Yes, I think that was in fact a mistake. But it's a small one.

In building a lisp of my own I decided it would be better if cyclic or improper lists were among the parts of the data representation that do not eval to runnable code. Their use in lambda lists seems like a minor wart to me.

I don't know if eliminating the improper-list lambda arguments would be a good choice for scheme; I certainly wouldn't be bringing it up in a scheme standardization effort because although I like the &listargs sigil better, the issue isn't all that important and most people think of the . as a sigil rather than list structure anyway.

The fact that it's list structure created inconsistencies when I defined 'lambda' as a procedure rather than as syntax. And rather than allowing the inconsistencies to spread and start affecting other things, I used a sigil list element instead. I'm also using other sigils (like &lazy) in lambda lists so it seemed to be the most consistent approach from the user's POV as well.

Scheme has committed to the improper-list formal argument structure for a very long time and, unlike a lot of more important things, does not seem to be causing a lot of wailing and gnashing of teeth. So I'd pass it by unless people were reporting it as a source of pain or limitation.

By Ray Dillinger at Tue, 2016-04-05 19:11 | login or register to post comments

argument case is correct; conclusion does not follow.

You are absolutely right when you say it is a failure of 'smoothness' when scheme does not allow something like applying a lambda expression directly.

But that happens because of the function/syntax roughness, not because of any arity roughness, so I don't see how it's an argument against having the same rules for return arity and argument arity.

In terms of arity, I see smoothness in a lambda calculus where a function can only take a single argument and then return a single value. It's even smooth when the single argument is a list of values and the single return is also a list of values. Where the implementation is consistent about boxing and unboxing (ie, never passes or returns a non-list) the multi-argument case is semantically identical to the single-argument case.

But passing a list of arguments one way and not passing a list of returns the other way breaks symmetry, especially where the language has continuations. For perfectly smooth semantics, functions ought to return the same way they're called - ie, the caller should not need to know whether this is a continuation or some other kind of function, should not have to check for and generate special-case code for that case, and should not need to know whether its own call stack is going to get garbage-collected after making the call. It just does what it does to call a function, and that's it.

If you want multiple arguments and single returns, that's smooth in the *absence* of continuations. And continuations are a bit of a semantic minefield in the first place, and a simple call stack with tail recursion is very general. So smoothness achieved without continuations would not make a language at all crippled.

By Ray Dillinger at Tue, 2016-04-05 18:43 | login or register to post comments

I didn't give a complete

I didn't give a complete argument against multiple-value returns; I touched lightly on a few parts of it. Continuations come into it because the way continuations are applied in Scheme encourages the misconception that leads to multiple-value returns; but even without continuations the misconception would still be possible. The basic notion behind multiple-value returns is that returning multiple values would be "more general than" returning just one. But if returning multiple values is more general than returning just one, wouldn't returning a first-class list of all the return values be "even more general than" multiple-value return? Of course it would, but doing so would seem silly because in that case why not just return the whole list as a single result. And indeed, it would be silly and one should just return the whole list as a single result. It should be clear that this is all just a question of how you want to arrange the syntax for returning multiple values — and that's a point on which Scheme is weak. Indeed, I seem to recall seeing the same weakness in Backus's "Can Programming Be Liberated from the von Neumann Style?": although in theory a computation that produces multiple results can be handled by a functional expression that evaluates to a tuple, in practice that only works if you have syntax for very conveniently taking the tuple apart once it's been returned. Kernel does have such syntax: generalized definiends. In Kernel when you have a procedure p that returns a list of four values, you can write

($define! (a b c d) (p ...))

and those four values will be bound to symbols a, b, c, d. Scheme can't do that because it's overloaded its define syntax with some unfortunate "syntactic sugar" for defining procedures. (From some things I've heard about introductory Scheme classes, that syntactic sugar also tends to sabotage students' understanding of the elegant concepts behind Lisp, by encouraging them to think of procedures as second-class language elements.)

So what we're really dealing with is a very deep design choice that might seem to be a "simple" question about the syntax of function value return, but really has sweeping consequences across the whole design. Making it difficult to give a compact explanation of the choice.

By John Shutt at Tue, 2016-04-05 20:46 | login or register to post comments

Okay, so for backward compatibility ...

... with the admittedly lame syntax sugar for defining procedures, we have to name our general multiple-value operators somewhat noisily: define-values, let-values, and let*-values. They've been around for 15 years and are part of R7RS-small. That seems a minor thing to make a fuss about.

By johnwcowan at Wed, 2016-04-06 02:02 | login or register to post comments

Uniformity looks like a

Uniformity looks like a small thing if you look for one small place where it'll make a huge difference. But it's ubiquitous, and has these individually seemingly-not-earthshaking effects that add up until, in the long run, its cumulative effect overwhelms everything else. Even then one might fail to notice it, because the effect is then on such a big scale, like failing to notice that you're standing on a continent.

By John Shutt at Wed, 2016-04-06 14:24 | login or register to post comments

I feel like you're overstating it.

If I were to complain about scheme I'd pick the lack of convenince:
1) arrays don't grow for you. Why? WTF why?
2) records have an interface that's so bad it looks like it's from the 50s
3) basic syntactic sugar are through things that once again are so crazy like make-set-tranformer or set-variable-transformer that I want to run screaming
4) the new macro system is horrible for everything except a few simple examples from someone's paper. Forget things being able to implement an object system with instance variables visible in methods.
5) despite protestations to the contrary syntactic tokens have hidden fields that you can't examine or set properly, such as the all important one that specifies what context a variable is in - at best you can just pick a token at random from the source and make a new token based on it and hope it's at the right level... Damn it's so bad.

I mean you're right about uniformity, but wrong that your example is a bad one

By Josh Scholar at Wed, 2016-04-06 15:17 | login or register to post comments

Lack of convenience

1) Arrays don't grow for you because efficiency, and because you can easily layer growing arrays over non-growing ones.

2) In the 1950s records looked like this:

   01  MAILING-RECORD.
       05  COMPANY-NAME            PIC X(30).
       05  CONTACTS.
           10  PRESIDENT.
               15  LAST-NAME       PIC X(15).
               15  FIRST-NAME      PIC X(8).
           10  VP-MARKETING.
               15  LAST-NAME       PIC X(15).
               15  FIRST-NAME      PIC X(8).
           10  ALTERNATE-CONTACT.
               15  TITLE           PIC X(10).
               15  LAST-NAME       PIC X(15).
               15  FIRST-NAME      PIC X(8).
       05  ADDRESS                 PIC X(15).
       05  CITY                    PIC X(15).
       05  STATE                   PIC XX.
       05  ZIP                     PIC 9(5).

Not Scheme, I assure you.

3) That's Racket-specific.

4) Identifier-syntax macros can and do handle "instance variables visible in methods". I personally think that when you see a variable, a variable it should be and not a disguised method call, but the capability exists in R6RS.

5) That's syntax-case-specific. Despite R6RS, syntax-case is not and never has been the only low-level macro system.

By johnwcowan at Thu, 2016-04-14 18:15 | login or register to post comments

Yup.

It is exactly as you say. A single argument - which is a list. And a single return - which is also a list. Makes it exactly equal to the one-argument lambda calculus. Neither more nor less general. I wasn't arguing about generality, I was arguing about call/return semantics mismatch. Smoothness fails at the semantic level if there is a mismatch where one is *always* a list and the other might not be.

The syntactic failure here is only that Scheme doesn't make it as easy to accept multiple returns as it is to pass multiple arguments. For me that comes under "yes the syntax is a shame, but no the semantics aren't wrong."

By Ray Dillinger at Thu, 2016-04-07 02:09 | login or register to post comments

some thoughts on steering scheme

I suggest asking those who did it why they pursued IEEE standardization in the first place. If they are not teetotalers, perhaps get a few drinks into them first. Ask: Are those reasons relevant today? Were the broader goals of IEEE standardization actually achieved? In retrospect, does it seem worth it (a) personally, (b) for the future development of research, development, and use of scheme. Cui bono? Quid?

Scheme would be a different and, I think, more interesting language these days if Steele's advisors had looked at the Rabbit thesis and said "It's a bit thin. Add a second half about dynamically interpreted Scheme and reflective features made possible in an interpreter."

Instead, the authorities since r4rs have seemed hell bent on making sure that any such dynamic features are Not Scheme. At the same time, they keep raising the bar of what a complete implementation is supposed to comprise.

While that's what the authorities are up to, no small part of (what remains of) actual real world interest in Scheme seems to concern flyweight implementations and the use of Scheme for dynamically programming interactive environments. Go figure.

By Thomas Lord at Mon, 2016-04-04 16:32 | login or register to post comments

One reason for pursuing

One reason for pursuing standardisation for Scheme (and for Common Lisp) back then was a fear that otherwise someone else would initiate a standardisation process and would consequently have more control over the resulting language definition.

Both Scheme and Common Lisp had an informal group working on what the language should be, and standardisation in effect took those groups and turned them into the technical committees for the standardisation efforts.

The danger wasn't only from potential standards for the very same language (Scheme or Common Lisp) but also from ones that would seem to cover those languages, so that it wouldn't make sense for them to have a separate standard. One of the reasons the ISO standard is for ISLisp, rather than Lisp, is that the Americans were adamant that there mustn't be a standard for all of Lisp -- indeed, McCarthy said he would denounce any such standard -- and that Lisp be treated as a family of languages which could have separate standards, rather than as one language.

Another reason for pursuing standardisation, at least for Common Lisp, was that some funding bodies preferred or required the use of applicable standards. That put languages that were not standardized (or, rather, people who wanted to use those languages) at a disadvantage.

By Jeff Dalton at Fri, 2016-04-15 12:39 | login or register to post comments

re: one reason

Another reason for pursuing standardisation, at least for Common Lisp, was that some funding bodies preferred or required the use of applicable standards. That put languages that were not standardized (or, rather, people who wanted to use those languages) at a disadvantage.

Sure, Common Lisp was clearly an attempt to draw the boundary lines between commodity forms to prevent lock-in to particular lisp vendors. It could be compared by analogy to the way POSIX was meant to resist lock-in to particular unix vendors.

The thing about Scheme is that it never actually encountered the threat of lock-in because no implementation of Scheme at any point in its entire history have ever had quite the economic importance of any of the commercial lisp machines (nevermind a unix). There's something cargo-cultish about standardizing it.

McCarthy said he would denounce any such standard -- and that Lisp be treated as a family of languages which could have separate standards, rather than as one language.

I had no idea. Smart guy.

By Thomas Lord at Sat, 2016-04-16 00:24 | login or register to post comments

Well....

If the reason the last RnRS failed to adapt to the new realities of unicode was a desire to be consistent with the IEEE standard, then it is best to allow the IEEE standard to die, so that the mistake is not repeated.

However, I fear that having already standardized on the Wrong Thing, they will not correct their errors in that regard. Nor will they make a choice between flow-of-control paradigms; now that throw/catch and exceptions have been forced into the language, they are going to fight unto Scheme's death against continuations - particularly against winding continuations, which are a stain on the language in their own right.

By Ray Dillinger at Tue, 2016-04-05 03:08 | login or register to post comments

So...

if the RnRS branch has gone irretrievably wrong, would that then be a reason to ignore it and strive to get the IEEE branch right?

By John Shutt at Tue, 2016-04-05 11:50 | login or register to post comments

Good luck with that

After doing all the work, you need to send out ballots to the electors, and get a 75% return on ballots and a 75% approval rate. "Politics is the art of the possible." (Peter Medawar)

By johnwcowan at Tue, 2016-04-05 16:09 | login or register to post comments

Figuring out what one wants

Figuring out what one wants to happen is prerequisite to any political effort, however imperfect, to make it happen. It's not possible even to approximate achievement of goals one hasn't identified.

By John Shutt at Tue, 2016-04-05 16:18 | login or register to post comments

What do you mean, "die"?

Do you imagine that because the IEEE withdraws its official approval, the P1178 document (or R4RS+, whatever) ceases to exist? For every Schemer who wants Scheme to "adapt to the new realities of Unicode", there are at least two Schemers who think the ASCII repertoire (not even the ASCII encoding) is the perfect, jewel-like counterpart to small Scheme. Everything beyond ASCII is standardized but optional in R7RS, and that's what was needed in order to get to consensus. (The fact that ASCII, like Unicode, was a political compromise is simply forgotten.)

With a less conservative Steering Committee, I would have been happy to make R7RS a more radical break with the pre-Unicode past. But the community elected that Committee, and presumably got what it expected to get. There is no "gray They" here, just John Cowan and Alex Shinn and Art Gleckler and all the WG1 members and all the people who voted either for or against R7RS.

Ah well, "the part-time help of wits is no better than the full-time help of halfwits." (Wolcott Gibbs, I think)

By johnwcowan at Tue, 2016-04-05 16:08 | login or register to post comments

Object system

I think that the new Scheme standard should contain a GOOPS- or CLOS-style object system.

By Tommi Höynälänmaa at Tue, 2020-04-14 07:48 | login or register to post comments

User login

Navigation

IEEE Scheme expiring soon

Comment viewing options

Browse archives

Active forum topics

New forum topics

Recent comments