Are scalars "just" degenerate matrices?

I'm working on a Haskell library implementing the ideas in George Hart's book Multidimensional Analysis: Algebras and Systems for Science and Engineering. In essence explains the way that physical dimensions propagate through matrix equations when different locations in a single matrix may not share the same physical dimension, which happens all the time in a bunch of useful engineering/optimization/signal-processing applications.

This is all well and good, and the dimensional library is a great jumping-off point that handles all of the type-level arithmetic.

My question/puzzle is in several parts, so I'll start with a motivating example.

If you look at the dimensional library, or most any approach to static typing of physical dimensions (or units, whether to track dimensions or units is a separate battle which I'm not exploring today), it works by representing dimensioned quantities using a phantom type that encodes their dimensions. When you do this in a library you end up with Double and Quantity DOne Double (to use dimensional's names as an example) being two separate types. This is annoying because they are logically equivalent. Appearances of the former basically mean "I am a function that uses a Double and I was written before the language learned about the existence of dimensions", and this doesn't seem to be a useful property to encode in the type system. So you end up with *~ one and /~ one at a lot of interfaces between "dimensionally aware" and "dimensionally naive" libraries; dimensional even encodes _1 :: (Num a) => Quantity DOne a and a bunch of similar shorthands for expressing commonly used dimensionally aware, but dimensionless, constants.

There are two solutions to this untidyness:

The first is introducing a system for user-defined coercions between types. The drawback is that inferring where to put such coercions is difficult, complicates/clutters the principle type of pretty much every polymorphic function, and doesn't mix well with a lot of other type system features that are desirable. In summary, it's good if your language already has it, but adding it to your language to solve this problem opens a large can of worms.

The second is to simply declare that all numbers have a dimension, it's just coincidence that that dimension is often 1. In Haskell-speak, you move all the types like Int, Integer, Float, Rational etc from kind * into a new kind, say Integer :: NumRep, fiddle with the definition of Num (and related things, not rehashing the problematic nature of Num ...) so that they apply at the new kind, and give all your numbers types like Quantity DOne Integer instead of Integer. The big advantage is that the type unification rules remain syntax-directed.

The second option seems justified for a green-field language design, even though it requires sweeping changes, because I can't think of a logical difference between a dimensionless 37 and a dimensionally naive 37 and because you can almost always erase the phantom type so there needn't be a performance problem. Question one: are there valid objections? Question two (bonus): literals that have the value 0 can be interpreted as polymorphic in their dimension, are there good objections for teaching the compiler about this rule? CSS is the only language I know of that does this, but it seems like a good idea.

When you start to track matrix sizes, you run into a similar problem where scalars and vectors start to look like degenerate matrices in the same way that dimensionally naive 37s are degenerate dimensioned 37s at dimension one. Adopting this convention recovers a single multiplication operator, saving you from scalar * scalar, scalar * vector, scalar * matrix, vector * matrix, matrix * matrix, and all the flipped versions. Scalar addition, inversion, and exponentiation maintain the same meanings if you look at them as degenerate matrix addition and exponentiation (at least for integer exponents?), and so do all the common trigonometric functions.

When you start to track matrix types for the non-uniformly dimensioned matrices I mentioned in the first paragraph, you end up realizing that matrix size checking falls out naturally from checking matrix dimensions (in the sense of physical dimensions) because you end up with an index type like MatrixShape :: Dim -> [Dim] -> [Dim] -> MatrixShape and your multiplication rule involves unifying the (normalized) "column dimensions" of the left operand with the (normalized) "row dimensions" of the right operand which happily takes care of the size checking.

This motivates the third question: what are the objections to treating all scalar numbers everywhere not only as degenerate dimensionless quantities, but as degenerate dimensionless matrices?

I have three objections so far:

It means you probably can't use your Matrix type generator as a generic two-dimensional array type, ruling out things like Matrix String. You end up wanting to restrict the element type of the matrix to the kind of types-that-may-sensibly-carry-a-physical-dimension which may not overlap precisely with types-that-have-a-useful-Field-instance. (This may or may not be a bad thing, I'm not sure. Maybe the latter group of types could be plausibly interpreted as having a physical dimension?)
Obviously you don't want to hit beginners with error messages that mention all the complicated phantom types just because they wrote "moose" + 3. (People have done a lot of interesting work on solving this problem, but it is still something to consider.)
Does this go far enough? Or am I just unaware of the next-more-general concept of which matrices are a degenerate form? What justifies stopping here as opposed to not starting down this road in the first place?

TL;DR: Is 3 :: (Num a) => a really the most general type of 3?

I've often thought it would

I've often thought it would be nice to have the literal interpreters `fromInteger` and `fromRational` be in separate typeclasses (from Num and Fractional). My own thoughts are more related to Data.AffineSpace and related types, where it can make sense to have numbers but not to have direct addition between them.

Haskell is still carrying a lot of warts from its early designs, in this case because they provided no decent means to change typeclasses after they become popular...

By dmbarbour at Wed, 2012-09-12 04:56 | login or register to post comments

Implicit conversions?

Are implicit conversions what you want?

On the type of matrices: viewing matrices as having N*M independent entries with different physical units as the book page you linked to does, may not be the best way. As far as I can see, a matrix' units can always be described by a rank-1 unit matrix (so there are only N+M independent units). Equivalently you can view matrices as linear maps from vectors of length N to vectors of length M and then give that map a type (which will also give you N+M units instead of N*M). The page does give a possible hint at this later on:

The traditional concept of a vector as a quantity with direction and magnitude is far too narrow for engineering purposes, while the traditional concept of a matrix as an array of scalars is far too broad. (Most vectors have no magnitude. Most arrays are not matrices.)

By Jules Jacobs at Wed, 2012-09-12 09:12 | login or register to post comments

In Bling, I encode

In Bling, I encode vector/matrix dimensions using real types to enable such coercions. This is only effective for a small number of dimensions, which is around 4 to support full 3D transformations, while we need to encode relationships like D4 - 1 is D3 in the type system.

By Sean McDirmid at Wed, 2012-09-12 09:23 | login or register to post comments

Is there any reason why more

Is there any reason why more general types make implicit conversions problematic, or is that just a limitation of C#?

By Jules Jacobs at Wed, 2012-09-12 09:26 | login or register to post comments

Like what? There aren't many

Like what? There aren't many type systems that include numbers in them, at least outside of the type dependent kind. I'm sure I would do it the same way in Scala.

By Sean McDirmid at Wed, 2012-09-12 09:45 | login or register to post comments

I agree, except that it

I agree, except that it works out that there are only N+M-1 independent units, which you can see pretty easily in the 1x1 case but which holds generally.

I think Hart does too, although it isn't really clear from just that web page.

Implicit conversions are what I want, but the bidirectional nature of the implicit conversions that I want makes them tricky. The "obvious" (natural, naive, ?) semantics results in rules that lets you introduce an arbitrary number of wrappings/unwrappings around any term (for example, here it would be taking x to x *~ one *~ one /~ one *~one /~one /~one) and the syntax of types doesn't provide any guidance for when you might actually want to do this. I have a similar wish for implicit conversions into and out of the Identity monad, but it has the same problem.

Implicit conversions also come with a bunch of coherence issues, but I am OK with only having some specific conversions built into the language and forbidding or restricting user-supplied ones, or in making user supplied ones come with an unverified proof obligation.

By Douglas McClean at Wed, 2012-09-12 12:36 | login or register to post comments

You're right of course. For

You're right of course. For the interested, here's an explanation. If you have Ax = y where A is a matrix and x,y are vectors, we can write that out like this:

sum_i A_ij x_i = y_j

In a sum all units have to be equal. So we have:

[A_ij][x_i] = [y_j]

Where brackets indicate "unit of". Therefore:

[A_ij] = [y_j]/[x_i]

Dividing by [x_i] we see that the N*M units of a matrix can be described by a vector [x_i] of units of length N and a vector [y_j] of units of length M, totaling N+M. But the catch that I didn't see before is that [y_j] and [x_i] can have a common factor: if you multiply all the [x_i] and [y_j] by the same unit, then the units of A_ij don't change. That's where the -1 comes from.

I'm not sure about how to solve the problems of implicit conversions you mention. What are the options already proposed in the literature?

By Jules Jacobs at Wed, 2012-09-12 15:18 | login or register to post comments

Scalars are not matrices

(Or, at least, scalar multiplication is not matrix multiplication)

Suppose L: 1x2, M: 2x1, N:5x7. Then L*M is a 1x1 matrix. If that meant it were a scalar, we could then write (L*M)*N. But that would bust associativity of multiplication since M*N is not well typed.

By Matt M at Wed, 2012-09-12 13:43 | login or register to post comments

I don't see a reason to

I don't see a reason to expect associativity of multiplication except when all three operands are 1x1.

I think there is some sleight of hand involved there. We can call L*M a scalar and write (L*M)*N and have it pass type checking. That doesn't necessarily mean we can write L*(M*N).

It's tough to pick a useful analogy, but for a similar reason just because we can write (fromJust $ parseMatrix "[[6]]") * N it doesn't follow that we should expect to pick a subexpression like "[[6]]" and expect "[[6]]" * N to be well-typed. The fact that a result of a multiplication is a scalar doesn't mean that we can infer that either of its operands was scalar, so we can't go stealing one of its operands and assuming that it is scalar.

(Am I missing a reason why the this is an important problem? I ran the example through GHCi with my library and nothing untoward happened, I got the type error I expected when I changed the grouping from (L*M)*N to L*(M*N).)

By Douglas McClean at Wed, 2012-09-12 14:22 | login or register to post comments

The associativity of matrix

The associativity of matrix multiplication is a basic fact (matrices represent linear transformations, multiplication represents composition of transforms, and function composition is associative) that I assume you're familiar with. Combining scalar multiplication with matrix multiplication breaks the characterization of multiplication as composition of transforms. Combining two semantically different operations for the purpose of eliminating some syntax is ugly to me even if I don't see exactly what it would break (other than equational reasoning).

By Matt M at Wed, 2012-09-12 15:12 | login or register to post comments

Except matrices do not only

Except matrices do not only represent linear transformations. Matrices also represent vectors. When they do, then it is correct to interpret row vector * column vector as a scalar dot product, not as a 1x1 matrix. Mathematicians do things like (L*M)*N all the time, using the fact that it only has one possible interpretation to reduce the notational burden. Some might even omit the parentheses, and rely on an alert reader!

Frankly, I have trouble imagining a case where you wouldn't want a 1x1 matrix to be interpreted as a scalar, but then I don't do much work in 1D geometry.

Note that floating-point multiplication isn't associative, so L*M*N isn't either, even if the matrices are compatible. The argument over associativity therefore seems somewhat moot.

By EddieEdwards at Tue, 2012-09-18 08:29 | login or register to post comments

Mathematicians play loose with notation though...

When ever you see the 1xN matrix as vectors being used there's normally a bit of implicit transposing that has to happen to get it to work when you perform any vector algebra with them.

Also Affine geometry makes the interpretation of 1x1 matrix as scalar problematic because it's either a 1D vector, 1D point or a dimensionless number.

The problem with a lot of these implicit conversions is that you want different implicit conversions in different contexts.

By Daniel Slaney at Tue, 2012-09-18 10:07 | login or register to post comments

Well I work in environments

Well I work in environments where you use different types for matrix, vector and scalar, and do dot products by dot(L,M), and I prefer that setup. I was just arguing against "matrices have one interpretation, therefore strict associativity". Manifestly, they have many interpretations, even within a single expression.

But how you perform matrix arithmetic doesn't depend on interpretation. You always happen to get the same numerical results if you treat a 1x1 as a scalar - there is no inconsistency or ambiguity. You just get less error checking, and we can argue until the cows come home about what forms of error checking are helpful and what forms just get in the way. It's too reliant on user expectation, which is highly variable.

I'm certainly not arguing for implicit transpose - that's just going to break things!

By EddieEdwards at Thu, 2012-09-20 08:07 | login or register to post comments

Scalars to Identity Matrix

Associativity is a very nice property to have when refactoring code. For matrix multiplication, it is also valuable for optimizing performance.

I would not allow multiplication of a 1x1 matrix with arbitrary other matrices. When you spoke of turning literal integers into polymorphic matrices, I was assuming you would turn, say, `6` into (scalarMult 6 idM) where idM provides an identity matrix of the appropriate size.

Something like:

class IdentM mtrx where
    idM :: mtrx

By dmbarbour at Wed, 2012-09-12 15:21 | login or register to post comments

Aha! Good point about not

Aha! Good point about not allowing multiplication of a 1x1 matrix with arbitrary other matrices. That right alone kills this idea entirely without even reaching the associativity issue.

OK, well spotted, that seems extremely obvious in retrospect. Oops.

By Douglas McClean at Wed, 2012-09-12 19:25 | login or register to post comments

Wow

Now I'm confused. What was the good point that shows you shouldn't allow multiplication of arbitrary matrices by 1x1 matrices? As far I see, David just stated that he wouldn't allow it...

Edit: Oh, are you saying you didn't notice that 1x1 and MxN matrices weren't compatible under ordinary matrix multiplication? That would make sense. The associativity thing is an argument for why you shouldn't try to extend matrix multiplication to include multiplication by 1x1s as scalars, which is what I thought you were proposing. It is possible: scalar multiplication agrees with matrix multiplication on 1x1 matrices whenever the latter is defined. But it's not a good idea.

Edit: Sorry for the ninja edit. Not fast enough I guess :)

By Matt M at Wed, 2012-09-12 20:09 | login or register to post comments

Right, he shows that 1x1

Right, he shows that 1x1 matrices aren't equivalent to scalars because you can multiply a scalar by a 3x5 matrix but you can't multiply a 1x1 matrix by a 3x5 matrix.

Looked at alternatively, he shows that it is probably a bad idea to treat scalars as identical to 1x1 matrices because of the wacky wacky type that that would give to the multiplication operator (a type that makes sure either that dimensions match or that one operand is 1x1).

In response to your edit: Yeah, I just didn't realize that. I don't have a good excuse, although it was late, it certainly is blindingly obvious now.

By Douglas McClean at Wed, 2012-09-12 20:24 | login or register to post comments

[put this in reply to the

[put this in reply to the wrong thing]

By Douglas McClean at Wed, 2012-09-12 20:23 | login or register to post comments

On a related topic

What say ye of languages (such as Javascript), which have a "string" type but no corresponding "character" type?

By Scott Johnson at Thu, 2012-09-13 18:49 | login or register to post comments

You can have an 'integer' type without a 'digit' type, right?

The position that a character doesn't exist independent of its identity as a string of length one, is pretty much the same as the position that a digit doesn't exist independent of its identity as an integer between 0 and 9 inclusive. If we mostly accept the latter, why should we have a problem with the former? This is actually an advance.

I for one consider manipulating strings as individual characters to be a venial sin on the same order as manipulating numbers by twiddling individual bits or by manipulating strings of digits. Programmers should have to specify their own representation if they intend to lower the abstraction level and type pun on the representation. If performing operations on strings, they should be using string operations.

Ray

By Ray Dillinger at Fri, 2012-09-14 08:15 | login or register to post comments

I for one consider

I for one consider manipulating strings as individual characters to be a venial sin

It's also flat out wrong given Unicode. Unicode strings have encodings that don't decode to individual characters.

By naasking at Fri, 2012-09-14 15:40 | login or register to post comments

I agree completely here. I'd

I agree completely here.

I'd rather see a code point type and a character type. But the character type should be entire composed characters possibly consisting of several code points (and not only because of surrogates, I am mostly talking about combining characters) along the lines of what .NET calls the StringInfo.GetTextElementEnumerator method.

Similarly, I don't necessarily think that all lists of code points should be valid String instances either, when they are sequenced improperly according to the Unicode standard.

By Douglas McClean at Fri, 2012-09-14 20:38 | login or register to post comments

One minor difference

is that an (abstract) integer type can be discussed independently of any particular implementation, such as a particular fixed-radix representation like decimal, binary, or whatever.

But as a general rule, abstractions may have components that they wish to keep private.

Unicode is indeed a good reason to be wary of treating strings as char[]--although JS strings aren't really helpful in this regard, leaving (as they do) issues like normalization and combining forms up to the user or library writer. And with a good string library, the need to manually iterate over a string generally goes away.

OTOH, a char is a distinct thing from a single-char string; a character is also a distinct thing from an integer, even though all important character encodings map chars to ints in some fashion.

My personal thought is that application code should need to work with chars rarely--the need to do so is usually a sign of an incomplete abstraction. OTOH, they are sometimes useful.

By Scott Johnson at Sat, 2012-09-15 03:27 | login or register to post comments

under what circumstances?

When is a character a different idea from a short string, and why wouldn't a short string work in that context? 'Cos I think it really is a situation like digits and integers; a digit has no meaning other than that given to it by its existence (and position) in a number, and a character has no meaning other than that given to it by its existence (and position) in a string.

There aren't useful operations on single unicode codepoints that are closed (ie, never return sequences of more than one unicode codepoint). So if you restrict your character type to one codepoint, you cannot implement basic character operations like switching to uppercase or switching to lowercase.

Even if you have a character type which represents a unicode base character codepoint plus nondefective sequence of combining codepoints and accents, casing operations are STILL not closed for characters like eszett (whose lowercase form is TWO lowercase 's' characters). And you cannot map such characters to integers, save by using an algorithm by which you could map any string to an integer.

Finally, there are a lot of characters (initial, medial, and final forms, and a lot of accented forms) whose casing rules and glyph selection depends on whether they appear at the beginning, in the middle, or at the end of a word. There are also a lot of characters whose casing rules vary depending on what language they appear in (dotless lowercase i, for example). You can't do basic operations on them without a surrounding context and knowledge of which language they appear in.

What the unicode standard has made painfully obvious, but a thing which has always been true, is that there is no general operation on characters which can be done outside the context of a string. So it's reasonable, I think, to abandon the character abstraction as one that supports no operations and just use strings the way we use integers - considering them to have opaque representation.

Ray

By Ray Dillinger at Sat, 2012-09-15 09:18 | login or register to post comments

two circumstances

The definitions of parsing I've seen all take strings of an alphabet to some result. If you have no characters, what is your alphabet, and how can you write a function polymorphic in the alphabet?

A digit in any base can be defined using only natural numbers (as natural numbers less than 10), as can an algorithm converting a natural number to its list of digits. How would you define a character when only given strings? How about the algorithm converting a string to a list of characters?

By jason stumpf at Sat, 2012-09-15 22:15 | login or register to post comments

Parsing

You can understand parsing in terms of matching substrings rather than individual characters. It even works better, since now you don't need that funny `empty` character for certain grammar rules.

If you have no characters, then you would not create an algorithm that turns strings into characters. You could, however, create algorithms that turn strings into lists of numbers or to lists of substrings.

(Aside: I've often thought about getting rid of `strings` and just having first-class grammars. A string would be a most primitive grammar.)

By dmbarbour at Sat, 2012-09-15 23:11 | login or register to post comments

There is also a concept of

There is also a concept of "smallest possible nonempty substring" which could be useful.

By Ray Dillinger at Sun, 2012-09-16 03:24 | login or register to post comments

how is this different from a character?

Is it not the case that every string is the concatenation of a list of strings all of which contain no nonempty proper substring? If it is indeed the case, this is a "character" type that you yourself say could be useful.

I do understand that C's "char", and other common character types, are obsolete. All I'm arguing is that a properly chosen type of characters is not obsolete.

By jason stumpf at Mon, 2012-09-17 23:16 | login or register to post comments

It seems like "character"

It seems like "character" would have to be defined as a sequence of code points that produce only a single glyph. At the very least, that's an invariant you'd have to check at runtime which imposes significant constraints on implementations. I'm not sure what the gain is though, because the character representation then isn't really different from a string, it's just a string with stricter constraints, but constraints that don't seem to add anything supremely useful that I can see. Is there a specific problem you think needs characters?

I think "strings all the way down" and byte vectors for performing de/encoding are the only two abstractions you really need for working with text.

By naasking at Tue, 2012-09-18 13:08 | login or register to post comments

Grapheme Clusters

From the Unicode page here, in the document for Chapter 2 - General Structure, on page 46 (pdf page 41):

End users have various concepts about what constitutes a letter or â€œcharacterâ€ in the writing
system for their language or languages. The precise scope of these end-user â€œcharactersâ€
depends on the particular written language and the orthography it uses. In addition to the
many instances of accented letters, they may extend to digraphs such as Slovak â€œchâ€, trigraphs
or longer combinations, and sequences using spacing letter modifiers, such as â€œkwâ€.
Such concepts are often important for processes such as collation, for the definition of
characters in regular expressions, and for counting â€œcharacterâ€ positions within text. In
instances such as these, what the user thinks of as a character may affect how the collation
or regular expression will be defined or how the â€œcharactersâ€ will be counted. Words and
other higher-level text elements generally do not split within elements that a user thinks of
as a character, even when the Unicode representation of them may consist of a sequence of
encoded characters.
The variety of these end-user-perceived characters is quite greatâ€”particularly for digraphs,
ligatures, or syllabic units. Furthermore, it depends on the particular language and writing
system that may be involved. Despite this variety, however, the core concept â€œcharacters
that should be kept togetherâ€ can be defined for the Unicode Standard in a language-independent
way. This core concept is known as a grapheme cluster, and it consists of any combining
character sequence that contains only nonspacing combining marks or any sequence
of characters that constitutes a Hangul syllable (possibly followed by one or more nonspacing
marks). An implementation operating on such a cluster would almost never want to
break between its elements for rendering, editing, or other such text processes; the grapheme
cluster is treated as a single unit. Unicode Standard Annex #29, â€œUnicode Text Segmentation,â€
provides a complete formal definition of a grapheme cluster and discusses its
application in the context of editing and other text processes. Implementations also may
tailor the definition of a grapheme cluster, so that under limited circumstances, particular
to one written language or another, the grapheme cluster may more closely pertain to what
end users think of as â€œcharactersâ€ for that language.

By Matt M at Tue, 2012-09-18 14:21 | login or register to post comments

Unicode Consortium should provide Lexers

The problem with the Unicode is that the designers of languages, or UIs, cannot be assumed to be written natural language experts.

If you assume that Unicode should be supported in future (scripting) languages, something I am hardly sure about, then the only domain expert who can tell you how to lex a Unicode source text is the Unicode consortium.

(For example, it is really difficult to do error reporting since it is not easy to derive the column width of a word in Unicode; I am not even sure words have a discrete column width since some glyphs may be narrow. Moreover, I am not sure how to lex simple identifiers, or operators, since some languages don't even have whitespace, or punctuation. Of course, one could force western whitespace and punctuation. But it would be much simpler if the Unicode consortium would prescribe simple classes for identifiers and operators.)

I am not sure Unicode, like it encodes languages today, is really what we'll end up with in the future. Maybe the amount of codepoints is enough to encode all grapheme clusters, in which case I assume that at some point people will clean the mess Unicode now seems to be becoming, and Unicode 2.0 will be much more trivial than what we have now.

By marco at Tue, 2012-09-18 17:00 | login or register to post comments

Thanks for the detailed

Thanks for the detailed reference!

By naasking at Tue, 2012-09-18 18:00 | login or register to post comments

Are there implementations which solved Unicode lexing?

Looking at Unicode, I could also ask the reverse question; i.e., instead of solving it (badly), let's assume that there must be languages which actually solved the problem of handling Unicode in programming source texts in an orthogonal and clean manner. Which languages/compilers would that be? Or are all solutions hacked in some form or another?

By marco at Tue, 2012-09-18 21:08 | login or register to post comments

It is different because it does not invite an error.

I guess what it comes down to is that a "character type" is an invitation to accidental overspecification leading to an error, and a "string type" is not. Most of us assume things about characters that are true of strings, but not, in general, true about characters. Most of us also assume simple things about the relationships between characters and strings which are not true. When these assumptions affect our code designs, or when we implement string operations in terms of characters, we create code that results in errors.

In fact it is my contention that there is *NO* usefully true assumption that can be made about a character data type other than those which are more certainly true about strings.

If a string is a sequence of characters, and you implement string operations on it by manipulating individual characters within it, then, first, you are probably wrong about any language other than a few of the simplest, and second, even if you are correct the result may appear in non-normalized form, making equivalence checks that ought to succeed, fail.

In particular, you cannot even concatenate strings by simply adding the characters from the first onto the end of the second; the result will sometimes be non-normalized. Nor can you divide strings by copying characters from the start up to the point of division into one string, then copying characters from the point of division to the end into another string; again, the results will sometimes be non-normalized. In fact, it is not even true that a string X characters long and a string Y characters long can be reliably joined into a string with X+Y characters, nor that a string X+Y characters long can be reliably separated into strings with X and Y characters.

If a routine is specified to return a character, then when whatever "character" operations it actually does return sequences of more than one character, you get a type error because it cannot return a string. It's far better to just specify it to return a string and then, if you care, check the length of the string on return. But why you should care escapes me just now.

And so on. It's just plain safer to have only a string datatype, and operations that work exclusively on strings, because it bypasses most mistakes that people make based on wrong expectations of how characters ought to behave or wrong expectations of the relationships between characters and strings.

Ray

By Ray Dillinger at Wed, 2012-09-19 03:46 | login or register to post comments

you're still thinking about code points too much

With characters defined as normalized grapheme clusters (see Matt M's comment), and strings defined as lists of characters, many of the problems you list go away. The reason is that code points, and thus normalization, are completely abstracted away.

I follow you saying that just printing in order each of the code points in each of the characters in such a string will not give you proper normalized unicode. Clearly input and output functions will need to be complicated. We are all used to this, though, most of our data types are not represented in the same form they are input and output.

What benefits do we get?

String concatenation is list concatenation with all its nice properties, there is no question of normalization because we aren't yet trying to output unicode, just to store a list of characters. String splitting is list splitting. Yes, this won't always give you correctly capitalized words, but we ought not expect it to. That's akin to building in a spell checker so only correctly spelled strings can be used, gibberish be damned.

Insertions, deletions, assignments, and accessing of individual characters works, because they are neither code points nor strings, so you can't get invalid code point sequences or normalization problems.

Equivalence checks work because there are no normalization problems.

Routines that are to take a character but may return a string are given that type, following your earlier example, toLower takes a character to a string. No type errors, no invalid assumptions, no problem!

Characters whose lower or upper case depend on where they are in the word (as in another earlier example of yours) still need to be able to be put in any case in any context for purposes like, for example, making a spreadsheet of all different lower and upper cases of a given character without displaying them actually in a word.

Reduced mental burden due to no concept of normalization except at input and output (where it can be automatic), or when explicitly converted to a unicode string, which shouldn't have a concept of a character, only of a code point.

The ability to write parse/lex/regex libraries that are polymorphic in the type of characters and still work on these strings.

(Aside:) why do I care about parsers polymorphic in characters? It's led to readable, reusable code many times for me. I have "parsed" complicated game boards for move validity, for example.

By jason stumpf at Fri, 2012-09-21 21:09 | login or register to post comments

The real problem is that to

The real problem is that to find the next grapheme you must have a Unicode database available in some form. Only domain experts know how to encode that database since it consists of a large number of foreign/exotic language rules and a Unicode specific approach to encoding rules.

Unicode may be broken, or not, I wouldn't know.

By marco at Sat, 2012-09-22 21:38 | login or register to post comments

What is a string?

A list of characters seems like a pretty good explanation to me. That there aren't many useful operations defined on individual characters (and I'd be surprised if all languages even have a concept of upper and lower case) doesn't mean the list abstraction isn't useful. Characters are the smallest unit we can select or copy/paste in an editor. That's useful.

By Matt M at Sun, 2012-09-16 13:08 | login or register to post comments

" Characters are the

" Characters are the smallest unit we can select or copy/paste in an editor."

Great. So you agree that a Unicode 'character' is a number of Unicode codepoints?

By marco at Mon, 2012-09-17 01:08 | login or register to post comments

Yeah, it really has become

Yeah, it really has become messy, right? I am stuck with Unicode myself. I needed to redesign the string representation for my language, and thought that I might make my language Unicode aware while I was at it. I agree, Unicode makes the idea of a character troubling, or even obsolete.

The only manner in which I think one can lex Unicode source text is probably only with Unicode regular expressions. But I am not sure if that is even feasible.

By marco at Mon, 2012-09-17 01:06 | login or register to post comments

Excessive polymorphism and overloading

This is one of those things that tends to seem cool, but ends up being a headache. This sort of thing breaks down at mixed-mode operator overloading. That's where the type system has to find the appropriate semantics for type combinations which were perhaps not anticipated by the language designer.

Python provides a nice example of that failure. In Python, "+" is addition for numeric types, but concatenation for sequences. Mercifully, you can't add an integer to a sequence, so "1" + 1 is disallowed. But then someone decided that multiplication should be treated as repeated addition. That led to treating multiplication in the concatenation context as repeated concatenation So "1" * 3 is 111.
It gets worse. [5,10] * 3 is [5, 10, 5, 10, 5, 10]. And
[1,2] + [3,4] is [1,2,3,4].

On top of that, Python has a numerics package, "numpy", which has more efficient arrays than the built-in ones. So you can write arr = numpy.array([1,2]). "arr" is now [1,2]. But arr*2 is [2,4], because numpy arrays have numeric semantics for arithmetic, not concatenation semantics.

So what happens if we add a numpy array and a built-in array? I'll leave that as exercise for the reader. It's not an error; you do get a result.

This is one reason why polymorphism should not be overdone. And why you have to be really careful about the semantics of the common operators. It's possible to design constructs that look mathematically reasonable but do something completely different than the user might expect. Don't do that.

By John Nagle at Tue, 2012-09-18 22:47 | login or register to post comments

So what happens if we add a

So what happens if we add a numpy array and a built-in array? I'll leave that as exercise for the reader. It's not an error; you do get a result.

Without having numpy installed, one can safely say that list+array translates into

list.__add__(array) ==> array.__radd__(list) ( the operators are swapped because list doesn't know how to add an array ).

while array + list gives

array.__add__(list)

So in both cases an array-addition is performed.

There is no other semantics in Python for operators than method calls, i.e. no default behavior or something alike as in Java, where operators are actually special while methods are not.

By Kay Schluehr at Wed, 2012-09-19 01:42 | login or register to post comments

So...

In my opinion, Python's + (where you have to know the type of a value before you can know whether + is commutative) is a wart.

That said, I do not think we should attempt "totally general numeric types", because that leads in too many directions.

That said, I am comfortable working with an array type (tensor) where the product of the dimensions of the tensor must equal the number of elements in the tensor. Here, scalars have zero dimensions, and matrices have 2. (And, the dimensions of a tensor itself has 1 dimension, which can turn into an interesting topic -- but it's too much for this comment.)

That said, type theory is about treating static properties of your system, and when you get into such structured collections of numbers you have a surfeit of things which are tempting to think of as types. It's perhaps instructive to realize that Haskell typically deals with only "lower bound" issues in its type system. Generally speaking, it does not try to protect you from writing programs which become incorrect in the face of resource limitations. Similarly, when dealing with tensors you can find numerous properties which are locally static but which should probably be kept out of the static type system.

My favorite example here, is that you probably do not want to treat the nullspace of a matrix type in your type system. (I can think of counterexamples, but they are atypical.)

By rdm at Wed, 2012-09-19 18:26 | login or register to post comments

Wart, but a bigger issue

In my opinion, Python's + (where you have to know the type of a value before you can know whether + is commutative) is a wart.

True. More generally, though, it is possible to build systems which make programs hard to understand. Polymorphism, overloading, and implicit conversions can be figured out by the compiler. The question is whether the typical maintenance programmer can figure them out.

Matlab has considerable generality with math operators, but Matlab has very few types. C++ has gone down this road, probably too far.

The worst problems come from implicit conversions. That's where unexpected behavior tends to manifest itself. Without implicit conversions, you get compile time (or in more dynamic systems, run time) errors. With implicit conversions, polymorphism, and overloading, strange results are possible due to some conversion being invoked invisibly.

By John Nagle at Thu, 2012-09-20 19:30 | login or register to post comments

True. More generally,

True. More generally, though, it is possible to build systems which make programs hard to understand. Polymorphism, overloading, and implicit conversions can be figured out by the compiler. The question is whether the typical maintenance programmer can figure them out.

Very true. Also, Knuth said it more succinctly.

There Never Has Been, Nor Will There Ever Be,
Any Language In Which It Is The Least Bit Difficult
To Write Bad Code.

-- Donald Earvin Knuth

By Ray Dillinger at Fri, 2012-09-21 03:06 | login or register to post comments

User login

Navigation

Are scalars "just" degenerate matrices?

Comment viewing options

Browse archives

Active forum topics

New forum topics

Recent comments