Is character as a type meaningless?

Assume 1,2,3 creates a list of three integers, 1 and 2 and 3. Similarly, I could create a list of identifiers like func1, func2, func3. I could also have three characters, 'a', 'b', 'c'.

Having 'a', 'b', 'c' is really like having the string "abc", since strings are an array of characters. For anyone having read how Unicode works, you might have noticed that the definition of a character is kind of tricky. Some characters are composed of several characters, so what do you mean with character? A Unicode code point as Unicode would like to see it? Single UTF-8 element as C and C++ typically see it? UTF-16 element as in Windows?

All this makes me wonder, is there any reason to treat character as something else than just a short string? It would be a sub-type of string, in that it is constrained to only evaluate to a single Unicode character. Would this make sense, or are there pitfalls with this kind of thinking?

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

I suppose that "it depends"

Is Person just a short list of People? In some systems, that would work fine. In most, though, that would be weird.

The reason that you have a type for something is so that you can ask it questions about itself. There are some reasonable questions that a character can answer about itself that are not answerable by a string of characters.

Variation on the theme

I'd put it this way, which amounts to much the same: Types are a PL construct, and thus are about how a value can be manipulated in the PL. If you say a string is an array of characters, that's not about the underlying implementation but about how you manipulate it in your code. And as for characters themselves, again it's a question of how you want to manipulate them.

Tbh I haven't worked much with unicode at a low level. In the one case I have worked with, a string can be accessed as an array of ascii characters, or of bytes; in case it's unicode, it's represented in UTF-8, and there are a set of functions for treating it as a series of codepoints. Which afaics can work tolerably well for the purposes I've used it for; admittedly, I haven't used it too intensively.

Small quibble

In most languages, arrays are mutable while strings are not. This has both runtime and semantic implications, so in my opinion it justifies a distinction of type.

Separately, there is the issue that "slots" in a string-array are not characters because of UTF-8. This has the consequence (at least) that the operations on the two are different. In some cases it leads to completely different data structures (e.g. cords).

Where the line should be between types and dirty runtime details is a question that has different answers depending on your language goals. In the most general sense, types are a means to validate things. Which things you want to validate depends on the problem you are trying to solve.

codepoints; extrinsic types

Slots in a string byte-array are not characters, nor are they codepoints, under UTF-8. However, an array of codepoints can be represented by an array of bytes, using UTF-8. The number of codepoints will, in general, be less than the number of bytes.

When you say "types are a means to validate things", seems you're talking about extrinsic types (per Reynolds), which describe properties of phrases whose meanings exist independent of the typing. As opposed to intrinsic types, which determine the meanings of expressions, so that ill-typed phrases in an intrinsically typed language have no meaning.

Depends on level of abstraction...

...but your point is well taken. I try to distinguish in my head between the logical string (of codepoints) and its encoding (UTF-8 bytes), but even so its easy to mix them up.

I'm not familiar with Reynolds, so I can't say. I meant that types encode property assertions as mathematical terms about which one can reason. There are many encodable things, a small subset of which we commonly think of as "types".

Reynolds

The Meaning of Types — From Intrinsic to Extrinsic Semantics ". John C. Reynolds, BRICS Technical Report RS-00-32, December 2000.

Types are a PL construct,

Types are a PL construct, and thus are about how a value can be manipulated in the PL. If you say a string is an array of characters, that's not about the underlying implementation but about how you manipulate it in your code. And as for characters themselves, again it's a question of how you want to manipulate them.

Well said.

Putting it that way, I

Putting it that way, I realize an argument for having character as its own type is because it has variable size. If you don't have characters as its own type, and each character is of variable size, how then would you be able to talk about a string's length in terms of characters?

Thanks for clearing that fog for me!

Agreed - sort of

The size of a character is an implementation attribute rather than a semantic attribute, so it is not a justification for a distinct type in and of itself.

It took a while to notice, but the practical utility of char type basically went away when UTF-8 was introduced. The only operational difference between char and string in modern languages is that char leaves out some operations (though they would be innocuous if present).

Actually, the more interesting part of the problem is that UNICODE shows us fairly conclusively that the entire notion of a character in the sense that ASCII or ISO-LATN-1 imagined it is incorrect. UNICODE strings do not contain characters; they contain code points. The two are not the same, and the distinction exists for compelling reasons. Semantically, chars never existed.

Primacy of Text

There was an article a while back by Manuel Simoni, on The Supposed Primacy of Text. Also related, Bret Victor's Explorable Explanations.

I often think about these article when thinking about what should be the 'type' of text. I feel we've done ourselves and our users a significant disservice by modeling the 'text' as a mere array of bytes or codepoints, or the unit of text as a 'character'.

Why not think of the base unit of text as an 'explanation' or 'idea' or 'argument' or something else that can be explored and composed and rendered in various ways?

I've tried exploring it a few times, and I have an idea that an 'idea' is something similar to a soft constraint model - i.e. a vague, ambiguous, heuristic restriction of properties relative to the blank sheet of paper. Composition of ideas can further refine them with constraints and decisions. Explanations are a form of composition and refinement, and could be computed in some cases. An edit-time tool could explore and render different refinements of an idea, and help users select between potential explanations and lock them in.

But this is a pretty deep subject. And it seems related to the gap between PL (conventionally, very rigid and precise) and natural language (very elastic and ambiguous). I've been approaching it from the PL side, contemplating various features required to even get started with a new conception of text. (E.g. good support for metaprogramming, 'syntax' that is so extremely user-defined that it could admit an SQLite database, etc.)

Modeling

We have all these related concepts like text, typeface, characters, character arrays, glyphs, images, colors, color models, pixels, subpixels, rasterization, aspect ratios, printing. The list goes on. A more general phenomenon to understand is how to allow programmers to pick levels of abstraction and map between them.

The same types of issues apply in math to tensors and other gadgets. We have vectors and forms, tensors, complex numbers, quaternions and octonions, geometric algebra, tensor bundles, conformal structures, etc. There's overlap in the abstractions. You can often look at a gadget as existing in different spaces under different abstractions.

Ratcheted Abstractions

It is easier to map an abstraction to a more detailed representation - e.g. characters to pixels, or DSL to GPL - than it is to go in the opposite direction. Further, this is lattice-like, e.g. we could map a DSL to two different GPLs, then map both GPLs to x86.

Mapping text to glyphs could involve choices in this ratcheted lattice between Bezier curves, SVG, Logo writer code, postscript, or unicode arrays.

Abstractions can be understood as objects, a substance or media with various properties. After picking one, we'll tend to be stuck with all of its properties, not just the few we hoped to leverage.

So, I wonder how to defer making choices, especially choice of abstraction, for as long as possible.

My rough idea is to program at a meta-level where we can express intentions and modularize multiple strategies to implement intentions upon unspecified abstractions with specified properties. The choice of program abstraction would be left to program search, tuned by feedback and refinement. But now I need an abstraction for the metaprogram. Soft constraints seem viable, but incomplete.

Just thinking...

Greater generality/abstraction obviously covers a larger swath of cases, but may also be harder to cope with because it lacks exploitable specialized structure. I'm reminded of Saunders Mac Lane's meta remark, in Categories for the Working Mathematician, that "good general theory does not search for the maximum generality, but for the right generality." (He was talking about adjunctions.)

Strategies without Commitment

Most software doesn't have anything resembling a theory. It just has a hand-wavy idea like "A big red button that launches nukes."

A programmer, encountering this idea, might reach for their favorite choice of button providers (I choose you, LabVIEW!) and perhaps a toy implementation for 'launching nukes' (Randomly display one of these videos of nukes exploding) then spend several days struggling to display videos in LabVIEW.

A small refinement to the idea might be "A big red button that launches nukes. As a web app." But, to a programmer who has already committed to choices about architecture and abstractions, this will often be a huge change that requires rewriting all the things.

What I'd like to do is somehow separate the expression of intention vs. implementation strategy. Instead of building a brittle bridge between intention and implementation, the programmer would either represent intentions as something machine-meaningful, or extend a catalog of strategies. They could express commitment to a choice by extending the requirements, e.g. add a "toy" refinement for "launching nukes" in case someone later develops a strategy for launching the real thing. But programmers would avoid directly expressing design decisions.

It seems to me that we can still have a lot "exploitable, specialized structure" within the different implementation strategies without committing to any particular decision.

Of course, this entire path only begins to become feasible with a sufficiently mature catalog of components, and sufficient support for reflection.

flexibility?

("Most software doesn't have anything resembling a theory." — The Mac Lane quote occurred to me because of what it says about generality, not for its fleeting reference to theory.)

Caveat. There seems a great deal of potential here for talking past each other, as the meaning of just about every key word we're dealing with here has ambiguity that could be discussed at great length without, most likely, getting anywhere.

For what I take to be your central point, I might suggest the terms interface versus implementation. Of possible choices of abstractions in the implementing programming language, presumably some would impose more constraints on the software interface, while others would afford more interface flexibility; one would therefore want to understand how different choices affect this design dimension, and choose something that affords maximal flexibility.

I'm not sure what the reference to reflection is about, tbh.

Re flexibility

I don't believe the 'interface' concept is effective for describing requirements or desiderata. When we start thinking about interfaces, we're already several steps removed from the original ideas and intentions. A single idea might have a huge surface area when projected onto an interface, and conversely an interface element can relate to many intentions. Further, there are design choices related to squeezing everything into the PL.

It's a boon to automation if software interfaces are more oriented around being robust, scalable, composable, efficient, easy to analyze. All the normal reasons to squeeze ideas into a computation model. Flexibility is potentially useful to humans, but non-sapient machines won't benefit from it.

So, my (very long term) goal isn't flexibility of interface so much as automating the "derived" design requirements and software design, shifting programs mostly into the domain of ideas, intentions, outcomes, high level requirements and desiderata. Another layer upwards for humans, high enough that internal and external interfaces and deployment are also implementation details, subject to the same constraints as everything else.

Developing strategies is to improve search performance and quality of results. It would become a diminishing aspect as machine learning takes a bigger role.

Unless you're talking about syntax/dsl/etc. as the program's "interface" to the programmer? We do need flexibility in that interface, to ensure expression of ideas is very direct, minimally damaged by translation into PL.

A few common forms of damage to ideas are detailing, fitting to medium, and fragmentation. Being vague and hand-wavy, covering a wide scope and being subject to refinement, seems to be an essential feature of ideas. Reshaping an idea to fit computation medium is unavoidable (even fitting ideas into natural language can be a problem) but should be minimal, keep it recognizable. Fragmenting ideas, sprinkling the fragments like seasoning over a codebase, prevents us from later refining or removing the idea.

Regarding 'sufficient' reflection: opaque behavior is problematic for program search, but I probably don't need runtime reflection - staged is sufficient.

Lambda the ultimate

Ignore

Representations

I've discussed "representation selection" in the past here, and am working to build it "soon", but I'm still not done with my language proper.

Step 1 is organizing the semantics of your language & libraries in a way that avoids commitment to unintended choices as much as possible. This step is hard on its own. The most general setting for your algorithm may be a commutative ring with properties X, Y, Z, but for the setting you actually need, you can just write Int. But if you plan to reuse the algorithm, you might want to define it against the more general abstraction.

Step 2 is a set of tools for extracting a representation for the semantics you've defined. I see a lot of commonality between rep selection and theorem proving. Finding a representation will be much like using Coq tactics to find a proof. It will be inherently brittle, requiring updates in response to changes to the higher level specification, but we can try to mitigate that with more resilient tactics. Ultimately, we have the spec and can measure the effect of optimizations, so I expect we can get better and better tactics and eventually maybe machine learning techniques.

Search for Representations

I think that automating search for representations is a good idea. There are many ways to represent, say, a sequence of values: array, struct, intmap, linked list, finger tree, lazy generator. It would be nice if the compiler can pick smartly based either on global knowledge of how the sequence is constructed and consumed, or some type-level annotations for desired performance characteristics.

Do it yourself tactics

I think the important thing is to expose it as meta programming. Then people can build their own libraries of tactics for their own purposes.

re: do it yourself meta programming

Static analysis, type-checking, optimization, compilation, parsing, etc. should all be modeled as meta programming, with the ability for users to define their own libraries.

My approach to this: Every module computes a value (of dictionaries, lists, and natural numbers). To process a module file `foo.xyzzy`, I search for module with name `language-xyzzy`, whose value represents a program to parse and compile a file binary into another value. A language module can load and freely process another module's value. The bootstrap consists of defining a simplified language module in terms of itself. Compilation can be represented by a module that computes an executable binary. Instead of a command-line compiler tool, we have a command-line binary extractor. The compilation logic, including static analysis or decisions about runtime representations, would be expressed within various modules, and can be fully user-defined.

However, effective support for program search isn't just about exposing meta programming. That just provides greater opportunity for users to also be language designers. It's still left to language designers to carefully design a program model that supports desired extensions, compositions, adaptations, analysis, optimizations, annotations, etc..

For example, I can represent floating-point numbers using just dictionaries and natural numbers. It's doable, though dealing with negative zero is still irksome. But for a high performance runtime representation, or accelerated compile-time computation, we ideally could use a compact representation with direct support from the CPU. To make this possible would require some expressive refinement types, and encoding the CPU's logic for floating-point arithmetic. This logic can get a little awkward because some CPUs will use extended precision internally. I feel this sort of problem should be already very well solved for potential users, preferably without the shortcut of just adding a floating-point primitive type.

Same for dealing with boxed and unboxed structures, etc..

Agree

Agreed on your first point. My language will be fully meta-circular, even to the point of proving itself consistent (as close as possible). I should follow-up to the old thread where I was trying to get around Godel and post my solution - I just wrote it up to someone a couple of days ago. Figuring out how to make tools that compile themselves without losing consistency strength isn't very practical, but I was happy to finally solve the puzzle.

I also agree that you need to avoid the Lisp problem of users trying to language designers. And as you say, I think it's about providing them with a clean framework for doing so and providing examples for them to emulate when building their own extensions.

Regarding floating point, I'm punting on modeling CPU behavior at the moment. For the common case where users select Float as a black-box for computing with real numbers, I'm asking that users use Real instead, defined axiomatically and requiring Dedekind cuts or whatever to define semantically (non-computable semantics). Then we extract a representation of their code that simply represents Real with Float. Proving nice properties about this representation is going to be a hard problem, as you say. I say ignore it for now.

One other comment along those lines is that rep. selection should be able to extract representations at the semantic level, not just in some IR. In the example of Reals being implemented with Float, the tool that does that should produce a new language level value that type checks against code written directly against Float.

one walled garde to rule them all

Some model-drive things, like FOAM that was mentioned on LtU before, seem like they want to wrap up everything in a new higher-level abstraction that automagically can power up a web app vs. a mobile app vs. a command line vs. whatever else.

At a low level, there is such an impedance mismatch among paradigms that I feel like those wrapper things aren't easy to build such that they support all the features let alone UI UX correctly.

So then I wonder what are the 'mathematical/logical' abstractions for things like interactive systems?

Problems:

  1. having just input -> calculation -> output, but that seems pretty fundamental/simple (modulo issues of mapping types).
  2. there are features in the target system that themselves have dependencies or interactions within the target system that make them hard to be stand-alone? Like, in order to do some http request you have to have some http client object and etc. ad nauseam, so then you kinda have to wrapper all that up into some middle layer that is the impedance mismatch adapter, but then if you are successful you end up with OpenGL style capability bit hell, no?
  3. there are ui ux nuances that are just different across target systems. How can one abstract those while honoring them?

Removed

Removed

Characters, words, paragraphs, ...

If you have characters it wouldn't be far-fetched to also have words, paragraphs, links, classifications like subject, header and do on. Starts to feel like a marked up language.

Sometimes you would like to traverse headers or words, sometimes just treat it as an array of characters. Some parts of the text might have other rules, like an URL.

Being able to switch between those abstractions while still keeping everything in sync would be nice.

Terminology counts

It's even more confusing if we don't use clear terminology. For example, characters are made up of code points, not characters.

Semantically, the choice of encoding (UTF-8 etc) is irrelevant. The thing that makes UTF-8 interesting is mainly the pun with ASCII, which only works if you know that the constituent letters fall within the ASCII range. The selected encoding changes total memory consumption dramatically, but that's orthogonal to type.

In unicode, the thing that breaks glyphs (not characters) as a type is the combining constructs and the need for canonicalization. Either one, or both, makes it clear that strings are not a "sort", at which point their meaning as a ground type becomes hard to explain. This is part of why so many programming languages and runtimes take care to specify a canonical normalization. Which helps, but is not sufficient.

If you somehow get past that, you can model strings of glyphs as a tumbler space (floating point is a large but length-bounded tumbler space).

Terminology counts

It's even more confusing if we don't use clear terminology. For example, characters are made up of code points, not characters.

Semantically, the choice of encoding (UTF-8 etc) is irrelevant. The thing that makes UTF-8 interesting is mainly the pun with ASCII, which only works if you know that the constituent letters fall within the ASCII range. The selected encoding changes total memory consumption dramatically, but that's orthogonal to type.

In unicode, the thing that breaks glyphs (not characters) as a type is the combining constructs and the need for canonicalization. Either one, or both, makes it clear that strings are not a "sort", at which point their meaning as a ground type becomes hard to explain. This is part of why so many programming languages and runtimes take care to specify a canonical normalization. Which helps, but is not sufficient.

If you somehow get past that, you can model strings of glyphs as a tumbler space (floating point is a large but length-bounded tumbler space).

With Unicode, Character as a type is in fact meaningless.

Duplicate post redacted, please ignore.

With Unicode, Character as a type is in fact meaningless.

That was my conclusion after reviewing Unicode.

There is a meaningful character type when you can guarantee that characters take a constant amount of space. In Unicode, the things we think of as characters - ie, "grapheme clusters" take various numbers of codepoints to represent. Further, in most of Unicode's normalization forms, individual codepoints take a different number of bits to represent depending on their value. You can't just random-access the 20th character in a string by relying on characters having a constant length; you actually have to count grapheme boundaries from the beginning of the buffer, making an O(1) operation into an O(n) operation where n is the size of the string.

There is a meaningful character type when you can guarantee that equality in representation bits is isomorphic to equality in the view of users. But in Unicode there are multiple ways to make the same canonical character.

There is a meaningful character type when you can guarantee that "normal" operations on characters do not change the space needed for representation. But in Unicode changing case can change the number of codepoints required to represent a character. This problem is made worse by multiple representation types because even character transformations that are one-to-one measured in codepoints result in the storage length of strings changing in all normalization forms except UTF32.

This annihilates the "contiguous memory" model of string handling - if you try storing a Unicode string in contiguous memory, changing the capitalization of a character can require copying the entire remainder of the string to move each codepoint after it by one, two, or seven positions toward the front or back. After you had to count from the front to even find the indexed character. Literally every routine O(1) operation, if implemented using "contiguous memory", becomes an O(n) operation in the length of the string.

Once you're dealing with Unicode, character sequences aren't "strings" any more in terms of data storage; at best they're trees, with each node of the tree giving the number of grapheme boundaries and the number of newlines (did we mention there are at least four different newline characters in Unicode?) in the subsequence managed by that tree node. Lots of O(1) operations become O(logN) operations. To be fair, that includes some operations that were O(N) under the contiguous storage model, like inserting a sequence in the middle of the string. But on average, this is a massive dump on efficiency.

There is a meaningful character type when you can guarantee that splitting strings yields two strings with "length" in characters equal to the length of the original string. But in Unicode you can split strings and rejoin partial string in ways that change the total number of characters, in ways that divide characters on codepoint boundaries instead of character boundaries, and in ways that leave an equal number of codepoints on each side of the split but an unequal number of characters, or vice versa. Every operation has to count grapheme boundaries and every substring produced by any operation has to be eager-normalized in order for subsequent operations to have consistent semantics.

There is a meaningful character type when you can give a definite answer to how much space a string of N characters will take to represent. In Unicode you cannot. By restricting yourself to a subset of Unicode you can impose a maximum, but character set is literally infinite. Any restriction will result in an infinite number of characters that cannot be represented.

There is a meaningful character type when you can iterate exhaustively over subranges. In Unicode you can't do that. Codepoints have to be checked against a map to see whether they have even been assigned a meaning, many codepoints that do have a meaning aren't meaningful in context depending on their combining class, and the set of valid "grapheme clusters" can't even be enumerated. So, at best, iteration can only be approximated. No matter how you manage it you're going to have if/thens that slow your code down by orders of magnitude by causing instruction cache misses, maps hanging around putting pressure on your data cache, and lots of different code paths that constantly need to be swapped in and out of the instruction cache. Even approximating an iteration over, say, 100 Unicode characters is going to take 200 times as long as iterating over 100 characters of a fixed character set.

So, no. There is no meaningful character type any more with Unicode strings. And text is no longer a "string" or array of characters.

At best, you can have a "text" type, represented as string segments managed in a tree structure with overhead to keep track of boundaries, and achieve LogN performance (times an achingly huge constant) on most operations.

And individual characters? Uh, maybe you could leave off the tree nodes, but you're still going to have overhead to keep track of how many codepoints there are in it, keep in mind that the number of codepoints required to represent a character could change with any operation, and keep in mind that the number of CHARACTERS can change with a case change. For example when you lowercase an eszett, you get two small 's' characters, meaning either your "Character" suddenly becomes "Text", or you give up supporting case operations on individual characters. You might as well have a tree-node above it since that's exactly the kind of overhead why you need tree-nodes to handle in longer text.

So. Nope. With Unicode there's no such thing as a character type, or at least no advantage in representing it or handling it differently from handling a sequence of characters.

And sequences of characters, likewise, can't stored as strings/arrays any more either. Now they are just "Text", they have a hideous overhead compared to "Strings," and "Character" is just a name for "Text" that happens to have a length of one.

My preferred design choices

Yes, Unicode kills the "contiguous memory" string model. However, a lot of string manipulation is search-and-replace which will change the length of your string anyway; I'm not sure how much is lost by switching the likes of tr// to copy instead of mutate.

If you really want, you can recover Boyer-Moore type algorithms (i.e., search that relies on skipping a known number of array values in O(1) time) by picking (say) UTF8 and working with the bytes.

Anyway, I'm pretty sure you can do all practical string processing with decent efficiency by reading strings in order. My preferred design choice is to design the string library to support immutable strings with in-order string processing, rather than try to recover the mysterious lost era of ASCII.

I want my immutable strings to concatenate efficiently by reference, so yes: they will be backed (where necessary) with slow, tree-like datastructures. However, trees are only needed for large strings, which can be stored as fairly large chunks, so most of your time can be spent processing bytes, rather than wrangling trees.

All this is independent of "what even is a character, bro?". I admit that Unicode pedants who make an issue that the term "character" is ill-defined in the context of Unicode have a point, but I no longer find their policy prescription to instead do everything with grapheme clusters, or extended grapheme clusters, or whatever becomes fashionable in future editions of Unicode, at all persuasive.

The fact is, codepoints are the only Unicode entity suitable to use as a basis for string processing. Mandating a representation is problematic. Clusters of various sorts are locale-dependent, and in any case are not stable between Unicode editions.

So, my string library will support processing strings by cluster, but those clusters will just be substrings (made up of codepoints). And it should be fully agnostic with respect to how those clusters are chosen.