Is character as a type meaningless?

Assume 1,2,3 creates a list of three integers, 1 and 2 and 3. Similarly, I could create a list of identifiers like func1, func2, func3. I could also have three characters, 'a', 'b', 'c'.

Having 'a', 'b', 'c' is really like having the string "abc", since strings are an array of characters. For anyone having read how Unicode works, you might have noticed that the definition of a character is kind of tricky. Some characters are composed of several characters, so what do you mean with character? A Unicode code point as Unicode would like to see it? Single UTF-8 element as C and C++ typically see it? UTF-16 element as in Windows?

All this makes me wonder, is there any reason to treat character as something else than just a short string? It would be a sub-type of string, in that it is constrained to only evaluate to a single Unicode character. Would this make sense, or are there pitfalls with this kind of thinking?

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

I suppose that "it depends"

Is Person just a short list of People? In some systems, that would work fine. In most, though, that would be weird.

The reason that you have a type for something is so that you can ask it questions about itself. There are some reasonable questions that a character can answer about itself that are not answerable by a string of characters.

Variation on the theme

I'd put it this way, which amounts to much the same: Types are a PL construct, and thus are about how a value can be manipulated in the PL. If you say a string is an array of characters, that's not about the underlying implementation but about how you manipulate it in your code. And as for characters themselves, again it's a question of how you want to manipulate them.

Tbh I haven't worked much with unicode at a low level. In the one case I have worked with, a string can be accessed as an array of ascii characters, or of bytes; in case it's unicode, it's represented in UTF-8, and there are a set of functions for treating it as a series of codepoints. Which afaics can work tolerably well for the purposes I've used it for; admittedly, I haven't used it too intensively.

Small quibble

In most languages, arrays are mutable while strings are not. This has both runtime and semantic implications, so in my opinion it justifies a distinction of type.

Separately, there is the issue that "slots" in a string-array are not characters because of UTF-8. This has the consequence (at least) that the operations on the two are different. In some cases it leads to completely different data structures (e.g. cords).

Where the line should be between types and dirty runtime details is a question that has different answers depending on your language goals. In the most general sense, types are a means to validate things. Which things you want to validate depends on the problem you are trying to solve.

codepoints; extrinsic types

Slots in a string byte-array are not characters, nor are they codepoints, under UTF-8. However, an array of codepoints can be represented by an array of bytes, using UTF-8. The number of codepoints will, in general, be less than the number of bytes.

When you say "types are a means to validate things", seems you're talking about extrinsic types (per Reynolds), which describe properties of phrases whose meanings exist independent of the typing. As opposed to intrinsic types, which determine the meanings of expressions, so that ill-typed phrases in an intrinsically typed language have no meaning.

Depends on level of abstraction...

...but your point is well taken. I try to distinguish in my head between the logical string (of codepoints) and its encoding (UTF-8 bytes), but even so its easy to mix them up.

I'm not familiar with Reynolds, so I can't say. I meant that types encode property assertions as mathematical terms about which one can reason. There are many encodable things, a small subset of which we commonly think of as "types".

Reynolds

The Meaning of Types — From Intrinsic to Extrinsic Semantics ". John C. Reynolds, BRICS Technical Report RS-00-32, December 2000.

Types are a PL construct,

Types are a PL construct, and thus are about how a value can be manipulated in the PL. If you say a string is an array of characters, that's not about the underlying implementation but about how you manipulate it in your code. And as for characters themselves, again it's a question of how you want to manipulate them.

Well said.

Putting it that way, I

Putting it that way, I realize an argument for having character as its own type is because it has variable size. If you don't have characters as its own type, and each character is of variable size, how then would you be able to talk about a string's length in terms of characters?

Thanks for clearing that fog for me!

Agreed - sort of

The size of a character is an implementation attribute rather than a semantic attribute, so it is not a justification for a distinct type in and of itself.

It took a while to notice, but the practical utility of char type basically went away when UTF-8 was introduced. The only operational difference between char and string in modern languages is that char leaves out some operations (though they would be innocuous if present).

Actually, the more interesting part of the problem is that UNICODE shows us fairly conclusively that the entire notion of a character in the sense that ASCII or ISO-LATN-1 imagined it is incorrect. UNICODE strings do not contain characters; they contain code points. The two are not the same, and the distinction exists for compelling reasons. Semantically, chars never existed.

Primacy of Text

There was an article a while back by Manuel Simoni, on The Supposed Primacy of Text. Also related, Bret Victor's Explorable Explanations.

I often think about these article when thinking about what should be the 'type' of text. I feel we've done ourselves and our users a significant disservice by modeling the 'text' as a mere array of bytes or codepoints, or the unit of text as a 'character'.

Why not think of the base unit of text as an 'explanation' or 'idea' or 'argument' or something else that can be explored and composed and rendered in various ways?

I've tried exploring it a few times, and I have an idea that an 'idea' is something similar to a soft constraint model - i.e. a vague, ambiguous, heuristic restriction of properties relative to the blank sheet of paper. Composition of ideas can further refine them with constraints and decisions. Explanations are a form of composition and refinement, and could be computed in some cases. An edit-time tool could explore and render different refinements of an idea, and help users select between potential explanations and lock them in.

But this is a pretty deep subject. And it seems related to the gap between PL (conventionally, very rigid and precise) and natural language (very elastic and ambiguous). I've been approaching it from the PL side, contemplating various features required to even get started with a new conception of text. (E.g. good support for metaprogramming, 'syntax' that is so extremely user-defined that it could admit an SQLite database, etc.)

Modeling

We have all these related concepts like text, typeface, characters, character arrays, glyphs, images, colors, color models, pixels, subpixels, rasterization, aspect ratios, printing. The list goes on. A more general phenomenon to understand is how to allow programmers to pick levels of abstraction and map between them.

The same types of issues apply in math to tensors and other gadgets. We have vectors and forms, tensors, complex numbers, quaternions and octonions, geometric algebra, tensor bundles, conformal structures, etc. There's overlap in the abstractions. You can often look at a gadget as existing in different spaces under different abstractions.

Ratcheted Abstractions

It is easier to map an abstraction to a more detailed representation - e.g. characters to pixels, or DSL to GPL - than it is to go in the opposite direction. Further, this is lattice-like, e.g. we could map a DSL to two different GPLs, then map both GPLs to x86.

Mapping text to glyphs could involve choices in this ratcheted lattice between Bezier curves, SVG, Logo writer code, postscript, or unicode arrays.

Abstractions can be understood as objects, a substance or media with various properties. After picking one, we'll tend to be stuck with all of its properties, not just the few we hoped to leverage.

So, I wonder how to defer making choices, especially choice of abstraction, for as long as possible.

My rough idea is to program at a meta-level where we can express intentions and modularize multiple strategies to implement intentions upon unspecified abstractions with specified properties. The choice of program abstraction would be left to program search, tuned by feedback and refinement. But now I need an abstraction for the metaprogram. Soft constraints seem viable, but incomplete.

Just thinking...

Greater generality/abstraction obviously covers a larger swath of cases, but may also be harder to cope with because it lacks exploitable specialized structure. I'm reminded of Saunders Mac Lane's meta remark, in Categories for the Working Mathematician, that "good general theory does not search for the maximum generality, but for the right generality." (He was talking about adjunctions.)

Strategies without Commitment

Most software doesn't have anything resembling a theory. It just has a hand-wavy idea like "A big red button that launches nukes."

A programmer, encountering this idea, might reach for their favorite choice of button providers (I choose you, LabVIEW!) and perhaps a toy implementation for 'launching nukes' (Randomly display one of these videos of nukes exploding) then spend several days struggling to display videos in LabVIEW.

A small refinement to the idea might be "A big red button that launches nukes. As a web app." But, to a programmer who has already committed to choices about architecture and abstractions, this will often be a huge change that requires rewriting all the things.

What I'd like to do is somehow separate the expression of intention vs. implementation strategy. Instead of building a brittle bridge between intention and implementation, the programmer would either represent intentions as something machine-meaningful, or extend a catalog of strategies. They could express commitment to a choice by extending the requirements, e.g. add a "toy" refinement for "launching nukes" in case someone later develops a strategy for launching the real thing. But programmers would avoid directly expressing design decisions.

It seems to me that we can still have a lot "exploitable, specialized structure" within the different implementation strategies without committing to any particular decision.

Of course, this entire path only begins to become feasible with a sufficiently mature catalog of components, and sufficient support for reflection.

flexibility?

("Most software doesn't have anything resembling a theory." — The Mac Lane quote occurred to me because of what it says about generality, not for its fleeting reference to theory.)

Caveat. There seems a great deal of potential here for talking past each other, as the meaning of just about every key word we're dealing with here has ambiguity that could be discussed at great length without, most likely, getting anywhere.

For what I take to be your central point, I might suggest the terms interface versus implementation. Of possible choices of abstractions in the implementing programming language, presumably some would impose more constraints on the software interface, while others would afford more interface flexibility; one would therefore want to understand how different choices affect this design dimension, and choose something that affords maximal flexibility.

I'm not sure what the reference to reflection is about, tbh.

Re flexibility

I don't believe the 'interface' concept is effective for describing requirements or desiderata. When we start thinking about interfaces, we're already several steps removed from the original ideas and intentions. A single idea might have a huge surface area when projected onto an interface, and conversely an interface element can relate to many intentions. Further, there are design choices related to squeezing everything into the PL.

It's a boon to automation if software interfaces are more oriented around being robust, scalable, composable, efficient, easy to analyze. All the normal reasons to squeeze ideas into a computation model. Flexibility is potentially useful to humans, but non-sapient machines won't benefit from it.

So, my (very long term) goal isn't flexibility of interface so much as automating the "derived" design requirements and software design, shifting programs mostly into the domain of ideas, intentions, outcomes, high level requirements and desiderata. Another layer upwards for humans, high enough that internal and external interfaces and deployment are also implementation details, subject to the same constraints as everything else.

Developing strategies is to improve search performance and quality of results. It would become a diminishing aspect as machine learning takes a bigger role.

Unless you're talking about syntax/dsl/etc. as the program's "interface" to the programmer? We do need flexibility in that interface, to ensure expression of ideas is very direct, minimally damaged by translation into PL.

A few common forms of damage to ideas are detailing, fitting to medium, and fragmentation. Being vague and hand-wavy, covering a wide scope and being subject to refinement, seems to be an essential feature of ideas. Reshaping an idea to fit computation medium is unavoidable (even fitting ideas into natural language can be a problem) but should be minimal, keep it recognizable. Fragmenting ideas, sprinkling the fragments like seasoning over a codebase, prevents us from later refining or removing the idea.

Regarding 'sufficient' reflection: opaque behavior is problematic for program search, but I probably don't need runtime reflection - staged is sufficient.

Lambda the ultimate

Ignore

Representations

I've discussed "representation selection" in the past here, and am working to build it "soon", but I'm still not done with my language proper.

Step 1 is organizing the semantics of your language & libraries in a way that avoids commitment to unintended choices as much as possible. This step is hard on its own. The most general setting for your algorithm may be a commutative ring with properties X, Y, Z, but for the setting you actually need, you can just write Int. But if you plan to reuse the algorithm, you might want to define it against the more general abstraction.

Step 2 is a set of tools for extracting a representation for the semantics you've defined. I see a lot of commonality between rep selection and theorem proving. Finding a representation will be much like using Coq tactics to find a proof. It will be inherently brittle, requiring updates in response to changes to the higher level specification, but we can try to mitigate that with more resilient tactics. Ultimately, we have the spec and can measure the effect of optimizations, so I expect we can get better and better tactics and eventually maybe machine learning techniques.

Search for Representations

I think that automating search for representations is a good idea. There are many ways to represent, say, a sequence of values: array, struct, intmap, linked list, finger tree, lazy generator. It would be nice if the compiler can pick smartly based either on global knowledge of how the sequence is constructed and consumed, or some type-level annotations for desired performance characteristics.

Do it yourself tactics

I think the important thing is to expose it as meta programming. Then people can build their own libraries of tactics for their own purposes.

re: do it yourself meta programming

Static analysis, type-checking, optimization, compilation, parsing, etc. should all be modeled as meta programming, with the ability for users to define their own libraries.

My approach to this: Every module computes a value (of dictionaries, lists, and natural numbers). To process a module file `foo.xyzzy`, I search for module with name `language-xyzzy`, whose value represents a program to parse and compile a file binary into another value. A language module can load and freely process another module's value. The bootstrap consists of defining a simplified language module in terms of itself. Compilation can be represented by a module that computes an executable binary. Instead of a command-line compiler tool, we have a command-line binary extractor. The compilation logic, including static analysis or decisions about runtime representations, would be expressed within various modules, and can be fully user-defined.

However, effective support for program search isn't just about exposing meta programming. That just provides greater opportunity for users to also be language designers. It's still left to language designers to carefully design a program model that supports desired extensions, compositions, adaptations, analysis, optimizations, annotations, etc..

For example, I can represent floating-point numbers using just dictionaries and natural numbers. It's doable, though dealing with negative zero is still irksome. But for a high performance runtime representation, or accelerated compile-time computation, we ideally could use a compact representation with direct support from the CPU. To make this possible would require some expressive refinement types, and encoding the CPU's logic for floating-point arithmetic. This logic can get a little awkward because some CPUs will use extended precision internally. I feel this sort of problem should be already very well solved for potential users, preferably without the shortcut of just adding a floating-point primitive type.

Same for dealing with boxed and unboxed structures, etc..

Agree

Agreed on your first point. My language will be fully meta-circular, even to the point of proving itself consistent (as close as possible). I should follow-up to the old thread where I was trying to get around Godel and post my solution - I just wrote it up to someone a couple of days ago. Figuring out how to make tools that compile themselves without losing consistency strength isn't very practical, but I was happy to finally solve the puzzle.

I also agree that you need to avoid the Lisp problem of users trying to language designers. And as you say, I think it's about providing them with a clean framework for doing so and providing examples for them to emulate when building their own extensions.

Regarding floating point, I'm punting on modeling CPU behavior at the moment. For the common case where users select Float as a black-box for computing with real numbers, I'm asking that users use Real instead, defined axiomatically and requiring Dedekind cuts or whatever to define semantically (non-computable semantics). Then we extract a representation of their code that simply represents Real with Float. Proving nice properties about this representation is going to be a hard problem, as you say. I say ignore it for now.

One other comment along those lines is that rep. selection should be able to extract representations at the semantic level, not just in some IR. In the example of Reals being implemented with Float, the tool that does that should produce a new language level value that type checks against code written directly against Float.

one walled garde to rule them all

Some model-drive things, like FOAM that was mentioned on LtU before, seem like they want to wrap up everything in a new higher-level abstraction that automagically can power up a web app vs. a mobile app vs. a command line vs. whatever else.

At a low level, there is such an impedance mismatch among paradigms that I feel like those wrapper things aren't easy to build such that they support all the features let alone UI UX correctly.

So then I wonder what are the 'mathematical/logical' abstractions for things like interactive systems?

Problems:

  1. having just input -> calculation -> output, but that seems pretty fundamental/simple (modulo issues of mapping types).
  2. there are features in the target system that themselves have dependencies or interactions within the target system that make them hard to be stand-alone? Like, in order to do some http request you have to have some http client object and etc. ad nauseam, so then you kinda have to wrapper all that up into some middle layer that is the impedance mismatch adapter, but then if you are successful you end up with OpenGL style capability bit hell, no?
  3. there are ui ux nuances that are just different across target systems. How can one abstract those while honoring them?

Removed

Removed

Characters, words, paragraphs, ...

If you have characters it wouldn't be far-fetched to also have words, paragraphs, links, classifications like subject, header and do on. Starts to feel like a marked up language.

Sometimes you would like to traverse headers or words, sometimes just treat it as an array of characters. Some parts of the text might have other rules, like an URL.

Being able to switch between those abstractions while still keeping everything in sync would be nice.