Algebra Of Pointers

I have been thinking a little about pointers, and what kind of mathematical structure they form. They seem to be like a one-dimensional vector (as opposed to a simple scalar). If we use the terms location for a point in n-dimensional space, and distance for a vector, then we get the following properties, adding a distance to a location results in a location, subtracting a distance from a location results in a location, subtracting a location from a location results in a distance.

Where is gets more tricky is that adding a location to a location would seem to be invalid (I don't know what type of thing this is), and yet to average two locations, to get the mid-point seems perfectly reasonable. Topologically what is going on here? What is the type of a location added to a location?

If we treat all locations as distances with an origin at zero, then everything makes sense again in terms of types, except for zero itself which seems to still be a location and not a distance (or it all gets horribly self-referential).

Any thoughts on how to make sense of this, what the type of the sum of two locations might be, etc?

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

I think where you're going wrong is...

… that subtracting a vector from a vector results in a vector, not a scalar. Vector distance is another operation entirely.

Where it gets a bit odd is that a "raw" pointer has an implicit positive direction, but subtracting two pointers may result in a "backwards" pointer. Addition mirrors subtraction, and the result is another pointer.

Distance

I don't agree with this. Subtracting two pointers may result in a negative distance, but this is fine as it just represents the offset of the first pointer to the second.

The simplest example of this is decrementing a pointer (which is adding a distance of minus one to it). Consider the following 'C' code:

void f(int* p) {
   int* q = --p;
   int distance = q - p;
}

Note the distance between p and q is minus one, which is completely correct as it represents the offset applied to 'p'. Note the distance is an integer not a pointer, and it is negative.

All of this makes sense so far. Adding two pointers does not result in another pointer, consider an array:

[1,2,3,4,5]

The address of the first element '1' is zero, the address of the second element '2' is one, we can subtract them to get the distance of '1', but this is not an addess, it is the distance between the two pointers into the array. What happens if we add the address of '1' to the address of '2' the result is not an address, and it is not a distance, it is something else.

See: http://geomalgorithms.com/points_and_vectors.html

I think this has a clue in something called Affine Addition.

coordinate system

My take: A vector is a directed magnitude. There are no locations in a vector algebra, only displacements. There is a choice of coordinate system, which allows locations to be described by vectors, but the choice of coordinate system is arbitrary. An actual pointer in C is a choice of coordinate system, and once you have that you can add and subtract integers — which from a typing perspective are vectors, displacements relative to the choice of coordinate system. Conversion from one choice of coordinate system (pointer) to another would be an affine transformation — scaling for type of element pointed to, translation for address in memory.

Vector Space vs Affine Space

To treat everything as a vector ignores this important point from the linked article:

We have already seen that the difference between two points can be considered as a vector. However, in general, it makes no sense to add two points together. Points denote an absolute position in space independent of any coordinate system describing them. Blindly adding individual coordinates together would give different answers for different coordinate reference frames.

So although we can add vectors we cannot add points (pointers) except in the special case of the affine-sum.

In general vs. in special case

Points *can* denote absolute positions in space independent of any coordinate system describing them. But no usable representation of points does. They all describe the position of points relative to an origin, and that description is a vector.

Vector sums and vector differences are well understood. In averaging or subtracting two vectors, the difference between the endpoints and the origin cancels. In an average the resulting point is always the midpoint on a straight line between the two original points, and in a subtraction the result is always a point with the same distance and opposite direction from the first arg, as the second arg w/r/t the first arg. These operations have the same locally-defined results irrespective of origin. But in adding two points, the result depends on the origin according to the parallelogram rule of vector addition.

This is very similar to the set of mathematical rules that apply to simple numeric operations. The average of any two numbers is the midpoint between them regardless of where they are relative to the origin, and the difference is likewise the same distance and opposite direction from one of them as that one is from the other, regardless of origin. But the sum depends on the distance of each number from the origin as well as its distance from the other number.

In which case, you're no

In which case, you're no longer treating pointers as one dimensional vectors, but rather as one dimensional coordinates. Not the same thing at all.

Centre of Mass

Looks like we can define memory as a one dimensional vector space, and define an affine space over that. This results in being able to have either barycentric co-ordinates (effectively the weighted sum of two (or more) pointers where the weights must add to one), or an affine-frame where there is a single pointer (the origin) and a set of scalars which are multipliers for a set of distances, if this is cartesian then in one dimension there is only one orthogonal vector, so the affine-frame would reduce to a pointer + a distance.

The mid-point of two pointers is the affine sum 0.5p + 0.5q, and represents the centre of mass of two equal weights put at those locations. We can also define all the locations between p and q in this way. I guess this makes sense in terms of a one-dimensional line equation: p + x * (q - p), where 0 <= x <= 1. Of course this is assuming Reals, memory being strictly Integer adds another interesting layer to this.

dimentional analysis

the reason that averaging works on even elements is that it's solving the sensible equation:

pointA + n = pointB - n
pointA + 2n = pointB
2n=pointB-pointA
n=(pointB-pointA)/2

then find the point at n
PointA+n = pointA+(pointB-pointA)/2
=(pointA+pointB)/2

See, it works because you can cast it as finding the distance to the center.

Good Point

I like this way of thinking about it. Basically:

int* mid_point(int* p, int* q) {
    int d = q - p;
    int m = d / 2; // half the distance
    return p + m;
}

I guess the weird bit is we can calculate this the other way, by adding and dividing by two, and get the correct answer, even though the 'sum of two points' is not a real thing.

This seems a little like 'i' being sqrt(-1), in that calculations involving imaginary steps can have real results.

I wonder if what we have here is an imaginary type? A type that is not in the universe of real things (like 'i' is not in the universe of real numbers). Perhaps the dimension of "p + q" is an imaginary-point, or a point in an imaginary-space?

machine address

If you choose to understand a "pointer" as a machine address, then you have chosen a coordinate system in which the origin is the address that the machine calls zero, and the scale is such that a unit on the coordinate axis is one byte (typically; some machines use a differently scaled address space). So to do "pointer arithmetic" in that sense, you've already chosen a coordinate system and converted the pointers into vectors.

To amplify my earlier remark about affine transformation: Type-casting a C pointer is a scaling operation. Adding an integer to a C pointer is translation. C doesn't support reflection, which is also an affine transformation, but if it did, that would be reversing the direction of the array: if you start with a pointer a (of some type), set b=a+10, then set b to the reflection of b (for which there is no notation in C), then you'd have two coordinate systems with the same scale, and origins at a distance of 10 from each other using that scale, but with opposite sense, so that (a+10)==b and (b+10)==a. And then (a+5)==(b+5).

[edit: typo, 4 instead of 5; doh]

on a slight tangent

In the safe assembly language I've been playing with you can only add one address to another without overflowing its type (for field or index access). So a pointer to a field inside an array of structs is a kind of spiralling in to the answer. Mid points are meaningless in this (safer, more secure) worldview.

Torsor


http://math.ucr.edu/home/baez/torsors.htm

https://ro-che.info/articles/2013-01-08-torsors


Pointer-to-const is also a monad
Unit = &
Join = *
Map f = & . f . *

Re: Torsor

Great, I think this is what I was looking for. I think I managed to understand the idea topologically, but could not relate it to an algebraic construct.

Pointer + Size

A pointer plus a size defines an affine space over the vector space of the memory, is there an algebraic equivalent to a (Cartesian) affine frame, IE a pair of Torsors or a pair consisting of a Torsor and a Vector?

Memory model

I think a central question when we think about pointers is the memory model. How are we thinking of values in memory and what is the formal structure that describes them?

Assembly language programmers have a very flat memory model where memory is a large array indexed by integers. My understanding of the C standard is that its memory model is sensibly more abstract. I essentially think of a C heap as a large bag of unrelated values, each of them having a rich structure: a tree (nesting of structs), with each sub-region of the tree having possibly several views on it in parallel (for example as typed data and as (char *), simultaneously).

We discuss memory models mostly in the setting of concurrent access to shared memory (the adventurous "weak memory models"), but the notion is also interesting for language designs of sequential (subsets of) languages. For example, I find these kind of models more useful than the plain "array of {bites,bytes}" view to understand the more subtle aspect of C (one-past pointers, aliasing restrictions, etc.). They also have been formalized in details, for example in A Formal C Memory Model for Separation Logic, Robbert Krebbers, 2015.

In this context, a pointer can be thought of as an entry point into one particular value in the sea/bag of values, under a particular view of a sub-tree of the value. It is very clear that it is not possible to jump to any other values from this pointer (there is no adressable transition from one to the other), but that one can explore other parts of the tree in this value. It would also be natural to consider pointers with a fixed span, that would denote a strict sub-tree and not other parts of the value -- this is in fact a representation often adopted by bound-checking tools.

Some questions emerge, such as what it means to cast a pointer to a different pointer type, that I think are more fundamental than trying to impose on pointers a particular algebraic structure that is probably not going to fit the bill. They have also detailed answers in the various formalizations of these memory models that have been proposed, such as the one linked above.

Arrays of bytes

I think there are certainly pointers for which relative addressing makes no sense, but in many ways every pointer can be considered an affine frame, if it includes a size element.

The kind of pointers you describe have an implicit size of one, hence you cannot legally increment or decrement them.

We can also have affine frames of size more than one, which represents an array of values. Arrays are very important due to memory locality issues (which can be more important than the 'O' performance of the algorithm on modern hardware). With an affine frame we can allow access to any element in the frame as an affine space (zero offset addressing, with all operations available on an integer or vector) you may recognise this space as the array index (and this holds for multi-dimensional arrays).

So I don't think there is any question of which memory model, all memories are vector spaces, over which you can define affine spaces (defined by an affine frame). Algebraically this makes pointers a Torsor, and the difference between pointers a Vector.

The only meaningful 'casting' is to take one array index and use it to access a different array, which corresponds to rebasing the affine space containing the vector (index).

Memory model

C formally uses a segmented architecture. Addition of offsets to pointers or formation of offsets by subtraction is well defined if, and only if, the pointers point at the head of, into, or one past the end of an object.

Pointers which can be subtracted can also form a total order and support comparisons.

Function pointers live in a separate space and support no operations except equality.

Because of this C++ adds a requirement for a template expressing a total order over ALL pointers (namely less<>, I'm responsible for this one :) This is not available in C.

The consequences of the rules are profound. In C data structures containing sorted collections of pointers to heap objects all have unspecified behavior. In C++ the behavior is well defined provided the
data structures uses the less template as a comparator.

dlsym() cannot possibly be implemented because a void* is only a valid holding place for data pointers, it may not hold a function address.

Note: these are rules of the abstract machine. There are of course additional rules related to types.

C rules have been abused for decades, mainly by ignorant Unix programmers. Those of us who started with Dos on x86 machines know better. Despite the fact most architectures these days use linear addressing, ISO C never did. One must note, however, that working in an extension of C or using it for platform specific programming is perfectly legitimate, but the extensions should be documented. For compilers, such documentation is often a mandatory requirement of the ISO Standard.

C memory models are particularly difficult because the correctness of an operation is not statically defined but depends on the liveness of the underlying object required to permit a calculation. For example it is quite hard to answer a really basic question such as: if a heap object is deleted, can you copy a pointer to it?

You may think it is only a bit pattern but you're wrong. On the x286 processor using segmented addresses a pointer is 48 bits and includes a resource descriptor: loading an invalid descriptor into a segment register causes a machine fault. Are you sure you don't have to NULL out every pointer to a deleted heap object in another object you might copy??

the malloc model

In the malloc model you can say every pointer is a triple (region, size, offset).

Region ids and sizes are unsigned integers. Offsets are signed.

Pointer differences are defined only for same-id regions.

Pointer and (signed) integer sums are defined and yield a same-region pointer.

All pointers with the same region id have the same size.

The null pointer exists.

Malloc always returns a new region id.

The rules of when and how a pointer can be derefenerced are hairy and of course are different before and after free is called.

Realloc adds similar complexity.

The algebra seems untidy because of various partial functions (like "pointer difference" which is defined only for pointers in the same region).

Affine Space

I am not specifically targeting the malloc model, but thinking about the abstract algebra of pointers. By getting pointers to behave according to the appropriate abstract algebra, we can get something much more consistent than the malloc model.

I think we can treat regions as affine spaces. As we don't know the affine-frame you cannot relate coordinates in one affine space to another. This would be a finite affine space defined by an affine frame. The affine frame defines the origin with a Torsor,and the scale with a vector (this is the unit size for the affine space).
A region would be a Torsor (base address), a unit vector (the size of an object), and a size (how many objects/unit-vectors in this region).

By this definition malloc simply defines a new affine space.

The whole contiguous memory (as seen by the CPU) is a vector space, and hence pointers are Torsors. This view is useful when subdividing a region. It seems to me the external view of a region is that a region is defined by a pair of Torsors or a Torsor and a vector.

In reality there is no null pointer, zero is a valid address. Option types should be used instead.

So the model fits very well. I think we can define these things _from_ the algebraic constructs and get a usable pointer system without any of the untidyness.

Well

I think the suggestion of using the terminology "torsor" was that a pointer would be an element of the group of additive integers and an offset would be a torsor of that group. You wouldn't use "vectors" as these require a scalar field that you don't have.

Also, I don't see how you're going to recover regions from an assumption that there aren't regions. Getting rid of regions is simplifying and corresponds to a "flat memory model" where all memory is addressable with a single pointer from a single offset. I don't think this is necessarily what the CPU will see. For example, don't x86 machines still support mapping some segment registers to different addressable memory?

flat memory

I think you're missing some key points. NO modern desktop level machine has a flat memory space. They all use virtual memory mapping. Depending on the system, this is similar to a segmented architecture: it allows the same memory to be addresses with distinct addresses, and it allows addresses that "map" to segment faults (unmapped address space). Most OS allow adding disk storage directly into the memory space (mmap). Some allow processes to share memory (Windows certainly does).

The primary difference between modern VM architectures and x286 style segmentation is that the hardware support for VM allows more flexible arrangement of the segments, except for the 4K boundary restriction of the paging system. Its slower, but the cost is usually hidden by lookahead, pipelining, caching etc.

Behaviour of Torsors

Torsors cannot be added together, they can only have an 'affine sum' operation defined on them. This is like points in a vector space cannot be added together, but we can perform an affine sum (which calculates the centre of mass of the points). This models pointers which cannot be added together. So:

pointers = Torsors (you cannot add them together, subtracting a pointer from a pointer gives a vector, we can add and subtract vectors to a pointer, and we can define the affine-sum an example of which is the mid-point of two pointers)

offsets = Vectors (you can add together and subtract freely, as they represent offsets, any operation on a vector results in another vector, for example if we have an offset of 2 from the start of an array and an offset of 3, we can add these together to get an offset of 5, which is something you cannot do with a pointer/Torsor).

memory = A vector space (which consists of points (behave as Torsors) and vectors (behave as Vectors), so memory is a vector space, as we have both addresses (points) and offsets (vectors)).

a region = An affine space, we have relative address (a vector) relative to an unknown affine frame. An affine frame consists of a Torsor (the base pointer for the region) and a Vector (defining the scale factor, which is the size of the objects in this region). So the affine frame for a region of 32bit words starting at address 0x1000 would be (0x1000, 4). The affine frame lets us translate between the underlying vector space and the affine space defined over it. So if we don't know the start address and/or object size (because it is opaque) we cannot translate from the affine space to the underlying vector space, and we are 'stuck' in the affine space, in which we can quite happily operate with vector offsets. The easiest example of this is an integer array index which behaves like a one-dimensional vector, and we can add indexes together meaningfully.

As of x86, the 64bit instruction set abandoned segment registers altogether, and all modern operating systems map a processes address space as contiguous memory starting from address zero. The MMU is used to make the process think it has a flat-memory-model machine all to itself.

Yeah I bungled my words

Yeah I bungled my words about torsors. My point was that you (technically) might not want to use vectors, but rather elements of the additive integer group. I think I'm just recalling that some compilers were using segment registers for thread local storage. That memory was probably aliased and also accessible via. some flat address.

Integers and Vectors.

Doesn't an integer behave like a one-dimensional vector?

Nitpicking

Vector spaces are defined in terms of a scalar field, which would include all the real numbers, which you probably don't want.

Even modules

Even modules are more than you want since they're over a ring.

[Or are they?]

Realistic

It would be realistic at least, as CPUs tend to wrap address spaces rather than have an overflow error, but I am not sure if it is a good idea to require this behaviour.

I think the critical properties are the difference between a point and a vector. Memory locations are like points (adding them is nonsensical, but an affine sum makes sense) and offsets are like vectors (and the relationships hold subtracting two points gives a vector, you can add and subtract vectors from points, you cannot add points, and you can add or subtract a pair of vectors). These are the important relationships I think.

Reals vs Integers

Yes, you are right of course. There is something called a finite affine space that looks like an integer version of an affine space. I would guess a finite vector space us the same for points?

Vector space: no

The CPU doesn't see a vector space. For a start that would require a field of scalars (integers don't form a field, integers modulo a prime do but that is hardly what is seen). So you're probably looking for the generalisation of a vector space, which is called an R-module, which only requires a ring. However the reality is nothing like that, the reality on say an Intel CPU is that an address is just a bit pattern and you can apply various instructions to it including addition and multiplication.

The whole point of a type system which has a model of pointers is to restrict the allowed calculations. No such restriction exists in C99 with casts since intptr_t and uintptr_t allow all arithmetic on pointers.

The question is not: what kind of algebraic structure do pointers form, but what kind of algebra SHOULD they form, that is, how should they be typed?

There is no simple answer: Cyclone, Rust, and Felix all have complex answers. ATS2 probably has the best theoretical answer (using dependent typing to define the size of point at objects).

FYI: Felix has an interesting model. Data types like arrays, records, tuples are product types. Values are first class (purely functional, immutable). Components are selected by projection functions (and they can be first class, i.e. be used as values).

Ok, so if you store a value in a variable, you get an object. That is, you can get the variable's address. So here is the trick: every product projection of type T is overloaded to also accept a pointer &T and returns a pointer to the corresponding component.

So now, the calculus of offsets is simply a composition of the projections. This gets rid of lvalues and removes the need for any kind of offset arithmetic. Now we have an algebraic structure for you: it's a category whose arrows are projections, the sole operator is composition.
[Well technically we also have closure formation]

So the question is reduced to: what kind of product types does the system support, because that defines the category of pointer operations.

What should they form

I am thinking of a fairly normal semantics, so that there are objects stored at addresses. Externally of any object address arithmetic is meaningless, however objects may be composed of sub-objects, and address arithmetic within an object may may sense (depending on the object).

The starting observation for all this was if I have a memory range between pointers A and B (where I don't know the base address of the space), I get something meaningful from subtracting A from B (the distance between the pointers). However adding the pointers gives nothing meaningful (as we don't know the base address.

For example consider an array of 32bit words starting at address 0x1000 and ending at address 0x2000. I can take any two address pointers in the object, say 0x1010 and 0x1020, and subtract them to get a meaningful distance of '10' units. However adding them together is not valid, in fact it gives an address outside the object itself (0x2030) so adding the pointers together is meaningless. However averaging the pointers (0.5 * 0x1010 + 0.5 * 0x1020) = 0x1018 which is meaningful as it is the mid-point between the pointers.

since my comment has been removed

Since a pretty banal comment of mine has been censored without prior notice or subsequent acknowledgment, I think this site is disgraceful.

Bye.

Hu ?

Are you sure this is not simply a database issue? This would seem more likely to me that a sudden, radical change in moderation behavior.

I suspect raould


i am man and superman

was i granted some superpower on the ltu site i never knew about? wow. cool. :-) i wish somebody would tell me so i could decline it.

I didn't see it.

I have not seen the comment, so it definitely has nothing to do with me.

I would say that I am interested in discussing the mathematics of addresses and offsets, rather than discussing if addresses and offsets are useful for programming, which is an entirely separate discussion. So if it was relevant to the discussion please repost it.

When was your comment posted?

When was your comment posted? There was a database crash on Sunday, due to overzealous webcrawlers. It's possible that caused an issue.

When was your comment posted?

When was your comment posted?

Monday.