Proposed extension to C - array size declarations

Somewhat to my surprise, I may have found a politically acceptable way to extend C in a way that leads to the prevention of buffer overflows. Many people have struggled with this, but the previous attempts led to incompatibility, excessive overhead, or a new language.

It turns out that combining C fixed size arrays, C++ references, and C99 variable-length automatic arrays seems to lead to a workable solution. An example is declaring the UNIX read call in this way:

Present C form of declaration:
int read(int fd, char* buf, size_t n);
Proposed form:
int read(int fd, char (&buf)[n], size_t n);
This is size-safe when called from new code, compatible with old code, and type-checkable at calls.

There's more, of course; see the draft paper: "Safe Arrays and Pointers for C" (PDF)

Now I need to find out if someone can find a flaw in this, so it needs to go before a qualified critical audience. So I'd like to see what the LtU crowd has to say about this. Thanks.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Unions

In most existing code, unions are used primarily for data being passed through networks, files, and message passing systems. Such data, since it comes from external sources, seldom contains pointers.

I think that's true of non-discriminated unions where the union is just used for (not so portable and technically undefined) type punning. But with discriminated unions I suspect pointers in unions are pretty common.

Re pointers in unions

I suspect pointers in unions are pretty common.

I miss Google Code Search. Then one could quickly answer questions about how common some construct was.

C, of course, doesn't really have tagged unions. You can have a union inside a struct with a tag, but the language doesn't connect the tag with the variant. If variant records (as in Pascal/Modula/Ada) or derived types (as in C++) were to be added to C, that would be a separate proposal. There may be a case for that. It needs new syntax, though, and I'm trying to avoid that if at all possible.

All alternatives of a union must have the same size

You say:

Unions cannot contain pointers or references, and all alternatives of a union must have the same size.

Do you mean array size or do you really mean we should manually pad the union?

I note that you sometimes use:

void_space (&)buf[n]

And you sometimes use:

void_space (&buf)[n]

The former is more readable. I think both look weird in C code.

How are you going to type

How are you going to type strcat?

strcat is deprecated, for

strcat is deprecated, for good cause.

strncat we can handle:
char (&strncat(char (&destination)[num], const char source[], size_t num))[num];

Except that "num" is a variable here, this is valid C++ syntax right now. (I didn't come up with that; that's the C++ standard.) The return type is the same as the input type for "destination". This maintains compatibility with the current API. Old calls will work in non-strict code. In strict mode, we might code something like this:

char dest[30] = "Hello";
char s1[] = "World";
auto ss = strncat(dest, s2, lengthof(dest)); // ss is a reference to dest
int stat = write(fd, ss, sizeof(ss)); // write to fd

Note that we have to pass the size of "dest" explicitly, just as we do now, but the compiler can check that we did it right. Probably at compile time. The compiler has to be able to verify that lengthof(dest) == lengthof(dest), and lengthof(ss)*sizeof(char) == sizeof(ss) to determine at run time that the checks are identities, are true, and can be optimized out.

Const arrays can be ambiguous as to size because there is only a risk of reading, not overwriting, invalid data. Clamping down on const arrays breaks too much existing code involving string constants. Projects writing security-critical code might want to restrict the use of the "[]" form, but making that mandatory would result in screams from the C programming community.

The problems I see are

The problems I see are twofold:

  1. Your proposal is much heavier than just implementing bounded arrays a-la Pascal, and even that never came to C. (Except for as libraries, and that's usually good enough for C programmers.) IF you're adding a language and runtime checks probably in the form of assertions, THEN you might as well implement Pascal style -lightweight- generated runtime bound checks.
  2. I've seen proposals like this before. I even made a similar one for a little-known compiler named Clean while I was a student. It's basically a form of dependent typing. The reason I mentioned strcat was because you'll quickly end up willing to express types like (in functional notation)
    fstrcat: char^n -> char^m -> char^(n+m)

    Something which opens a complete new can of worms because the compiler would need to prove a dependent type correct which it probably won't be able to do at compile time.

    Re: The problems I see are

    Your proposal is much heavier than just implementing bounded arrays a-la Pascal, and even that never came to C.

    It's not that bad. It doesn't require array descriptors or "fat pointers", like some of the past "safe C" projects. So the run-time representation of user-defined data doesn't change. (Contrast with the Safe C Compiler). This allows interoperability with old code.

    This proposal doesn't introduce new syntax, like CCured or SAL. It just relaxes some limitations of C syntax, while importing a very few known features from C++. This is for programmer acceptability.

    ... because the compiler would need to prove a dependent type correct which it probably won't be able to do at compile time.

    This proposal only allows talking about the size of arrays. It's not a full-scale dependent type system. One could create the dependent type matching problem by extending this proposal to C++ and then trying to make it interact with the matching system of templates. I am not proposing that. C is sufficiently limited that this isn't going to lead to undecidability issues.

    Where run-time checking is required, it will never be more complex than comparing two integers.

    And there you have it

    but making that mandatory would result in screams from the C programming community

    And there you have it. You can have a new law passed sometimes under just one year affecting an entire country's people on whether they can or cannot be sent to jail based on the law's criteria applicability held in a couple sentences of the legal jargon, but ...

    when it comes to just consider hard-deprecating, after decades past inception, a silly single syntax construct [or function] among hundreds or thousands [of a runtime library among thousands] of a computer language among thousands, though only used by a tiny(*) bunch of coders... the "weight of the legacy" makes for a whole different strength in its meaning.

    Our curse, folks. But also an interesting challenge.

    (*) (at least relatively to the expected-to be law-abiding general public population aforementioned, that is)

    Re: And there you have it

    Our curse, folks. But also an interesting challenge.

    Exactly. From a language theory perspective, this is a solved problem. We all know it can be done. The problem is solved in other languages, some of which are quite close to C.

    But not quite close enough. This is a minor enough and compatible enough extension to C to be within the range of political acceptability. Once the technical problems have been hammered out (which is why I'm doing this here), the political problems begin. At least one compiler implementation group needs to be convinced to try it as an extension, and then it needs to be pushed forward through the C standards process.

    The payoff is large. Every day, millions of programs crash or are compromised through buffer overflows. Almost always, C code is responsible. A practical solution to that problem is worth pursuing.

    The payoff is large. Every


    The payoff is large. Every day, millions of programs crash or are compromised through buffer overflows. Almost always, C code is responsible. A practical solution to that problem is worth pursuing.

    All of which could have been avoided had a Pascal dialect taken C place at the right moment.

    Unfortunately neither Turbo Pascal nor the too late Extended Pascal ISO standard were able to achieve that. Then Modula-2 was also largely ignored by the industry.

    I think the only hope to fix C's easiness to security exploits is to hope that eventually Ada, Go, D or C# (natively compiled like in Singularity) will manage to do that, if ever.

    Go, C#, and D are

    Go, C#, and D are garbage-collected, so they're not well suited to low-level code at the OS, driver, and real-time level. Ada, like Modula, never really caught on. We seem to be stuck with C as the predominant low-level language, like it or not. Thus we should fix it as best we can.

    Well, Native Oberon and A2

    Well, Native Oberon and A2 Bluebottle proved that it is possible to have operating systems with powerfull GUIs implemented in GC enabled systems languages, with Assembly restricted to the boot loader, and hardware interface for the device drivers.

    Operating system that was widely deployed at ETHZ IT department on its golden days, as far as I am aware.

    The Project Oberon book is well worth a read.

    What if "n" is altered?

    What are the precise semantics of an array's lengthof operator? Let's start with lengthof as applied to a C99 dynamic local array:

    int n = 100;
    int array[n];
    n--;
    assert(lengthof(array) == 100);

    Does the assert fire? In a naive implementation, it would, but then lengthof(array) would not be a function that told you the length of an array! Or, you could squirrel away the value of n as (say) const int __array_lengthof - but then you're introducing hidden local variables which may not be politically correct in C.

    We can extend this example to functions & structs.

    void copybyref(int (&a)[n], const int (&b)[n], size_t n)
    {
    n--; // <== decrement n
    for (int i = 0; i < lengthof(a); i++)
    { a[i] = b[i]; }
    }

    What happens here? Do the loops run one fewer iteration, or is the value of n on entry squirrelled away (perhaps twice, to __a_lengthof and __b_lengthof)?

    Finally, the struct case, I think, ties your hands:

    struct Array
    {
    int (&array)[length];
    int length;
    };

    Array->length--; // <== now, surely, array is shorter?

    Surely now array must be shorter, because you cannot introduce hidden constant fields into a struct? We must have lengthof(foo->array) === foo->length here, I think?

    Should the length field therefore always be const (to match the const nature of the ref binding) to try to avoid some of these issues?

    Re: What if "n" is altered?

    Those are good questions.

    In general, the length value is bound at initialization time. The first and second cases above do imply that the system must create a local temporary variable to retain the length. In the paper, I wrote "In the above worst case example, the compiler must generate a temporary to store the size of the array, because the inputs to the expression defining its size change. This is usually unnecessary." Compilers create temporary variables on the stack all the time, so that's not unreasonable.

    I don't think there is any case where the length can't be kept as a local temporary. Parameters and local variables are handled as above. Static and global variables are sized at compile time, so they're not an issue. Elements of struct variables fall under the struct case below.

    The struct case you give is tougher. We're trying to construct an object with internal consistency semantics in a language that doesn't have constructors. However, we do have C initializers, which can initialize a whole struct at once, and can initialize a const field. We could insist, as you suggest, that a field used as a length must be const, which forces initialization of the entire structure through structure initialization or replacement from another structure. I think you're right to add a "field used as input to a length expression must be const" rule. That wouldn't impact existing code, because existing code won't have length expressions.

    Initializing such things statically is straightforward. Initializing them with with malloc is going to be ugly. It's going to take some macros which use force_cast to do it cleanly. That area needs more design work.

    Thanks.

    static or dynamic?

    If C compiler wizards hang out here, I missed them. With your permission, I'll play low-key devil's advocate and just barely criticize. If you want more aggressive critical feedback you'll have to say so, because I'd rather be respectfully agreeable.

    Preventing buffer overflows is a fine goal. But can static declarations give a compiler enough info to see buffer overflows? What if logic involved is quite dynamic, because you have your own memory management; it might require figuring out what complex code means to see if a write ever goes beyond the end of space you intended.

    Do you only intend to detect buffer overflows in C's stack handling, rather than in user code memory management? I'm thinking: a C compiler might see where you declared space, and see you passed arguments allowing too large a write to occur, then warn you. But I can also imagine code for memory management a C compiler can't figure out, forcing it to shrug.

    When you say "compatible with old code" you don't mean it compiles in old C compilers, do you? If I stick that read() declaration in a C header file and compile in my current environment, the compiler emits a message like this one:

    In file included from foo.c:63:
    foo.h:1234: error: expected `)' before `&' token
    foo.h:1234: error: expected `;', `,' or `)' before `size_t'

    I think you mean if folks upgrade their compilers, new headers will be understood and old code will still work without changes. Doesn't that make headers using new sytnax depend on new compilers?

    I'll presume personal perspective is relevant and tell you how I try to avoid buffer overflow. Part of my point will be that it's not foolproof because some dynamic parts can defeat analysis. If lifetime of info describing arrays (and extents) is dynamic enough, you can't figure out whether it's stale by time of use, except indirectly.

    I haven't done this recently, but I wrote a lot of scatter/gather code using clones of struct iovec, so each buffer is described by pointer and length, and arrays of iovecs are also explicitly annotated with length. (Somebody has to write such code in network stacks, and I did it for products I worked on; so if a buffer overflow happened, blame would be mine. Note I know more about software than networking.)

    Using iovecs (or iovs for short) this way, one can aim to never write more bytes than a buffer describes. This works quite well, except when someone changes the iovecs while still in use, and this really happens before code is fully debugged.

    Hypothetically, suppose we define an async version of readv() named areadv(), where the caller passes an iovec array as part of a request to fill those fragments of memory with the number of bytes requested. Part of the contract is this detail: until callback when complete, the caller of areadv() can neither free the iovec array nor change the size of any iovec in the array, since this would subvert the areadv() implementation. (We actually do this, but the streams are not files and the method is not named areadv(). Sometimes several coordinate systems are involved in dedup, and it gets interesting.)

    Guess what happens? Sometimes the caller changes the iov_len fields while areadv() is still running, which is detected by seeing the sum of every iov_len field change between the call start and when writing is done. Naturally a fix is urgent. But how would changing C function declarations be able to find this problem's cause? I assume it's impossible, but I wouldn't mind learning otherwise.

    I have trouble seeing how static analysis in a C compiler will prevent dynamic async lifetime problems with non-deterministic control flow.

    Re: static or dynamic?


    But can static declarations give a compiler enough info to see buffer overflows?

    Short of full proof of correctness systems, no. However, that's not what is proposed here. What is proposed is to be able to say, within the C language, how big an array is. This allows a compile time crosscheck between function caller and function callee. It allows run-time checking if desired. It allows programmers to access array size size information from within code.

    When you say "compatible with old code" you don't mean it compiles in old C compilers, do you?

    No. This is a proposal for a backwards compatible extension to the C language, for the next generation of the C standard.

    I have trouble seeing how static analysis in a C compiler will prevent dynamic async lifetime problems with non-deterministic control flow.

    Me too. Although Microsoft's Static Driver Verifier does do something comparable. However, if you were required to change the size of the object and the pointer to the buffer in one assignment or initialization operation, it would be harder to dig yourself into that hole.

    To really solve thread-type async problems at the language level, you need something like Java "synchronized" or an Ada rendezvous. A fundamental problem with C asynchrony is that, while there are locks, the language has no way to say which lock locks what data. Fixing that would take the C language too far from existing C practice.

    (At various times in my career, I have spent considerable effort fixing defects in asynchronous code written by others. I thus tend to take the position that one should not try to be too clever in asynchronous code.)

    My own critique

    Now that this has been up for a few days, and no one has yet found a killer flaw, I'll say what I consider to be the major problems with the proposal.

    There's an implicit style associated with this kind of safe programming. It wants you to package up your global variables into structs, with size information about those variables in the same struct. This is generally considered good programming practice today. In object-oriented languages, you're almost forced to program that way. It also wants you to declare and initialize variables at the same place, with local scope, almost in a single-assignment style. If you write that way, this proposal won't affect your programming style much.

    But there are programmers, and programs, which use a large number of global variables. A long time ago, those were the people who came from FORTRAN. The style remains. There are programs with huge include files of global variables. A typical program in that style might obtain a default table size from a command line option, put that number in a global variable, and then allocate buffers accordingly. To express that safely, the pointers to those buffers need to go into a struct (not necessarily struct per buffer) so that the buffer can be tied to its size. We can't tie a size to a global non-const variable with this approach.

    Arguably, what I'm proposing forces a bit of object orientation onto C programmers. There are programmers who just don't like that. It's not a performance issue; a reference to an element of a static struct is an address fixed at compile time, just like a global variable. But the programmer has to write structname.varname instead of varname, and some people will hate that.

    Also, the initialization of a struct with an array of known size has to be done in a rather rigid way, so that the struct is initialized or assigned all at once. This is unnatural to people who don't think in terms of object-oriented programming. Some C programmers aren't even aware that structs are assignable.

    This I see as the biggest practical problem. The people who read Lambda the Ultimate are unlikely to have trouble with these concepts, but there are many C programmers who aren't comfortable with even the more advanced concepts in C.

    Assignment to structs

    Assignment to structs (records) all at once is not unique to OO. See aggregates in Ada. I also recall this in pl/1... There you also had a subtle discussion of structs containing sized arrays, particularly in the context of io (reading such structs from file).

    If it were about a proposed

    If it were about a proposed extension to C++, given that my coding-fu in the latter as a whole or part of its most subtle specifics, is getting definitely too rusty, I'd refrain to comment more. Assuming I'd even be interested (unlikely).

    Hopefully, alike bike-riding you can't really completely forget after enough practice, my C still is in a little better shape, as I happen from time to time to be forced back into getting my hands "dirty", as they say, with maintaining, adapting, or wrapping some legacy C library sources still around (at workplaces, that is) and that we also P/Invoke with, etc. I sure wish I had more opportunities to design/write really new C code, but heck, I'm no decider there (believe it or not, I like to read or write C almost as much as, say, C#. Please don't ask me what about my taste for "full-blown" C++, though, sorry.)

    I also wish this comment be more useful, though, since I can't really consider myself as a "C compiler wizard" either, if I ever was one in some sense (or as a user anyway). On the "fun" side, I can only remember my days at Borland tech support 15 years ago, dealing with a few of our support contracts customers lucky enough to stumble on some "funny" C++ template instanciation "bugs" in Borland C++ 4.51 (or was it 4.52 ?) etc. Memories.

    Anyway...

    Thus I like very much, would that be only it, the intent of your proposal. I read your proposal text twice and I didn't find any much obvious flaw. I did understand that it is at the same time humble and ambitious: I especially like your pragmatism which consists in thriving to leverage the closest existing feature of the C++ static type checker back towards C to have good-old-grandpa finally benefit from something that it overlooked so far and which has little to do with the rest of the OO paradigm shebang thing found in C++ (classes, ctors, dtors, operator overloading, templates, ... you name it), in fact.

    Arguably, what I'm proposing forces a bit of object orientation onto C programmers.

    I think I see what you mean, but IMO, you shouldn't extrapolate too much for yourself/your proposal around that, as I imagine some people will gladly be prompt to try convince you (or discourage, maybe?) that's the sort of psychological aspect you can't overcome because of X, Y, Z, etc, as precedents.

    AFAICT, I don't think anyone can really know/project/predict usefully enough in those waters, and I mean this either ways. After all, a good proportion of people did have to learn, for daily practice, beyond the old K & R style for formal param declaration. (To just name one if-only-syntactic move & not quite inconspicuous either, yet not so much formidable it prevented language evolution in the early days.)

    What really matters is the consistency and well-definedness of your proposal, of course (aside the implementation cost/benefit ratio to discuss).

    Also, since that you're going to need the C compiler to be "slightly" thus "enhanced" anyway, if your proposal has success, I'm not even sure about this being a definite technical impossibility, actually:

    To express that safely, the pointers to those buffers need to go into a struct (not necessarily struct per buffer) so that the buffer can be tied to its size. We can't tie a size to a global non-const variable with this approach.

    Couldn't your new flavor of (known size arrays-) strict mode, in the compiler proper, be a tiny little bit more clever, and (optionally? I don't know...) make an implicit hoisting of "not-already-owned-by-a-struct" globals into an "invisible" (*), purposedly-crafted static struct for the current compilation unit ? (Totally artificial, as helper for the type checker) (**)

    Anyway, I wish you best of success and, as you pointed out several times already, which made perfect sense to me, given the stakes, the current legacy, the "humble" overall scope of changes to the compiler, and the proposal's outcome hoped for, I, too, do think it's worth pursuing.

    'HTH

    EDIT
    (*) to the programmer

    (**) Ouch. Wait, silly me. That would likely also require at least one change of its own, in pair, to the linker, there (re: the would-be hoisted externs)

    New draft of proposal for comment

    In response to everyone's comments, I've put up a new draft of the proposal: August draft. This looks almost ready to go out to a larger community. So I'd like to get more comments.

    The main changes involve the primitives needed to use malloc in a way which is type-safe and memory-safe. Without constructors, that's a touchy area. The problem turns out to be solveable through the addition of a new built-in type, void_space, and a generic built-in function, init_space. The goal is to avoid using casts for routine operations.

    I guess I frequent Ltu less

    I guess I frequent Ltu less often than I use to. Computer Security may have become enough of a concern (see CVE-Mitre for example) that one might be tempted to attempt to address some of these concerns at the language level.

    I have a couple questions:

    1) Can interprocedural analysis be used to infer the size of an array bound in a called function from the args list?

    2) Could similar techniques be used for malloc, free when including type inference?

    It would seem to me that if one take a liberal interpretation of C semantics one could in the worse case pass the size argument without the programmer knowing. SSA based compilers with graph coloring register allocators have already made certain keywords obselete like the register keyword.