Lambda the Ultimate

Language Myths Again
started 5/12/2003; 12:44:52 PM - last post 5/18/2003; 1:58:48 PM

Isaac Gouy - Language Myths Again

5/12/2003; 12:44:52 PM (reads: 1022, responses: 22)

LtU has a Quotations page.
Maybe there should be a Language Myths page?

There was an light-hearted discussion of this last year. It seems there might be a more serious purpose - debunking the kind of myths that undermine new programming languages.

"The aim of this book has been to demonstrate that functional languages, in particular SML, can be used to build the kind of real-world projects that you might ordinarily only think of using C or C++ for... Unfortunately programmers are prone to myths about programming languages and seldom revisit them... Here are some common ones

Lisp is slow. Optimising compilers have been available for years.

Ada is huge and bloated. What, C++ isn't?

C must be efficient because you can program down to the bare metal. Nope, its 1970s-vintage machine model no longer fits well with modern computers."

Unix System Programming with Standard ML

Is The Truth out there?

Ehud Lamm - Re: Language Myths Again

5/12/2003; 12:48:09 PM (reads: 1019, responses: 0)

Maybe there should be a Language Myths page?

I think the discussion you mentioned I said that if someone created a nice page (with links to LtU message, for instance), I'd be glad to host it here, and link to it from the navigation bar on the left.

Isaac Gouy - Re: Language Myths Again

5/12/2003; 12:54:30 PM (reads: 1018, responses: 1)

Excellent! Now we just need to decide on our Top Ten Myths and sort-out Myth from Truth ;-)

(I'm quite open to some of the Myths turning out to be true)

Ehud Lamm - Re: Language Myths Again

5/12/2003; 1:21:37 PM (reads: 1034, responses: 0)

Amazingly enough people actually do real research about programming languages ("re-search" as opposed "google-search")...

Isaac Gouy - Re: Language Myths Again

5/12/2003; 1:32:50 PM (reads: 1014, responses: 0)

For reasons I'll never comprehend Menlo Park library doesn't subscribe to Software Practice and Experience...

Predictably, I'd suggest the "C/C++ is so much more efficient" myth.
Just because I hear it over and over... maybe it's true!

Dan Shappir - Re: Language Myths Again

5/12/2003; 2:15:39 PM (reads: 1003, responses: 0)

Here's one for you: "Java is better than C++ because it's pure OO" except it isn't, while C++ is intentially multi-paradigmatic.

Oh, and another one: "Java was inovative for introducing the concept of a VM."

Kimberley Burchett - Re: Language Myths Again

5/12/2003; 3:32:10 PM (reads: 996, responses: 0)

Here's one that as far as I know isn't a myth at all. C and C++ allow you to use much more compact memory representations than "safe" languages where pointer arithmetic isn't permitted. In other words, C and C++ are the best languages for systems programming and high-volume data structures.

For example, I'd love to see this optimization technique (or something similar) implemented in any safe language (ocaml, perl, java, smalltalk, etc). For the work I do, this every-byte-counts stuff actually adds up -- we routinely store hundreds of millions of objects in memory, and going 64-bit is not yet an option.

It's for this reason that when I daydream about language features that I'd like to have, I usually end up thinking about efficient, custom data structures generated from a high-level specification.

Isaac Gouy - Re: Language Myths Again

5/12/2003; 6:09:43 PM (reads: 985, responses: 0)

more compact memory representations than "safe" languages
What is and isn't a "safe" (type safe?) language?
Ada? Modula2?

From a recent LtU discussion: "Generics are new"

Maybe we should hangout in slashdot for a while to pick up some mythic themes...

Dan Shappir - Re: Language Myths Again

5/13/2003; 4:08:57 AM (reads: 961, responses: 0)

I can't resist: "in a hundred years everybody will be programming in LISP"

Ehud Lamm - Re: Language Myths Again

5/13/2003; 5:34:10 AM (reads: 965, responses: 2)

It is perhaps instructive to think what are the different kinds of myths out there.

I can think of these.

Statements that can be proven wrong, using the theoretical framework of the field. These are the easy ones.

Next we have things that require empirical evidence. "C is always faster than Scheme" can be deubnked, simply by showing *one* counter example. This sort of thing is often the result of over generalizing from personal experience.

Next we have vague terms, that result in meaningless statements (e.g., "OOP is the best methodology". Niether 'OOP' nor 'methodology' have an exact meaning, but 'best' is really meaningless in this context).

The last type I want to mentiopn shouldn't really be considered a myth. I am thinking about experssions of opinion. We shouldn't let opinions be disguised as facts, but controversial and even wrong opinions are not myths. They are sometimes the result of mass dellusion, but that's beside the point. In this category I'd list things like "ten years from now everyone will be programming in Lisp". Statements in this category often also use vague terms: "imperative languages are easier for beginners".

Dan Shappir - Re: Language Myths Again

5/13/2003; 6:40:55 AM (reads: 996, responses: 1)

BTW Ehud, from your comment on the weather I deduce that you are still living in Israel. So am I. I believe we both even work in Jerusalem. And yet, a year after the original thread, we have yet to meet ;-)

Isaac Gouy - Re: Language Myths Again

5/13/2003; 7:39:00 AM (reads: 942, responses: 0)

Maybe 2 categories are enough:

Programmers who use Language X (LISP) are always more productive
Programmers who use Language X (LISP) are more productive than those who use Language Y
Software written in Language X (C/C++) will have more defects than Language Y
Language X (hand coded assembly) is always the fastest
Language X (C) is faster than Language Y (Scheme , C++)
C and C++ allow you to use much more compact memory representations than "safe" languages where pointer arithmetic isn't permitted.
?Language X (LISP) is more expressive than other languages

Opinion & Meaningless Generalization (maybe some of them can be restated):

Real Men don't need GC
Language X (Java/C++/C...) is the best programming language ever designed
Language X (Java/C++/C...) is the best language to base a first year / intro to programming course on, since Language X (Java/C++/C...) is what they use in industry.
To be a professional programmer it is enough to know one language well
Programming paradigm X (imperative, functional, OO) is more intuitive than Programming paradigm Y (imperative, functional, OO)
OOP is best design methodology
Exceptions are bad
Language X (Ada, C++, Java, C#) is huge and bloated (compared to what?)
An OOPL must have a class construct
?All programming languages are basically the same (equivalent or equally useful?)
?Java is pure OO (compared to C++)
?Inheritance is the best way to get code reuse
?Static typing reduces productivity
?Dynamic typing doesn't scale

Kimberley Burchett - Re: Language Myths Again

5/13/2003; 8:01:59 AM (reads: 944, responses: 0)

Well, at least you put my "myth" in the category of things that can be proven by counterexample. But until I actually get a counterexample, perhaps we need a third category:

Myths that are actually true.

;-)

Isaac Gouy - Re: Language Myths Again

5/13/2003; 11:43:23 AM (reads: 910, responses: 0)

Patience! ;-)
If it pleases you, think of them as claims about programming languages that it might be interesting to investigate further.

Are all those items really in the correct category?
What repeated, unsubstantiated claims are missing?

Frank Atanassow - Re: Language Myths Again

5/14/2003; 7:19:34 AM (reads: 883, responses: 0)

I've looked at your "Efficient Representation of Shared Data" page, and have some remarks.

Here is the way I understand it.

There is an object S : Shared which is shared between N objects O1, O2, .. ON : Unique. You want a type T such that there exist functions:

  mkT : (Shared, Unique) -> T
  getShared : T -> Shared
getUnique : T -> Unique

that make T into a product of Shared and Unique.

Technique #1: Basically T = Shared * Unique.

Technique #2: This is really a non-technique, since it doesn't satisfy your requirements. You basically take T = Unique and require that a computation accessing the O's go through a type X = Shared * UnboxedArray T at some point in its history. Obviously there is no function get_shared definable on T. But since all access to the O's must initially go through X, one can arrange that S is always in scope (there is always a free variable denoting S).

This suggests that, if we can force S to always be in scope for every computation on the O's, we can implement the wanted functions above. So define a Shared-computation (M a) to be value a with a value of type Shared in scope:

  type M a = M (Shared -> a)

This forms a reader monad:

  return :: a -> M a
  return x = M (\s -> x)

  (>>=) :: M a -> (a -> M b) -> M b
  (M m) >>= k = k m


  fetch :: M Shared
  fetch = M (\s -> s)


  runM :: Shared -> M a -> a
runM s (M m) = m s

Now suppose we have a function which processes T's (producing Int's, say) and requires access shared data. It should have type:

  f :: Unique -> M Int

This type is isomorphic to:

  Unique -> (Shared -> Int)

which is iso to:

  Unique * Shared -> Int

which, taking T = Unique * Shared, is iso to:

  T -> Int

So we can implement something like the wanted functions:

  mkT' :: Unique -> M T
  mkT' o = do { s <- fetch; return (o,s) }

  getShared' :: Unique -> M Shared
  getShared' _ = fetch


  getUnique' :: Unique -> M Unique
getUnique' = return

I think this satisfies your requirements, though. Every Unique object is stored by itself, as is the object S, and no pointer arithmetic required. Instead of changing the data representation, we changed the representation of computations on the data. The monad M ensures that every computation on a Unique gets passed the value S. I think in OO circles this is known as the Flyweight pattern.

Kimberley Burchett - Re: Language Myths Again

5/14/2003; 10:07:12 AM (reads: 871, responses: 1)

I'm flattered. Not only did you look at the page, you took the time to understand it and translate it into haskell! :)

But it seems you missed one important point: there can be more than one shared object, and each unshared object should be associated with a particular shared object. Judging from runM, it looks like you're assuming that there is only one shared object in the entire system. In that case, sure, you can just structure computations so that the shared object is supplied at some point in the future. But that's basically begging the question.

By the way, thanks for the exhibition of monadic transforms. They're still a bit new to me, so seeing them in a familiar context was useful.

Frank Atanassow - Re: Language Myths Again

5/14/2003; 10:45:17 AM (reads: 897, responses: 0)

But it seems you missed one important point: there can be more than one shared object, and each unshared object should be associated with a particular shared object. Judging from runM, it looks like you're assuming that there is only one shared object in the entire system.

I don't think there is a problem there. You have one object s associated with a set of objects. When you write:

runM s m

you're effectively creating that association: it associates s with all the objects mentioned in the computation m.

If you have more than one shared object `in the system', then you can use a reader monad transformer and standard techniques for combining monads. It is true in that case that there is no information about which objects are associated with which shared object but that should not pose a problem since you can almost certainly assume that if the set O is associated with s and the set O' with s', then O and O' are disjoint. (Why can you assume this? Because in your `efficient representation' the objects are stored by value in a contiguous array, so if they are part of two shared structures, then they have to either all be in common---O=O'---or not, and if they are all in common, then you can just merged their shared bits...)

Kimberley Burchett - Re: Language Myths Again

5/14/2003; 2:15:53 PM (reads: 856, responses: 1)

Okay, I'll assume that using these "standard techniques for combining monads", you can create computations that refer to more than just one shared object. However, even though I'm not familiar with the techniques you're referring to, I have a strong intuition that these "combined monads" will end up using more than one byte of overhead for each unique object.

Assuming I haven't exhausted your patience already, I'd be interested to see how your technique solves a problem like this (trying to keep it simple, yet still illustrate my point):

You have a list of unique objects, each of which has a shared object associated with it. Suppose, for simplicity, that the shared object is simply an integer. I want the sum of all the integers associated with the unique objects, disregarding the sharing. In haskell:

sum uniques = foldl (+) 0 (map getShared uniques)

Feel free to lift the computation into a monad and do whatever other transformations you'd like.

Ehud Lamm - Re: Language Myths Again

5/14/2003; 10:37:02 PM (reads: 905, responses: 0)

we both even work in Jerusalem

Right! I'll try to get in touch next week (you can also email me), and we should get together some time.

Frank Atanassow - Re: Language Myths Again

5/15/2003; 5:28:01 AM (reads: 853, responses: 0)

However, even though I'm not familiar with the techniques you're referring to, I have a strong intuition that these "combined monads" will end up using more than one byte of overhead for each unique object.

If you have two shared objects, then the combined monad looks like:

Shared -> Shared -> a

or equivalently

(Shared, Shared) -> a

So there is really no overhead on the uniques. However, it is not really meaningful, I think, to talk about data representation overhead here, since there is no object stored in memory corresponding to a Shared plus all its associated uniques. Rather, I think you need to start talking about the space complexity of a computation in the monad.

I'm afraid, if I understand you correctly, that this is not so straightforward. Since you have one `Shared' appearing in the type of the monad for each shared object, such a function needs to be polymorphic recursive and operate at a range of types:

()
Shared -> a
Shared -> Shared -> a
...

So at the least you need to do induction on types, which is possible, but not something I want to get into here. Also, I think you were right earlier and some more information needs to be added to this type in order to tell which unique is associated with which Shared; undoubtedly that extra information would exceed your minimum overhead.

Actually, a more serious problem occurred to me yesterday after I wrote my message. I said that if you have two Shared objects then it is almost certain that their associated uniques will be disjoint. The reason I wrote almost certain is that your efficient C representation stores the uniques by value in an array, and so if some were associated with two distinct Shared objects, then the intersection has to correspond to a suffix of the array for one Shared and a prefix of the array for the other Shared, which seemed pretty unlikely.

But I thought of an example where exactly that situation occurs, namely when the uniques are characters in a text editor buffer and the Shared object gives some styling or text properties which are associated with a contiguous range of characters. (This is, I think, the motivating example for the Flyweight pattern in the GOF book.)

Anyway, I think you could apply the reader monad approach in this case but it would certainly take more work, and I'm not sure about the space complexity off the top of my head.

Oleg - Re: Language Myths Again

5/15/2003; 4:13:46 PM (reads: 856, responses: 0)

We show two solutions to your problem -- both have fewer memory requirements than the efficient solution on your page. Neither require any pointer arithmetics. Both solutions can be efficiently implemented in any language that permits arrays (i.e., indexable collections).

First, the solution on your page has the space complexity of N + sizeof (int) + sizeof (pointer) bytes per shared object, provided there are at most 256 unique objects per shared objects and objects cannot be allocated or freed individually (without an expensive reorganization).

It seems the gist of the problem is one-to-many correspondence. The solution follows from the relational database approach in a straightforward manner. We demonstrate two solutions here:

1. Consider a set of shared objects one relations, and the set of unique objects another relation. We also need a relation between shared and unique objects. Suppose we have at most 256 shared objects. We can allocate them in a vector. The index of a shared object in the vector would be that shared object's primary key. Unique objects can be allocated as you please. The handle (reference) to that object is its primary key. To relate a unique object and its shared part, we need a foreign key: shared object's index. By assumption, there can be no more than 256 shared objects -- so one byte per foreign key should suffice.

        struct SharedData {
                /* shared data */
        };

        SharedData shared_data_table [] =
   { ....};


        struct MyObject {
                /* non-shared data */
      ...
          uint8 shared_data_key;
        };


   SharedData *get_shared_data(MyObject *object) {
      return & shared_data_table[object->shared_data_key];
   }

Complexity: N. Yes, we need only N extra bytes, where N is the number of unique objects. There is a limitation of 256 shared objects, but no limitation on how many unique objects can share common data. Unique objects can be allocated and disposed of individually.

2. What if we want to find all unique objects that share a given shared object? We should turn to a more general solution. We should ask ourselves, what should be a primary key of a unique object? It could be its handle -- or it could be a pair (shared_obj_key, unique_obj_key). Thus, a unique object is identified by a shared object and some additional key.

We can assume that there can be no more than 65536 shared objects and no more than 65536 unique objects per one shared object.

        struct SharedData {
                /* shared data */
      int UniqueDataCount;
      UniqueData * unique_obs; // vector of unique objs
        };

        struct UniqueData {
                /* unique data */
        };


   SharedData shared_data_table [] =
   { ....};



   typedef uint16 SharedRef; // an index to a shared obj
   typedef uint16 UniqueOfSharedRef; // An index for a unique obj


   struct UniqueObjRef {
      SharedRef shared_obj;
      UniqueOfSharedRef unique_obj;
   }


   SharedData *get_shared_data(UniqueObjRef unique_ref) {
      return  & shared_data_table [unique_ref.shared_obj];
   }



   UniqueData *get_nonshared_data(UniqueObjRef unique_ref) {
   return & ((shared_data_table [unique_ref.shared_obj]).
        unique_obs[unique_ref.unique_obj]);
        }

Overhead (per shared object): sizeof(int) + sizeof(ptr). The overhead does not depend on the number of unique objects.

No address arithmetics is required. We only need array access. Furthermore, because UniqueObjRef does not depend on actual memory addresses, UniqueObjRef can refer to objects on disk as well objects in memory, can be persistent, and can be reliably stored in persistent and distributed data structures.

It must be noted that a virtual address is likewise a complex structure of various indices (into a segment and page tables).

Kimberley Burchett - Re: Language Myths Again

5/15/2003; 6:20:38 PM (reads: 794, responses: 0)

I concede the point :)

Unfortunately, the first version won't work in practice. Every compiler I know of will pad that last byte up to the nearest word size, which on current machines is four bytes. If it didn't, then you would pay a heavy penalty for using any of the unique objects, since they would not be word-aligned and would therefore need to be copied out bytewise before being used. It also restricts you to 256 shared objects total, with no real way to increase that number.

However, the second version is good enough to prove me wrong. This version has two drawbacks and one advantage. The first drawback is that you have to know up-front how many shared objects you're going to have (although this can be offset by using a table that grows on demand, at a trivial cost). The second drawback is that it restricts the total number of shared objects, depending on how many bits you allocate for the shared and unique offsets, and this allocation is static and must be tailored to the problem at hand. On the other hand, it has the advantage that on a 64-bit cpu it will remain only 32 bits, while a pointer would double in size.

In nearly every situation, your second version would be sufficient to solve the problem. However, in the situation I was working in when I developed my technique, I had about 40 million shared objects (26 bits), and if I limited them to only 32 unique objects each (the remaining 6 bits), they would have doubled to 80 million, and so on and so on. So it wouldn't have worked for me, but I was in a very specific and odd situation.

I'll link to this thread from my page.

Oleg - Re: Language Myths Again

5/18/2003; 1:58:48 PM (reads: 737, responses: 0)

I have to admit a failure to express Solution 1 clearly. You're absolutely right: if the non-shared data "..." in the object below consists entirely of integers, floats and other aligned data, the lone uint8 at the end will consume four or even eight bytes rather than expected one.

        struct MyObject {
               /* non-shared data */
              ...
          uint8 shared_data_key;
        };

I thought however that MyObject might include ASCII strings or pointers to ASCII strings. It would be beneficial then to include such strings within the object itself:

        struct MyObject {
          uint8 shared_data_key;
     char my_str[0];
        };

In that case, the cost of shared_data_key becomes negligible.

Your comment highlights the often neglected fact that allocating memory for strings via malloc() is quite wasteful: the overhead is 4 bytes (50% of the 1-7 padding bytes) plus (on many systems) 8 bytes for memory block header. For short strings, the overhead can be quite large. Alas, in C all non-static strings that are to survive the return from the creating function must be allocated on heap. Memory allocation is really a very difficult subject and should therefore be left to specialists such as Hans-J. Boehm.