Lambda the Ultimate

CDuce: a programming language adapted to the manipulation of XML documents
started 7/3/2003; 4:45:30 PM - last post 7/27/2003; 9:30:35 AM

Ken Shan - CDuce: a programming language adapted to the manipulation of XML documents

7/3/2003; 4:45:30 PM (reads: 1753, responses: 12)

CDuce: a programming language adapted to the manipulation of XML documents

Our point of view and our guideline for the design of CDuce is that a programming language for XML should take XML types ( DTD, XML Schema, Relax-NG, ...) seriously. The benefit are the following:

static verifications (e.g.: ensure that a transformation produces a valid document);
in particular, we aim at smooth and safe compositions of XML transformations, and incremental programming;
static optimizations and efficient execution model (knowing the type of a document is crucial to extract information efficiently).

Some of CDuce peculiar features:

XML objects can be manipulated as first-class citizen values: elements, sequences, tags, characters and strings, attribute sets; sequences of XML elements can be specified by regular expressions, which also apply to characters strings;
functions themselves are first-class values, they can be manipulated, stored in data structure, returned by a function,...
a powerful pattern matching operation can perform complex extractions from sequences of XML elements;
a rich type algebra, with recursive types and arbitrary boolean combinations (union, intersection, complement) allows precise definitions of data structures and XML types; general purpose types and types constructors are taken seriously (products, extensible records, arbitrary precision integers with interval constraints, Unicode characters);
polymorphism through a natural notion of subtyping, and overloaded functions with dynamic dispatch;
an highly-effective type-driven compilation schema.

Preliminary benchmarks suggest that despite the overhead for static type verification, a CDuce program can run faster (30% to 60%) than an equivalent XSLT style-sheet (we performed benchmarks with the xsltproc tools from the Gnome libxslt library).

Posted to xml by Ken Shan on 7/3/03; 4:46:27 PM

Oleg - Re: CDuce: a programming language adapted to the manipulation of XML documents

7/4/2003; 9:01:51 PM (reads: 1056, responses: 0)

Ok, I bite. I'm quite irritated by the following claims

- static verifications (e.g.: ensure that a transformation produces a valid document);

because the above goal is not attainable short of proving the whole transformation code with ACL2 or a similar system. I'm quite disturbed when people replace definitions with the ones they find more agreeable. In the context of XML, "valid XML document" means only one thing -- the satisfaction of all validity constraints specified in the XML 1.0 Recommendation. I must emphasize all constraints. These constraints include not only the content model constraints (e.g., that the element declared with empty content must indeed has empty content). A valid XML document must also satisfy the following constraints:

Validity constraint: ID
Values of type ID must match the Name [p.8] production. A name must not appear more than once in an XML document as a value of this type; i.e., ID values must uniquely identify the elements which bear them.

Validity constraint: IDREF
Values of type IDREF must match the Name [p.8] production, and values of type IDREFS must match Names [p.8] ; each Name [p.8] must match the value of an ID attribute on some element in the XML document; i.e. IDREF values must match the value of some ID attribute.

Validity constraint: Fixed Attribute Default
If an attribute has a default value declared with the #FIXED keyword, instances of that attribute must match the default value.

It's plain that to represent such constraints in a type system we must have a fully value-dependent type system. People may have different opinions about the merits of the above constraints and the wisdom of including them. These opinions are ultimately irrelevant: until a new version of the XML Recommendation rescinds those constraints, we are obliged to follow them all if we would like to call our XML documents valid.

Let us consider the following XML transformation problem: we have two documents of a type <!DOCTYPE Foo [ <!ELEMENT Bar EMPTY> <!ATTLIST Bar Attr ID #REQUIRED> ]> and we want to merge them (into a document of the same type). Clearly we cannot write a transformation stylesheet that assuredly produces the valid result for any valid input documents. The reason that the values of the attribute 'Attr' must be unique throughout the whole document. Two input XML documents may have unique values of the attribute Attr. The result of their merger may have duplicates. We are not at liberty to change the values of attributes to guarantee their uniqueness: the values of attributes may have an independent meaning (e.g., part codes, catalog codes, etc).

So, we must introduce dynamic, run-time checks. We must formulate our transformation code type soundness theorem in the following way: given valid input documents, the transformation code, if it terminates without error, will generate a valid document. Now I jump up and say: I can give exactly the same kind of assurance without any typing. I will transform input XML documents with SXSLT, and then run the result through a validating XML parser. If the parser reports no error, I'll send the result to the client. In both cases the client is guaranteed to receive a valid XML document. Thus, in case of XML transformations, static typing alone cannot give assurances that the produced document is valid. As far as validity of the XML documents is concerned, static typing doesn't seem to convey benefits over dynamic typing.

The following message (posted on the SSAX-SXML mailing list) makes the same point with more evidence and less anger:
http://sourceforge.net/mailarchive/forum.php?thread_id=1562776&forum_id=599

castagna - Re: CDuce: a programming language adapted to the manipulation of XML documents

7/11/2003; 6:07:22 AM (reads: 920, responses: 0)

Thanks a lot for pointing out this point. Even if it is on our main page we overlooked it. Of course you are right and we should have specified valid w.r.t. all the constraints that can be expressed in the type system. Now it is done (is it ok?).

And what about the rest? I have to admit that you did not convinced me that "as far as validity of the XML documents is concerned, static typing doesn't seem to convey benefits over dynamic typing". A earlier detection of type errors helps me a lot in the development phase. But I confess I do not know SSAX-SXML so we should definitively have a look at it SSAX-SXML. Could you tell us a good starting point?

In any case besides static safety one of our point is that having types allows you to have a *very* fast implementation. This is expecially true in XML where if you know the type of your tree then cheking just the tag gives you a lot of information about what is below, and to avoid a lot of useless checks, and we do believe that CDuce proves it. In particular you see this expecially on computationally intensive executions (roughly for transformations in which the load&parse time is not much larger than the execution time). I warmly invite you to check it in person, by trying our beta release and compare execution times with your favorite XML processor. If you do please let us know you results, both negative and positives we are very interested in them.

Also if you have any question, remark, or suggestion we are eager to receive them. Please post it _also_ on one of our mailing lists (e.g. devel@cduce.org or users@cduce.org). And, you know, not all that's wrong was written wrongly on purpose ;-)

Cheers

---Beppe

P.S. I forgot, the Fixed Attribute Default constraint can be verified in CDuce as the type system has sigleton types (types that contain just one value).

Isaac Gouy - Re: CDuce: a programming language adapted to the manipulation of XML documents

7/11/2003; 7:37:08 AM (reads: 914, responses: 0)

Now it is done (is it ok?)

font-size:8px

My poor old eyes struggle with fine-print this small, especially when it's a fixed pixel size while the other text is em ;-)

Sjoerd Visscher - Re: CDuce: a programming language adapted to the manipulation of XML documents

7/11/2003; 8:23:57 AM (reads: 914, responses: 0)

Clearly we cannot write a transformation stylesheet that assuredly produces the valid result for any valid input documents.

But it is possible to detect that the code may produce invalid results, so the compiler could issue a warning. When the transformation stylesheet is changed to capture and resolve id conflicts, the compiler could detect that too, and can compile without error. I don't know enough about type systems to tell how feasible this is, but it would be a very useful feature to have.

castagna - Re: CDuce: a programming language adapted to the manipulation of XML documents

7/11/2003; 10:14:33 AM (reads: 883, responses: 0)

>font-size:8px

>My poor old eyes struggle with fine-print this small

SORRY, on my 19' screen it looked nice, but now I am on my lapotop and I see how awful it is. Changed to font-size:80%. Better isn't it?

castagna - Re: CDuce: a programming language adapted to the manipulation of XML documents

7/11/2003; 10:28:47 AM (reads: 898, responses: 0)

Clearly we cannot write a transformation stylesheet that assuredly produces the valid result for any valid input documents.

Of course we can (write the identity transformation and it will always transform valid results from any valid input document). No, what we cannot do is to statically decide for every possible transformation if it produces valid result for any valid input documents.

Now detecting it is out of reach of type systems. If you have to decide if a given attribute is unique and the values of attributes are calculated by some expressions this is equivalent to decide whether a given expression will return a given value. We can do good approximations with some static analysis techniques (and thus issue warnings if this may happen) but only dynamic check can verify it.

On the other hand type system can find very nasty errors. For instance look at this declaration

<!ELEMENT person (name,children)>
<!ELEMENT children (person+)>
<!ELEMENT name (#PCDATA)>

You see the problem? There is no document (without refs) that can be valid, as a person contains a sequence of childrens that contain a non empty sequence of persons, etc generating an infinite tree.

Let us write the same type in CDuce and look at the result (taken verbatim from the on line demo, no color added)

type Person = <person>[ Name Children ]
type Children = <children>[Person+]
type Name = <name>[PCDATA]

Warning at chars 57-76:
type Children = <children>[Person+]
This definition yields an empty type for Children
Warning at chars 14-39:
type Person = <person>[ Name Children ]
This definition yields an empty type for Person

Ok.

We paid special care in localizing errors and suggesting solutions. I can give you more significative examples if you want, but you can ust try it by yourself by picking the examples available on the on line demo and putting in them random errors.

Cheers

---Beppe

Oleg - Re: CDuce: a programming language adapted to the manipulation of XML documents

7/18/2003; 1:46:30 PM (reads: 806, responses: 1)

When I write a factorial function in a statically typed language, and it compiles, the compiler does not guarantee that the function gives the right result (let's disregard dependent and undecidable type systems as they are not common). The compiler will only guarantee that my function, if it ever finishes without an error, will return some integer. If the XML working group had included a specialist in programming languages, he would have hopefully explained the rest of the group the benefits of separating some aspects of data (the ones that can be statically checked) from the other aspects. Perhaps the group would have introduced a category of "structurally valid XML documents" that is not burdened with "ID uniqueness" validity constraints. I must note that RELAX/NG does include such a category and does draw the separation between statically-checkable and the other aspects of XML documents [1]. It seems James Clark learned a lot from XDuce.

There is an important aspect however that sets XML processing apart: the aspect of trust and reliability. The compiler and the typechecker implicitly trust the run-time system. If they compile my monomorphic factorial function, the function's code can be sure it will be given one and only one argument, and that argument will be an integer. The function will never receive a "half" of an integer. These assurances do not generally apply to XML processing, where a producer and a consumer of an XML document are generally separate entities, communicating via an untrusted and unreliable medium. If you write an XML processing application that takes XML documents of doctype A, and I give you an XML document that claims to be of a doctype A, would you take my word for it? In this day and age, it would be quite unwise to do so. Therefore, you would need to parse and validate my document against the document type declaration for doctype A. You write: "This is especially true in XML where if you know the type of your tree then checking just the tag gives you a lot of information about what is below, and to avoid a lot of useless checks, and we do believe that CDuce proves it." Indeed, after you verified that the incoming document is of the right doctype, you already know a lot about the document, and the processing application proper can take advantage of that information and avoid useless checks. However, before your application gets to work you must first do the full validation parse of the incoming document. To validate that the document is of the right doctype you have to check the entire document -- even the parts that your processing application might not care about. If a processing application repeatedly traverses the whole document, then the cost of the full validation check diminishes in comparison with the processing cost. If the processing application does a single scan (of a part) of a document, the following optimistic scenario might be preferable.

Verifying the doctype of incoming documents is a pessimistic scenario. Allow me to suggest an optimistic, "untyped" alternative, which might work better in some circumstances. Suppose my processing application extracts several pieces of data from some elements and attributes of an incoming XML document. I won't care what the type of the incoming document is. I parse the document without validation, and get to work. Of course, I have to check that all the need pieces of data are indeed present. If they are, I finish successfully. If I miss something, I roll-back and signal an error. If I only scan a part of the document, I don't need to check the parts I don't care about. Thus the total number of checks (parsing + processing) may be fewer in the optimistic scenario than that in the pessimistic one. Furthermore, for some transformations, I may be able to do the processing as I parse the document. SSAX parser [2] is especially conducive to such an approach. Therefore, I may be able to process the document in a stream-wise fashion. I can receive the document from one pipe and send the transformed document through another pipe, without any need to store the whole document in memory. This processing mode, when it applies, dramatically reduces the processing latency and lets me handle huge documents that won't fit into memory. Regarding the roll-back and error signaling: HTTP 1.1 conveniently provides for sending of out-of-band notifications to a client. In the best case, my client will receive the result with little latency. In worse cases, I simply tell the client to forget what I have said during the current transaction, and roll back.

The optimistic approach has another advantage. Suppose my application is designed to handle documents of the following structure

<!DOCTYPE Foo PUBLIC "T1" [ 
  <!ELEMENT A EMPTY> 
  <!ELEMENT Foo (A)+> 
]>

That is, I expect that the root element is named 'Foo' and it contains one or several children named 'A'. Suppose you have sent me a document of a type T2:

<!DOCTYPE Foo PUBLIC "T2" [ 
  <!ELEMENT A EMPTY> 
  <!ELEMENT Foo (A)*> 
]>

If I am using a pessimistic, statically typed approach, I should reject your document: T2 is not a subtype of T1 (it's actually the other way around). Documents of a type T2 may have no elements named 'A' in the content of the root element. However, there may be some (many?) documents of type T2 that will contain at least one element 'A' as a child of 'Foo'. If the particular document you sent me is indeed of that class, I can handle it. The optimistic, "dynamically typed" approach so to speak, permits me to be opportunistic. I hasten to add that whether being opportunistic is a good thing depends entirely on application. For data mining, being opportunistic is good. OTH, if an incoming XML document is an order to perform a serious action, the application is better be sure the order is authentic. The full validation of the incoming document is one step in authentication.

This brings us to the point that validation is far trickier an issue than it may appear. In fact, a consumer and a producer can use different schemas to validate the document against. James Clark talks at some length why a consumer or an author might need different schemas, and about the complexities of validation [3]. For that reason, RELAX/NG does not provide a "standard" way of associating a schema with an XML document. The type of an XML document is in the eye of the beholder. There may be several beholders, not related by any particular hierarchy.

I should also point out that even well-formedness is far less clear cut than it may appear. We saw that some applications might wish to accept invalid XML documents -- or the documents with a wrong manifest type -- as long as the document contains all the data the application needs. One can make a case for an application that accepts even ill-formed XML documents. For example, let's consider processing of invoice records, which come in the following long XML document:

   <Invoices>
    <Invoice trans_id='1' cust_id='2' product_id='3'/>
    <Invoice trans_id='2' cust_id='22' product_id='23'/>
   ...

My process is to load invoices into a relational database. Suppose I receive the document in question from an HTTP pipe. In the optimistic scenario, I can start processing right up-front, loading the data in the database as I parse them. Suppose I processed all the records, got to the end of the document. And then, when I was about to read the closing </Invoices>, the network connection broke down. Stuff happens. What should I do? Should I roll back what I have loaded? One may argue that I should: had I downloaded the document first, I would have noticed that it is ill-formed, and rejected it. One may also argue that each invoice comes in its own transaction, so I should keep the loaded records.

These examples show that there are several, tenable points of view on the validity and typing of XML documents. I believe it's important to be aware of the variety, so we can choose approaches that fit best our problems.

But I confess I do not know SSAX-SXML so we should definitively have a look at it SSAX-SXML. Could you tell us a good starting point?
http://ssax.sourceforge.net/
You might be particularly interested in papers about SXSLT.

[1] "The RELAX NG TC spent a considerable amount of time considering what support RELAX NG should provide for enforcing identity (uniqueness and cross-reference) constraints. In the end, the conclusion was that identity constraints were better separated out into a separate specification. Accordingly, RELAX NG itself provides no support for identity constraints.... Another reason is that it is often desirable to perform grammar processing separately from identity constraint processing. For example, it may be known that a particular document is valid with respect to a grammar but not known that it satisfies identity constraints. The type system of the language that was used to generate a document may well be able to guarantee that it is valid with respect to the grammar; it is unlikely that it will be able to guarantee that it satisfies the identity constraints. A document assembled from a number of components may guaranteed to be valid with respect to a grammar because of the validity of the components, but this will often not be the case with identity constraints. Even when a document is known to satisfy the identity constraints as well as be valid with respect to the grammar, it may be necessary to perform identity constraint processing in order to allow application programs to follow references." James Clark: The Design of RELAX NG. December 6, 2001. <http://www.thaiopensource.com/relaxng/design.html>

[2] SSAX XML parser
http://ssax.sourceforge.net/
In particular, the PADL02 paper.

[3] "By layering type assignment on top of RELAX NG validation, applications that do not require type assignment do not need to pay the cost for a feature that they do not use. It is also more flexible. For example, it allows there to be two schemas for a document: one strict schema that captures as many constraints as possible but does not satisfy the restrictions necessary for type assignment and another looser schema that cannot express all the constraints but which can be used for type assignment. One advantage to not performing type assignment with RELAX NG is that RELAX NG works well with existing APIs such as DOM and SAX. Type assignment would require major changes to report the types assignment to elements and attributes.

With XML 1.0, the XML document uses a DOCTYPE declaration to identify the DTD with respect to which it is valid. There is no provision for a document to be validated with respect to DTD that is specified independently of the document. This is unsatisfactory for interchange. When a document recipient receives a document from an untrusted sender, the recipient may need to check that the document is valid with respect to a particular DTD. The recipient cannot assume that the DOCTYPE declaration of the document correctly identifies that DTD. The recipient may want to validate against a DTD different from that used by the author: for example, the recipient may validate against a generalized, public DTD, whereas the author may validate against a restrictive, private DTD that is a subset of the public DTD. Unlike XML 1.0, RELAX NG does not tie a document to a single schema. The RELAX NG validation process has two inputs: a document and a schema against which to validate the document.

In fact, RELAX NG does not define any mechanism for associating a document with a RELAX NG schema. Although it is useful to be able to specify rules for determining the schema to be used to validate a particular document, this problem is not specific to RELAX NG. Validation is just one of many processes that can be applied to an XML document. For example, a user may wish to perform XInclude [21] processing or XSLT processing. A user may wish to perform validation before or after any of these other processes. The problem of associating a schema with a document is really just a special case of the problem of associating processing with a document. What is needed is a solution that can specify a series of processes to be applied to a document." James Clark: The Design of RELAX NG. December 6, 2001. <http://www.thaiopensource.com/relaxng/design.html>

wh - Re: CDuce: a programming language adapted to the manipulation of XML documents

7/22/2003; 2:16:37 AM (reads: 731, responses: 1)

I'd like to understand the motivation for overloaded functions in CDuce. It seems that the examples using overloaded functions in the white paper could just as well be expressed with mutually recursive functions (with different names). Are they simply a matter of convenience, or is there something specific about the way XML types are expressed/extended that requires overloading (and extensible overloading -- section 3.5 "incremental programming"). Thanks,

Warren

Alain Frisch - Re: CDuce: a programming language adapted to the manipulatio

7/26/2003; 12:05:41 PM (reads: 708, responses: 0)

Hi,

Thank you Oleg for the interesting remarks.

I don't quite understand the distinction you make between the optimistic (untyped) and the pessimistic (typed) approaches. It is possible in CDuce to validate the input document against a type that constraints only partially the structure of the document. For instance, validating an XML document against the type <a>[ <b>Any <c>String ] will not check anything under the tag <b>. The application can still perform "untyped" computations on the content of the b element by explicit pattern matching.

You also seem to imply that with the CDuce approach, the application necessarily has to load the whole XML document in memory. It is true that the current implementation loads the whole document in memory, but I see this as an implementation issue. It would be possible to implement the load_xml operator lazily, so as to parse the XML document on demand (this can be useful if validating against a "partial" type). A step further would give the whole language a lazy semantics, thus paving the way to streaming processing.

I don't understand your point in the example with types T1 and T2. Why would the application reject the document with doctype T2 ? In CDuce at least, the application would reject the document only if it is empty. Indeed, CDuce does not use the doctype declaration when loading an XML document; instead, it performs validation on its content (a possible optimization would be to avoid performing the validation if the doctype is a subtype of the expected type and if the document comes from a trusted source).

Alain Frisch - Re: CDuce: a programming language adapted to the manipulatio

7/27/2003; 1:24:40 AM (reads: 700, responses: 0)

[ note: answer also posted to the users@cduce.org mailing list. I propose to continue discussion of technical points specific to CDuce there. Click here for instruction on how to suscribe. ]

You're probably referring to the "split" function ("sort" in the white paper - note that our ICFP03 paper to be presented is an updated version of the white paper, available at http://www.cduce.org/papers.html).

type Person = FPerson | MPerson
type FPerson = <person gender = "F">[ Name Children ]
type MPerson = <person gender = "M">[ Name Children ]
type Children = <children>[Person*]
type Name = <name>[ PCDATA ]

type Man = <man name=String>[ Sons Daughters ]
type Woman = <woman name=String>[ Sons Daughters ]
type Sons = <sons>[ Man* ]
type Daughters = <daughters>[ Woman* ]


let split (MPerson -> Man ; FPerson -> Woman)
  <person gender=g>[ <name>n <children>[(mc::MPerson | fc::FPerson)*] ] ->
     let tag = match g with "F" -> `woman | "M" -> `man in
     let s = map mc with x -> split x in
     let d = map fc with x -> split x in
     <(tag) name=n>[ <sons>s  <daughters>d ]

The function could actually be expressed using mutually recursive functions, like:

let split_common([ Person* ] -> [ Sons Daughters ])
 [ (mc::MPerson | fc::FPerson)* ] ->
     let s = map mc with x -> splitM x in
     let d = map fc with x -> splitF x in
     [ <sons>s  <daughters>d ]

let splitM (MPerson -> Man)
  <person>[ <name>n <children>c ] -> <man name=n>(split_common c)


let splitF (FPerson -> Woman)
  <person>[ <name>n <children>c ] -> <woman name=n>(split_common c)


let split (Person -> Man|Woman) MPerson & x -> splitM x | x -> splitF x

To avoid duplicating code between splitM and splitF, we have to introduce a function split_common. We also introduce a function split to recover the fact that we want to apply the function to Person elements. Here, it is easy to factorize common code, but it is not necessarily the case.

Another use of overloading is to give more precise information about the result, when one has more information about the argument. Simple minded examples:

type Nat = 0--*
let succ (Int -> Int; Nat -> Nat) x -> x + 1

Here, we would have to completely duplicate the body of the function:

let succInt (Int -> Int) x -> x + 1
let succNat (Nat -> Nat) x -> x + 1

Suppose we want to use succ like this:

type T = [ Int* Nat* ]
let succSeq (s : T) : T = map s with x -> succ x

If we want to use succInt/succNat instead, we have to dispatch between Nat and Int in the caller:

let succSeq (s : T) : T = map s with x & Nat -> succNat x | x -> succInt x

But this dispach is completely useless, because the dynamic behavior is the same in both cases. Now replace Nat and Int by complex XML types, and x -> x + 1 by a complex transformation, and you get the picture.

There are also probably interesting examples combining overloaded functions and higher-order functions, but we haven't found a convincing one yet.

Are they simply a matter of convenience, or is there something specific about the way XML types are expressed/extended that requires overloading (and extensible overloading -- section 3.5 "incremental programming").

I let Beppe answer this part.

castagna - Re: CDuce: a programming language adapted to the manipulation of XML documents

7/27/2003; 9:29:23 AM (reads: 644, responses: 0)

just let me add an example to what Alain said. The function

type XMLdoc = <_>_


let f(XMLdoc -> XMLdoc) 
  <m t> e -> g(m,t,e,f)

is a generic function that manipulates any well-formed XML documents and returns another wellformed document. Well-formation here is the only requirement (and the only property enforced on the result).

castagna - Re: CDuce: a programming language adapted to the manipulation of XML documents

7/27/2003; 9:30:35 AM (reads: 661, responses: 0)

Are they simply a matter of convenience, or is there
something specific about the way XML types are expressed/extended that
requires overloading (and extensible overloading -- section 3.5
"incremental programming"). Thanks,

I let Beppe answer this part.

[posting on users@cduce.org: see previous message]

Ok, so let me add my two cents worth. This is an old idea, that is to use dynamic overloading to mimic object orientation. The rough idea is that code specialization can be obtained by incremental definition of overloaded functions. The act of adding a new method for a message msg, corresponds to add a new code to (i.e. a new arrow in the interface of) the overloaded function msg.

To exemplify let us consider the split function of our example. Imagine that we define a subtype of person in that contains also a nationality attribute:

 type Citizen = FCitizen | MCitizen
 type FCitizen = <person gender = "F" nat = String>[ Name Children ]
 type MCitizen = <person gender = "M" nat = String>[ Name Children ]

note that the Children are still persons so no nationality is recorded. Imagine now that we want to specialize the behavior of split. As it is split works also with Citizens but it completely loose the information about nationality. To make things harder imagine that instead split has to register the nationality in the result type and propagate it to children as well, that is.

type CMan = <man name=String nat=String>[ <sons>[CMan*] <daughters>[CWoman*] ]
type CWoman = <woman name=String nat=String>[ <sons>[CMan*] <daughters>[CWoman*] ]

Of course we can redefine split from scratch but by stating binding the new behavior will be available only to the new functions. If instead we implement specialization with dynimic binding then also the "old" calls to split will enjoy the specialized behavior.
so if split were define in the module SomeModule we could write (in a yet not existing syntax)

addmethod SomeModule.split (FCitizen -> CWoman; MCitizen -> CMan)
 <person gender=g nat=nn>[ <name>n <children>[(mc::MPerson | fc::FPerson)*] ] ->
   let tag = match g with "F" -> `woman | "M" -> `man in
   let s = map mc with <person (atr)> elms -> split <person (atr+{nat=nn})> elms in
   let d = map fc with <person (atr)> elms -> split <person (atr+{nat=nn})> elms in
     <(tag) name=n nat=nn>[ <sons>s  <daughters>d ]

[we add the nat attribute in the arguments of the recursive calls and in the result]

This would roughly correspond to transform all the existin call to split to a call to the following function:

let split (MPerson -> Man ; FPerson -> Woman ; FCitizen -> CWoman ; MCitizen -> CMan)
| <person gender=g nat=nn>[ <name>n <children>[(mc::MPerson | fc::FPerson)*] ] ->
   let tag = match g with "F" -> `woman | "M" -> `man in
   let s = map mc with <person (atr)> elms -> split <person (atr+{nat=nn})> elms in
   let d = map fc with <person (atr)> elms -> split <person (atr+{nat=nn})> elms in
     <(tag) name=n nat=nn>[ <sons>s  <daughters>d ]
| <person gender=g>[ <name>n <children>[(mc::MPerson | fc::FPerson)*] ] ->
   let tag = match g with "F" -> `woman | "M" -> `man in
   let s = map mc with x -> split x in
   let d = map fc with x -> split x in
     <(tag) name=n>[ <sons>s  <daughters>d ]

Of course there still is a long way to go, first we need a real module system.

---Beppe---