Data schema languages


I've been thinking recently about schema languages for data, and in particular what it takes to give them a formal foundation.

I expect a lot of type system theory would apply, since these are effectively type systems for data, however they're a bit different to a typical programming language's type system.

On the one hand they often include value-level constraints like 'integer between 0 and 16', which I imagine might require dependent types to formalise in terms of type theory.

On the other hand, things are in some ways simpler, because soundness and completeness are only required with respect to correct serialisation and deserialisation of data, not with respect to correct execution of code.

Then there is the fact that often a schema language is designed to express more than just a true/false validation process, but to add extra metadata to types for reflection by various tools / for various purposes, eg:

  • Serialisation and deserialisation tools for various programming languages
  • Hypertext clients (see eg the hyper-schema extensions to json-schema)
  • Data persistence tools
  • Even simple UI tools like admin form generators

Wonder if anyone here has any good references on this stuff, in particular any attempts to formalise semantics and prove soundness etc for such a language? (FYI I'm interested in attempting this for a YAML or JSON-based schema language)


Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Also type inference is not

Also type inference is not normally a requirement. Although there are some convenience tools which try to extrapolate a schema from some sample data, they're not usually relied on.

Also sometimes the trade-off may be in favour serialising the data as compactly as possible, hence omitting type constructors/tags wherever possible from the serialisation. Which obviously makes type inference harder to do reliably from the serialised data. It's assumed you'll have the schema around in order to deserialize correctly.

Data description calculus

Hi Matt,

You might find our recent JACM article of interest:

"The Next 700 Data Description Languages"

(available here:

While I'm not sure if our work is exactly what you're looking for, the related work and references will hopefully be useful. You might also take a look at the XSugar papers which address the theoretical aspects of round-tripping quite nicely. Finally, Nate Foster and others have a series of papers on "lenses" which you might also find relevant from the theory side.


Just the sort of thing I was

Just the sort of thing I was looking for, cheers!


Clickable link for the lazy:

More general than lenses would be the Boomerang project. General in the sense it provides a framework for constructing lenses and putting the theory into practice. Some silly people, like me, sometimes appreciate playing with a tool or seeing an idea in multiple forms to appreciate its significance.

... it's all the same people.

... it's all the same people (which isn't a bad thing)

Look at Wadler's papers on

Look at Wadler's papers on formalizing XML Schema:

Frank is back?

Have I not been paying attention in the past five years or so, or are we really witnessing Frank's comeback?

(Let's see, the last post I remember was about six years ago.)


Well, I've posted a couple of things since then, but not a lot.

As for "being back": I dunno. We'll see.

Seems so

Eight posts spread out over six days is definitely hanging around, at least for a while. Welcome back!

back to the future

Even if it's only for a little while, it's good to see you back (or at least posting more frequently).

A formal comparison of

A formal comparison of conceptual data modeling languages is probably a good start. See also Description Logics. Quoting:

The main effort of the research in knowledge representation is providing theories and systems for expressing structured knowledge and for accessing and reasoning with it in a principled way. Description Logics are considered the most important knowledge representation formalism unifying and giving a logical basis to the well known traditions of Frame-based systems, Semantic Networks and KL-ONE-like languages, Object-Oriented representations, Semantic data models, and Type systems.

I think the field of formal verification of data (modeling) languages has been ignored by the LtU community. Maybe this is due to an abuse of graphical modeling tools (UML/etc). I think that a formal approach to data modeling (at any level) would be really useful (in order to avoid inconsistencies in data, for example).

Regarding your YAML: I think that RelaxNG Compact is very good!.