"typed" files OR xml OR meta information for delim. files.

I hope this isn't off topic. It seems a very large number of text files are some sort of delimited files (.csv, .tab, etc.). Awk seems to expect these files and cleanly allows dealing with only specific 'fields' in the file. Have there been any attempts to introduce some sort of (easy to use) meta data which describes layout of the file...perhaps more importantly the type of data in it. Excel or specific database files are obviously tied to the application that created them.

XML works well here...it defines data types and some relationship among data points, but is too verbose (a file with 5 columns but thousands of lines would be many times larger if xmlized).

I've come accross some information about type systems which describe memory layout (...which I don't really understand yet)...could something like that also be used to describe disk files?

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.


Take a look at the PADS language by Kathleen Fisher and Robert Gruber.

at javaone people kept mentio

at javaone people kept mentioning "infosets" as the solution to xml verbosity. i have no idea what they were talking about - it's on my list of things to track down. so far i've come up with this ("infoset" seems to be simply an formal way of looking at the data in an xml document, "binary infoset" seems more like what was intended, or "binary encoding of infoset", maybe?).


XML Information Set

It's basically the (or, a notion of) semantic content of an XML document. Morally speaking, it is what an "XML parser" ought to provide to an "XML application" after parsing an XML document.

ASN Fast Infoset

They were probably talking about the ASN.1-based Fast Infoset, which is supported in Java Web Services and has an open-source implementation. See this article. The X.891 standard should be available here any day now.

perhaps a general format

I don't know how various formatted files are organized (gif/jpeg/doc/etc). Perhaps there should be a very generic description of a file in its header, how its data is layed out, the types of that data, etc. This will help with perofmrance and effeciency because the file system will know some general things about all files in it. I started wondering about it because by having 'well described' files, we could further reduce errors in programs. I've read recently about programs that can be checked for constraints (I guess Java does that) or a programmer can provide explicit proofs...extending such constraints or proofs (statically checked) to files should prove beneficial.

Once we have a framework for formal 'description' of a file, we can start adding information such as: existence of read/write only bias, frequency of updates, security (stored with file so these attributes travel along on networks), expiration, etc., etc.

Any way, I'm thinking as I'm writing this.


Any way, I'm thinking as I'm writing this.

Often I feel that way too but sometimes I found out that it wasn't true ;-)