Strongly-Typed Language Support for Internet-Scale Information Sources: F# Type Providers

Strongly-Typed Language Support for Internet-Scale Information Sources

Don Syme, Keith Battocchi, Kenji Takeda(1), Donna Malayeri, Jomo Fisher, Jack Hu, Tao Liu, Brian McNamara, Daniel Quirk, Matteo Taveggia, Wonseok Chae, Uladzimir Matsveyeu(2), Tomas Petricek(3)

1 Microsoft Research, Cambridge, United Kingdom
2 Microsoft Corporation, Redmond WA, USA
3 University of Cambridge, United Kingdom


...Most modern applications incorporate one or more external information sources as integral components. Providing strongly typed access to these sources is a key consideration for strongly-typed programming languages, to insure low impedance mismatch in information access...

In this report we describe the design and implementation of the type provider mechanism in F# 3.0 and its applications to typed programming with web ontologies, web-services, systems management information, database mappings, data markets, content management systems, economic data and hosted scripting. Type soundness becomes relative to the soundness of the type providers and the schema change in information sources, but the role of types in information-rich programming tasks is massively expanded, especially through tooling that benefits from rich types in explorative programming.

What do you think of this approach?

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Beyond Java

A few years ago Bruce Tate wrote a book called Beyond Java, which basically asked what the next great advancement would be after Java. He suggested dynamic languages, but what he was really trying to say was explicit typing was the next major accidental complexity developers will try to eliminate.

Dynamic languages never could eliminate such accidental complexity, only shift and hide it.

F# Type providers get us a much closer, but the problem Don is punting on is this little nasty caveat:

Type soundness becomes relative to the soundness of the type providers and the schema change in information sources,

In other words, the so-called "traditional type bridging mechanisms don't scale" argument, while correct, does nothing to demonstrate a null hypothesis for F# Type Providers. For example, one reason things like WSDL-driven code proxy generators "don't scale" is that there is no guarantee the code they generate is sound with respect to the WSDL. I can provide two real world examples:

1) In Visual Studio 2010, a breaking change was introduced to the service proxy generator utility. Basically, the tool tries to intelligently decide how to serialize/de-serialize XML for you. The problem is, in general, it can intelligently decide to generate code which will silently fail to de-serialize XML correctly. WS-* faults get de-serialized as exception objects without exception info. In general, the .NET XmlSerializer has many serialization issues that cause silent failures.

2) In Java 5, some bright soul at Sun Microsystems broke backward compatibility in hte BigDecimal class, by changing the string representation of an object given by calling ToString. Consumers who wanted pre Java 5 behavior now have to call a new method, ToPlainString. Otherwise, they risk ToString returning engineering notation representation instead of "plain notation". Every major Java open source framework has had to do a patch to fix the issue on the service side.

On the client side, Sun "solved" the problem by changing all their clients to accept engineering notation as valid XML WS-* schema decimals, which is contrary to the specification. Yet, this still leaves all .NET clients broken. In WCF, there is no widely known good way to work around this issue from the client side. Automatically generating C# proxy classes from the WSDL will generate code that uses .NET Decimal data types, which WCF in turn will automatically de-serialize using the default NumberStyle representation settings. In other words, WCF cannot de-serialize engineering notation into a Decimal data type backed property. Which means proxy generation tools output either needs hacking or custom coding.

Given these examples where I waste most of my time debugging and solving these issues, how can F# Type Providers really help? I think it cannot. Ultimately, it is type soundness that matters. You can still have the issues 1 and 2 above even with Type Providers.

Anyway, feel free to disagree with me, but you asked for any old 2 cents.

Code Distribution, not Type Providers

The better approach to solve this problem is an ability to distribute code to operate on our behalf near the remote resources or services.

Given this ability, we can create programs that compile into shards distributed across the resources - i.e. first class abstraction of overlays. Those shards can perform ad-hoc data transforms to better fit each particular client of a service; further, at compilation, communication between the shards can be strongly typed internally even if they perform some serialization or parsing at the edges. Runtime upgrade and remote debugging is achievable by replacing the remote shards. This technique is robust to version updates of protocols, clients, services, and data types. It is very flexible and extensible.

With first-class abstraction of overlay networks, some benefits of type providers can be achieved through more local representations, such as type classes or SYB techniques. More importantly, many problems that type providers are intended to address can be avoided in the first place - i.e. there is less pressure to package all data into 'standard' formats and protocols for network distribution; instead, send some code over to interact with local APIs.

Today, this technique can be achieved within a scope of web services and cloud applications. Opa language is a recent example. But the technique has been developed many times at more constrained scopes (e.g. in SmallTalk/Croquet, ToonTalk, Oz, Alice, E, OCaml). Cloud Haskell is a potential target as well, albeit a relatively new one.

A major constraint on scope for this technique has been various security and resource control concerns - i.e. controlling distribution of authority, eliminating risk of denial-of-service attacks - plus addressing the normal challenges of distributed programming. A language with good properties in these areas would be a better candidate for shards and overlays to operate near services and resources whose administrators do not fully trust their clients (i.e. sane administrators). It would also support deeper mashups and better open extension.

There is a lot to win by focusing on security, distribution, and performance control in PL designs. F* is probably the most promising language I've recently seen for this purpose. Of course, I have my own designs that I've been pursuing, but I haven't turned them into a language yet.

Code Distribution, not Type Providers

Just to mention this is not an either-or - the two approaches can be complementary. For the work covered by the tech-report, strongly typed queries (normally LINQ queries) are used as the primary mechanism for moving declarative logic - though not arbitrary code - to heterogeneous servers. For the case of information services, type providers give the scalable complement to LINQ queries. In many ways I see them as "LINQ 2.0" (though ErikM may well have that term reserved :-)).

Kind regards

We deal with these issues already

In any system that provides read/write invariance for data, or which can distribute parts of a computation over a network, the serialization interface, ie, methods for converting values of that type to and from a string representation, must be implemented for all data types. This is broadly (and on such systems properly) regarded as part of a type's definition.

At the syntactic level, this kind of breakage happens when a serialization interface fails to get distributed along with (or ahead of) values of that type, or when a value of that type is crossing a boundary into a system whose serialization interface for the same type is incompatible.

At a semantic level, it's a bit murkier. If system A can represent exact rationals and system B only has IEEE-float approximations to them, then when a "number" is transmitted from A to B, even if the serialization routines are compatible, the value recieved is not quite the same as the value transmitted. If A's integers have a 64-bit representation and B's integers have a 32-bit representation, then "5" may be emitted for an integer value by A, and "successfully" accepted by B as an integer having "the same value" -- but even though the value has been successfully transmitted, it will have slightly different semantics in the two systems because the operations on it will have different overflow/underflow behavior.

And in this world where most programming is figuring out how to get different systems to communicate successfully with each other, this kind of type impedance is a very familiar problem.

What is *really* needed to make the problem go away, is a standard for serialization. We used to do that with S-expressions and Lispy data serialization, but unless you were sucking it into another compatible Lisp system where all values of user-defined types are also readable as values of some standard type (usually an array or list) the meaning and schema of each layout remained implicit, and potentially could be misread on the other end.

XML tried to improve that, with its DTDs and schemas. The result is something we've grown to love and hate. The syntactic and semantic compatibility is in fact somewhat better than we got with S-expressions.

I *still* hate to admit that. Before I really thought about what DTDs are and do, I used to believe XML was strictly inferior to S-expressions. The actual serialized data transmissions have horrifically awful, bandwidth-wasteful syntax, and standard practice forces ridiculously high levels of latency with multiple roundtrips as related resources and inevitably the DTD/schema itself get recursively loaded. It's just an awful way to do things. But, sigh, yes, it's more reliable and transparent than S-expressions, because the DTD's explain in a machine-readable way what it means. Transmitting bare S-expressions never made machine-readable explanations of their meanings available. So we accept the horrible wasteful syntax and the extra latency for the sake of marginally better semantics. Two steps forward, one-and-three-quarter steps back.

But even for all the bandwidth it uses/wastes and all the latency it forces, it's still not really reliable. It's okay for transmitting values, despite occasional breakages when someone like Sun or Microsoft decides it's okay to change the representation of a fundamental type. But it isn't a means of ensuring that those values really and truly mean the same thing (ie, have the same semantics in some deep sense) on both ends of the transmission.


What is *really* needed to

What is *really* needed to make the problem go away, is a standard for serialization.

Yeah, because the answer to too many inadequate standards is to build another one.

Rather than standard serialization, I'd focus on consistent semantics for types - e.g. so we don't have murky 32-bit vs. 64-bit issues. Floating point models should semantically include an error-model, in which case it would be much easier to ensure that the error constraints are respected.