The talk was dedicated to semantic filtering: a translation of a
document from one formal notation to another while preserving its
meaning. Examples are translating: code from Fortran to C; an image
from an XWD to the TIFF formats; a database query to a query for a
differently-structured database. In a simpler case, semantic filtering
is porting, e.g., a C program from Unix to Windows.
Semantics of the source language -- to be preserved during the
translation -- is represented compositionally by Horn clauses. This
makes the semantics directly executable by a logic programming system,
and permits automatical generation of provably correct translators. In
many cases, a reverse translator can also be generated automatically
from the same specification.
The paper introduces the approach and describes it in more detail on
concrete applications, such as translating a Nemeth Math Braille
notation to LaTeX, translating HTML to VoiceXML, and providing
interoperability between several bioinformatics genome analyses
packages.
Semantics of the source language to be preserved during
translation is denotational semantics. As the paper explains,
denotational semantics has three components:
- mapping from a concrete document to its parsed tree (syntax)
- mathematical objects that express the meaning -- typically
sets or lattices of values (semantic domain)
- mapping from parsed tree to elements of semantic domain
(valuation function).
In the approach of the paper, the valuation function is given as a set
of Horn clauses, and semantic domain is the abstract or concrete
syntax of the target language. Thus the meaning of the source language
is given in terms of the target language.
The translating procedure is then as follows: parse the source
document into its Source_AST, prove a goal
'ast_relation(Source_AST,Target)' and pretty print the 'Target' (AST).
The syntax of the source language is given in Prolog's Definite Clause
Grammars (DCG). Thus both the syntax and the semantics of the source
language are specified declaratively as Horn clauses. The
specification can be "executed" by a Prolog system, which is
tantamount to translation.
Since the denotational program (which expresses the meaning of one AST
in terms of the other) is a logical program, it can be used for
verification of the source. The reverse translator is also possible.
Prolog's definite clause grammars (DCG) can handle context-sensitive
languages. As the paper notes, many languages created by domain
experts (rather than computer scientists) turn out context-sensitive.
A translation from a Nemeth Math Braille notation to LaTeX is a very
impressive application. The extremely convoluted Nemeth Math Braille
notation was first specified in 1951. The notation is very context
sensitive, to let the reader know as much of the context of a phrase
as possible. For example, pow(x,y)+1 will be written in Nemeth Math
Braille notation as x^y"+1 (the double-quote means returning to the
base level), whereas pow(a,(pow(x,y)+1))+2 will be written
a^x^^y^+1"+2. The notation specified as several hundred examples, no
formal description has existed. Machine translation from the full
Nemeth Math Braille notation was considered _impossible_. With the
logical programming approach, a translator to LaTeX was built only in
3-4 man-months of work!
Another impressive application discussed in the paper is making
several genome analyses packages inter-operate (see www.swbic.org for
background). A typical analysis includes querying of a database of
genetic information, aligning of sequences and inferring of phylogenic
information. The packages that run each stage read and write generally
incompatible data formats. Logic programming has built several
_bi-directional_ translators, using NEXUS as an internal
format. Parsing from NEXUS was challenging: NEXUS was developed by
computational biologists to be human readable and highly general. As
the result, the format is horrific. For example, NEXUS permits nested
comments and comments inside tokens! DCG grammars proved sufficient;
but they had to explicitly resort to non-determinism [guessing and
backtracking]. The NEXUS translator was produced in 6 man-month. It
was the first complete NEXUS translator. As the presenter mentioned,
the translation speed is good: translating 100K of bioinformatics data
from one format to another takes seconds.
The paper also talks about HTML to VoiceXML conversion. However, they
used a subset of HTML (I believe with explicitly closed tags). They're
still working on a translator from the full HTML (where closing tags
can be omitted).
The paper concludes that
- the speed of creating translators (a matter of
several man-months for what previously considered intractable data formats),
- the ability to get the reverse translator for free
- provable correctness of generated translator
- specification of the language semantics that is rather suitable
to machine analyses and deductions
could make semantic filtering a killer application for logic programming.
Related approaches:
S.Stepney. High Integrity Compilation. Prentice Hall, 1993.
Building provably correct compilers by declaratively specifying
mapping between parse tree patterns and machine instructions.
Non-LALR parser generators: ANTRL, a pred-LL(k) parser generator by T. Parr:
www.antlr.org
-- Oleg
|