MURATA Makoto (FAMILY Given)
Fuji Xerox Information Systems
On the basis of the tree automaton theory, this paper demonstrates DTD transformation. Controlled by patterns and contextual conditions, operators transform not only XML documents, but also DTD's. It is guaranteed that transformation of XML documents permitted by the input DTD creates XML documents permitted by the output DTD. Furthermore, the output DTD is minimally sufficient. Patterns are conditions on (possibly non-immediate) subordinate nodes, and contextual conditions are conditions on non-subordinate nodes (e.g., superior nodes, ancestor nodes, sibling nodes, and subordinates of sibling nodes).
XML, DTD, XML document, tree automaton, transformation, pattern, contextual condition
This presentation shows transformations of DTD's and XML documents. Each operator transforms not only XML documents, but also DTD's. The output instances are guaranteed to conform to the output DTD's. Furthermore, the output DTD's are minimally sufficient; in other words, they allow only those XML documents which may be generated.
Operators are controlled by patterns and contextual conditions. Patterns are conditions on (possibly non-immediate) subordinate nodes, and contextual conditions are conditions on non-subordinate nodes (e.g., superior nodes, ancestor nodes, sibling nodes, and subordinates of sibling nodes).
Such DTD transformation is highly important for at least reasons. First, it helps DTD evolution; by writing an update program with operators, we can update not only instances but also DTD's. Second, transformation among different DTD's become a lot easier; we can examine DTD's created by transformation operators.
This work is based on the theory of tree automatons. A DTD is first translated to a tree automaton. This tree automaton is then repeatedly transformed. Finally, this tree automaton is translated back to a DTD. Patterns and contexual conditions of operators are also captured as tree automatons.
In this presentation, we demonstrate an example of DTD transformation rather than providing theoretical details. Such details are presented in another article "Transformation of Documents and Schemas by Patterns and Contextual Conditions" that is to appear in the proceedings of Principles of Document Processing '96 (Springer LNCS).
Suppose that we have documents conforming to a source DTD as follows:
<!ELEMENT doc (sec-a | sec-b)*>
<!ELEMENT sec-a (title, para*, subsec*)>
<!ELEMENT sec-b (title, para*, subsec*)>
<!ELEMENT subsec (title, para*, subsec*)>
<!ELEMENT title (#PCDATA)>
<!ELEMENT para (#PCDATA)>
Suppose that para's and titles below sec-b's and those below sec-a's must always be handled differently. We might thus want to transform each of the documents by performing the following operations in sequence:
We also want to generate a target DTD by transforming the source DTD with these operations. In this case, we want to generate a DTD as follows:
<!ELEMENT doc (sec-a | sec-b)*>
<!ELEMENT sec-a (title-a, para-a*, subsec-a*)>
<!ELEMENT sec-b (title-b, para-b*, subsec-b*)>
<!ELEMENT subsec-a (subtitle-a, para-a*, subsec-a*)>
<!ELEMENT subsec-b (subtitle-b, para-b*, subsec-b*)>
<!ELEMENT title-a (#PCDATA)>
<!ELEMENT title-b (#PCDATA)>
<!ELEMENT subtitle-a (#PCDATA)>
<!ELEMENT subtitle-b (#PCDATA)>
<!ELEMENT para-a (#PCDATA)>
<!ELEMENT para-b (#PCDATA)>
For the rest of this document we study the construction steps (see below) of this DTD:
We first convert the source DTD to a tree automaton as follows:
Doc ::= doc <(Sec-A | Sec-B)*>
Sec-A ::= sec-a <Title Para* Subsec*>
Sec-B ::= sec-b <Title Para* Subsec*>
Subsec ::= subsec <Title Para* Subsec*>
Title ::= title<"">
Para ::= para<"">
This is not a DTD but rather a tree automaton: unlike DTD's, we have states between element types and content models. This automaton becomes the source automaton for the next step. The last line means that "Given a para node and states for data characters, the transition function provides a state Para." The first line means that "Given a doc node and a sequence of states Sec-A or Sec-B, the transition function provides a state Doc." The final state of this automaton is Doc.
Alternatively, one can assume that a tree automaton is an extension of XML DTD's. The only difference is the introduction of non-terminals. An element declaration specifies a non-terminal, and content models reference to non-terminals rather than to element types.
Pattern
Our pattern is title<#*>, where # denotes any tree. Any subtree rooted by a title node matches this pattern. A determinstic tree automaton that accepts such subtrees is as follows:
<!ELEMENT NOT title [q0] ((q0 | q1)*)>
<!ELEMENT title [q1] ((q0 | q1)*)>
Here q0, q1 are states, and q1 is the final state.
"NOT title" is merely syntax sugar. Instead of writing an element declaration for every element type except "title", we write "NOT title".
The contextual condition of this operation is that a sec-b must appear as an immediate superior.
The present example does not use conditions on siblings or siblings of ancestors but only uses conditions on ancestors. Thus, the required construction is significantly simpler than in the generalized case. Construction for the general case is presented in my PODP paper.
First, we create a deterministic (string) automaton that accepts the set of paths containing a "sec-b" as an immediate superior, namely ".* sec-b".
States: f0, f1
Start state: f0
Final state(s): f1
Transition:
tr(f0, sec-b) = f1
tr(f0, x) = f0 where x != sec-b
tr(f1, sec-b) = f1
tr(f1, x) = f0 where x != sec-b
Next, we create a non-deterministic unambiguous tree automaton. The states of this tree automaton are the states of the string automaton, namely f0 and f1. The final states of our tree automaton are the start state of the string automaton, namely f0. Given a symbol x, the transition function of our tree automaton returns a state f if and only if the current state sequence is in {tr(f,x)}*.
As a result, we construct a tree automaton as follows:
<ELEMENT sec-b [f0] (f1*)>
<ELEMENT NOT sec-b [f0] (f0*)>
<ELEMENT sec-b [f1] (f1*)>
<ELEMENT NOT sec-b [f1] (f0*)>
A match-identifying automaton performs the computation of the original tree automaton as well as the computation of the tree automaton constructed from the pattern and that constructed from the contextual condition. We create a match-identifying automaton by augumenting the original tree automaton with the two automaton by the intersecion operation.
The match-identifying automaton is shown below, where the final state is Doc-q0-f0 and the marked state is Title-q1-f1.
<ELEMENT doc [Doc-q0-f0] ((Sec-A-q0-f0 | Sec-B-q0-f0 )*)>
<ELEMENT sec-a [Sec-A-q0-f0] (Title-q1-f0, Para-q0-f0*, Subsec-q0-f0*)>
<ELEMENT sec-b [Sec-B-q0-f0] (Title-q1-f1, Para-q0-f1*, Subsec-q0-f1*)>
<ELEMENT subsec [Subsec-q0-f0] (Title-q1-f0, Para-q0-f0*, Subsec-q0-f0*)>
<ELEMENT subsec [Subsec-q0-f1] (Title-q1-f0, Para-q0-f0*, Subsec-q0-f0*)>
<ELEMENT title [Title-q1-f0] (#PCDATA)>
<ELEMENT title [Title-q1-f1] (#PCDATA)>
<ELEMENT para [Para-q0-f0] (#PCDATA)>
<ELEMENT para [Para-q0-f1] (#PCDATA)>
Target automaton
By renaming "title" in the rhs for Title-q1-f1, merging Para-q0-f0 and Para-q0-f1, merging Subsec-q0-f0 and Subsec-q0-f1, and appropriately renaming states, we have a tree automaton as follows:
<ELEMENT doc [Doc] ((Sec-A | Sec-B)*)>
<ELEMENT sec-a [Sec-A] (Title, Para*, Subsec*)>
<ELEMENT sec-b [Sec-B] (Title-B, Para*, Subsec*)>
<ELEMENT subsec [Subsec] (Title, Para*, Subsec*)>
<ELEMENT title [Title] (#PCDATA)>
<ELEMENT title-b [Title-B] (#PCDATA)>
<ELEMENT para [Para] (#PCDATA)>
Pattern
As in Step 2, our pattern is title<#*>, which is equivalent to a deterministic tree automaton as follows:
<ELEMENT NOT title [q0] ((q0 | q1)*)>
<ELEMENT title [q1] ((q0 | q1)*)>
Here q0, q1 are states, and q1 is the final state.
Contextual condition
The contextual condition is that a sec-a must appear as an immediate superior. As in Step 2, we construct a tree automaton as follows:
<ELEMENT sec-a [f0] (f1*)>
<ELEMENT NOT sec-a [f0] (f0*)>
<ELEMENT sec-a [f1] (f1*)>
<ELEMENT NOT sec-a [f1] (f0*)>
Match-identifying automaton
As in Step 2, we construct a match-identifying automaton (see below), where the final state is Doc-q0-f0 and the marked state is Title-q1-f1.
<ELEMENT doc [Doc-q0-f0] ((Sec-A-q0-f0 | Sec-B-q0-f0 )*)>
<ELEMENT sec-a [Sec-A-q0-f0] (Title-q1-f1, Para-q0-f1*, Subsec-q0-f1*)>
<ELEMENT sec-b [Sec-B-q0-f0] (Title-B-q0-f0, Para-q0-f0*, Subsec-q0-f0*)>
<ELEMENT subsec [Subsec-q0-f0] (Title-q1-f0, Para-q0-f0*, Subsec-q0-f0*)>
<ELEMENT subsec [Subsec-q0-f1] (Title-q1-f0, Para-q0-f0*, Subsec-q0-f0*)>
<ELEMENT title [Title-q1-f0] (#PCDATA)>
<ELEMENT title [Title-q1-f1] (#PCDATA)>
<ELEMENT title-b [Title-B-q0-f0] (#PCDATA)>
<ELEMENT para [Para-q0-f0] (#PCDATA)>
<ELEMENT para [Para-q0-f1] (#PCDATA)>
Target automaton
By renaming "title" in the rhs for Title-q1-f1, merging Para-q0-f0 and Para-q0-f1, merging Subsec-q0-f0 and Subsec-q0-f1, and appropriately renaming states, we have a tree automaton as follows:
<ELEMENT doc [Doc] ((Sec-A | Sec-B)*)>
<ELEMENT sec-a [Sec-A] (Title-A, Para*, Subsec*)>
<ELEMENT sec-b [Sec-B] (Title-B, Para*, Subsec*)>
<ELEMENT subsec [Subsec] (Title, Para*, Subsec*)>
<ELEMENT title [Title] (#PCDATA)>
<ELEMENT title-a [Title-A] (#PCDATA)>
<ELEMENT title-b [Title-B] (#PCDATA)>
<ELEMENT para [Para] (#PCDATA)>
Pattern
Our pattern is subsec<#*>. A determinstic tree automaton that accepts this language is as follows:
<ELEMENT NOT subsec [q0] ((q0 | q1)*)>
<ELEMENT subsec [q1] ((q0 | q1)*)>
Here q0, q1 are states, and q1 is the final state.
The contextual condition of this operation is that a "sec-b" must appear as a (possibly non-immediate) superior.
First, we create a deterministic (string) automaton that accepts the set of paths containing at least one "sec-b" as a (possibly non-immediate) superior, namely ".* sec-b .*".
States: f0, f1
Start state: f0
Final state(s): f1
Transition:
tr(f0, sec-b) = f1
tr(f0, x) = f0 where x != sec-b
tr(f1, x) = f1 for every x
Next, as in Step 2 we create a non-deterministic unambiguous tree automaton as follows:
<ELEMENT sec-b [f0] (f1*)>
<ELEMENT NOT sec-b [f0] (f0*)>
<ELEMENT ANY [f1] (f1*)>
Match-identifying automaton
We construct a match-identifying automaton (see below), where the final state is Doc-q0-f0 and the marked state is Subsec-q1-f1.
<ELEMENT doc [Doc-q0-f0]
((Sec-A-q0-f0 | Sec-B-q0-f0)*)>
<ELEMENT sec-a [Sec-A-q0-f0]
(Title-A-q0-f0, Para-q0-f0*, Subsec-q1-f0*)>
<ELEMENT sec-b [Sec-B-q0-f0]
(Title-B-q0-f1, Para-q0-f1*, Subsec-q1-f1*)>
<ELEMENT subsec [Subsec-q1-f0]
(Title-q0-f0, Para-q0-f0*, Subsec-q1-f0*)>
<ELEMENT subsec [Subsec-q1-f1]
(Title-q0-f1, Para-q0-f1*, Subsec-q1-f1*)>
<ELEMENT title [Title-q0-f0] (#PCDATA)>
<ELEMENT title [Title-q0-f1] (#PCDATA)>
<ELEMENT title-a [Title-A-q0-f0] (#PCDATA)>
<ELEMENT title-b [Title-B-q0-f1] (#PCDATA)>
<ELEMENT para [Para-q0-f0] (#PCDATA)>
<ELEMENT para [Para-q0-f1] (#PCDATA)>
Target automaton
By renaming "subsec" in the rhs for Subsec-q1-f1, merging Para-q0-f0 and Para-q0-f1, merging Title-q0-f0 and Title-q0-f1, and appropriately renaming states, we have a tree automaton as follows:
<ELEMENT doc [Doc] ((Sec-A | Sec-B)*)>
<ELEMENT sec-a [Sec-A] (Title-A, Para*, Subsec*)>
<ELEMENT sec-b [Sec-B] (Title-B, Para*, Subsec-B*)>
<ELEMENT subsec [Subsec] (Title, Para*, Subsec*)>
<ELEMENT subsec-b [Subsec-B] (Title, Para*, Subsec-B*)>
<ELEMENT title [Title] (#PCDATA)>
<ELEMENT title-a [Title-A] (#PCDATA)>
<ELEMENT title-b [Title-B] (#PCDATA)>
<ELEMENT para [Para] (#PCDATA)>
Pattern
As in Step 4, our pattern is subsec<#*>, which is equivalent to a determinstic tree automaton as follows:
<ELEMENT NOT subsec [q0] ((q0 | q1)*)>
<ELEMENT subsec [q1] ((q0 | q1)*)>
Here q0, q1 are states, and q1 is the final state.
Contextual condition
The contextual condition of this operation is that a "sec-a" must appear as a (possibly non-immediate) superior. As in Step 4, we construct a tree automaton as follows:
<ELEMENT sec-a [f0] (f1*)>
<ELEMENT NOT sec-a [f0] (f0*)>
<ELEMENT any [f1] (f1*)>
Match-identifying automaton
We construct a match-identifying automaton (see below), where the final state is Doc-q0-f0 and the marked state is Subsec-q1-f1.
<ELEMENT doc [Doc-q0-f0]
((Sec-A-q0-f0 | Sec-B-q0-f0)*)>
<ELEMENT sec-a [Sec-A-q0-f0]
(Title-A-q0-f1, Para-q0-f1*, Subsec-q1-f1*)>
<ELEMENT sec-b [Sec-B-q0-f0]
(Title-B-q0-f0, Para-q0-f0*, Subsec-B-q0-f0*)>
<ELEMENT subsec [Subsec-q1-f1]
(Title-q0-f1, Para-q0-f1*, Subsec-q1-f1*)>
<ELEMENT subsec-b [Subsec-B-q0-f0]
(Title-q0-f0, Para-q0-f0*, Subsec-B-q0-f0*)>
<ELEMENT title [Title-q0-f0] (#PCDATA)>
<ELEMENT title [Title-q0-f1] (#PCDATA)>
<ELEMENT title-a [Title-A-q0-f1] (#PCDATA)>
<ELEMENT title-b [Title-B-q0-f0] (#PCDATA)>
<ELEMENT para [Para-q0-f0] (#PCDATA)>
<ELEMENT para [Para-q0-f1] (#PCDATA)>
Target automaton
By renaming "subsec" in the rhs for Subsec-q1-f1, merging Para-q0-f0 and Para-q0-f1, merging Title-q0-f0 and Title-q0-f1, and appropriately renaming states, we have a tree automaton as follows:
<ELEMENT doc [Doc] ((Sec-A | Sec-B)*)>
<ELEMENT sec-a [Sec-A] (Title-A, Para*, Subsec-A*)>
<ELEMENT sec-b [Sec-B] (Title-B, Para*, Subsec-B*)>
<ELEMENT subsec-a [Subsec-A] (Title, Para*, Subsec-A*)>
<ELEMENT subsec-b [Subsec-B] (Title, Para*, Subsec-B*)>
<ELEMENT title [Title] (#PCDATA)>
<ELEMENT title-a [Title-A] (#PCDATA)>
<ELEMENT title-b [Title-B] (#PCDATA)>
<ELEMENT para [Para] (#PCDATA)>
Pattern
Our pattern is para<#*>. A determinstic tree automaton that accepts this language is as follows:
<ELEMENT NOT para [q0] ((q0 | q1)*)>
<ELEMENT para [q1] ((q0 | q1)*)>
Here q0, q1 are states, and q1 is the final state.
Contextual condition
The contextual condition of this operation is that a "sec-b" must appear as a (possibly non-immediate) superior. As in Step 4, we create a tree automaton as follows:
<ELEMENT sec-b [f0] (f1*)>
<ELEMENT NOT sec-b [f0] (f0*)>
<ELEMENT any [f1] (f1*)>
Match-identifying automaton
We construct a match-identifying automaton (see below), where the final state is Doc-q0-f0 and the marked state is Para-q1-f1.
<ELEMENT doc [Doc-q0-f0]
((Sec-A-q0-f0 | Sec-B-q0-f0)*)>
<ELEMENT sec-a [Sec-A-q0-f0]
(Title-A-q0-f0, Para-q1-f0*, Subsec-A-q0-f0*)>
<ELEMENT sec-b [Sec-B-q0-f0]
(Title-B-q0-f1, Para-q1-f1*, Subsec-B-q0-f1*)>
<ELEMENT subsec-a [Subsec-A-q0-f0]
(Title-q0-f0, Para-q1-f0*, Subsec-A-q0-f0*)>
<ELEMENT subsec-b [Subsec-B-q0-f1]
(Title-q0-f1, Para-q1-f1*, Subsec-B-q0-f1*)>
<ELEMENT title [Title-q0-f0] (#PCDATA)>
<ELEMENT title [Title-q0-f1] (#PCDATA)>
<ELEMENT title-a [Title-A-q0-f0] (#PCDATA)>
<ELEMENT title-b [Title-B-q0-f1] (#PCDATA)>
<ELEMENT para [Para-q1-f0] (#PCDATA)>
<ELEMENT para [Para-q1-f1] (#PCDATA)>
Target automaton
By renaming "para" in the rhs for Para-q1-f1, merging Title-q0-f0 and Title-q0-f1, and appropriately renaming states, we have a tree automaton as follows:
<ELEMENT doc [Doc] ((Sec-A | Sec-B)*)>
<ELEMENT sec-a [Sec-A] (Title-A, Para*, Subsec-A*)>
<ELEMENT sec-b [Sec-B] (Title-B, Para-B*, Subsec-B*)>
<ELEMENT subsec-a [Subsec-A] (Title, Para*, Subsec-A*)>
<ELEMENT subsec-b [Subsec-B] (Title, Para-B*, Subsec-B*)>
<ELEMENT title [Title] (#PCDATA)>
<ELEMENT title-a [Title-A] (#PCDATA)>
<ELEMENT title-b [Title-B] (#PCDATA)>
<ELEMENT para [Para] (#PCDATA)>
<ELEMENT para-b [Para-B] (#PCDATA)>
Pattern
As in Step 6, our pattern is para<#*>, which is equivalent to a determinstic tree automaton as follows:
<ELEMENT NOT para [q0] ((q0 | q1)*)>
<ELEMENT para [q1] ((q0 | q1)*)>
Here q0, q1 are states, and q1 is the final state.
Contextual condition
The contextual condition is the same as in Step 5. The tree automaton is:
<ELEMENT sec-a [f0] (f1*)>
<ELEMENT NOT sec-a [f0] (f0*)>
<ELEMENT ANY [f1] (f1*)>
Match-identifying automaton
We construct a match-identifying automaton (see below), where the final state is Doc-q0-f0 and the marked state is Para-q1-f1.
<ELEMENT doc [Doc-q0-f0]
((Sec-A-q0-f0 | Sec-B-q0-f0)*)>
<ELEMENT sec-a [Sec-A-q0-f0]
(Title-A-q0-f1, Para-q1-f1*, Subsec-A-q0-f1*)>
<ELEMENT sec-b [Sec-B-q0-f0]
(Title-B-q0-f0, Para-B-q0-f0*, Subsec-B-q0-f0*)>
<ELEMENT subsec-a [Subsec-A-q0-f1]
(Title-q0-f1, Para-q1-f1*, Subsec-A-q0-f1*)>
<ELEMENT subsec-b [Subsec-B-q0-f0]
(Title-q0-f0, Para-B-q0-f0*, Subsec-B-q0-f0*)>
<ELEMENT title [Title-q0-f0] (#PCDATA)>
<ELEMENT title [Title-q0-f1] (#PCDATA)>
<ELEMENT title-a [Title-A-q0-f1] (#PCDATA)>
<ELEMENT title-b [Title-B-q0-f0] (#PCDATA)>
<ELEMENT para [Para-q1-f1] (#PCDATA)>
<ELEMENT para-b [Para-B-q0-f0] (#PCDATA)>
Target automaton
By renaming "para" in the rhs for Para-q1-f1, merging Title-q0-f0 and Title-q0-f1, and appropriately renaming states, we have a tree automaton as follows:
<ELEMENT doc [Doc] ((Sec-A | Sec-B)*)>
<ELEMENT sec-a [Sec-A] (Title-A, Para-A*, Subsec-A*)>
<ELEMENT sec-b [Sec-B] (Title-B, Para-B*, Subsec-B*)>
<ELEMENT subsec-a [Subsec-A] (Title, Para-A*, Subsec-A*)>
<ELEMENT subsec-b [Subsec-B] (Title, Para-B*, Subsec-B*)>
<ELEMENT title [Title] (#PCDATA)>
<ELEMENT title-a [Title-A] (#PCDATA)>
<ELEMENT title-b [Title-B] (#PCDATA)>
<ELEMENT para-a [Para-A] (#PCDATA)>
<ELEMENT para-b [Para-B] (#PCDATA)>
Pattern
As in Step 2, our pattern is title<#*>, which is equivalent to a deterministic tree automaton as follows:
<ELEMENT NOT title [q0] ((q0 | q1)*)>
<ELEMENT title [q1] ((q0 | q1)*)>
Here q0, q1 are states, and q1 is the final state.
Contextual condition
The contextual condition of this operation is that a subsec-b must appear as an immediate superior. As in Step 2, we construct a tree automaton as follows:
<ELEMENT subsec-b [f0] (f1*)>
<ELEMENT NOT subsec-b [f0] (f0*)>
<ELEMENT subsec-b [f1] (f1*)>
<ELEMENT NOT subsec-b [f1] (f0*)>
Match-identifying automaton
We construct a match-identifying automaton (see below), where the final state is Doc-q0-f0 and the marked state is Title-q1-f1.
<ELEMENT doc [Doc-q0-f0]
((Sec-A-q0-f0 | Sec-B-q0-f0)*)>
<ELEMENT sec-a [Sec-A-q0-f0]
(Title-A-q0-f0, Para-A-q0-f0*, Subsec-A-q0-f0*)>
<ELEMENT sec-b [Sec-B-q0-f0]
(Title-B-q0-f0, Para-B-q0-f0*, Subsec-B-q0-f0*)>
<ELEMENT subsec-a [Subsec-A-q0-f0]
(Title-q1-f0, Para-A-q0-f0*, Subsec-A-q0-f0*)>
<ELEMENT subsec-b [Subsec-B-q0-f0]
(Title-q1-f1, Para-B-q0-f1*, Subsec-B-q0-f1*)>
<ELEMENT subsec-b [Subsec-B-q0-f1]
(Title-q1-f1, Para-B-q0-f1*, Subsec-B-q0-f1*)>
<ELEMENT title [Title-q1-f0] (#PCDATA)>
<ELEMENT title [Title-q1-f1] (#PCDATA)>
<ELEMENT title-a [Title-A-q0-f0] (#PCDATA)>
<ELEMENT title-b [Title-B-q0-f0] (#PCDATA)>
<ELEMENT para-a [Para-A-q0-f0] (#PCDATA)>
<ELEMENT para-b [Para-B-q0-f0] (#PCDATA)>
<ELEMENT para-b [Para-B-q0-f1] (#PCDATA)>
Target automaton
By renaming "title" in the rhs for Title-q1-f1, merging Para-B-q0-f0 and Para-B-q0-f1, merging Subsec-A-q1-f0 and Subsec-A-q1-f1, and appropriately renaming states, we have a tree automaton as follows:
<ELEMENT doc [Doc] ((Sec-A | Sec-B)*)>
<ELEMENT sec-a [Sec-A] (Title-A, Para-A*, Subsec-A*)>
<ELEMENT sec-b [Sec-B] (Title-B, Para-B*, Subsec-B*)>
<ELEMENT subsec-a [Subsec-A] (Title, Para-A*, Subsec-A*)>
<ELEMENT subsec-b [Subsec-B] (Subtitle-B, Para-B*, Subsec-B*)>
<ELEMENT title [Title] (#PCDATA)>
<ELEMENT subtitle-b [Subtitle-B] (#PCDATA)>
<ELEMENT title-a [Title-A] (#PCDATA)>
<ELEMENT title-b [Title-B] (#PCDATA)>
<ELEMENT para-a [Para-A] (#PCDATA)>
<ELEMENT para-b [Para-B] (#PCDATA)>
Pattern
As in Step 2, our pattern is title<#*>, which is equivalent to a deterministic tree automaton as follows:
<ELEMENT NOT title [q0] ((q0 | q1)*)>
<ELEMENT title [q1] ((q0 | q1)*)>
Here q0, q1 are states, and q1 is the final state.
Contextual condition
The contextual condition of this operation is that an subsec-a must appear as an immediate superior. As in Step 2, we construct a tree automaton as follows:
<ELEMENT subsec-a [f0] (f1*)>
<ELEMENT NOT subsec-a [f0] (f0*)>
<ELEMENT subsec-a [f1] (f1*)>
<ELEMENT NOT subsec-a [f1] (f0*)>
Match-identifying automaton
We construct a match-identifying automaton (see below), where the final state is Doc-q0-f0 and the marked state is Title-q1-f1.
<ELEMENT doc [Doc-q0-f0]
((Sec-A-q0-f0 | Sec-B-q0-f0)*)>
<ELEMENT sec-a [Sec-A-q0-f0]
(Title-A-q0-f0, Para-A-q0-f0*, Subsec-A-q0-f0*)>
<ELEMENT sec-b [Sec-B-q0-f0]
(Title-B-q0-f0, Para-B-q0-f0*, Subsec-B-q0-f0*)>
<ELEMENT subsec-a [Subsec-A-q0-f0]
(Title-q1-f1, Para-A-q0-f1*, Subsec-A-q0-f1*)>
<ELEMENT subsec-a [Subsec-A-q0-f1]
(Title-q1-f1, Para-A-q0-f1*, Subsec-A-q0-f1*)>
<ELEMENT subsec-b [Subsec-B-q0-f0]
(Subtitle-B-q0-f0, Para-B-q0-f0*, Subsec-B-q0-f0*)>
<ELEMENT title [Title-q1-f1] (#PCDATA)>
<ELEMENT title [Subtitle-B-q0-f0] (#PCDATA)>
<ELEMENT title-a [Title-A-q0-f0] (#PCDATA)>
<ELEMENT title-b [Title-B-q0-f0] (#PCDATA)>
<ELEMENT para-a [Para-A-q0-f0] (#PCDATA)>
<ELEMENT para-a [Para-A-q0-f1] (#PCDATA)>
<ELEMENT para-b [Para-B-q0-f0] (#PCDATA)>
Target automaton
By renaming "title" in the rhs for Title-q1-f1, merging Para-A-q0-f0 and Para-A-q0-f1, merging Subsec-A-q1-f0 and Subsec-A-q1-f1, and appropriately renaming states, we have a tree automaton as follows:
<ELEMENT doc [Doc] ((Sec-A | Sec-B)*)>
<ELEMENT sec-a [Sec-A] (Title-A, Para-A*, Subsec-A*)>
<ELEMENT sec-b [Sec-B] (Title-B, Para-B*, Subsec-B*)>
<ELEMENT subsec-a [Subsec-A] (Subtitle-A, Para-A*, Subsec-A*)>
<ELEMENT subsec-b [Subsec-B] (Subtitle-B, Para-B*, Subsec-B*)>
<ELEMENT subtitle-a [Subtitle-A] (#PCDATA)>
<ELEMENT subtitle-b [Subtitle-B] (#PCDATA)>
<ELEMENT title-a [Title-A] (#PCDATA)>
<ELEMENT title-b [Title-B] (#PCDATA)>
<ELEMENT para-a [Para-A] (#PCDATA)>
<ELEMENT para-b [Para-B] (#PCDATA)>
The above tree automaton is equivalent to a DTD (shown below), which is exactly what we have been aiming at!
<!ELEMENT doc (sec-a | sec-b)*>
<!ELEMENT sec-a (title-a, para-a*, subsec-a*)>
<!ELEMENT sec-b (title-b, para-b*, subsec-b*)>
<!ELEMENT subsec-a (subtitle-a, para-a*, subsec-a*)>
<!ELEMENT subsec-b (subtitle-b, para-b*, subsec-b*)>
<!ELEMENT subtitle-a (#PCDATA)>
<!ELEMENT subtitle-b (#PCDATA)>
<!ELEMENT title-a (#PCDATA)>
<!ELEMENT title-b (#PCDATA)>
<!ELEMENT para-a (#PCDATA)>
<!ELEMENT para-b (#PCDATA)>
We have presented an example of DTD transformation. Although this example uses a single source DTD, the underlying framework can easily handle multiple source DTD's.
Implementation on top of an automaton construction toolkit called "Grail" is in progress. A declarative query language for document database systems is also in progress.
The DTD-like notation for tree automatons is proposed by HIYAMA Masayuki (LAST First).
The best survey of the tree automaton theory is written by James Thatcher. However, only binary trees are considered in this survey.
@incollection{Thatcher:survey, author = "James W. Thatcher", title = "Tree Automata: An informal survey", booktitle = "Currents in the theory of computing", editor = "Aho, Alfred V.", publisher = "Prentice-Hall", pages = "143-172", year = "1987", }
There is a standard textbook on tree automata. This book is very mathematical. In this book, not only binary trees but also N-ary trees are considered. However, N is fixed for every automaton. In other words, automatons considered in this book cannot capture sections, as sections have an arbitrary number of paragraphs.
@book{gs:Automata, author = "F. G\'{e}cseg and M. Steinby", title = "Tree Automata", publisher = "Akad\'{e}miai Kiad\'{o}, Budapest", year= "1984", }
Automatons considered in these articles are flexible; they allow a section to have an arbitrary number of paragraphs.
@Article{Thatcher:68, author = "Thatcher, J.W. and Wright, J.B. ", title = "Generalized Finite Automata Theory with an Application to a Decision Problem of Second-Order Logic", journal = "Mathematical Systems Theory", year = "1968", volume = "2", number = "1", pages = "57-81" }
@Article{Pair:68, author = "C. Pair and A. Quere", title = "D\'{e}finition et Etude des Bilangages R\'{e}guliers", journal = "Information and Control", language = "French", year = "1968", month = dec, volume = "13", number = "6", pages = "565-593", }
@Article{Takahashi:75, author = "Masako Takahashi", title = "Generalizations of Regular Sets and Their Application to a Study of Context-Free Languages", journal = "Information and Control", year = "1975", volume = "27", pages = "1-36", }
@unpublished{MuSvy, author = "Makoto Murata", title = "Forest-regular languages and Tree-regular Languages", note = "preprint", }
(Note: This is a summary of relevant results in other papers. If you are interested, please e-mail me privately.)
Contextual conditions (conditions on non-subordinate nodes such as ancestor nodes, sibling nodes, subordinates of sibling nodes, etc.) are captured as pointed tree sets. Pointed trees were studied by Nivat and Podelski.
@Article{NivPod93, author = "Nivat and Podelski", title = "Another Variation on the Common Subexpression Problem", journal = "Theoretical Computer Science", volume = "114", year = "1993", }
@InCollection{Podelski92, author = "Podelski", title = "A Monoid Approach to Tree Automata", editor = "Nivat and Podelski", booktitle = "Tree Automata and Languages, Studies in Computer Science and Artificial Intelligence 10", publisher = "North-Holland", year = "1992", }
@InCollection{Murata96a, title = "Transformation of Documents and Schemas by Patterns and Contextual Conditions", booktitle = "Principles of Document Processing '96", author = "Makoto Murata", publisher = "Springer-Verlag", year = "1997", volume = "1293", series = "Lecture Notes in Computer Science", }
(Note: This is a mathematical version of my SGML/XML'97 paper. However, this paper is restricted to binary trees.)
MURATA Makoto / mura034@attglobal.net