Extensible Markup Language, or XML, is a language developed by the World Wide Web Consortium, and used in the creation of documents. It is based on Standard Generalized Markup Language (SGML), and Peter Flynn’s XML FAQ states that it is “an abbreviated version of SGML, to make it easier for you to define your own document types, and to make it easier for programmers to write programs to handle them. It omits the more complex and less-used parts of SGML in return for the benefits of being easier to write applications for, easier to understand, and more suited to delivery and interoperability over the Web.” Although similar to Hypertext Markup Language (HTML), it has some important differences, such as the ability for users to define their own tags, and “begin and end tags for all elements” (http://www.bluestone.com/downloads/doc/051900_XML-FAQ.doc, p. 2). Flynn’s FAQ defines XML as “a metalanguage—a language for describing other languages—which lets you design your own markup.” According to Bluestone Software, the major parts of an XML document are entities, markup, document type definition (DTD), and document object model (DOM). Entities are “storage units that can contain character data and markup” (p. 2). The markup consists of the tags used, as well as comments and instructions. The DTD’s define “what entities are allowed in an XML document, what attributes those entities may have, and even what values and attributes they may have” (p. 2). The DOM tells an application how to use a document.

GoXML is a search engine designed with XML in mind. It indexes XML documents, and allows users to search for various information within these documents. An advantage to an XML-based search engine is that it can search based on tags and metadata. The Technical Discussion page (http://www.goxml.com/product/goxml-syntax.htm) states that most search engines “simply treat XML as yet more text content. GoXML on the other hand is built from the ground up to leverage XML syntax.” A problem with HTML and similar languages, according to the company (http://www.xmlglobal.com/prod/search/index.jsp?openEl=1_0), is that they “do not provide a structure that enables precise categorization and searching.” GoXML searches documents “not just as text, but as a structured collection of content and semantics,” and it can “fuly [sic] leverage the paradigm shift that XML provides.” In other words, GoXML incorporates many aspects of XML into its search engine, allowing users to search more effectively, and to use queries based on fields other than simple text. It can also, according to the corporate website (http://www.xmlglobal.com/prod/search/goxmlsearch_features.jsp?openEl=1_0) “index and retrieve information from dissimilar information structures and return them as a coherent set.” These benefits can make searching XML easier to simply searching in plain text or HTML.

The GoXML page provides some demonstrations of XML-based search engines. One is a searchable database of the works of William Shakespeare. The purpose of this database, according to the site, is that it “demonstrates how simply GoXML can cut through content-oriented data” (http://www.goxml.com/shakespeareSearch/). Searches can use XML tags, as well as text. This allows for the filtering of results. While this would make it simple to find any specific element in the text, it is limited by the fact that a searcher would have to know which tags were used, and, since tags in XML are user-defined, a search on a new database would require learning a new set of tags. The Astronomical Data Center search (http://www.goxml.com/nasaSearch/) has the added feature of providing a pull-down menu of tags that can be used in a search query, but some of them are still confusing. (What, for instance, does “history/ingest” mean?) A search can be done using only text, but including the tags would probably assist the engine to find the particular item more easily.

The GoXML page defines the three major parts of its system as the indexer, query engine, and acceptor. The indexer “decomposes XML documents into the GoXML™ index schema” (http://www.goxml.com/product/specs.htm). Its system “cross-references the XML content, both by its tag structure and its physical text content.” The query engine is used to process queries, which can be inputted in various formats, and the acceptor converts documents into standard XML formats.

A search engine like GoXML might be useful for digital libraries. Searching is a major function of library software, and allowing searchers to use XML could prove beneficial to a library. GoXML can work with documents that were not created in XML, yet it provides additional benefits for those documents that do incorporate XML. If a library is thinking of using XML in its web pages, GoXML, or a similar search engine, would be a handy way of searching these pages. Its usefulness would pertain mainly to programmers and technical staff, however, as it is doubtful that regular library patrons would know enough XML to reap the benefits of such an engine.

BIBLIOGRAPHY

Bluestone Software (May 2000). “Bluestone Software, Inc. XML Frequently Asked Questions (FAQ).” Available: http://www.bluestone.com/downloads/doc/051900_XML-FAQ.doc

Flynn, Peter, et al. (21 July 2000). “Frequently Asked Questions about the Extensible Markup Language.” Available: http://www.ucc.ie/xml/

“GoXML Product Architecture.” Available: http://www.goxml.com/product/specs.htm

“Leveraging XML Syntax: Technical Discussion.” Available: http://www.goxml.com/product/goxml-syntax.htm

“Search Astronomical Data from NASA.” Available: http://www.goxml.com/nasaSearch/

“Search the Works of Shakespeare.” Available: http://www.goxml.com/shakespeareSearch/

XML Global Technologies (2001). “GoXML Search: Part of the GoXML Foundation Suite of Products.” Available: http://www.xmlglobal.com/prod/search/index.jsp?openEl=1_0

XML Global Technologies (2001). “GoXML Search: Product Technical Features.” Available: http://www.xmlglobal.com/prod/search/goxmlsearch_features.jsp?openEl=1_0

Click here to return to my main INFO 653 page. 1