Data types in Tcl (Tool Command Language) (c) 1996 by Ronald Ira Feigenblatt All Rights Reserved INTRODUCTION Tcl is a widely ported, interpreted computer language especially suited for rapid prototyping and use as a scripting language. While slow itself, it is easy to introduce new primitives into the languages built on object code created from other (compiled) languages, allowing one to exploit the rapid composition aided by an interpreted language with the rapid execution enabled by a compiled language. This article seeks to help the novice Tcl author distinguish between the various "data types" in Tcl so as to understand why one or the other is best employed to a particular end when designing a program. HERITAGE Newcomers to Tcl are sometimes frustrated by its parsing scheme because the complicated syntax of other languages is the basis of their intuition. In reality, Tcl syntax is simple and logical. Tcl is most directly a descendent of UNIX scripting languages like the Bourne shell; these in turn descend from the ancient language of LISP, being procedural variations on the venerable functional original. One commendable property of languages like these is that they have few, or even only one type(s) of data structure. e.g. LISP uses only lists. Tcl basically uses only strings and associative arrays of strings. The "list" is sometimes cited as yet another data type; but as we shall soon see it is just a special way to interpret a string. THE LEXICAL BASIS OF TCL Like many other languages, a Tcl program consists of an ordered series of character strings (hereinafter just "strings") called statements. Normally the newline character or semi-colon delimits the statements. To overcome the very arbitrary nature of line-length, before it does anything else (logically speaking), Tcl scans an entire program and removes adjacent backslash-newline doublets, enabling thereby statements to span lines. For simiplicity, we will ignore this subtlety from now on. Each statement consists of a series of words, each word being delimited from the next by "whitespace" (space(s) and/or tab(s)). Except for the use of the hash character (#) as the first character in a command to designate it is a comment, leading and trailing whitespace is ignored. It being sometimes useful to include as words those character sequences which include whitespaces, a method for overriding white space as a word delimiter is available, namely the bounding curly-brace pair. Between such pairs, all successive characters are taken to form a single word. Even newline characters are ignored as command delimiters and rendered as if a single blank character, all blank characters now being regarded as the equals of the non-blank ones in the character sequence comprising the one word. Of course, the curly-brace pair also has another function when each statement is parsed and then executed by Tcl: namely, the inhibition of backslash-, command-, and variable-substitution. Indeed, the double quote (") can also be used to inhibit the word-breaking character of white space (including the carriage return), logically turns what naively looks like several "words" into one. But the curly brace is more closely associated with the notion of list because: (a) in most list contexts we wish to suppress the substitution which is allowed within the double-quote (e.g. forming loop constructs) (b) nesting of lists (cf. below) REQUIRES curly braces, because they come in both left- and right-hand versions, whereas the double-quote does not. DATA STRUCTURES IN TCL Not only is each statement in Tcl a string; more or less so is the value (or "contents") of any variable in Tcl (excluding arrays for now). Tcl is careful to distinguish between a variable and its value by requiring explicit "derefencing" of the latter. Thus the value of the variable "Myvar" is written "$Myvar" (or even "${Myvar}" in a useful variation). This makes manifest in Tcl and similar languages (e.g. C-Shell, REXX, bash) what is only implicit in other languages (e.g. Java, C++, Visual Basic), where context defines the distinction. If virtually the only Tcl data type is "string", what does it mean by a "list"? A LIST IS BASICALLY A SPECIAL WAY TO INTERPRET A STRING. Specifically, it means regarding the successive whitespace-delimited words in the string as the ordered sequence of corresponding list elements. This definition is made recursive, so that any curly-bracket-pair-bound character sequence can be not only a word in a larger list, but also a list in itself, whose elements are the delimited words which comprise it. Thus, for example, while each statement in a Tcl program is a string, it is at the same time a list. Tcl assigns meaning to a statement by regarding the first element in this list as the name which indicates which compiled code module ("primitive command") to call, the remainder of the list elements being the successive arguments passed to that module. When dealing with the value of a variable in Tcl, one can also refer to it as either a string or a list, recognizing the difference is one of semantic convenience and not fundamental distinction. In particular, the various primitive commands in Tcl take as their various arguments either strings or lists as is documented. As indicated earlier, Tcl also supports yet one more datatype, namely the (string) array. The Tcl array is associative, i.e. it consists of a set of unordered elements, each element being a string pair. One string in the pair is always designated the "index", the other the corresponding "value". The illusion or ordinality can be imposed by using (string representations of) numbers for the indices of the array elements, e.g. "5" or "234". The illusion of multidimensional ordinality can be perpetrated in turn by using indices like "981,56". (Unfortunately this illusion is easily vacated when one seeks to transpose an array or invert the element order in a dimension.) In fact, any string used to indicate the index of an array element undergoes the same substitutions as any word in a statement string. Indeed, essentially any string can be used for either the array index or corresponding value, including those strings which could be called multielement lists. As any string can be stored in any element, and such elements need not have as many characters as one another, Tcl arrays can also be used to create what is called a "structure" in C, or "user-defined type" in Visual Basic. The space for all variable values in Tcl can dynamically grow and shrink; arrays are no exception, and even the number of elements can vary dynamically. DATA VARIABLE SCOPE AND LIFETIME Tcl supports both global and local (i.e. stack) variables. Global variables persist for the duration of a program whereas local variables are deallocated automatically when the procedure ("proc") which uses them is done ("return"). There is no concept of local static storage, much less that of object member data. In place of local static one must use global variables, preferably with prefix-using names, to avoid accidental name collisions. Of course, Tcl is not intended for massive programming jobs, which are better handled by languages with fancier scoping and data classes, like Java or C. But by using the "trace" command, data validation - one of the advantages of data hiding - can be achieved in Tcl without true object orientation. "Incr Tcl" is a Tcl extension which provides true object orientation. Space is dynamically allocated for both global and local variables when "set" is first used to assign a value to them. Both types can be explicitly deallocated space by the use of "unset", or one can rely on the automatic retrieval of local variable space explained. Variables shadowed by local variables of the same name in a called procedure can be exposed by using the "upvar" and "uplevel" commands in Tcl. This is especially important for arrays, which cannot be passed by value. Even within a local scope, it is nice to be able force another level of parsing (as "uplevel" does) with "exec". (The distinction between a variable name and its value is critical here.) This allows one to dynamically build executable code, which Marvin Minsky, the dean and chief exponent of "classical" (non-connectionist) Artificial Inteligence ("AI") explained is the key (and once unique) feature of LISP which made it the dominant language of AI. DATA MANIPULATION PRIMITIVES Control primitives define the order in which statements are executed, while most other features of a language enable the manipulation of data, which is our present interest. The programmer is free to represent data as a simple string, interpret that strings as a "list", or aggregate strings or lists as the elements of arrays. We now compare the various manipulation primitives available to each type of data to guide the choice of data structure design. Regretably, since the character whose binary value is zero is tacitly (and covertly) used to terminate strings in Tcl, in the style of C, it is not possible to store any of the 256 values of a byte in a character. This makes for especially inefficient handling of binary data when works-arounds are employed. DATA STRUCTURE CREATION AND TYPE CONVERSION The first group of commands to consider are those that create entire structures for the first time, including those translate from one type to another. Strings, being the simplest type can be composited from simpler elements merely by enclosing the entire sequence within double-parenthesis pairs, within which variable- and other types of "substitution" can take place. Strings can be assembled from, or parsed into, smaller ones using "format" and "scan". The conversion between strings and lists can be made using the "split" and "join" commands. Arrays are created merely by assigning the first element using the set command like this: set Myarry(Someindex) Somevalue A list can be composited from a series of arguments, each being considered a single element of the list, whatever it looks like, using the "list" command. Or the elements of a series of well-formed lists can become the elements of a new list by passing those lists as the arguments of the "concat" command. Actually, the "concat" command is a bit of "syntactic sugar", because the same method used to build a string can be used as an alternative. DATA STRUCTURE INTERROGATION Strings, lists and arrays are interrogated in rather similar ways, but arrays are blessed with fewer options than strings and lists. The constituents of a string are its characters; of a list, its words, of an array, its elements. Constituents are enumerated starting at ZERO for strings and lists; arrays in Tcl are not ordinal. To determine the number of constituents it has, one uses: the "string length" command on strings, the "llength" command on lists and the "array size" command on arrays. To extract a single constituent from it, one uses: the "string index" command on strings, the "lindex" command on lists and the "array(index)" citation for arrays. To extract a group of contiguous constituents from it, one uses: the "string range" command on strings, the "lrange" command on lists, the "array get" command on arrays (index is pattern-matched) To search from amongst the constituents of it, one uses: the "string first" or "string last" command on strings and the lsearch command on lists. Use "array names" command for arrays to generate all the indices. More work is needed to search values. DATA STRUCTURE MODIFICATION Lists have the greatest number of options for modification, and arrays the fewset. To append (a) new constituent(s) to it, one uses the command: "append" for strings (syntactic sugar) "lappend" for lists "set Myarry(Someindex) Somevalue" for arrays To insert new constitents in it, one uses the command: "linsert" for lists. To replace constituents in it, one uses the command: "lreplace" for lists and "regsub for strings. To sort its constituents, one uses the command: "lsort" for lists. Strings have a number of handy commands for changing case and trimming: "string to lower", "string toupper"; "string trim", "string trimleft", "string trimright". DATA STRUCTURE COMPARISON The operations here return 1 (true) or 0 (false) as a result of comparing two pieces of data together. All the commands below pertain to strings. Lists (as lists per se) and arrays are not easily compared with one another. Commands: To compare two strings: string compare To compare a string and a glob pattern: string match To compare a string and a regexp: regexp DATA STRUCTURE ITERATION The "for" command together with "string index" can be used to loop over the characters of a string. The "foreach" command alone is a compact way to iterate through thw words of a list. The "foreach" command, together with "array names" can be used to loop over the elements of an array. SYNOPSIS While the simple, humble string is the basis of all data (and commands!) in Tcl, a variety of methods exist to organize and interpret them. Given that Tcl script execution is relatively slow, it behooves the author of Tcl code to consider carefully how to structure his data.