saved from url http://www-cs.canisius.edu/PL_TUTORIALS/C++/EXAMPLES/MRD/about_databases --> Relation Databases Using MRD Introduction ============ This document guides you through the basics of the MRD program, "Mark's Relational Database". In the process it explains the essential concepts of relational databases. These concepts apply to all commercial relational databases, such as dBase IV, Ingres, and any product that uses SQL. What is a Database? =================== A database is a collection of data that are stored in some permanent medium like tapes or disk drives. These data are interrelated and interconnected in various ways that reflect the complicated nature of the real world which they model. Databases came about because simple files did not provide enough intelligent support to programmers and users. A file is a group of data that are stored together on a disk drive. Some systems like UNIX allow any arbitrary collection of characters to comprise a file. Others, organize the data into a sequence of records, each record containing a series of fields. Usually one record de- scribes one real world entity like an employee or a piece of equipment. In one sense a modern database is a collection of files, since the file is a fundamental unit of disk storage. However this is a relic of the fact that operating systems tend to force us to store data into files because that is the main concept that an operating system uses and the principle abstraction that it provides. There are other possibilities. Relational databases ==================== There are several different types of databases, but the most widely used at the current time is the relational database. It replaced older types known as hierarchical or network databases. Today, object oriented databases are threatening the supremacy of the relational database. In a relational database, the principle "atom" of storage is the "relation". In MRD, a relation is stored in a disk file, so a database is really a collection of files, each storing one relation. Relations are also called "tables" because they resemble tables of data when printed out. Each relation consists of 0 or more rows, or records or "tuples." Each record consists of a number of pieces of data, such as an integer, a character string, another integer and a floating point number. The set of all similar pieces of data in a relation is called a "field" and has a name and a definite data type. Here is a relation: name city size ------------------------------------------ Canisius Buffalo 5000 UB Buffalo 13000 Niagara Univ Niagara Falls 3585 John Carroll Univ Cleveland 3671 Oberlin College Oberlin 2797 It is called "colleges" because each record (i.e. each row) gives a little information about a college or university. Notice that the name of the college is given, along with the city in which it resides and its size in students. Starting MRD and getting help ============================= To start MRD, just type % mrd and press RETURN. You will get the MRD> prompt. To see what the commands are, type "help". To quit, type "quit". MRD> help Here is a list of commands you will see when you type "help": batch filename info filename select oldrelation newrelation field1 > value1 or field2 = field2 ... cross relation1 relation2 newrelation join relation1 field1 relation2 field2 newrelation diff relation1 field1 relation2 field2 newrelation project relation1 newrelation field1 field2 field3 ... append relation1 relation2 newrelation print relation1 sort oldrelation newrelation aord field1 stats oldrelation newrelation field1 number oldrelation newrelation unique oldrelation newrelation addrecord relation changevalue relation recnum fieldname newvalue renamefield relation fieldname newname changewidth relation fieldname newwidth deleterecord relation recnum addfield relation fieldname datatype printwidth defaultvalue retype relation field1 field2 calc relation fieldname "arithmetic expression" These lines give the syntax of the command, not an explanation of what they do. To get information about a specific command, type "help" followed by the command's name. The syntax is given again, along with a short explanation and an example: MRD> help sort -- sort a relation based on the values in one of its fields. -- You can select either ascending or descending for the values. SYNTAX: sort oldrelation newrelation aord field1 EXAMPLE: sort planets planets2 a nummoons UNIX commands inside MRD ======================== To perform a UNIX command, put an exclamation mark in front. Thus, you can use "ls", "mail", and other commands while inside MRD. Non-interactive use of MRD ========================== You can also perform just one command from the command line. Surround the mrd command in double quotes and put an at-sign in front: % mrd "@print planets" This is handy if you want to use mrd in a UNIX shell script. MRD commands can be clumped into batch files. If you give the name of a batch file to mrd on the command line of UNIX, it will perform all the MRD commands in that file and then quit, rather than interact with you: % mrd commandsfile Prompting for parameters in interactive mode ============================================ When you are using MRD in interactive mode, you can either type in the complete command, as shown above in the help examples, or you can let MRD prompt you for all the parameters. Just type the name of the command and press RETURN and MRD will ask you for all the particulars. Interrupting MRD ================ If you start a command that you do not wish to finish in MRD, then simply press CONTROL-C and you will immediately get back to the UNIX prompt. Since all relations are stored as files on disk, nothing will be lost. How relations are stored on disk ================================ All relations in MRD are stored on disk in an ASCII readable format, so you can edit the disk file to fix the relation, if you want. Here's what the above colleges relation looks like on disk: name 0 c 20 city 1 c 15 size 2 i 6 # Canisius|Buffalo|5000 UB|Buffalo|13000 Niagara Univ|Niagara Falls|3585 John Carroll Univ|Cleveland|3671 Oberlin College|Oberlin|2797 The first three lines of the file are the "data dictionary" which describes the fields. Each field has a name (such as "city"), a position in the record start- ing with 0, a datatype and a print width. The datatype can be only one of 3 things: c = character string i = integer d = floating point number (chosen due to C's use of "double") The print width is an integer which describes how many characters long the field should be when it is printed out. The single pound sign in column 4 indicates the end of the data dictionary. Next come the actual records. Since there are three fields described in the data dictionary, there must be three fields of data in each record, separated by vertical bars. All blanks in the fields are considered to be part of the field, although blanks in integer or floating point fields do not affect the value. Notice that the fields do not line up in nice neat columns in the actual disk file, although they do when you print them out. The "dump" command in MRD will show you the relation in its "raw" form, more or less as you see it if you were to "cat" the file. If you edit a relation on disk, be very careful because you could change MRD's interpretation of things. For example: 1. In the data dictionary section, there must be only one blank between the field name and its position, and its datatype and its printwidth. The name must begin in the first column. 2. The # must be the first thing in the line that follows the data dictionary. 3. There must be no blank lines in the file. 4. There must be no extra blanks inside fields, as these are considered part of the data of the field. Some values obviously contain blanks, such as "New York". But do not put extra blanks in front of or in back of the vertical bar, unless you mean to. 5. The field names must be all one word -- no blanks, although you can use underscores. 6. Do not confuse the vertical bar with the exclamation mark! The three basic operations ========================== If you were to study the theoretical side of relational databases, you would learn that there are three fundamental operations: select -- choose some of the rows project -- choose some of the columns join -- splice together rows from 2 different relations Each operator creates a whole new relation which can then be used in further operations. Since we have a set of objects and a set of operators for those objects, we call this an "algebra" and in particular, this is sometimes called "relational algebra." Select ====== Select forms a new relation by choosing certain rows (records) from an existing relation. The new relation has the exact same fields, just fewer rows. Typically, the selection criteria are based on values of fields. In MRD, this is true, and the criteria can be conjoined with AND and OR. This should be clear in the following examples. Here is the states database: name capital nickname area population yearadmitted --------------------------------------------------------------- New York Albany Empire State 49576 17500000 1788 Ohio Columbus Buckeye State 41222 10700000 1803 Nebraska Lincoln Cornhusker State 77227 1500000 1867 Alaska Juneau Last Frontier 586412 302173 1959 If we wanted to choose only those states (in this list, which is of course not complete!) that were admitted after the Civil War (1865), then we would do the following in MRD: MRD> select Name of existing relation: states Now you will enter one or more selection criteria. Enter name of field to compare against: yearadmitted Enter comparison: = != < > <= >= > Enter the data to be used in comparison: 1865 Another? (y/n) n MRD> print Name of existing relation: name capital nickname area population yearadmitted --------------------------------------------------------------- Nebraska Lincoln Cornhusker State 77227 1500000 1867 Alaska Juneau Last Frontier 586412 302173 1959 Notice that a criterion is a comparison of a field, indicated by the field's name, against a value. Thus our criterion is, to put it succinctly: yearadmitted > 1865 There is only one criterion so we answered 'n' when it asked Another? MRD creates new relations and stores them in a file called zzzz.results. Subsequently, you can use this new relation in other operations. For example, we printed it above. You can either type in the name "zzzz.results" or you can simply press return when MRD asks you for Name of existing relation: and it knows that you mean zzzz.results. To save zzzz.results, you could use the UNIX "cp" command: MRD> cp zzzz.results statesafter1865 or use the "save" command: MRD> save What do you want to call the new relation? statesafter1865 MRD> ls DOC colleges states zzzz.oldresults cities mrd statesafter1865 Notice that this just renamed zzzz.results. As a protection measure, whenever MRD is about to put new results into the file "zzzz.results" and there is already a file by that name, it renames the previous one to "zzzz.oldresults". Therefore, if you make a mistake, you haven't obliterated the old relation, just renamed it to zzzz.oldresults, so you could undo your mistake by using the UNIX rename command: MRD> mv zzzz.oldresults zzzz.results or even save zzzz.oldresults as some permanent name. However, doing two operations in a row would cause the original zzzz.oldresults to be lost forever. Here's an example of a section criteria that is compound. We will ask for all records where population > 1000000 and area < 50000 Here's how we do it in MRD: MRD> select Name of existing relation: states Now you will enter one or more selection criteria. Enter name of field to compare against: population Enter comparison: = != < > <= >= > Enter the data to be used in comparison: 1000000 Another? (y/n) y How does this combine with the previous comparison? o=or a=and: a Enter name of field to compare against: area Enter comparison: = != < > <= >= < Enter the data to be used in comparison: 50000 Another? (y/n) n MRD> print Name of existing relation: name capital nickname area population yearadmitted --------------------------------------------------------------- New York Albany Empire State 49576 17500000 1788 Ohio Columbus Buckeye State 41222 10700000 1803 Notice that MRD asks you how the next criterion relates to the previous, by either AND or OR. You type in "a" or "o". There is no implied precedence to these operators, as there is in Boolean algebra. When MRD makes the selection, it just starts with the first criterion and works its way through. There is currently no way to specify precedence or priority, except to form intermediate relations by using only 1 or a few criteria and then performing selects on these intermediate results. Project ======= Project selects certain "columns" or fields, making a new relation. It is a way of removing some fields (and all the data in the records that is in those fields.) Here's an example of removing the nickname and yearadmitted fields from relation "states": MRD> project Name of existing relation: states Now enter one or more field names. Put one space between field names. Do not press return until all done. >> name capital area population MRD> print Name of existing relation: name capital area population ---------------------------------------- New York Albany 49576 17500000 Ohio Columbus 41222 10700000 Nebraska Lincoln 77227 1500000 Alaska Juneau 586412 302173 This may not look like a terribly useful operation, but it is, and it sometimes has some surprising implications. For example, let us choose to keep only the state field of the "cities" relation: MRD> print Name of existing relation: cities name state population ---------------------------------------- Lincoln Nebraska 200000 Buffalo New York 1000000 Niagara Falls New York 40000 Oberlin Ohio 100000 MRD> project Name of existing relation: cities Now enter one or more field names. Put one space between field names. Do not press return until all done. >> state MRD> print Name of existing relation: state ---------------- Nebraska New York New York Ohio Notice that there are still four records, but only one field. Further, two of the records are now identical -- both "New York". Many database systems would not permit two records to have identical data, but the implementation cost of weeding out duplicates turns out to be quite high. Thus MRD allows this to be an optional step. To make sure that every record is unique, use the "unique" command: MRD> unique Name of existing relation: MRD> print Name of existing relation: state ---------------- Nebraska New York Ohio In MRD, weeding out duplicate records also causes the new relation to be sorted. Join ==== The most fascinating and most difficult command to understand is join. It is a way of creating a new relation in which the records are concatenations of records from two pre-existing relations. These concatenations may or may be based on some criterion. There are two types of joins. One is called the "cross product" and is done by the MRD command "cross". If you are familiar with set theory, then a relation is really just a set of tuples, and the cross product of the two sets is what we are after. Let us take a simple example. There are two relations: MRD> print Name of existing relation: people name age --------------- John 37 Susan 25 MRD> print Name of existing relation: candy name color ------------------- red hots red bubble gum pink chocolate bar brown There are two rows in people and three rows in candy. If we perform the cross product, we will get a new relation with 6 rows. For each row in people, add on to the "end" of it each row from candy. Here's how it all looks: MRD> cross Name of first relation: people Name of second relation: candy MRD> print Name of existing relation: name age _name color ---------------------------------- John 37 red hots red John 37 bubble gum pink John 37 chocolate bar brown Susan 25 red hots red Susan 25 bubble gum pink Susan 25 chocolate bar brown Notice that both relations had fields named "name" so in the new relation, one of them had to be renamed to "_name". (You can change the name of a field by editing the data dictionary part of the file that holds the relation.) As you can see, each record of "people" is duplicated 3 times in the resulting relation, since there were three records in candy. And each record of candy is duplicated twice in the result since there were 2 records in people. The join operation is really nothing more than a cross product followed by a selection. However, if a database implementation actually took the full cross product of two relations, the result would be enormous, and most of the records might be thrown away anyway. Thus, as an optimization, the equijoin operator is defined as a way of joining two relations on a common field. Here's an example. Let's look at the two relations "cities" and "colleges": MRD> print Name of existing relation: cities name state population ---------------------------------------- Lincoln Nebraska 200000 Buffalo New York 1000000 Niagara Falls New York 40000 Oberlin Ohio 100000 MRD> print Name of existing relation: colleges name city size -------------------------------------------- Canisius Buffalo 5000 UB Buffalo 13000 Niagara Univ Niagara Falls 3585 John Carroll Univ Cleveland 3671 Oberlin College Oberlin 2797 If we wanted to find out the state in which the college was located, we could not directly look into the "colleges" relation since there is no state field. But the "city" field of colleges links up with the "name" field of cities to provide us this information. Here's how to join these in MRD: MRD> join Name of first relation: cities which field in this relation? name Name of second relation: colleges which field in this relation? city MRD> print Name of existing relation: name state population _name city size -------------------------------------------------------------------------- Buffalo New York 1000000 Canisius Buffalo 5000 Buffalo New York 1000000 UB Buffalo 13000 Niagara Falls New York 40000 Niagara Univ Niagara Falls 3585 Oberlin Ohio 100000 Oberlin College Oberlin 2797 Notice that a cross product would have created 4 x 5 = 20 records, whereas we only want four records in the equijoin. This operation is called an equijoin because we are joining two relations based on equal values of specified fields. Other kinds of joins are possible based on relationships such as less than or greater than or equal to. MRD only pro- vides equality joining on one field from each relation. Other MRD operations ==================== There are several other operations that MRD provides for convenience. One of them is "stats" which does simple statistics on a single numeric field. It creates a new relation with only one record. Let's do stats on the size field of colleges: MRD> stats Name of existing relation: colleges What is the name of the field to summarize? size MRD> print Name of existing relation: num sum min max avg ssq ------------------------ 5 28053 2797 13000 5610.60010 70757513.20000 Unfortunately the printing of this obscures some of the statistics. The statistics will be either integers or floating point numbers, depending on the type of the field. Since "size" is defined to be an integer, all the stats except average and ssq (sum of squares) are integers. If it were a floating point field, all stats except num (number of records) would be floating point. The "number" command inserts a new field called "seqnum" (sequence number), starting at 0: MRD> number Name of existing relation: stats Cannot open file stats MRD> number Name of existing relation: states MRD> print Name of existing relation: seqnum name capital nickname area population yearadmitted ---------------------------------------------------------------------- 0 New York Albany Empire State 49576 17500000 1788 1 Ohio Columbus Buckeye State 41222 10700000 1803 2 Nebraska Lincoln Cornhusker State 77227 1500000 1867 3 Alaska Juneau Last Frontier 586412 302173 1959 The "sort" command allows you to sort a relation based on a field. For example, to sort the states field by population: MRD> sort Name of existing relation: states What is the name of the field to sort on? population Arrange by ascending or descending values? (a/d): a MRD> print Name of existing relation: name capital nickname area population yearadmitted --------------------------------------------------------------- Alaska Juneau Last Frontier 586412 302173 1959 Nebraska Lincoln Cornhusker State 77227 1500000 1867 Ohio Columbus Buckeye State 41222 10700000 1803 New York Albany Empire State 49576 17500000 1788 Notice that you get the option to arrange by either ascending or descending values. Editing relations ================= Changes to relations may be done in two ways: 1. by editing the raw files 2. by using MRD commands to add fields, change values in fields, etc. If you edit the raw file, you must be EXTREMELY CAREFUL! For example, you must not destroy the data dictionary that occurs at the top of the file or MRD will not recognize your relation. Also, any field that contains string data that contains blanks must be surrounded by parentheses or else MRD will think that the first blank ends the data. 1