saved from url
http://www-cs.canisius.edu/PL_TUTORIALS/C++/EXAMPLES/MRD/about_databases -->
Relation Databases Using MRD
Introduction
============
This document guides you through the basics of the MRD program, "Mark's
Relational Database". In the process it explains the essential concepts of
relational databases. These concepts apply to all commercial relational
databases, such as dBase IV, Ingres, and any product that uses SQL.
What is a Database?
===================
A database is a collection of data that are stored in some permanent medium
like tapes or disk drives. These data are interrelated and interconnected in
various ways that reflect the complicated nature of the real world which they
model.
Databases came about because simple files did not provide enough intelligent
support to programmers and users. A file is a group of data that are stored
together on a disk drive. Some systems like UNIX allow any arbitrary collection
of characters to comprise a file. Others, organize the data into a sequence of
records, each record containing a series of fields. Usually one record de-
scribes one real world entity like an employee or a piece of equipment.
In one sense a modern database is a collection of files, since the file is a
fundamental unit of disk storage. However this is a relic of the fact that
operating systems tend to force us to store data into files because that is
the main concept that an operating system uses and the principle abstraction
that it provides. There are other possibilities.
Relational databases
====================
There are several different types of databases, but the most widely used at
the current time is the relational database. It replaced older types known
as hierarchical or network databases. Today, object oriented databases are
threatening the supremacy of the relational database.
In a relational database, the principle "atom" of storage is the "relation".
In MRD, a relation is stored in a disk file, so a database is really a
collection of files, each storing one relation.
Relations are also called "tables" because they resemble tables of data when
printed out. Each relation consists of 0 or more rows, or records or "tuples."
Each record consists of a number of pieces of data, such as an integer, a
character string, another integer and a floating point number. The set of all
similar pieces of data in a relation is called a "field" and has a name and a
definite data type. Here is a relation:
name city size
------------------------------------------
Canisius Buffalo 5000
UB Buffalo 13000
Niagara Univ Niagara Falls 3585
John Carroll Univ Cleveland 3671
Oberlin College Oberlin 2797
It is called "colleges" because each record (i.e. each row) gives a little
information about a college or university. Notice that the name of the college
is given, along with the city in which it resides and its size in students.
Starting MRD and getting help
=============================
To start MRD, just type
% mrd
and press RETURN. You will get the MRD> prompt. To see what the commands are,
type "help". To quit, type "quit".
MRD> help
Here is a list of commands you will see when you type "help":
batch filename
info filename
select oldrelation newrelation field1 > value1 or field2 = field2 ...
cross relation1 relation2 newrelation
join relation1 field1 relation2 field2 newrelation
diff relation1 field1 relation2 field2 newrelation
project relation1 newrelation field1 field2 field3 ...
append relation1 relation2 newrelation
print relation1
sort oldrelation newrelation aord field1
stats oldrelation newrelation field1
number oldrelation newrelation
unique oldrelation newrelation
addrecord relation
changevalue relation recnum fieldname newvalue
renamefield relation fieldname newname
changewidth relation fieldname newwidth
deleterecord relation recnum
addfield relation fieldname datatype printwidth defaultvalue
retype relation field1 field2
calc relation fieldname "arithmetic expression"
These lines give the syntax of the command, not an explanation of what they do.
To get information about a specific command, type "help" followed by the
command's name. The syntax is given again, along with a short explanation and
an example:
MRD> help sort
-- sort a relation based on the values in one of its fields.
-- You can select either ascending or descending for the values.
SYNTAX: sort oldrelation newrelation aord field1
EXAMPLE: sort planets planets2 a nummoons
UNIX commands inside MRD
========================
To perform a UNIX command, put an exclamation mark in front. Thus, you can use
"ls", "mail", and other commands while inside MRD.
Non-interactive use of MRD
==========================
You can also perform just one command from the command line. Surround the
mrd command in double quotes and put an at-sign in front:
% mrd "@print planets"
This is handy if you want to use mrd in a UNIX shell script.
MRD commands can be clumped into batch files. If you give the name of a
batch file to mrd on the command line of UNIX, it will perform all the MRD
commands in that file and then quit, rather than interact with you:
% mrd commandsfile
Prompting for parameters in interactive mode
============================================
When you are using MRD in interactive mode, you can either type in the complete
command, as shown above in the help examples, or you can let MRD prompt you
for all the parameters. Just type the name of the command and press RETURN and
MRD will ask you for all the particulars.
Interrupting MRD
================
If you start a command that you do not wish to finish in MRD, then simply
press CONTROL-C and you will immediately get back to the UNIX prompt. Since
all relations are stored as files on disk, nothing will be lost.
How relations are stored on disk
================================
All relations in MRD are stored on disk in an ASCII readable format, so you can
edit the disk file to fix the relation, if you want. Here's what the above
colleges relation looks like on disk:
name 0 c 20
city 1 c 15
size 2 i 6
#
Canisius|Buffalo|5000
UB|Buffalo|13000
Niagara Univ|Niagara Falls|3585
John Carroll Univ|Cleveland|3671
Oberlin College|Oberlin|2797
The first three lines of the file are the "data dictionary" which describes the
fields. Each field has a name (such as "city"), a position in the record start-
ing with 0, a datatype and a print width. The datatype can be only one of 3
things:
c = character string
i = integer
d = floating point number (chosen due to C's use of "double")
The print width is an integer which describes how many characters long the field
should be when it is printed out.
The single pound sign in column 4 indicates the end of the data dictionary.
Next come the actual records.
Since there are three fields described in the data dictionary, there must be
three fields of data in each record, separated by vertical bars. All blanks
in the fields are considered to be part of the field, although blanks in
integer or floating point fields do not affect the value. Notice that the
fields do not line up in nice neat columns in the actual disk file, although
they do when you print them out.
The "dump" command in MRD will show you the relation in its "raw" form, more
or less as you see it if you were to "cat" the file.
If you edit a relation on disk, be very careful because you could change MRD's
interpretation of things. For example:
1. In the data dictionary section, there must be only one blank between
the field name and its position, and its datatype and its printwidth.
The name must begin in the first column.
2. The # must be the first thing in the line that follows the data
dictionary.
3. There must be no blank lines in the file.
4. There must be no extra blanks inside fields, as these are considered
part of the data of the field. Some values obviously contain blanks,
such as "New York". But do not put extra blanks in front of or in
back of the vertical bar, unless you mean to.
5. The field names must be all one word -- no blanks, although you can
use underscores.
6. Do not confuse the vertical bar with the exclamation mark!
The three basic operations
==========================
If you were to study the theoretical side of relational databases, you would
learn that there are three fundamental operations:
select -- choose some of the rows
project -- choose some of the columns
join -- splice together rows from 2 different relations
Each operator creates a whole new relation which can then be used in further
operations. Since we have a set of objects and a set of operators for those
objects, we call this an "algebra" and in particular, this is sometimes called
"relational algebra."
Select
======
Select forms a new relation by choosing certain rows (records) from an existing
relation. The new relation has the exact same fields, just fewer rows.
Typically, the selection criteria are based on values of fields. In MRD, this
is true, and the criteria can be conjoined with AND and OR. This should be
clear in the following examples.
Here is the states database:
name capital nickname area population yearadmitted
---------------------------------------------------------------
New York Albany Empire State 49576 17500000 1788
Ohio Columbus Buckeye State 41222 10700000 1803
Nebraska Lincoln Cornhusker State 77227 1500000 1867
Alaska Juneau Last Frontier 586412 302173 1959
If we wanted to choose only those states (in this list, which is of course not
complete!) that were admitted after the Civil War (1865), then we would do the
following in MRD:
MRD> select
Name of existing relation: states
Now you will enter one or more selection criteria.
Enter name of field to compare against: yearadmitted
Enter comparison: = != < > <= >=
>
Enter the data to be used in comparison:
1865
Another? (y/n) n
MRD> print
Name of existing relation:
name capital nickname area population yearadmitted
---------------------------------------------------------------
Nebraska Lincoln Cornhusker State 77227 1500000 1867
Alaska Juneau Last Frontier 586412 302173 1959
Notice that a criterion is a comparison of a field, indicated by the field's
name, against a value. Thus our criterion is, to put it succinctly:
yearadmitted > 1865
There is only one criterion so we answered 'n' when it asked Another?
MRD creates new relations and stores them in a file called zzzz.results.
Subsequently, you can use this new relation in other operations. For example,
we printed it above. You can either type in the name "zzzz.results" or you can
simply press return when MRD asks you for
Name of existing relation:
and it knows that you mean zzzz.results.
To save zzzz.results, you could use the UNIX "cp" command:
MRD> cp zzzz.results statesafter1865
or use the "save" command:
MRD> save
What do you want to call the new relation? statesafter1865
MRD> ls
DOC colleges states zzzz.oldresults
cities mrd statesafter1865
Notice that this just renamed zzzz.results.
As a protection measure, whenever MRD is about to put new results into the
file "zzzz.results" and there is already a file by that name, it renames the
previous one to "zzzz.oldresults". Therefore, if you make a mistake, you
haven't obliterated the old relation, just renamed it to zzzz.oldresults, so
you could undo your mistake by using the UNIX rename command:
MRD> mv zzzz.oldresults zzzz.results
or even save zzzz.oldresults as some permanent name. However, doing two
operations in a row would cause the original zzzz.oldresults to be lost
forever.
Here's an example of a section criteria that is compound. We will ask for all
records where
population > 1000000 and area < 50000
Here's how we do it in MRD:
MRD> select
Name of existing relation: states
Now you will enter one or more selection criteria.
Enter name of field to compare against: population
Enter comparison: = != < > <= >=
>
Enter the data to be used in comparison:
1000000
Another? (y/n) y
How does this combine with the previous comparison?
o=or a=and: a
Enter name of field to compare against: area
Enter comparison: = != < > <= >=
<
Enter the data to be used in comparison:
50000
Another? (y/n) n
MRD> print
Name of existing relation:
name capital nickname area population yearadmitted
---------------------------------------------------------------
New York Albany Empire State 49576 17500000 1788
Ohio Columbus Buckeye State 41222 10700000 1803
Notice that MRD asks you how the next criterion relates to the previous, by
either AND or OR. You type in "a" or "o". There is no implied precedence
to these operators, as there is in Boolean algebra. When MRD makes the
selection, it just starts with the first criterion and works its way through.
There is currently no way to specify precedence or priority, except to form
intermediate relations by using only 1 or a few criteria and then performing
selects on these intermediate results.
Project
=======
Project selects certain "columns" or fields, making a new relation. It is a
way of removing some fields (and all the data in the records that is in those
fields.)
Here's an example of removing the nickname and yearadmitted fields from
relation "states":
MRD> project
Name of existing relation: states
Now enter one or more field names.
Put one space between field names.
Do not press return until all done.
>> name capital area population
MRD> print
Name of existing relation:
name capital area population
----------------------------------------
New York Albany 49576 17500000
Ohio Columbus 41222 10700000
Nebraska Lincoln 77227 1500000
Alaska Juneau 586412 302173
This may not look like a terribly useful operation, but it is, and it sometimes
has some surprising implications. For example, let us choose to keep only the
state field of the "cities" relation:
MRD> print
Name of existing relation: cities
name state population
----------------------------------------
Lincoln Nebraska 200000
Buffalo New York 1000000
Niagara Falls New York 40000
Oberlin Ohio 100000
MRD> project
Name of existing relation: cities
Now enter one or more field names.
Put one space between field names.
Do not press return until all done.
>> state
MRD> print
Name of existing relation:
state
----------------
Nebraska
New York
New York
Ohio
Notice that there are still four records, but only one field. Further, two of
the records are now identical -- both "New York". Many database systems would
not permit two records to have identical data, but the implementation cost of
weeding out duplicates turns out to be quite high. Thus MRD allows this to be
an optional step. To make sure that every record is unique, use the "unique"
command:
MRD> unique
Name of existing relation:
MRD> print
Name of existing relation:
state
----------------
Nebraska
New York
Ohio
In MRD, weeding out duplicate records also causes the new relation to be sorted.
Join
====
The most fascinating and most difficult command to understand is join. It is
a way of creating a new relation in which the records are concatenations of
records from two pre-existing relations. These concatenations may or may be
based on some criterion.
There are two types of joins. One is called the "cross product" and is done
by the MRD command "cross". If you are familiar with set theory, then a
relation is really just a set of tuples, and the cross product of the two sets
is what we are after.
Let us take a simple example. There are two relations:
MRD> print
Name of existing relation: people
name age
---------------
John 37
Susan 25
MRD> print
Name of existing relation: candy
name color
-------------------
red hots red
bubble gum pink
chocolate bar brown
There are two rows in people and three rows in candy. If we perform the cross
product, we will get a new relation with 6 rows. For each row in people,
add on to the "end" of it each row from candy. Here's how it all looks:
MRD> cross
Name of first relation: people
Name of second relation: candy
MRD> print
Name of existing relation:
name age _name color
----------------------------------
John 37 red hots red
John 37 bubble gum pink
John 37 chocolate bar brown
Susan 25 red hots red
Susan 25 bubble gum pink
Susan 25 chocolate bar brown
Notice that both relations had fields named "name" so in the new relation, one
of them had to be renamed to "_name". (You can change the name of a field by
editing the data dictionary part of the file that holds the relation.)
As you can see, each record of "people" is duplicated 3 times in the resulting
relation, since there were three records in candy. And each record of candy
is duplicated twice in the result since there were 2 records in people.
The join operation is really nothing more than a cross product followed by a
selection. However, if a database implementation actually took the full cross
product of two relations, the result would be enormous, and most of the records
might be thrown away anyway. Thus, as an optimization, the equijoin operator
is defined as a way of joining two relations on a common field.
Here's an example. Let's look at the two relations "cities" and "colleges":
MRD> print
Name of existing relation: cities
name state population
----------------------------------------
Lincoln Nebraska 200000
Buffalo New York 1000000
Niagara Falls New York 40000
Oberlin Ohio 100000
MRD> print
Name of existing relation: colleges
name city size
--------------------------------------------
Canisius Buffalo 5000
UB Buffalo 13000
Niagara Univ Niagara Falls 3585
John Carroll Univ Cleveland 3671
Oberlin College Oberlin 2797
If we wanted to find out the state in which the college was located, we could
not directly look into the "colleges" relation since there is no state field.
But the "city" field of colleges links up with the "name" field of cities to
provide us this information. Here's how to join these in MRD:
MRD> join
Name of first relation: cities
which field in this relation? name
Name of second relation: colleges
which field in this relation? city
MRD> print
Name of existing relation:
name state population _name city size
--------------------------------------------------------------------------
Buffalo New York 1000000 Canisius Buffalo 5000
Buffalo New York 1000000 UB Buffalo 13000
Niagara Falls New York 40000 Niagara Univ Niagara Falls 3585
Oberlin Ohio 100000 Oberlin College Oberlin 2797
Notice that a cross product would have created 4 x 5 = 20 records, whereas we
only want four records in the equijoin.
This operation is called an equijoin because we are joining two relations based
on equal values of specified fields. Other kinds of joins are possible based
on relationships such as less than or greater than or equal to. MRD only pro-
vides equality joining on one field from each relation.
Other MRD operations
====================
There are several other operations that MRD provides for convenience. One of
them is "stats" which does simple statistics on a single numeric field. It
creates a new relation with only one record. Let's do stats on the size field
of colleges:
MRD> stats
Name of existing relation: colleges
What is the name of the field to summarize? size
MRD> print
Name of existing relation:
num sum min max avg ssq
------------------------
5 28053 2797 13000 5610.60010 70757513.20000
Unfortunately the printing of this obscures some of the statistics. The
statistics will be either integers or floating point numbers, depending on the
type of the field. Since "size" is defined to be an integer, all the stats
except average and ssq (sum of squares) are integers. If it were a floating
point field, all stats except num (number of records) would be floating point.
The "number" command inserts a new field called "seqnum" (sequence number),
starting at 0:
MRD> number
Name of existing relation: stats
Cannot open file stats
MRD> number
Name of existing relation: states
MRD> print
Name of existing relation:
seqnum name capital nickname area population yearadmitted
----------------------------------------------------------------------
0 New York Albany Empire State 49576 17500000 1788
1 Ohio Columbus Buckeye State 41222 10700000 1803
2 Nebraska Lincoln Cornhusker State 77227 1500000 1867
3 Alaska Juneau Last Frontier 586412 302173 1959
The "sort" command allows you to sort a relation based on a field. For example,
to sort the states field by population:
MRD> sort
Name of existing relation: states
What is the name of the field to sort on? population
Arrange by ascending or descending values? (a/d): a
MRD> print
Name of existing relation:
name capital nickname area population yearadmitted
---------------------------------------------------------------
Alaska Juneau Last Frontier 586412 302173 1959
Nebraska Lincoln Cornhusker State 77227 1500000 1867
Ohio Columbus Buckeye State 41222 10700000 1803
New York Albany Empire State 49576 17500000 1788
Notice that you get the option to arrange by either ascending or descending
values.
Editing relations
=================
Changes to relations may be done in two ways:
1. by editing the raw files
2. by using MRD commands to add fields, change values in fields, etc.
If you edit the raw file, you must be EXTREMELY CAREFUL! For example, you
must not destroy the data dictionary that occurs at the top of the file or
MRD will not recognize your relation. Also, any field that contains string
data that contains blanks must be surrounded by parentheses or else MRD will
think that the first blank ends the data.