Look at your SQL query the same way. You should always know exactly what you want to accomplish and then strive to follow the path of least resistance. The more time you spend planning, the less time you'll have to spend revising later. Your goal should always be to retrieve accurate data and to do so in as little time as possible. An end user waiting on a slow query is like a hungry diner impatiently awaiting a tardy meal. Although you can write most queries in several ways, the arrangement of the components within the query is the factor that makes the difference of seconds, minutes, and sometimes hours when you execute the query. Streamlining SQL is the process of finding the optimal arrangement of the elements within your query.
In addition to streamlining your SQL statement, you should also consider several other factors when trying to improve general database performance, for example, concurrent user transactions that occur within a database, indexing of tables, and deep-down database tuning.
NOTE: Today's examples use Personal Oracle7 and tools that are available with the Oracle7.3 relational database management system. The concepts discussed today are not restricted to Oracle; they may be applied to other relational database management systems.
Try to read this statement:
SQL> SELECT EMPLOYEE_TBL.EMPLOYEE_ID, EMPLOYEE_TBL.NAME,EMPLOYEE_PAY_TBL.SALARY,EMPLOYEE_PAY_TBL.HIRE_DATE 2 FROM EMPLOYEE_TBL, EMPLOYEE_PAY_TBL 3 WHERE EMPLOYEE_TBL.EMPLOYEE_ID = EMPLOYEE_PAY_TBL.EMPLOYEE_ID AND 4 EMPLOYEE_PAY_TBL.SALARY > 30000 OR (EMPLOYEE_PAY_TBL.SALARY BETWEEN 25000 5 AND 30000 AND EMPLOYEE_PAY_TBL.HIRE_DATE < SYSDATE - 365);Here's the same query reformatted to enhance readability:
SQL> SELECT E.EMPLOYEE_ID, E.NAME, P.SALARY, P.HIRE_DATE 2 FROM EMPLOYEE_TBL E, 3 EMPLOYEE_PAY_TBL P 4 WHERE E.EMPLOYEE_ID = P.EMPLOYEE_ID 5 AND P.SALARY > 30000 6 OR (P.SALARY BETWEEN 25000 AND 30000 7 AND P.HIRE_DATE < SYSDATE - 365);
The two queries are identical, but the second one is obviously much easier to read. It is very structured; that is, the logical components of the query have been separated by carriage returns and consistent spacing. You can quickly see what is being selected (the SELECT clause), what tables are being accessed (the FROM clause), and what conditions need to be met (the WHERE clause).
NOTE: Notice the use of table aliases in the preceding query. EMPLOYEE_TBL in line 2 has been assigned the alias E, and EMPLOYEE_PAY_TBL in line 3 has been assigned the alias P. You can see that in lines 4, 5, 6, and 7, the E and P stand for the full table names. Aliases require much less typing than spelling out the full table name, and even more important, queries that use aliases are more organized and easier to read than queries that are cluttered with unnecessarily long full table names.
You can avoid a full-table scan by creating an index on columns that are used as conditions in the WHERE clause of an SQL statement. Indexes provide a direct path to the data the same way an index in a book refers the reader to a page number. Adding an index speeds up data access.
Although programmers usually frown upon full-table scans, they are sometimes appropriate. For example:
In addition, indexes are best used on large tables. You should always consider table size when you are designing tables and indexes. Properly indexing tables involves familiarity with the data, knowing which columns will be referenced most, and may require experimentation to see which indexes work best.
NOTE: When speaking of a "large table," large is a relative term. A table that is extremely large to one individual may be minute to another. The size of a table is relative to the size of other tables in the database, to the disk space available, to the number of disks available, and simple common sense. Obviously, a 2GB table is large, whereas a 16KB table is small. In a database environment where the average table size is 100MB, a 500MB table may be considered massive.
We have also seen this type of slowdown when a new screen or window has been added to a front-end application. One of the first things to do when you begin to troubleshoot is to find out whether the target table has an index. In most of the cases we have seen, the target table has an index, but one of the new conditions in the WHERE clause may lack an index. Looking at the WHERE clause of the SQL statement, we have asked, Should we add another index? The answer may be yes if:
You do not have to use a column that is indexed in the WHERE clause, but it is obviously more beneficial to do so. Try to narrow down the results of the SQL statement by using an index that returns the fewest number of rows. The condition that returns the fewest records in a table is said to be the most restrictive condition. As a general statement, you should place the most restrictive conditions last in the WHERE clause. (Oracle's query optimizer reads a WHERE clause from the bottom up, so in a sense, you would be placing the most restrictive condition first.)
When the optimizer reads the most restrictive condition first, it is able to narrow down the first set of results before proceeding to the next condition. The next condition, instead of looking at the whole table, should look at the subset that was selected by the most selective condition. Ultimately, data is retrieved faster. The most selective condition may be unclear in complex queries with multiple conditions, subqueries, calculations, and several combinations of the AND, OR, and LIKE.
The following test is one of many we have run to measure the difference of elapsed time between two uniquely arranged queries with the same content. These examples use Oracle7.3 relational database management system. Remember, the optimizer in this implementation reads the WHERE clause from the bottom up.
TIP: Always check your database documentation to see how SQL statements are processed in your implementation.
Before creating the SELECT statement, we selected distinct row counts on each condition that we planned to use. Here are the values selected for each condition:
Condition | Distinct Values |
calc_ytd = '-2109490.8' | 13,000 + |
dt_stmp = '01-SEP-96' | 15 |
output_cd = '001' | 13 |
activity_cd = 'IN' | 10 |
status_cd = 'A' | 4 |
function_cd = '060' | 6 |
The next example places the most restrictive conditions first in the WHERE clause:
NOTE: The most restrictive condition is also the condition with the most distinct values.
SQL> SET TIMING ON 2 SELECT COUNT(*) 3 FROM FACT_TABLE 4 WHERE CALC_YTD = '-2109490.8' 5 AND DT_STMP = '01-SEP-96' 6 AND OUTPUT_CD = '001' 7 AND ACTIVITY_CD = 'IN' 8 AND STATUS_CD = 'A' 9 AND FUNCTION_CD = '060';
COUNT(*) -------- 8 1 row selected. Elapsed: 00:00:15.37This example places the most restrictive conditions last in the WHERE clause:
SQL> SET TIMING ON 2 SELECT COUNT(*) 3 FROM FACT_TABLE 4 WHERE FUNCTION_CD = '060' 5 AND STATUS_CD = 'A' 6 AND ACTIVITY_CD = 'IN' 7 AND OUTPUT_CD = '001' 8 AND DT_STMP = '01-SEP-96' 9 AND CALC_YTD = '-2109490.8'; COUNT(*) -------- 8 1 row selected. Elapsed: 00:00:01.80
Procedures are compiled by the database engine and then executed. Unlike an SQL statement, the database engine need not optimize the procedure before it is executed. Procedures, as opposed to numerous individual queries, may be easier for the user to maintain and more efficient for the database.
SQL> SELECT * 2 FROM FACT_TABLE 3 WHERE STATUS_CD = 'A' 4 OR STATUS_CD = 'B' 5 OR STATUS_CD = 'C' 6 OR STATUS_CD = 'D' 7 OR STATUS_CD = 'E' 8 OR STATUS_CD = 'F' 9 ORDER BY STATUS_CD;Here is the same query using SUBSTR and IN:
SQL> SELECT * 2 FROM FACT_TABLE 3 WHERE STATUS_CD IN ('A','B','C','D','E','F') 4 ORDER BY STATUS_CD;
Here is another example using SUBSTR and IN. Notice that the first query combines LIKE with OR.
SQL> SELECT * 2 FROM FACT_TABLE 3 WHERE PROD_CD LIKE 'AB%' 4 OR PROD_CD LIKE 'AC%' 5 OR PROD_CD LIKE 'BB%' 6 OR PROD_CD LIKE 'BC%' 7 OR PROD_CD LIKE 'CC%' 8 ORDER BY PROD_CD;
SQL> SELECT * 2 FROM FACT_TABLE 3 WHERE SUBSTR(PROD_CD,1,2) IN ('AB','AC','BB','BC','CC') 4 ORDER BY PROD_CD;
One of the biggest issues in a transactional database is rollback segments. The amount and size of rollback segments heavily depend on how many users are concurrently accessing the database, as well as the amount of work in each transaction. The best approach is to have several rollback segments in a transactional environment.
Another concern in a transactional environment is the integrity of the transaction logs, which are written to after each transaction. These logs exist for the sole purpose of recovery. Therefore, each SQL implementation needs a way to back up the logs for use in a "point in time recovery." SQL Server uses dump devices; Oracle uses a database mode known as ARCHIVELOG mode. Transaction logs also involve a performance consideration because backing up logs requires additional overhead.
Because the purpose of this type of system is to retrieve useful decision-making data, you can expect many complex queries, which normally involve grouping and sorting of data. Compared to a transactional database, OLAP systems typically take more space for the sort area but less space for the rollback area.
Most transactions in an OLAP system take place as part of a batch process. Instead of having several rollback areas for user input, you may resort to one large rollback area for the loads, which can be taken offline during daily activity to reduce overhead.
Regular user input usually consists of SQL statements such as INSERT, UPDATE, and DELETE. These types of transactions are often performed by the end user, or the customer. End users are normally using a front-end application such as PowerBuilder to interface with the database, and therefore they seldom issue visible SQL statements. Nevertheless, the SQL code has already been generated for the user by the front-end application.
Your main focus when optimizing the performance of a database should be the end-user transactions. After all, "no customer" equates to "no database," which in turn means that you are out of a job. Always try to keep your customers happy, even though their expectations of system/database performance may sometimes be unreasonable. One consideration with end-user input is the number of concurrent users. The more concurrent database users you have, the greater the possibilities of performance degradation.
What is a batch load? A batch load performs heaps of transactions against the database at once. For example, suppose you are archiving last year's data into a massive history table. You may need to insert thousands, or even millions, of rows of data into your history table. You probably wouldn't want to do this task manually, so you are likely to create a batch job or script to automate the process. (Numerous techniques are available for loading data in a batch.) Batch loads are notorious for taxing system and database resources. These database resources may include table access, system catalog access, the database rollback segment, and sort area space; system resources may include available CPU and shared memory. Many other factors are involved, depending on your operating system and database server.
Both end-user transactions and batch loads are necessary for most databases to be successful, but your system could experience serious performance problems if these two types of processing lock horns. Therefore, you should know the difference between them and keep them segregated as much as possible. For example, you would not want to load massive amounts of data into the database when user activity is high. The database response may already be slow because of the number of concurrent users. Always try to run batch loads when user activity is at a minimum. Many shops reserve times in the evenings or early morning to load data in batch to avoid interfering with daily processing.
You should always plan the timing for massive batch loads, being careful to avoid scheduling them when the database is expected to be available for normal use. Figure 15.1 depicts heavy batch updates running concurrently with several user processes, all contending for system resources.
Figure 15.1.
System resource contention.
As you can see, many processes are contending for system resources. The heavy batch updates that are being done throw a monkey wrench into the equation. Instead of the system resources being dispersed somewhat evenly among the users, the batch updates appear to be hogging them. This situation is just the beginning of resource contention. As the batch transactions proceed, the user processes may eventually be forced out of the picture. This condition is not a good way of doing business. Even if the system has only one user, significant contention for that user could occur.
Another problem with batch processes is that the process may hold locks on a table that a user is trying to access. If there is a lock on a table, the user will be refused access until the lock is freed by the batch process, which could be hours. Batch processes should take place when system resources are at their best if possible. Don't make the users' transactions compete with batch. Nobody wins that game.
When you load data into a table with an index, you can usually expect a great deal of index use, especially if you are updating a high percentage of rows in the table. Look at it this way. If you are studying a book and highlighting key points for future reference, you may find it quicker to browse through the book from beginning to end rather than using the index to locate your key points. (Using the index would be efficient if you were highlighting only a small portion of the book.)
To maximize the efficiency of batch loads/updates that affect a high percentage of rows in a table, you can take these three basic steps to disable an index:
2. Load/update the table's data.
3. Rebuild the table's index.
As you can guess, if you never issue a COMMIT or ROLLBACK command, transactions keep building within the rollback segments. Subsequently, if the data you are loading is greater in size than the available space in the rollback segments, the database will essentially come to a halt and ban further transactional activity. Not issuing COMMIT commands is a common programming pitfall; regular COMMITs help to ensure stable performance of the entire database system.
The management of rollback segments is a complex and vital database administrator (DBA) responsibility because transactions dynamically affect the rollback segments, and in turn, affect the overall performance of the database as well as individual SQL statements. So when you are loading large amounts of data, be sure to issue the COMMIT command on a regular basis. Check with your DBA for advice on how often to commit during batch transactions. (See Figure 15.2.)
Figure 15.2.
The rollback area.
As you can see in Figure 15.2, when a user performs a transaction, the changes are retained in the rollback area.
The result of constant high-volume changes to a database is growth, which in turn yields fragmentation. Fragmentation can easily get out of hand if growth is not managed properly. Oracle allocates an initial extent to tables when they are created. When data is loaded and fills the table's initial extent, a next extent, which is also allocated when the table is created, is taken.
Sizing tables and indexes is essentially a DBA function and can drastically affect SQL statement performance. The first step in growth management is to be proactive. Allow room for tables to grow from day one, within reason. Also plan to defragment the database on a regular basis, even if doing so means developing a weekly routine. Here are the basic conceptual steps involved in defragmenting tables and indexes in a relational database management system:
2. Drop the table(s) and/or index(es).
3. Rebuild the table(s) and/or index(es) with new space allocation.
4. Restore the data into the newly built table(s).
5. Re-create the index(es) if necessary.
6. Reestablish user/role permissions on the table if necessary.
7. Save the backup of your table until you are absolutely sure that the new table was built successfully. If you choose to discard the backup of the original table, you should first make a backup of the new table after the data has been fully restored.
The following example demonstrates a practical use of a mailing list table in an Oracle database environment.
WARNING: Never get rid of the backup of your table until you are sure that the new table was built successfully.
CREATE TABLE MAILING_TBL_BKUP AS SELECT * FROM MAILING_TBL;
Table Created.
drop table mailing_tbl; Table Dropped. CREATE TABLE MAILING_TBL ( INDIVIDUAL_ID VARCHAR2(12) NOT NULL, INDIVIDUAL_NAME VARCHAR2(30) NOT NULL, ADDRESS VARCHAR(40) NOT NULL, CITY VARCHAR(25) NOT NULL, STATE VARCHAR(2) NOT NULL, ZIP_CODE VARCHAR(9) NOT NULL, ) TABLESPACE TABLESPACE_NAME STORAGE ( INITIAL NEW_SIZE, NEXT NEW_SIZE ); Table created. INSERT INTO MAILING_TBL select * from mailing_tbl_bkup; 93,451 rows inserted. CREATE INDEX MAILING_IDX ON MAILING TABLE ( INDIVIDUAL_ID ) TABLESPACE TABLESPACE_NAME STORAGE ( INITIAL NEW_SIZE, NEXT NEW_SIZE ); Index Created. grant select on mailing_tbl to public; Grant Succeeded. drop table mailing_tbl_bkup; Table Dropped.
Figure 15.3.
Using available disks to enhance performance.
The scenario in Figure 15.3 uses four devices: disk01 through disk04. The objective when spreading your heavy database areas and objects is to keep areas of high use away from each another.
Notice that in Figure 15.3 the tables and indexes are stored on different devices. You can also see how a "Big Table" or index may be striped across two or more devices. This technique splits the table into smaller segments that can be accessed simultaneously. Striping a table or index across multiple devices is a way to control fragmentation. In this scenario, tables may be read while their corresponding indexes are being referenced, which increases the speed of overall data access.
TIP: Also note how the application tables and indexes have been placed on each disk. Tables and indexes should be spread as much as possible.
This example is really quite simple. Depending on the function, size, and system-related issues of your database, you may find a similar method for optimizing system resources that works better. In a perfect world where money is no obstacle, the best configuration is to have a separate disk for each major database entity, including large tables and indexes.
Tuning a database very much depends on the specific database system you are using. Obviously, tuning a database entails much more than just preparing queries and letting them fly. On the other hand, you won't get much reward for tuning a database when the application SQL is not fine-tuned itself. Professionals who tune databases for a living often specialize on one database product and learn as much as they possibly can about its features and idiosyncrasies. Although database tuning is often looked upon as a painful task, it can provide very lucrative employment for the people who truly understand it.
NOTE: The DBA and system administrator should work together to balance database space allocation and optimize the memory that is available on the server.
This section summarizes the most common obstacles in system performance and database response time.
TIP: You can use various methods to optimize the structure of an SQL statement, depending upon the steps taken by the database server during SQL statement processing.
Oracle has two popular tools for managing SQL statement performance. These tools are explain plan and tkprof. The explain plan tool identifies the access path that will be taken when the SQL statement is executed. tkprof measures the performance by time elapsed during each phase of SQL statement processing. Oracle Corporation also provides other tools that help with SQL statement and database analysis, but the two mentioned here are the most popular. If you want to simply measure the elapsed time of a query in Oracle, you can use the SQL*Plus command SET TIMING ON.
SET TIMING ON and other SET commands are covered in more depth on Day 20, "SQL*Plus."
Sybase's SQL Server has diagnostic tools for SQL statements. These options are in the form of SET commands that you can add to your SQL statements. (These commands are similar to Oracle's SET commands). Some common commands are SET SHOWPLAN ON, SET STATISTIC IO ON, and SET STATISTICS TIME ON. These SET commands display output concerning the steps performed in a query, the number of reads and writes required to perform the query, and general statement-parsing information. SQL Server SET commands are covered on Day 19, "Transact-SQL: An Introduction."
Application developers must know the data. The key to an optimal database design is thorough knowledge of the application's data. Developers and production programmers must know when to use indexes, when to add another index, and when to allow batch jobs to run. Always plan batch loads and keep batch processing separate from daily transactional processing.
Databases can be tuned to improve the performance of individual applications that access them. Database administrators must be concerned with the daily operation and performance of the database. In addition to the meticulous tuning that occurs behind the scenes, the DBA can usually offer creative suggestions for accessing data more efficiently, such as manipulating indexes or reconstructing an SQL statement. The DBA should also be familiar with the tools that are readily available with the database software to measure performance and provide suggestions for statement tweaking.
A Performance gain depends on the size of your tables, whether or not columns in the table are indexed, and other relative factors. In a very large database, a complex query that runs for hours can sometimes be cut to minutes. In the case of transactional processing, streamlining an SQL statement can save important seconds for the end user.
Q How do I coordinate my batch loads or updates?
A Check with the database administrator and, of course, with management when scheduling a batch load or update. If you are a system engineer, you probably will not know everything that is going on within the database.
Q How often should I commit my batch transactions?
A Check with the DBA for advice. The DBA will need to know approximately how much data you are inserting, updating, or deleting. The frequency of COMMIT statements should also take into account other batch loads occurring simultaneously with other database activities.
Q Should I stripe all of my tables?
A Striping offers performance benefits only for large tables and/or for tables that are heavily accessed on a regular basis.
2. Should tables and their corresponding indexes reside on the same disk?
3. Why is the arrangement of conditions in an SQL statement important?
4. What happens during a full-table scan?
5. How can you avoid a full-table scan?
6. What are some common hindrances of general performance?
SELECT EMPLOYEE.LAST_NAME, EMPLOYEE.FIRST_NAME, EMPLOYEE.MIDDLE_NAME, EMPLOYEE.ADDRESS, EMPLOYEE.PHONE_NUMBER, PAYROLL.SALARY, PAYROLL.POSITION, EMPLOYEE.SSN, PAYROLL.START_DATE FROM EMPLOYEE, PAYROLL WHERE EMPLOYEE.SSN = PAYROLL.SSN AND EMPLOYEE.LAST_NAME LIKE 'S%' AND PAYROLL.SALARY > 20000;
593 individuals have the last name SMITH.712 individuals live in INDIANAPOLIS.
3,492 individuals are MALE.
1,233 individuals earn a salary >= 30,000.
5,009 individuals are single.
SELECT M.INDIVIDUAL_NAME, M.ADDRESS, M.CITY, M.STATE, M.ZIP_CODE, S.SEX, S.MARITAL_STATUS, S.SALARY FROM MAILING_TBL M, INDIVIDUAL_STAT_TBL S WHERE M.NAME LIKE 'SMITH%' AND M.CITY = 'INDIANAPOLIS' AND S.SEX = 'MALE' AND S.SALARY >= 30000 AND S.MARITAL_STATUS = 'S' AND M.INDIVIDUAL_ID = S.INDIVIDUAL_ID; --------------