|
SQLBy
Jonathan Gennick |
Read all about the SELECT statement in the Oracle SQL Reference manual, available for various releases of the Oracle database server in the Documenation section of Oracle Technology Network (http://technet.oracle.com/). |
For example, summarizing (using the GROUP BY clause) can mask poorly specified join conditions as well as other mistakes in the WHERE clause. Summarize only after you are certain you are summarizing the correct data. You might also incorrectly specify join conditions right from the start, but other conditions in the WHERE clause might make the results of the query appear to be working. You will get bitten down the road when the data changes or when someone else modifies the query.
Step 1. The FROM clause. First, I identify the tables that hold the data I need. Then, I work through the process of joining them correctly.
Step 2. The WHERE clause. Once I'm sure I've specified the join conditions correctly, I add any other necessary conditions to the WHERE clause.
Step 3. The GROUP BY clause. Only after I'm sure I have the correct data do I even think about summarizing it.
Step 4. The HAVING clause. Once I've summarized the data correctly, I eliminate any summary rows I don't want in the result set.
Step 5. The SELECT statement. Having gotten this far, if I still need to add any columns to the SELECT list, I add them at this time.
Step 6. The ORDER BY clause. Usually the very last thing I do in this process is specify the sort criteria.
It's true that I need to at least think about the columns I list in the SELECT statement in order to identify the tables, and it's also true that I need to select some data in order to test the join. However, I'm not contradicting myself. My focus during Step 1 is on the join, and any columns I select will be the primary- and foreign-key columns necessary to verify that I've done the join correctly.
When I'm working through these steps, I take care to sanity-check the query at each step. To verify that I've joined the tables correctly, for example, I select the columns involved in the join and review them. Equijoins are easy to specify and easy to validate—you just need to be sure the join columns from each table match. Things get a little tougher if you are joining two tables based on a range of values, such as a range of dates.
Where possible, I generate test data to verify that boundary conditions and other exceptional cases don't cause my query to fail. When joining tables on a range of dates, I may generate some test data that has a time component, because it's not uncommon for DATE fields to also have a time associated with them.
When building complex queries, here are some other things I do:
At times I am forced to violate my usual query-development order— if I'm joining tables based on summarized data, for example. In such a case, I work on that join after I've written the GROUP BY clause. But typically I find that using this approach is a good way to ensure that relationships among all data elements—even things that may be hidden from me at first, as we'll see in a minute—are taken into account to provide accurate information.
The Business Enterprise Program aims to enable blind and sight-impaired people to make a living by operating vending concessions, stores, and cafeterias in office buildings and rest areas throughout the state of Michigan. As you can see from Figure 1, the database design contains several relatively typical-looking many-to-one relationships among tables that hold data for facilities, operators, sites, permits, and periods. The one unusual thing is that many of the relationships have a time component, which makes the task of joining these tables a bit more "interesting" than usual.
Business Enterprise
Program Database Design |
FIGURE 1: This is a small snapshot of a larger design that I helped implement for the Michigan Commission for the Blind's Business Enterprise Program. The program enables blind and sight-impaired people to make a living by operating vending concessions, stores, and cafeterias in office buildings and rest areas throughout the state. The database contains several many-to-one relationships, and many of the relationships have a time component. |
These results look great, don't they? Two facilities of each type—two cafeterias, two stores, and two vending machines—were opened during the calendar year 2000. Although they may look good, these results are actually incorrect (as we'll see in a minute as we step through the incremental approach). If you just type the query in as a whole, that's not obvious.
If, on the other hand, you work through the query methodically and sanity-check the results as you go, the flaws quickly become apparent.
Let's look at the results of this interim query: The results list two lines of detail each for both sites 3 and 5 . But why, when we were expecting only one line each? Because our query doesn't take into account the one-to-many relationship between facility and site into account. We need a site record in order to get to a permit, but which site should we use, if a given facility has multiple sites?
Visit http://www.gennick.com/ for Oracle-related tips and techniques. |
The answer is that we should use the site flagged as the primary site. This doubtless comes as a surprise, because I haven't mentioned anything about the primary-site flag. Trust me, though, your clients will occasionally surprise you in the same way—sometimes you have to ferret out what they haven't thought to tell you. If we add the primary-site restriction to our join, our query now looks like that in Listing 3.
But wait! Can a site have more than one permit? The answer is yes. The database design allows multiple permits per site. This is a great example of how finding one problem can lead you to think of another. To resolve this issue, let's assert the business rule that the only permit that matters is the one most recently approved. To restrict the join to only the most recently approved permit, we must add a subquery to our WHERE clause, as in Listing 4.
This would also be a good time to generate some test cases for sites that do have multiple permits, so we can be sure that our subquery is behaving as we expect it to, but let's skip that step for now. If this subquery were any more complex, I would also be inclined to test it separately from the main query.
For a hands-on, step-by-step design and development project, see
www.oracle.com/ oramag/webcolumns/hyde.html. |
Notice that the open date for facility 3 includes a time component, which the original query didn't take into account. We need to write our query in such a way as to make the time of day irrelevant. Instead of checking to see if the open date is less than or equal to 31-Dec-2000, we can check to see if it is less than 1-Jan-2000. That way, any time up to midnight on December 31 will be included. Avoid using the TRUNC function on the date column, because that might preclude the Oracle database server from using any indexes on that column. Listing 6 shows the new version of our query.
This date example underscores the importance of looking at your data and sanity-checking your results. It also pays to validate any underlying assumptions your query makes about the data being retrieved.
Now that we know our query is returning the correct data, the next step is to summarize it. Our goal was to count the number of facilities of each type opened during the year. To achieve that, we need to summarize on the PER_SITE_TYPE column and we need to use the aggregate function COUNT to report the number of facilities of each type. Our final query, together with the results it returns, is shown in Listing 7.
These results are different from those reported by the first version of the query. By working through the query-development process in a methodical manner and sanity-checking our results after each step, we identified and resolved several problems that caused the query to return incorrect results.
For more information about the Michigan Commission for the
Blind, visit www.mfia.state.mi.us/ COMBLIND/MCB.htm |
In this article, we detected several problems just by looking at the data returned by the query in its various stages of development. An initial query gave erroneous results—which appeared to be just fine—because it didn't take into account the one-to-many relationship between facilities and sites. Other details of database design that might not have been immediately apparent—the primary site flag, for instance, were revealed only as we constructed our query piece by piece, testing the results as we went along. Had we not followed a methodical process, we might have missed some of these subtle, data-related issues.
Jonathan Gennick (jonathan@gennick.com) is an Oracle Certified Professional and independent consultant. He is author of several books, including Oracle SQL*Plus, The Definitive Guide (O'Reilly & Associates; ISBN: 1565925785) and is co-author of Oracle8i DBA Bible (IDG Books Worldwide; ISBN: 0764546236; 2000). Thanks to the Michigan Commission for the Blind Business Enterprise Program for its gracious permission to use real-world examples from its database for this article.