Definition
Query optimization is the refining of a standard
database query into a specific
method for retrieving
data.
The design of query optimization varies greatly depending on the type of database you are using. The most common databases that use query optimization are relational databases and object oriented databases.
SQL has been adapted as the industry query language standard, however many times query optimization is done with relational algebra. Regardless of what language you use the concepts are the same. However it may be useful to note that the word "SELECT" in SQL is actually the projection operator in relational algebra, and the word "WHERE" functions the same as the selection operator.
Broad Overview
The general idea of query optimization, ideally, is to figure out a list of possible plans and evaluate each one based on run time. That approach proves time consuming however so certain assumptions are usually made to derive an optimal plan initially.
The main operations in a database query are projections (column selection), selections(tuple or row selections) and joins (cartesian product of tables). See relational algebra for a more accurate description of these operators.
The order and way you execute each of these operations greatly varies the time a query will take. The other consideration is the searching mechanism used for finding the data.
In relational databases, data can be stored based on several keys, where a key is a unique data field that identifies any given tuple (row). Common data structures are hashes and Btrees, depending on the most likely query and type of data stored. Different data storage structures are useful if queries with range operators (>, <, >=, <=) are going to be present (ie. find all gpa scores higher than 3.0).
Simple Example
Here is an extremely simple example that will demonstrate some of the basic concepts behind query optimization.
Start with a relational database with a table for students and a table for classes. These tables are related so that you can look up which students belong in which classes. The way the tables are related (via another table or a shared key) will be ignored for now. Here is a query, in SQL, that might be requested.
SELECT class_ID
FROM Students, Classes
WHERE class_Name='Drinking 101' AND student_Name='Barney Gumble'
The same query in relational algebra
π_{class_ID}((σ_{student_Name='Barney Gumble'}(Students)) X (σ_{class_Name='Drinking 101'}(Classes))
The big X represents cartesian product and could be replaced with the join operator.
In order to execute this query four things have to be done: joining the two tables classes and students, selecting the classes with a className of Drinking 101, selecting the students with a studentName Barney Gumble, and projecting out of the remaining tuples the class ID. You can construct a tree to represent the order of these operations with the things to be executed first on the bottom, and things to be done last on the top.
π class_ID



σ class_Name='Drinking 101'



σ student_Name='Barney Gumble'



Students X Classes
In most situations the join (cartesian product) is the most
time consuming operation, therefore you want to reduce the size of the tables as much as possible before joining anything. With the tree above we are joining the table Students with the table Classes, which could be a really long operation if these tables are relatively large. If we do the
selections and
projections first it will reduce the size of the tables drastically. This is called pushing down projections and selections. Now we make a new tree to represent the change in operation order. Two operations on the same level can be run in
parallel.
Students X Classes



π class_ID
/ \
_____/ \_____
/ \
/ \
σ class_Name='Drinking 101' σ student_Name='Barney Gumble'
Notice in this tree that the selections and projections are closer to the bottom (ie pushed down) and get executed first. By doing this basic simplification we have drastically reduced the query time. Other considerations would be the type of join used (left outer joins are frequently optimal), and the search mechanisms used to find the class name and the student name. This gets fairly complicated with large queries and ranged searches, but that is the basics of query optimization.