The goal of this tutorial is to show you how to use extended statistics to optimize multi-column relationships and function-based statistics.
Approximately 30 minutes
This tutorial covers the following topics:
Place the cursor over this icon to load and view all the screenshots for this tutorial. (Caution: This action loads all screenshots simultaneously, so response time may be slow depending on your Internet connection.)
Note: Alternatively, you can place the cursor over an individual icon in the following steps to load and view only the screenshot associated with that step. You can hide an individual screenshot by clicking it.
In real-world data there is often a correlation between two
or more columns in a table. For example, job title and salary are related (the
VP of a company is likely to earn a lot more than the janitor does) or car make
and price (a BMW is likely to be a lot more expensive than a Honda). Up until
now, the optimizer has had no way of knowing that these relationships exist
between the columns in a table. When a query was executed against the table
with multiple, single column predicates, it was impossible for the optimizer
to calculate the correct selectivity of these predicates as it had no way of
knowing if the columns were related or not.
It has also been extremely hard for the optimizer to calculate the correct selectivity
for a column that has a function apply to it. For example, UPPER(surname)=.SMITH.
In Oracle Database 11g, extended statistics (multi-column statistics)
have been introduced, which allow you to collect statistics on a group of columns
as a whole and on functions, thus enabling the optimizer to calculate the selectivity
of these predicates correctly. As a result, the optimizer is aware of the correct
selectivity (cardinality). This tutorial demonstrates why you need extended
statistics and how you can create them.
Before starting this tutorial, you should first complete the following steps:
1. |
Install Oracle Database 11g.
|
|
2. |
Make sure the SH user is unlocked. Open a terminal window and execute the following: sqlplus sys/<syspassword> as sysdba
alter user sh identified by sh account unlock;
exit
|
|
3. |
Download and unzip the multicolstats.zip file into a working directory.
|
|
4. |
You need to create and load the data for the CUSTOMERS_OBE table. From your terminal window, execute the following: cd <sql_files_directory> imp sh/sh file= customers_obe.dmp log=imp.log full=y
|
A good example of correlated or related columns are the country_id and cust_state_province columns in the CUSTOMERS_OBE table. When the value of cust_state_province is 'CA' , the value of country_id is 'US'. There is a skew in the data in these two columns, which means the majority of rows in the table have the values 'CA' and 'US'. Both the relationship between the columns and the skew in the data make it difficult for the optimizer to calculate the selectivity or cardinality of these columns correctly when they are used together in a query. Extended statistics should help in this situation. Let's begin by checking what the actual cardinality or number of rows in the CUSTOMERS_OBE table is when the country_id is 'US' and cust_state_province is 'CA'. Perform the following steps:
1. | You first want to check what the actual cardinality or number of rows in the CUSTOMER_OBE table is when the country_id is 'US' and cust_state_province is 'CA'. Open a terminal window and execute the following commands: CD<sql_files_directory>
sqlplus sh/sh
set echo on
@check_cardinality
The check_cardinality.sql code is as follows: select count(*)
|
|
2. | Now that you know the actual number of rows returned, you can gather statistics on the table. From your terminal window, execute the following SQL script: @gather_stats
The gather_stats.sql code is as follows: exec dbms_stats.gather_table_stats(null,'customers_obe', -
|
|
3. |
Review the column statistics that were generated. From your terminal window, execute the following SQL scripts: @review_stats
The review_stats.sql code is as follows: select column_name, num_distinct, histogram
You see that the number of distinct values are for both the country_id and cust_state_province columns. Both of these values look accurate.
|
|
4. | Given these statistics, check the number of rows the optimizer estimates will be returned by the query. From your terminal window, execute the following SQL script: @explain_plan
The explain_plan.sql code is as follows: explain plan for
With just basic statistics, the optimizer thinks there will only be 1 row returned. Because you know that this is not true, you need to provide better statistics to the optimizer so it can determine the correct row count.
|
Prior to Oracle Database 11g, there was no way to tell the optimizer what the correct combined selectivity is for multiple columns. The only insight given to the optimizer was that there was a data skew in both of these columns. By making the optimizer aware of the data skew in Oracle Database 11g, a more accurate selectivity can be calculated. In Oracle Database 11g, you can gather histograms on the skewed columns. After you have histograms on the columns, you can recheck the optimizer's estimate for the number of rows. Perform the following steps:
1. | Gather a histogram on the skewed columns. From your terminal window, execute the following SQL script: @gather_histogram
The gather_histogram.sql code is as follows: exec dbms_stats.gather_table_stats(null,'customers_obe', -
|
|
2. | Now you can generate the explain plan to see what the optimizer estimates are now. From your terminal window, execute the following SQL script: @explain_plan
The explain_plan.sql code is as follows: explain plan for
There is a slight improvement in the estimate because there are histograms for individual columns. However, the optimizer is still not aware that there is a relationship or correlation between the two columns.
|
In Oracle Database 11g, the optimizer can learn about this correlation by creating extended statistics for the columns. Extended statistics means the next time you gather statistics on the CUSTOMERS_OBE table, an extra set of statistics for the combined group of country_id and cust_state_province is gathered. Perform the following steps:
1. | Create the extended statistics group. From your terminal window, execute the following SQL script: @create_extended_stats
The create_extended_stats.sql code is as follows: select dbms_stats.create_extended_stats(null,'customers_obe',
The output of create_extended_stats.sql is a system-generated name for the virtual column that gets created.
|
|
2. |
Execute the script you ran previously to gather a histogram on the skewed columns. From your terminal window, execute the following SQL script: @gather_histogram
The gather_histogram.sql code is as follows: exec dbms_stats.gather_table_stats(null,'customers_obe', -
You should see a new column listed for the table that has a system-generated name, which was returned by the create_extended_stats procedure.
|
|
3. | Rerun the explain plan. From your terminal window, execute the following script: @explain_plan
The explain_plan.sql code is as follows: explain plan for
Now the optimizer gets the correct number of rows for the query.
|
The optimizer can also have trouble finding the right cardinality when there is a function or expression used on a column. Take the LOWER function for example; this function takes a string or a character column and returns that string in all lowercase letters. Perform the following steps:
1. |
You want to run a query using a function that selects the total number of rows who have a country_id equal to 'US'. The value passed to the query is in lowercase so the LOWER function to the column country_id is specified. From your terminal window, execute the following script: @get_count_lower
The get_count_lower.sql code is as follows: select count(*)
The actual number of rows is returned.
|
|
2. | Now you can run the explain plan again to see how many rows the optimizer thinks this query will return. From your terminal window, execute the following script: @explain_plan_lowercase
The explain_plan_lowercase.sql code is as follows: explain plan for
The optimizer did not get close to the right number of rows.
|
|
3. |
If you create extended statistics for the expression LOWER(country_id), you can help the optimizer get the correct number of rows. In the standard gather_table_stats procedure, the method_opt argument allows you to specify the extension or the extended statistic you want to collect. The gather_table_stats procedure creates the extended statistics automatically if it does not exist. From your terminal window, execute the following script: @gather_stats_lower_col
The gather_stats_lower_col.sql code is as follows: exec dbms_stats.gather_table_stats(null,'customers_obe', -
|
|
4. |
Now that you created the extended statistic and gathered statistics, you can look at the statistics for the CUSTOMERS_OBE table. You should see a new column with a system-generated name. From your terminal window, execute the following script: @review_col_stats
The review_col_stats.sql code is as follows: select column_name, num_distinct, histogram
|
|
5. | Rerun the explain plan to see what the optimizer estimates. From your terminal window, execute the following script: @explain_plan_lowercase
The explain_plan_lowercase.sql code is as follows: explain plan for
|
|
6. |
You can also regenerate the virtual column definition from the system-generated name. From your terminal window, execute the following command: select dbms_metadata.get_ddl('TABLE','CUSTOMERS_OBE') from dual;
|
Now you can cleanup your environment by performing the following steps:
1. | From your SQL*Plus session, execute the following command: DROP TABLE CUSTOMERS_OBE;
|
In this tutorial, you learned how to:
Determine single column statistics | ||
Gather histograms on skewed columns | ||
Create extended statistics to correlate columns | ||
Create extended statistics for a function used on a column |