Using Extended Statistics to Optimize Multi-Column Relationships and Function Based Statistics

Using Extended Statistics to Optimize Multi-Column Relationships and Function-Based Statistics

Purpose

The goal of this tutorial is to show you how to use extended statistics to optimize multi-column relationships and function-based statistics.

Time to Complete

Approximately 30 minutes

Topics

This tutorial covers the following topics:

	Overview
	Prerequisites
	Determining Single Column Statistics
	Gathering Histograms on Skewed Columns
	Creating Extended Statistics to Correlate Columns
	Creating Extended Statistics for a Function Used on a Column
	Cleanup
	Summary

Viewing Screenshots

Place the cursor over this icon to load and view all the screenshots for this tutorial. (Caution: This action loads all screenshots simultaneously, so response time may be slow depending on your Internet connection.)

Note: Alternatively, you can place the cursor over an individual icon in the following steps to load and view only the screenshot associated with that step. You can hide an individual screenshot by clicking it.

Overview

In real-world data there is often a correlation between two or more columns in a table. For example, job title and salary are related (the VP of a company is likely to earn a lot more than the janitor does) or car make and price (a BMW is likely to be a lot more expensive than a Honda). Up until now, the optimizer has had no way of knowing that these relationships exist between the columns in a table. When a query was executed against the table with multiple, single column predicates, it was impossible for the optimizer to calculate the correct selectivity of these predicates as it had no way of knowing if the columns were related or not.

It has also been extremely hard for the optimizer to calculate the correct selectivity for a column that has a function apply to it. For example, UPPER(surname)=.SMITH.

In Oracle Database 11g, extended statistics (multi-column statistics) have been introduced, which allow you to collect statistics on a group of columns as a whole and on functions, thus enabling the optimizer to calculate the selectivity of these predicates correctly. As a result, the optimizer is aware of the correct selectivity (cardinality). This tutorial demonstrates why you need extended statistics and how you can create them.

Back to Topic List

Prerequisites

Before starting this tutorial, you should first complete the following steps:

1.	Install Oracle Database 11g.
2.	Make sure the SH user is unlocked. Open a terminal window and execute the following: sqlplus sys/<syspassword> as sysdba alter user sh identified by sh account unlock; exit
3.	Download and unzip the multicolstats.zip file into a working directory.
4.	You need to create and load the data for the CUSTOMERS_OBE table. From your terminal window, execute the following: cd <sql_files_directory> imp sh/sh file= customers_obe.dmp log=imp.log full=y

Back to Topic List

Determining Single Column Statistics

A good example of correlated or related columns are the country_id and cust_state_province columns in the CUSTOMERS_OBE table. When the value of cust_state_province is 'CA' , the value of country_id is 'US'. There is a skew in the data in these two columns, which means the majority of rows in the table have the values 'CA' and 'US'. Both the relationship between the columns and the skew in the data make it difficult for the optimizer to calculate the selectivity or cardinality of these columns correctly when they are used together in a query. Extended statistics should help in this situation. Let's begin by checking what the actual cardinality or number of rows in the CUSTOMERS_OBE table is when the country_id is 'US' and cust_state_province is 'CA'. Perform the following steps:

1.	You first want to check what the actual cardinality or number of rows in the CUSTOMER_OBE table is when the country_id is 'US' and cust_state_province is 'CA'. Open a terminal window and execute the following commands: CD<sql_files_directory> sqlplus sh/sh set echo on @check_cardinality The check_cardinality.sql code is as follows: select count(*) from customers_obe where country_id = 'US' and cust_state_province = 'CA';
2.	Now that you know the actual number of rows returned, you can gather statistics on the table. From your terminal window, execute the following SQL script: @gather_stats The gather_stats.sql code is as follows: exec dbms_stats.gather_table_stats(null,'customers_obe', - method_opt => 'for all columns size 1');
3.	Review the column statistics that were generated. From your terminal window, execute the following SQL scripts: @review_stats The review_stats.sql code is as follows: select column_name, num_distinct, histogram from user_tab_col_statistics where table_name = 'CUSTOMERS_OBE'; You see that the number of distinct values are for both the country_id and cust_state_province columns. Both of these values look accurate.
4.	Given these statistics, check the number of rows the optimizer estimates will be returned by the query. From your terminal window, execute the following SQL script: @explain_plan The explain_plan.sql code is as follows: explain plan for select * from customers_obe where country_id = 'US' and cust_state_province = 'CA'; select plan_table_output from table(dbms_xplan.display('plan_table',null,'BASIC ROWS')); With just basic statistics, the optimizer thinks there will only be 1 row returned. Because you know that this is not true, you need to provide better statistics to the optimizer so it can determine the correct row count.

Back to Topic List

Gathering Histograms on Skewed Columns

Prior to Oracle Database 11g, there was no way to tell the optimizer what the correct combined selectivity is for multiple columns. The only insight given to the optimizer was that there was a data skew in both of these columns. By making the optimizer aware of the data skew in Oracle Database 11g, a more accurate selectivity can be calculated. In Oracle Database 11g, you can gather histograms on the skewed columns. After you have histograms on the columns, you can recheck the optimizer's estimate for the number of rows. Perform the following steps:

Gather a histogram on the skewed columns. From your terminal window, execute the following SQL script:

@gather_histogram

The gather_histogram.sql code is as follows:

exec dbms_stats.gather_table_stats(null,'customers_obe', -
      method_opt => 'for all columns size skewonly');

select column_name, num_distinct, histogram
from user_tab_col_statistics where table_name = 'CUSTOMERS_OBE';

Now you can generate the explain plan to see what the optimizer estimates are now. From your terminal window, execute the following SQL script:

@explain_plan

The explain_plan.sql code is as follows:

explain plan for
select *
from customers_obe
where country_id = 'US' and cust_state_province = 'CA';

select plan_table_output
from table(dbms_xplan.display('plan_table',null,'BASIC ROWS'));

There is a slight improvement in the estimate because there are histograms for individual columns. However, the optimizer is still not aware that there is a relationship or correlation between the two columns.

Back to Topic List

Creating Extended Statistics to Correlate Columns

In Oracle Database 11g, the optimizer can learn about this correlation by creating extended statistics for the columns. Extended statistics means the next time you gather statistics on the CUSTOMERS_OBE table, an extra set of statistics for the combined group of country_id and cust_state_province is gathered. Perform the following steps:

Create the extended statistics group. From your terminal window, execute the following SQL script:

@create_extended_stats

The create_extended_stats.sql code is as follows:

select  dbms_stats.create_extended_stats(null,'customers_obe',
         '(country_id, cust_state_province)')
from dual;

The output of create_extended_stats.sql is a system-generated name for the virtual column that gets created.

Execute the script you ran previously to gather a histogram on the skewed columns. From your terminal window, execute the following SQL script:

@gather_histogram

The gather_histogram.sql code is as follows:

exec dbms_stats.gather_table_stats(null,'customers_obe', -
      method_opt => 'for all columns size skewonly');

select column_name, num_distinct, histogram
from user_tab_col_statistics where table_name = 'CUSTOMERS_OBE';

You should see a new column listed for the table that has a system-generated name, which was returned by the create_extended_stats procedure.

Rerun the explain plan. From your terminal window, execute the following script:

@explain_plan

The explain_plan.sql code is as follows:

explain plan for
select *
from customers_obe
where country_id = 'US' and cust_state_province = 'CA';

select plan_table_output
from table(dbms_xplan.display('plan_table',null,'BASIC ROWS'));

Now the optimizer gets the correct number of rows for the query.

Back to Topic List

Creating Extended Statistics for a Function Used on a Column

The optimizer can also have trouble finding the right cardinality when there is a function or expression used on a column. Take the LOWER function for example; this function takes a string or a character column and returns that string in all lowercase letters. Perform the following steps:

1.	You want to run a query using a function that selects the total number of rows who have a country_id equal to 'US'. The value passed to the query is in lowercase so the LOWER function to the column country_id is specified. From your terminal window, execute the following script: @get_count_lower The get_count_lower.sql code is as follows: select count(*) from customers_obe where lower(country_id) = 'us'; The actual number of rows is returned.
2.	Now you can run the explain plan again to see how many rows the optimizer thinks this query will return. From your terminal window, execute the following script: @explain_plan_lowercase The explain_plan_lowercase.sql code is as follows: explain plan for select * from customers_obe where lower(country_id) = 'us'; select plan_table_output from table(dbms_xplan.display('plan_table',null,'BASIC ROWS')); The optimizer did not get close to the right number of rows.
3.	If you create extended statistics for the expression LOWER(country_id), you can help the optimizer get the correct number of rows. In the standard gather_table_stats procedure, the method_opt argument allows you to specify the extension or the extended statistic you want to collect. The gather_table_stats procedure creates the extended statistics automatically if it does not exist. From your terminal window, execute the following script: @gather_stats_lower_col The gather_stats_lower_col.sql code is as follows: exec dbms_stats.gather_table_stats(null,'customers_obe', - method_opt => - 'for all columns size skewonly for columns (lower(country_id))');
4.	Now that you created the extended statistic and gathered statistics, you can look at the statistics for the CUSTOMERS_OBE table. You should see a new column with a system-generated name. From your terminal window, execute the following script: @review_col_stats The review_col_stats.sql code is as follows: select column_name, num_distinct, histogram from user_tab_col_statistics where table_name = 'CUSTOMERS_OBE';
5.	Rerun the explain plan to see what the optimizer estimates. From your terminal window, execute the following script: @explain_plan_lowercase The explain_plan_lowercase.sql code is as follows: explain plan for select * from customers_obe where lower(country_id) = 'us'; select plan_table_output from table(dbms_xplan.display('plan_table',null,'BASIC ROWS'));
6.	You can also regenerate the virtual column definition from the system-generated name. From your terminal window, execute the following command: select dbms_metadata.get_ddl('TABLE','CUSTOMERS_OBE') from dual;

Back to Topic List

Cleanup

Now you can cleanup your environment by performing the following steps:

From your SQL*Plus session, execute the following command:

DROP TABLE CUSTOMERS_OBE;

Back to Topic List

Summary

In this tutorial, you learned how to:

	Determine single column statistics
	Gather histograms on skewed columns
	Create extended statistics to correlate columns
	Create extended statistics for a function used on a column

Back to Topic List

Place your cursor on this icon to hide all screenshots.