Table Statistics: Databases for Developers: Performance #2

View All Tutorials Login to Run Tutorial

Tutorial Table Statistics: Databases for Developers: Performance #2
Description When choosing an execution plan, the optimizer uses the table statistics to decide how to access tables and in which order. This tutorial shows you how to view database statistics, spot when they are out-of-date, and how to update them.
Tags performance, dbms_stats
Area Performance and Tuning
Contributor Chris Saxon (Oracle)
Created Thursday February 13, 2020
Modules 15

Prerequisite SQL

alter session set statistics_level = all;

create table bricks (
  brick_id         integer not null primary key,
  colour_rgb_value varchar2(10) not null,
  shape            varchar2(10) not null,
  weight           integer not null
);

create table colours (
  colour_rgb_value varchar2(10) not null,
  colour_name      varchar2(10) not null
);

insert into colours values ( 'FF0000', 'red' );
insert into colours values ( '00FF00', 'green' );
insert into colours values ( '0000FF', 'blue' );

insert into bricks
  select rownum,
         case mod ( level, 3 )
           when 0 then 'FF0000'
           when 1 then '00FF00'
           when 2 then '0000FF'
         end,
         case mod ( level, 3 )
           when 0 then 'cylinder'
           when 1 then 'cube'
           when 2 then 'pyramid'
         end,
         floor ( 100 / rownum )
  from   dual
  connect by level <= 100;
  
insert into bricks
  select rownum + 1000,
         case mod ( level, 3 )
           when 0 then 'FF0000'
           when 1 then '00FF00'
           when 2 then '0000FF'
         end,
         case mod ( level, 3 )
           when 0 then 'cylinder'
           when 1 then 'cube'
           when 2 then 'pyramid'
         end,
         floor ( 200 / rownum )
  from   dual
  connect by level <= 200;

commit;

declare
  stats dbms_stats.statrec;
  distcnt  number; 
  density  number;
  nullcnt  number; 
  avgclen  number;
begin

  dbms_stats.gather_table_stats ( null, 'colours' );
  dbms_stats.gather_table_stats ( null, 'bricks' );
  dbms_stats.set_table_stats ( null, 'bricks', numrows => 30 );
  dbms_stats.set_table_stats ( null, 'colours', numrows => 3000 );
  dbms_stats.get_column_stats ( null, 'colours', 'colour_rgb_value', 
    distcnt => distcnt, 
    density => density,
    nullcnt => nullcnt, 
    avgclen => avgclen,
    srec => stats
  );
  stats.minval := utl_raw.cast_to_raw ( '0000FF' );
  stats.maxval := utl_raw.cast_to_raw ( 'FF0000' );
  dbms_stats.set_column_stats ( null, 'colours', 'colour_rgb_value', distcnt => 10, srec => stats );
  dbms_stats.set_column_stats ( null, 'bricks', 'colour_rgb_value', distcnt => 10, srec => stats );

end;
/

Module 1

Viewing Table Stats

You can query dictionary views to see the current stats for a table. Key statistics are:

The total number of rows in a table
The number of distinct (different) values in each column
The high and low values for each column

Use this query to get these values for all the columns in the current schema:

select ut.table_name, ut.num_rows, 
       utcs.column_name, utcs.num_distinct, 
       case utc.data_type
         when 'VARCHAR2' then
           utl_raw.cast_to_varchar2 ( utcs.low_value ) 
        when 'NUMBER' then
           to_char ( utl_raw.cast_to_number ( utcs.low_value ) )
       end low_val, 
       case utc.data_type
         when 'VARCHAR2' then
           utl_raw.cast_to_varchar2 ( utcs.high_value ) 
         when 'NUMBER' then
           to_char ( utl_raw.cast_to_number ( utcs.high_value ) )
       end high_val
from   user_tables ut
join   user_tab_cols utc
on     ut.table_name = utc.table_name
join   user_tab_col_statistics utcs
on     ut.table_name = utcs.table_name
and    utc.column_name = utcs.column_name
order  by ut.table_name, utcs.column_name;

Note: the high and low value columns are in binary format. You need to convert them to view the values.

You can check to see how accurate these figures are by querying the tables:

select count ( distinct b.colour_rgb_value ) , count (*)
from   bricks b;

select count ( distinct c.colour_rgb_value ) , count (*)
from   colours c;

The values in the table vary greatly from the database stats! This makes it hard for the optimizer to find the best plan.

Module 2
Incorrect Row Estimates
When choosing a plan for a query, the optimizer uses the statistics to estimate how many rows it will get from each table.

The bricks and colours tables have incorrect stats. So the optimizer chooses the "wrong" plan for this query:
```
select /*+ gather_plan_statistics */c.colour_name, count (*)
from   bricks b
join   colours c
on     c.colour_rgb_value = b.colour_rgb_value
group  by c.colour_name;

select * 
from   table(dbms_xplan.display_cursor(:LIVESQL_LAST_SQL_ID, format => 'ROWSTATS LAST'));
```
The optimizer prefers to start a join with the table returning the fewest rows. For example, say your query joins a billion-row table to a hundred-row table. Before joining, it filters the billion-row table down to just ten rows. But it gets all one hundred rows from the other table. So the optimizer still prefers to read the billion-row table first because the returns the fewest rows.

The previous query reads BRICKS first, which returns 300 rows. Whereas the second table, COLOURS, returns just three. The tables should be the other way around in the join!

This is because there's a large difference between the estimated and actual number of rows for each table. The closer the number of estimated and actual numbers of rows are at each step in the plan, the more likely it is the optimizer has found the best plan.

When doing this assessment, you need to multiply the estimated rows by the number of starts for the operation:
```
estimated rows * starts ~ actual rows
```
When these are more than an order of magnitude different, it's likely there's a better plan available. To correct this, you should refresh the statistics!
Module 3
Gathering Statistics
To update the statistics on a table, call the GATHER_TABLE_STATS routine in DBMS_stats:
```
exec dbms_stats.gather_table_stats ( null, 'colours' ) ;

select ut.table_name, ut.num_rows, utcs.column_name, utcs.num_distinct
from   user_tables ut
join   user_tab_col_statistics utcs
on     ut.table_name = utcs.table_name
where  ut.table_name = 'COLOURS';
```
Oracle Database has an automatic job that gathers table stats for you. In most cases you won't need to gather stats manually. But there are a couple of cases where you may want to do this.

After large batch processes

If you insert, update or delete more than ten percent of the rows in a table it's likely there's a big mismatch between the stats and the real data. You know the process made a significant change to the data. So it can be worth gathering stats immediately, instead of waiting for the background job.

The default gather process is too rare

By default the optimizer gathers stats on tables where more than ten percent of the rows have changed. On tables with billions of rows this means 100 million rows need to change before the optimizer will regather stats!

You can make the database start gathering sooner by lowering the gather threshold. Do this by setting table preferences. You view and set these using DBMS_stats.

This sets the change threshold (stale percent) to one percent for the colours table:
```
select dbms_stats.get_prefs ( 'STALE_PERCENT', null, 'colours' ) from dual;
exec dbms_stats.set_table_prefs ( null, 'colours', 'STALE_PERCENT', 1 );
select dbms_stats.get_prefs ( 'STALE_PERCENT', null, 'colours' ) from dual;
```
Table preferences apply to both manual and automatic stats gathering routines. When gathering stats manually you can override the preference settings.
Module 4
Plan Invalidation
Gathering stats on a table may change the optimizer's row estimates for queries against it. If you have lots of queries accessing a table, updating stats may cause the database to have to create a new plan for them all. This adds significant workload to your database!

So optimizer decides when to invalidate the cursors. This means you may not see plan changes immediately after gathering stats on a table.

To force the optimizer to check queries for new plans immediately, set NO_INVALIDATE to FALSE when calling gather stats:
```
exec dbms_stats.gather_table_stats ( null, 'colours', no_invalidate => false ) ;
```
When getting plans in these tutorials you may find DBMS_XPlan reporting the old, "wrong" plan. In these cases making small changes to the query (e.g. adding column aliases) forces the optimizer to reparse the query.

Remember: invalidating cursors when gathering stats this will cause the optimizer to reparse all queries on this table. Use with caution in production environments!
Module 5
Try It!
Replace /* TODO */ in this code to gather stats for the bricks table:
```
select ut.num_rows
from   user_tables ut
where  ut.table_name = 'BRICKS';

exec dbms_stats.gather_table_stats ( null, ' /* TODO */' ) ;

select count (*) from bricks;

select ut.num_rows
from   user_tables ut
where  ut.table_name = 'BRICKS';
```
Check this has updated the stats by verifying that the number of rows in the table matches the number reported in USER_TABLES.

After gathering stats on BRICKS and COLOURS, the join in the first module should now have the correct plan. This reads COLOURS first and the row estimates matching the actual rows:
```
select /*+ gather_plan_statistics */c.colour_name, count (*) c
from   bricks b
join   colours c
on     c.colour_rgb_value = b.colour_rgb_value
group  by c.colour_name;

select * 
from   table(dbms_xplan.display_cursor(:LIVESQL_LAST_SQL_ID, format => 'ROWSTATS LAST'));
```
For simple queries having up-to-date stats is often good enough for the optimizer to choose the best plan. But there are a couple of common problems which can still lead to incorrect estimates:
- Data skew
- Column correlation
Module 6
Data Skew
The row estimates for this query are out by a factor of ten:
```
select /*+ gather_plan_statistics */count (*) 
from   bricks
where  weight = 1;

select * 
from   table(dbms_xplan.display_cursor(:LIVESQL_LAST_SQL_ID, format => 'ROWSTATS LAST'));
```
This is because the optimizer assumes there is an even distribution of the values. So it estimates the number of rows matching a particular value is:
```
# non-null rows in the table / number of distinct values in the column
```
In the real world many columns display skew - unevenness in the spread of values. Assuming a uniform distribution for these leads to wildly incorrect row estimates and slow plans.

Skew comes in two forms: range and value.

Module 7

Value Skew

There is an uneven distribution of the weight values in the bricks table:

select weight, count (*)
from   bricks
group  by weight
order  by weight;

There are 300 rows in the bricks table. And 27 unique values. So these optimizer estimates ( 300 / 27 ) ~ 11 rows for both of these queries. But the first returns 150 rows and the second just one!

select /*+ gather_plan_statistics */count (*) from bricks
where  weight = 1;

select * 
from   table(dbms_xplan.display_cursor(:LIVESQL_LAST_SQL_ID, format => 'ROWSTATS LAST'));

select /*+ gather_plan_statistics */count (*) from bricks
where  weight = 200;

select * 
from   table(dbms_xplan.display_cursor(:LIVESQL_LAST_SQL_ID, format => 'ROWSTATS LAST'));

Module 8

Range Skew

You can also have skew in the distribution of the values. For example, there is a gap in the BRICK_IDs from 101 to 999:

with rws as (
  select level r from dual
  connect by level <= 15
)
  select r, count ( brick_id ) 
  from   rws
  left   join bricks
  on     ceil ( brick_id / 100 ) = r
  group  by r
  order  by r;

Again, initially the database assumes the values are evenly distributed between the min and max values for BRICK_ID. So it estimates the same number of rows for these queries. Even though the first and last return 100 rows and the middle zero rows:

select /*+ gather_plan_statistics */count (*) from bricks
where  brick_id between 0 and 100;

select * 
from   table(dbms_xplan.display_cursor(:LIVESQL_LAST_SQL_ID, format => 'ROWSTATS LAST'));

select /*+ gather_plan_statistics */count (*) from bricks
where  brick_id between 400 and 500;

select * 
from   table(dbms_xplan.display_cursor(:LIVESQL_LAST_SQL_ID, format => 'ROWSTATS LAST'));

select /*+ gather_plan_statistics */count (*) from bricks
where  brick_id between 1000 and 1100;

select * 
from   table(dbms_xplan.display_cursor(:LIVESQL_LAST_SQL_ID, format => 'ROWSTATS LAST'));

To improve the row estimates, you can gather histograms on these columns.

Module 9

Gathering Histograms

Currently there are no histograms on BRICK_ID and WEIGHT in BRICKS. You can verify this by checking the HISTOGRAM and NUM_BUCKETS columns in *_TAB_COL_STATISTICS:

select utcs.column_name, utcs.histogram, utcs.num_buckets
from   user_tables ut
join   user_tab_col_statistics utcs
on     ut.table_name = utcs.table_name
where  ut.table_name = 'BRICKS'
and    utcs.column_name in ( 'BRICK_ID', 'WEIGHT' );

To overcome these skew problems, you can gather histograms on the relevant columns. The optimizer will do this automatically based on column usage and data skew. The rules for when this happens are:

The column has value skew and statements use the column in range (<, >=, etc.), LIKE, or equality conditions
The column has range skew and the column is used in range or LIKE conditions.
The column has a small number of distinct values (with some repeated values) and the column is used in range (<, >=, etc.), LIKE, or equality conditions
It may also capture histograms when using incremental statistics are used, even if there is no skew. These are ignored by optimizer stats and are out-of-scope for this tutorial.

The queries in the previous module set the necessary column usage details. So gathering stats will create histograms on the WEIGHT and BRICK_ID columns:

exec dbms_stats.gather_table_stats ( null, 'bricks', no_invalidate => false ) ;

select utcs.column_name, utcs.histogram, utcs.num_buckets
from   user_tables ut
join   user_tab_col_statistics utcs
on     ut.table_name = utcs.table_name
where  ut.table_name = 'BRICKS'
and    utcs.column_name in ( 'BRICK_ID', 'WEIGHT' );

With the histogram in place, the optimizer gets much better row estimates for the previous queries on skewed values:

select /*+ gather_plan_statistics */count (*) c
from bricks
where  weight = 1;

select * 
from   table(dbms_xplan.display_cursor(:LIVESQL_LAST_SQL_ID, format => 'ROWSTATS LAST'));

select /*+ gather_plan_statistics */count (*) c
from bricks
where  weight = 200;

select * 
from   table(dbms_xplan.display_cursor(:LIVESQL_LAST_SQL_ID, format => 'ROWSTATS LAST'));

select /*+ gather_plan_statistics */count (*) c
from bricks
where  brick_id between 0 and 100;

select * 
from   table(dbms_xplan.display_cursor(:LIVESQL_LAST_SQL_ID, format => 'ROWSTATS LAST'));

select /*+ gather_plan_statistics */count (*) c
from bricks
where  brick_id between 400 and 500;

select * 
from   table(dbms_xplan.display_cursor(:LIVESQL_LAST_SQL_ID, format => 'ROWSTATS LAST'));

Module 10
Histogram Types
There are four types of histograms in Oracle Database:
- Frequency
- Height-balanced
- Hybrid
- Top-frequency
When there are few different values in a column, the database will create a frequency histogram. These store the exact number of rows for each value in the column. These give the best row estimates for queries.

But there's a limit to how many different values the optimizer can collect exact stats for. So when there are a large number of values for a histogram the optimizer will switch to a different type.

From Oracle Database 12c, the optimizer will choose a hybrid or top-frequency histogram if there are many values in the column. Before this it would create height-balanced balanced histograms. These were deprecated in 12c. You'll only see height-balanced histograms in 12c and later if you upgrade from an earlier release.

For more details on the algorithms for these, read about the histogram types in the docs.
Module 11
Disabling Histograms
In most cases histograms help the optimizer find better plans in columns with data skew. But sometimes they may lead to unstable plans. In which case you may want to disable histograms on the columns in problematic queries.

The SIZE option for the gather method determines what histograms the optimizer will create. By default, it uses:
```
for all columns size auto
```
This creates histograms based on the column usage and data skew rules described in the module Gathering Histograms.

To disable histograms for a column, set SIZE to one. Do this with the METHOD_OPT parameter of the gather stats routines. You can specify a specific column or ALL to do this for. This removes all the histograms on BRICKS:
```
exec dbms_stats.gather_table_stats ( null, 'bricks', method_opt => 'for all columns size 1' ) ;

select utcs.column_name, utcs.histogram, utcs.num_buckets
from   user_tables ut
join   user_tab_col_statistics utcs
on     ut.table_name = utcs.table_name
where  ut.table_name = 'BRICKS';
```
If you want to ensure histograms remain disabled, it's best to do this by setting table preferences:
```
begin 
  dbms_stats.set_table_prefs ( 
    null, 'bricks', 
    'method_opt', 'for all columns size 1' 
  ); 
  dbms_stats.gather_table_stats ( null, 'bricks' ) ;
end;
/

select utcs.column_name, utcs.histogram
from   user_tables ut
join   user_tab_col_statistics utcs
on     ut.table_name = utcs.table_name
where  ut.table_name = 'BRICKS';
```

Module 12

Try it!

Re-enable histograms for bricks by setting the size to auto in using table preferences:

begin 
  dbms_stats.set_table_prefs ( 
    null, 'bricks', 
    'method_opt', 'for TODO' 
  ); 
  dbms_stats.gather_table_stats ( null, 'bricks' ) ;
end;
/

select utcs.column_name, utcs.histogram
from   user_tables ut
join   user_tab_col_statistics utcs
on     ut.table_name = utcs.table_name
where  ut.table_name = 'BRICKS';

The query should show a HYBRID histogram for BRICK_ID and FREQUENCY histograms for WEIGHT and COLOUR_RGB_VALUE

Module 13
Correlated Columns
The row stats for BRICKS should now be up-to-date. But the row estimates for queries searching on both COLOUR_RGB_VALUE and SHAPE are still incorrect!
```
select /*+ gather_plan_statistics */count (*)
from   bricks
where  colour_rgb_value = 'FF0000'
and    shape = 'cylinder';

select * 
from   table(dbms_xplan.display_cursor(:LIVESQL_LAST_SQL_ID, format => 'ROWSTATS LAST'));
```
This is because the colour and shape values are correlated. All the rows for each shape have the same colour. You can verify this with this query:
```
select colour_rgb_value, shape, count (*)
from   bricks
group  by colour_rgb_value, shape;
```
By default the optimizer doesn't know about this relationship. You improve row estimates in this case by creating extended statistics.
Module 14
Creating Extended Statistics
Define extended stats using dbms_stats as shown:
```
select dbms_stats.create_extended_stats ( null, 'bricks', '(colour_rgb_value, shape)' )
from   dual;

exec dbms_stats.gather_table_stats ( null, 'bricks', method_opt => 'for columns (colour_rgb_value, shape)', no_invalidate => false ) ;
```
The argument (colour_rgb_value, shape) tells the optimizer to gather stats on the combinations of these columns. You can create extended stats on any set of columns in a table. Or even on expressions such as upper ( shape ).

With extended stats in place, the plans should now show the correct row estimates:
```
select /*+ gather_plan_statistics */count (*) c
from   bricks
where  colour_rgb_value = 'FF0000'
and    shape = 'cylinder';

select * 
from   table(dbms_xplan.display_cursor(:LIVESQL_LAST_SQL_ID, format => 'ROWSTATS LAST'));
```
As with histograms, the optimizer can detect column groups automatically based on column usage. To do this, you need to set the preference AUTO_STAT_EXTENSIONS to ON; by default this is OFF.
Module 15
Try It!
The optimizer also struggles to get correct row estimates when you apply functions to columns. This query fetches 150 rows from the table, but the optimizer thinks it'll get just far less:
```
select /*+ gather_plan_statistics */ count (*) from bricks
where  weight + 2 = 3;

select *
from table(dbms_xplan.display_cursor(:LIVESQL_LAST_SQL_ID, format => 'ROWSTATS LAST'));
```
You can also overcome this with extended stats. Complete this code to create extended stats on (weight + 2) for the bricks table:
```
select dbms_stats.create_extended_stats ( null, 'bricks', '(TODO)' )
from dual;

exec dbms_stats.gather_table_stats ( null, 'bricks', method_opt => 'for all columns size skewonly', no_invalidate => false ) ;
```
Re-run the query to verify that the optimizer now has better row estimates:
```
select /*+ gather_plan_statistics */count (*) from bricks
where  weight + 2 = 3;

select *
from table(dbms_xplan.display_cursor(:LIVESQL_LAST_SQL_ID, format => 'ROWSTATS LAST'));
```
It's even better to remove functions from columns in queries to avoid issues like this. For example, changing this query to weight = ( 3-2 ). But it may not be possible to do this with more complex formulas.

Additional Information

Database on OTN SQL and PL/SQL Discussion forums
Oracle Database
Download Oracle Database

Table Statistics: Databases for Developers: Performance #2

Prerequisite SQL

Viewing Table Stats

Incorrect Row Estimates

Gathering Statistics

After large batch processes

The default gather process is too rare

Plan Invalidation

Try It!

Data Skew

Value Skew

Range Skew

Gathering Histograms

Histogram Types

Disabling Histograms

Try it!

Correlated Columns

Creating Extended Statistics

Try It!

Links

Additional Information