alter session set statistics_level = all;
create table bricks (
brick_id integer not null primary key,
colour_rgb_value varchar2(10) not null,
shape varchar2(10) not null,
weight integer not null
);
create table colours (
colour_rgb_value varchar2(10) not null,
colour_name varchar2(10) not null
);
insert into colours values ( 'FF0000', 'red' );
insert into colours values ( '00FF00', 'green' );
insert into colours values ( '0000FF', 'blue' );
insert into bricks
select rownum,
case mod ( level, 3 )
when 0 then 'FF0000'
when 1 then '00FF00'
when 2 then '0000FF'
end,
case mod ( level, 3 )
when 0 then 'cylinder'
when 1 then 'cube'
when 2 then 'pyramid'
end,
floor ( 100 / rownum )
from dual
connect by level <= 100;
insert into bricks
select rownum + 1000,
case mod ( level, 3 )
when 0 then 'FF0000'
when 1 then '00FF00'
when 2 then '0000FF'
end,
case mod ( level, 3 )
when 0 then 'cylinder'
when 1 then 'cube'
when 2 then 'pyramid'
end,
floor ( 200 / rownum )
from dual
connect by level <= 200;
commit;
declare
stats dbms_stats.statrec;
distcnt number;
density number;
nullcnt number;
avgclen number;
begin
dbms_stats.gather_table_stats ( null, 'colours' );
dbms_stats.gather_table_stats ( null, 'bricks' );
dbms_stats.set_table_stats ( null, 'bricks', numrows => 30 );
dbms_stats.set_table_stats ( null, 'colours', numrows => 3000 );
dbms_stats.get_column_stats ( null, 'colours', 'colour_rgb_value',
distcnt => distcnt,
density => density,
nullcnt => nullcnt,
avgclen => avgclen,
srec => stats
);
stats.minval := utl_raw.cast_to_raw ( '0000FF' );
stats.maxval := utl_raw.cast_to_raw ( 'FF0000' );
dbms_stats.set_column_stats ( null, 'colours', 'colour_rgb_value', distcnt => 10, srec => stats );
dbms_stats.set_column_stats ( null, 'bricks', 'colour_rgb_value', distcnt => 10, srec => stats );
end;
/
You can query dictionary views to see the current stats for a table. Key statistics are:
Use this query to get these values for all the columns in the current schema:
select ut.table_name, ut.num_rows,
utcs.column_name, utcs.num_distinct,
case utc.data_type
when 'VARCHAR2' then
utl_raw.cast_to_varchar2 ( utcs.low_value )
when 'NUMBER' then
to_char ( utl_raw.cast_to_number ( utcs.low_value ) )
end low_val,
case utc.data_type
when 'VARCHAR2' then
utl_raw.cast_to_varchar2 ( utcs.high_value )
when 'NUMBER' then
to_char ( utl_raw.cast_to_number ( utcs.high_value ) )
end high_val
from user_tables ut
join user_tab_cols utc
on ut.table_name = utc.table_name
join user_tab_col_statistics utcs
on ut.table_name = utcs.table_name
and utc.column_name = utcs.column_name
order by ut.table_name, utcs.column_name;
Note: the high and low value columns are in binary format. You need to convert them to view the values.
You can check to see how accurate these figures are by querying the tables:
select count ( distinct b.colour_rgb_value ) , count (*)
from bricks b;
select count ( distinct c.colour_rgb_value ) , count (*)
from colours c;
The values in the table vary greatly from the database stats! This makes it hard for the optimizer to find the best plan.
When choosing a plan for a query, the optimizer uses the statistics to estimate how many rows it will get from each table.
The bricks and colours tables have incorrect stats. So the optimizer chooses the "wrong" plan for this query:
select /*+ gather_plan_statistics */c.colour_name, count (*)
from bricks b
join colours c
on c.colour_rgb_value = b.colour_rgb_value
group by c.colour_name;
select *
from table(dbms_xplan.display_cursor(:LIVESQL_LAST_SQL_ID, format => 'ROWSTATS LAST'));
The optimizer prefers to start a join with the table returning the fewest rows. For example, say your query joins a billion-row table to a hundred-row table. Before joining, it filters the billion-row table down to just ten rows. But it gets all one hundred rows from the other table. So the optimizer still prefers to read the billion-row table first because the returns the fewest rows.
The previous query reads BRICKS first, which returns 300 rows. Whereas the second table, COLOURS, returns just three. The tables should be the other way around in the join!
This is because there's a large difference between the estimated and actual number of rows for each table. The closer the number of estimated and actual numbers of rows are at each step in the plan, the more likely it is the optimizer has found the best plan.
When doing this assessment, you need to multiply the estimated rows by the number of starts for the operation:
estimated rows * starts ~ actual rows
When these are more than an order of magnitude different, it's likely there's a better plan available. To correct this, you should refresh the statistics!
To update the statistics on a table, call the GATHER_TABLE_STATS routine in DBMS_stats:
exec dbms_stats.gather_table_stats ( null, 'colours' ) ;
select ut.table_name, ut.num_rows, utcs.column_name, utcs.num_distinct
from user_tables ut
join user_tab_col_statistics utcs
on ut.table_name = utcs.table_name
where ut.table_name = 'COLOURS';
Oracle Database has an automatic job that gathers table stats for you. In most cases you won't need to gather stats manually. But there are a couple of cases where you may want to do this.
If you insert, update or delete more than ten percent of the rows in a table it's likely there's a big mismatch between the stats and the real data. You know the process made a significant change to the data. So it can be worth gathering stats immediately, instead of waiting for the background job.
By default the optimizer gathers stats on tables where more than ten percent of the rows have changed. On tables with billions of rows this means 100 million rows need to change before the optimizer will regather stats!
You can make the database start gathering sooner by lowering the gather threshold. Do this by setting table preferences. You view and set these using DBMS_stats.
This sets the change threshold (stale percent) to one percent for the colours table:
select dbms_stats.get_prefs ( 'STALE_PERCENT', null, 'colours' ) from dual;
exec dbms_stats.set_table_prefs ( null, 'colours', 'STALE_PERCENT', 1 );
select dbms_stats.get_prefs ( 'STALE_PERCENT', null, 'colours' ) from dual;
Table preferences apply to both manual and automatic stats gathering routines. When gathering stats manually you can override the preference settings.
Gathering stats on a table may change the optimizer's row estimates for queries against it. If you have lots of queries accessing a table, updating stats may cause the database to have to create a new plan for them all. This adds significant workload to your database!
So optimizer decides when to invalidate the cursors. This means you may not see plan changes immediately after gathering stats on a table.
To force the optimizer to check queries for new plans immediately, set NO_INVALIDATE to FALSE when calling gather stats:
exec dbms_stats.gather_table_stats ( null, 'colours', no_invalidate => false ) ;
When getting plans in these tutorials you may find DBMS_XPlan reporting the old, "wrong" plan. In these cases making small changes to the query (e.g. adding column aliases) forces the optimizer to reparse the query.
Remember: invalidating cursors when gathering stats this will cause the optimizer to reparse all queries on this table. Use with caution in production environments!
Replace /* TODO */ in this code to gather stats for the bricks table:
select ut.num_rows
from user_tables ut
where ut.table_name = 'BRICKS';
exec dbms_stats.gather_table_stats ( null, ' /* TODO */' ) ;
select count (*) from bricks;
select ut.num_rows
from user_tables ut
where ut.table_name = 'BRICKS';
Check this has updated the stats by verifying that the number of rows in the table matches the number reported in USER_TABLES.
After gathering stats on BRICKS and COLOURS, the join in the first module should now have the correct plan. This reads COLOURS first and the row estimates matching the actual rows:
select /*+ gather_plan_statistics */c.colour_name, count (*) c
from bricks b
join colours c
on c.colour_rgb_value = b.colour_rgb_value
group by c.colour_name;
select *
from table(dbms_xplan.display_cursor(:LIVESQL_LAST_SQL_ID, format => 'ROWSTATS LAST'));
For simple queries having up-to-date stats is often good enough for the optimizer to choose the best plan. But there are a couple of common problems which can still lead to incorrect estimates:
The row estimates for this query are out by a factor of ten:
select /*+ gather_plan_statistics */count (*)
from bricks
where weight = 1;
select *
from table(dbms_xplan.display_cursor(:LIVESQL_LAST_SQL_ID, format => 'ROWSTATS LAST'));
This is because the optimizer assumes there is an even distribution of the values. So it estimates the number of rows matching a particular value is:
# non-null rows in the table / number of distinct values in the column
In the real world many columns display skew - unevenness in the spread of values. Assuming a uniform distribution for these leads to wildly incorrect row estimates and slow plans.
Skew comes in two forms: range and value.
There is an uneven distribution of the weight values in the bricks table:
select weight, count (*)
from bricks
group by weight
order by weight;
There are 300 rows in the bricks table. And 27 unique values. So these optimizer estimates ( 300 / 27 ) ~ 11 rows for both of these queries. But the first returns 150 rows and the second just one!
select /*+ gather_plan_statistics */count (*) from bricks
where weight = 1;
select *
from table(dbms_xplan.display_cursor(:LIVESQL_LAST_SQL_ID, format => 'ROWSTATS LAST'));
select /*+ gather_plan_statistics */count (*) from bricks
where weight = 200;
select *
from table(dbms_xplan.display_cursor(:LIVESQL_LAST_SQL_ID, format => 'ROWSTATS LAST'));
You can also have skew in the distribution of the values. For example, there is a gap in the BRICK_IDs from 101 to 999:
with rws as (
select level r from dual
connect by level <= 15
)
select r, count ( brick_id )
from rws
left join bricks
on ceil ( brick_id / 100 ) = r
group by r
order by r;
Again, initially the database assumes the values are evenly distributed between the min and max values for BRICK_ID. So it estimates the same number of rows for these queries. Even though the first and last return 100 rows and the middle zero rows:
select /*+ gather_plan_statistics */count (*) from bricks
where brick_id between 0 and 100;
select *
from table(dbms_xplan.display_cursor(:LIVESQL_LAST_SQL_ID, format => 'ROWSTATS LAST'));
select /*+ gather_plan_statistics */count (*) from bricks
where brick_id between 400 and 500;
select *
from table(dbms_xplan.display_cursor(:LIVESQL_LAST_SQL_ID, format => 'ROWSTATS LAST'));
select /*+ gather_plan_statistics */count (*) from bricks
where brick_id between 1000 and 1100;
select *
from table(dbms_xplan.display_cursor(:LIVESQL_LAST_SQL_ID, format => 'ROWSTATS LAST'));
To improve the row estimates, you can gather histograms on these columns.
Currently there are no histograms on BRICK_ID and WEIGHT in BRICKS. You can verify this by checking the HISTOGRAM and NUM_BUCKETS columns in *_TAB_COL_STATISTICS:
select utcs.column_name, utcs.histogram, utcs.num_buckets
from user_tables ut
join user_tab_col_statistics utcs
on ut.table_name = utcs.table_name
where ut.table_name = 'BRICKS'
and utcs.column_name in ( 'BRICK_ID', 'WEIGHT' );
To overcome these skew problems, you can gather histograms on the relevant columns. The optimizer will do this automatically based on column usage and data skew. The rules for when this happens are:
The queries in the previous module set the necessary column usage details. So gathering stats will create histograms on the WEIGHT and BRICK_ID columns:
exec dbms_stats.gather_table_stats ( null, 'bricks', no_invalidate => false ) ;
select utcs.column_name, utcs.histogram, utcs.num_buckets
from user_tables ut
join user_tab_col_statistics utcs
on ut.table_name = utcs.table_name
where ut.table_name = 'BRICKS'
and utcs.column_name in ( 'BRICK_ID', 'WEIGHT' );
With the histogram in place, the optimizer gets much better row estimates for the previous queries on skewed values:
select /*+ gather_plan_statistics */count (*) c
from bricks
where weight = 1;
select *
from table(dbms_xplan.display_cursor(:LIVESQL_LAST_SQL_ID, format => 'ROWSTATS LAST'));
select /*+ gather_plan_statistics */count (*) c
from bricks
where weight = 200;
select *
from table(dbms_xplan.display_cursor(:LIVESQL_LAST_SQL_ID, format => 'ROWSTATS LAST'));
select /*+ gather_plan_statistics */count (*) c
from bricks
where brick_id between 0 and 100;
select *
from table(dbms_xplan.display_cursor(:LIVESQL_LAST_SQL_ID, format => 'ROWSTATS LAST'));
select /*+ gather_plan_statistics */count (*) c
from bricks
where brick_id between 400 and 500;
select *
from table(dbms_xplan.display_cursor(:LIVESQL_LAST_SQL_ID, format => 'ROWSTATS LAST'));
There are four types of histograms in Oracle Database:
When there are few different values in a column, the database will create a frequency histogram. These store the exact number of rows for each value in the column. These give the best row estimates for queries.
But there's a limit to how many different values the optimizer can collect exact stats for. So when there are a large number of values for a histogram the optimizer will switch to a different type.
From Oracle Database 12c, the optimizer will choose a hybrid or top-frequency histogram if there are many values in the column. Before this it would create height-balanced balanced histograms. These were deprecated in 12c. You'll only see height-balanced histograms in 12c and later if you upgrade from an earlier release.
For more details on the algorithms for these, read about the histogram types in the docs.
In most cases histograms help the optimizer find better plans in columns with data skew. But sometimes they may lead to unstable plans. In which case you may want to disable histograms on the columns in problematic queries.
The SIZE option for the gather method determines what histograms the optimizer will create. By default, it uses:
for all columns size auto
This creates histograms based on the column usage and data skew rules described in the module Gathering Histograms.
To disable histograms for a column, set SIZE to one. Do this with the METHOD_OPT parameter of the gather stats routines. You can specify a specific column or ALL to do this for. This removes all the histograms on BRICKS:
exec dbms_stats.gather_table_stats ( null, 'bricks', method_opt => 'for all columns size 1' ) ;
select utcs.column_name, utcs.histogram, utcs.num_buckets
from user_tables ut
join user_tab_col_statistics utcs
on ut.table_name = utcs.table_name
where ut.table_name = 'BRICKS';
If you want to ensure histograms remain disabled, it's best to do this by setting table preferences:
begin
dbms_stats.set_table_prefs (
null, 'bricks',
'method_opt', 'for all columns size 1'
);
dbms_stats.gather_table_stats ( null, 'bricks' ) ;
end;
/
select utcs.column_name, utcs.histogram
from user_tables ut
join user_tab_col_statistics utcs
on ut.table_name = utcs.table_name
where ut.table_name = 'BRICKS';
Re-enable histograms for bricks by setting the size to auto in using table preferences:
begin
dbms_stats.set_table_prefs (
null, 'bricks',
'method_opt', 'for TODO'
);
dbms_stats.gather_table_stats ( null, 'bricks' ) ;
end;
/
select utcs.column_name, utcs.histogram
from user_tables ut
join user_tab_col_statistics utcs
on ut.table_name = utcs.table_name
where ut.table_name = 'BRICKS';
The query should show a HYBRID histogram for BRICK_ID and FREQUENCY histograms for WEIGHT and COLOUR_RGB_VALUE
The row stats for BRICKS should now be up-to-date. But the row estimates for queries searching on both COLOUR_RGB_VALUE and SHAPE are still incorrect!
select /*+ gather_plan_statistics */count (*)
from bricks
where colour_rgb_value = 'FF0000'
and shape = 'cylinder';
select *
from table(dbms_xplan.display_cursor(:LIVESQL_LAST_SQL_ID, format => 'ROWSTATS LAST'));
This is because the colour and shape values are correlated. All the rows for each shape have the same colour. You can verify this with this query:
select colour_rgb_value, shape, count (*)
from bricks
group by colour_rgb_value, shape;
By default the optimizer doesn't know about this relationship. You improve row estimates in this case by creating extended statistics.
Define extended stats using dbms_stats as shown:
select dbms_stats.create_extended_stats ( null, 'bricks', '(colour_rgb_value, shape)' )
from dual;
exec dbms_stats.gather_table_stats ( null, 'bricks', method_opt => 'for columns (colour_rgb_value, shape)', no_invalidate => false ) ;
The argument (colour_rgb_value, shape) tells the optimizer to gather stats on the combinations of these columns. You can create extended stats on any set of columns in a table. Or even on expressions such as upper ( shape ).
With extended stats in place, the plans should now show the correct row estimates:
select /*+ gather_plan_statistics */count (*) c
from bricks
where colour_rgb_value = 'FF0000'
and shape = 'cylinder';
select *
from table(dbms_xplan.display_cursor(:LIVESQL_LAST_SQL_ID, format => 'ROWSTATS LAST'));
As with histograms, the optimizer can detect column groups automatically based on column usage. To do this, you need to set the preference AUTO_STAT_EXTENSIONS to ON; by default this is OFF.
The optimizer also struggles to get correct row estimates when you apply functions to columns. This query fetches 150 rows from the table, but the optimizer thinks it'll get just far less:
select /*+ gather_plan_statistics */ count (*) from bricks
where weight + 2 = 3;
select *
from table(dbms_xplan.display_cursor(:LIVESQL_LAST_SQL_ID, format => 'ROWSTATS LAST'));
You can also overcome this with extended stats. Complete this code to create extended stats on (weight + 2) for the bricks table:
select dbms_stats.create_extended_stats ( null, 'bricks', '(TODO)' )
from dual;
exec dbms_stats.gather_table_stats ( null, 'bricks', method_opt => 'for all columns size skewonly', no_invalidate => false ) ;
Re-run the query to verify that the optimizer now has better row estimates:
select /*+ gather_plan_statistics */count (*) from bricks
where weight + 2 = 3;
select *
from table(dbms_xplan.display_cursor(:LIVESQL_LAST_SQL_ID, format => 'ROWSTATS LAST'));
It's even better to remove functions from columns in queries to avoid issues like this. For example, changing this query to weight = ( 3-2 ). But it may not be possible to do this with more complex formulas.