exec dbms_random.seed ( 0 );
create table bricks (
brick_id not null constraint bricks_pk primary key,
colour not null,
shape not null,
weight not null,
insert_date not null,
junk default lpad ( 'x', 50, 'x' ) not null
) as
with rws as (
select level x from dual
connect by level <= 10000
)
select rownum brick_id,
case ceil ( rownum / 2500 )
when 4 then 'red'
when 1 then 'blue'
when 2 then 'green'
when 3 then 'yellow'
end colour,
case mod ( rownum, 4 )
when 0 then 'cube'
when 1 then 'cylinder'
when 2 then 'pyramid'
when 3 then 'prism'
end shape,
round ( dbms_random.value ( 1, 1000 ) ),
date'2020-01-01' + ( rownum/24 ) + ( mod ( rownum, 24 ) / 36 ) insert_date,
lpad ( ascii ( mod ( rownum, 26 ) + 65 ), 50, 'x' )
from rws;
create global temporary table bricks_temp as
select * from bricks
where 1 = 0;
create index brick_weight_i on
bricks ( weight );
create index brick_shape_i on
bricks ( shape );
create index brick_colour_i on
bricks ( colour );
create index brick_insert_date_i on
bricks ( insert_date );
exec dbms_stats.gather_table_stats ( null, 'bricks' ) ;
An index is said to be useful when it locates few rows in the table. But what exactly does "few" mean?
The BRICKS table stores 10,000 rows:
select count(*) from bricks;
It has indexes on the BRICK_ID, COLOUR, WEIGHT, SHAPE, and INSERT_DATE columns:
select ui.index_name,
listagg ( uic.column_name, ',' )
within group ( order by column_position ) cols
from user_indexes ui
join user_ind_columns uic
on ui.index_name = uic.index_name
where ui.table_name = 'BRICKS'
group by ui.index_name
There are 88 rows with weights between 1 and 10, a small fraction of the total in the table (0.88%). But the database opts for a full table scan:
select /*+ gather_plan_statistics */ count ( distinct junk ), count (*)
from bricks
where weight between 1 and 10;
select *
from table(dbms_xplan.display_cursor( :LIVESQL_LAST_SQL_ID, format => 'IOSTATS LAST'));
But this query searches for the first thousand BRICK_IDs returns 10% of the table's rows. Yet uses an index!
select /*+ gather_plan_statistics */ count ( distinct junk ), count (*)
from bricks
where brick_id between 1 and 1000;
select *
from table(dbms_xplan.display_cursor( :LIVESQL_LAST_SQL_ID, format => 'IOSTATS LAST'));
So why does the optimizer use an index for one query but not the other? It's all to do with how the rows are stored in the table.
Oracle Database stores rows in blocks. You can find the block number for a row using DBMS_rowid:
select brick_id,
dbms_rowid.rowid_block_number ( rowid ) blk#
from bricks
where mod ( brick_id, 100 ) = 0;
By default, tables are heap organized in Oracle Database. This means that the database is free to place rows wherever it wants.
But indexes are ordered data structures. New entries must go in the correct location. For example, if you insert 42 in a number column, this entry must go between the last entry of 41 or lower and the first entry of 43 or higher in any indexes on that column.
The closer the physical order of rows matches the logical order of an index, the more effective that index will be.
The smallest unit of I/O in Oracle Database is a block. So the more consecutive index entries there are that point to the same block, the more rows you can fetch in one I/O. Thus the more effective the index will be.
This query build a matrix of BRICK_IDs and blocks, counting the number of rows in the block and brick ranges:
with rws as (
select ceil ( brick_id / 1000 ) id,
ceil (
dense_rank () over (
order by dbms_rowid.rowid_block_number ( rowid )
) / 10
) rid
from bricks
)
select * from rws
pivot (
count (*) for rid in (
1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12
)
)
order by id;
You can see the values for BRICK_ID are well clustered. All the rows for a given ID range are all in the same block range. There are many block ranges with zero rows for an ID range.
Contrast this with the weights. Plugging these values into the matrix shows weight values are spread throughout the whole table:
with rws as (
select ceil ( weight / 100 ) wt,
ceil (
dense_rank () over (
order by dbms_rowid.rowid_block_number ( rowid )
) / 10
) rid
from bricks
)
select * from rws
pivot (
count (*) for rid in (
1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12
)
)
order by wt;
The rows for the first weight range are stored in every block in the table. Whereas the rows for the first ID range are in 3-4 blocks.
This means the database will have to do lots more I/O to fetch the same number of rows for a range of weights compared to a range of identifiers. Thus an index on WEIGHT will be less effective than on BRICK_ID.
When determining how effective an index is, it's the number of I/O operations that matter. Not how many rows it accesses!
So how does the optimizer know how closely the logical and physical orders match?
It estimates using the clustering factor.
The clustering factor is a measure of how closely the logical index order matches the physical table order for rows. The database calculates this when gathering stats. It does this by asking:
Is the row for the current index entry in the same block as the previous entry or a different one?
Each time consecutive index entries are in different blocks, the optimizer increments a counter. When stats gathering is complete this gives a value ranging from the number of blocks in the table to the number of rows. The lower this value, the better clustered the rows are and the more likely the database is to use the index.
You can view the clustering factor by querying the *_INDEXES views:
select index_name, clustering_factor, ut.num_rows, ut.blocks
from user_indexes ui
join user_tables ut
on ui.table_name = ut.table_name
where ui.table_name = 'BRICKS';
The index on BRICK_ID (BRICK_PK) has perfect clustering (clustering factor <= blocks). Whereas the clustering for the weight index is almost as bad as it can be (clustering factor ~ number of rows). The other indexes all vary between these extremes.
The INSERT_DATE index has an "average" clustering factor. This means the database is unlikely to use it, even when fetching relatively few rows. For example, this query fetches less than 5% of the rows, but still opts for a full table scan:
select /*+ gather_plan_statistics */ count ( distinct junk ), count (*)
from bricks
where insert_date >= date'2020-02-01'
and insert_date < date'2020-02-21';
select *
from table(dbms_xplan.display_cursor( :LIVESQL_LAST_SQL_ID, format => 'IOSTATS LAST'));
But zooming out months, it seems the values are almost as well clustered as BRICK_ID:
with rws as (
select trunc ( insert_date, 'mm' ) dt,
ceil (
dense_rank () over (
order by dbms_rowid.rowid_block_number ( rowid )
) / 10
) rid
from bricks
)
select * from rws
pivot (
count (*) for rid in (
1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12
)
)
order by dt;
So why is the clustering factor for its index so high?
This is because there are several rows and the start and end of each day that hop back-and-forth between two blocks:
with rws as (
select brick_id,
to_char ( insert_date, 'DD MON HH24:MI' ) dt,
dbms_rowid.rowid_block_number ( rowid ) current_block,
lag ( dbms_rowid.rowid_block_number ( rowid ) ) over (
order by insert_date
) prev_block
from bricks
where insert_date >= date '2020-01-01'
and insert_date < date '2020-02-01'
)
select * from rws
where current_block <> prev_block
order by dt;
This gives the optimizer the impression that the rows are much poorly clustered than they really are. This is a common problem for always increasing values such as insert dates which are "mostly" clustered. Luckily there's a trick to help in these scenarios:
Set the TABLE_CACHED_BLOCKS preference!
The TABLE_CACHED_BLOCKS preference tells the optimizer to keep track of how recently it saw a value. By default this is one:
select dbms_stats.get_prefs ( 'table_cached_blocks', null, 'bricks' )
from dual;
This means when consecutive index entries point to different blocks, the clustering factor increases. Upping the preference slightly, e.g. to 16 enables the database to spot these mostly clustered values. You do this by setting the table preference:
begin
dbms_stats.set_table_prefs (
null, 'bricks', 'table_cached_blocks', 16
);
end;
/
You need to regather stats for this change to take effect; doing so brings the clustering factor for BRICK_INSERT_DATE_I down to the same value as the primary key index:
select index_name, clustering_factor
from user_indexes
where table_name = 'BRICKS';
exec dbms_stats.gather_table_stats ( null, 'bricks' ) ;
select index_name, clustering_factor
from user_indexes
where table_name = 'BRICKS';
This makes the optimizer much more likely to use the index when searching by date. In fact, the optimizer will now use this index when fetching more than 5% of the rows in the table:
select /*+ gather_plan_statistics cached_blocks */ count ( distinct junk ), count (*)
from bricks
where insert_date >= date'2020-02-01'
and insert_date < date'2020-03-01';
select *
from table(dbms_xplan.display_cursor ( :LIVESQL_LAST_SQL_ID, format => 'IOSTATS LAST' ) );
Replace /* TODO */ to set the table_cached_blocks to 128 for the bricks table:
begin
dbms_stats.set_table_prefs (
null, 'bricks', 'table_cached_blocks', /* TODO */
);
end;
/
select dbms_stats.get_prefs ( 'table_cached_blocks', null, 'bricks' )
from dual;
What effect do you think this will have on the clustering factor for all the indexes? Regather the stats to find out:
select index_name, clustering_factor
from user_indexes
where table_name = 'BRICKS';
exec dbms_stats.gather_table_stats ( null, 'bricks' ) ;
select index_name, clustering_factor
from user_indexes
where table_name = 'BRICKS';
select dbms_stats.get_prefs ( 'table_cached_blocks', null, 'bricks' )
from dual;
So why has this happened?
All the rows fit within 128 blocks. So the cached blocks parameter is higher than the number of blocks in the table. This makes all indexes too attractive. Avoid setting this value too high!
Set the TABLE_CACHED_BLOCKS to 8 before continuing:
begin
dbms_stats.set_table_prefs (
null, 'bricks', 'table_cached_blocks', 8
);
end;
/
Setting TABLE_CACHED_BLOCKS helps the optimizer spot mostly clustered values.
But setting this to a huge value merely tricks the optimizer. This can fool it to thinking that values such as WEIGHT that are spread throughout the whole table are clustered in a few blocks. The database still needs to do the same number of I/Os to fetch the rows.
To make indexes on WEIGHT more effective, you need to physically re-order rows in the table. This re-ordering must move rows with the same value for WEIGHT to the same blocks.
It's possible to do this by copying all the rows to a temporary table. Then deleting all the rows from BRICKS and re-inserting them sorted by WEIGHT:
insert into bricks_temp
select * from bricks;
delete bricks;
insert into bricks
select * from bricks_temp
order by weight;
commit;
exec dbms_stats.gather_table_stats ( null, 'bricks' );
select index_name, clustering_factor
from user_indexes
where table_name = 'BRICKS';
This leads to a big improvement for the clustering of the weight index. But comes at the cost of the other columns. The indexes on BRICK_ID and INSERT_DATE are now nearly imperfectly clustered!
Re-sorting all the rows in a table like this is impractical in most applications. Luckily Oracle Database has several data structures you can use to force physical order on data.
Deleting then re-inserting all the rows in a table is an unworkable solution for most tables. It's better to place rows where you want them on insert.
Oracle Database has several data structures that do this to a greater or lesser extent. These include:
These all have various pros and cons. And require a rebuild to add to existing tables. Making it impractical to change most existing tables. There is another, more lightweight technique available from 12.2:
Attribute Clustering
Attribute clustering defines a physical order for rows. Using a linear order sorts the values in the same way as ORDER BY; it works through the columns left-to-right, only using the next column to resolve ties.
This adds linear clustering by INSERT_DATE. So the database will sort the rows by this column:
alter table bricks
add clustering
by linear order ( insert_date );
But this has no effect on existing data. Only rows added using direct-path inserts use this ordering. To apply it to current rows, move the table:
select index_name, clustering_factor
from user_indexes
where table_name = 'BRICKS';
alter table bricks
move online;
exec dbms_stats.gather_table_stats ( null, 'bricks' ) ;
select index_name, clustering_factor
from user_indexes
where table_name = 'BRICKS';
Linear sorting gives perfect clustering for the leading column of the cluster. Along with any other columns highly correlated with these values.
But it's rare all queries on a table use the same column in the where clause. So sorting on one column improves the performance of these queries, at the cost of queries on other columns.
Usually there are several key queries, each filtering on different columns. You want to ensure all these columns can use an index if possible. To help with this it's possible to interleave the values from two or more columns.
Interleaved clustering sorts similar values together. This can improve the clustering for two or more loosely correlated columns at the same time. This interleaves the values for BRICK_ID and SHAPE:
select index_name, clustering_factor
from user_indexes
where table_name = 'BRICKS';
alter table bricks
drop clustering ;
alter table bricks
add clustering
by interleaved order ( brick_id, shape );
alter table bricks
move online;
exec dbms_stats.gather_table_stats ( null, 'bricks' ) ;
select index_name, clustering_factor
from user_indexes
where table_name = 'BRICKS';
This gives near-perfect clustering for these two columns. The effect of this is to trace a Z-shape in a matrix of these values:
with rws as (
select ceil ( brick_id / 100 ) id,
shape,
dense_rank () over ( order by dbms_rowid.rowid_block_number ( rowid ) ) rid
from bricks
)
select * from rws
pivot (
max ( rid )
for id in (
1, 2, 3, 4, 5, 6, 7, 8, 9, 10
)
)
order by shape;
Notice that some RID values appear in two rows. If you trace these you'll follow a Z-like pattern through the matrix.
Experiment with replacing /* TODO */ with different sets of columns from BRICKS. Which combination of columns gives the lowest overall clustering factor? Which gives the highest?
alter table bricks
drop clustering ;
alter table bricks
add clustering
by interleaved order ( /* TODO */ );
alter table bricks
move online;
exec dbms_stats.gather_table_stats ( null, 'bricks' ) ;
select index_name, sum ( clustering_factor )
from user_indexes
where table_name = 'BRICKS'
group by rollup ( index_name );
But it's not the total clustering factor for your indexes that matters. You need to assess the performance needs of your queries. It may be that queries by weight or colour are part of the application search page. These are constantly executed and must be as fast as possible.
Whereas searches by date only happen once per day as part of a background job. So this query can take a little longer. Once you've found these requirements you can choose the appropriate columns to cluster on. In this example WEIGHT and COLOUR.