When is an Index Useful? Databases for Developers: Performance #5

View All Tutorials Login to Run Tutorial

Tutorial When is an Index Useful? Databases for Developers: Performance #5
Description An overview of when an index finds rows more efficiently than using a full table scan. Introduces the concept of the clustering factor, a key indicator in how effective an index is.
Tags index, performance, clustering factor
Area Performance and Tuning
Contributor Chris Saxon (Oracle)
Created Thursday February 13, 2020
Modules 11

Prerequisite SQL

exec dbms_random.seed ( 0 );
create table bricks ( 
  brick_id not null constraint bricks_pk primary key,
  colour   not null,
  shape    not null,
  weight   not null,
  insert_date not null,
  junk     default lpad ( 'x', 50, 'x' ) not null 
) as
  with rws as (
    select level x from dual
    connect by level <= 10000
  )
    select rownum brick_id, 
           case ceil ( rownum / 2500 )
             when 4 then 'red'
             when 1 then 'blue'
             when 2 then 'green'
             when 3 then 'yellow'
           end colour, 
           case mod ( rownum, 4 )
             when 0 then 'cube'
             when 1 then 'cylinder'
             when 2 then 'pyramid'
             when 3 then 'prism'
           end shape,
           round ( dbms_random.value ( 1, 1000 ) ),
           date'2020-01-01' + ( rownum/24 ) + ( mod ( rownum, 24 ) / 36 ) insert_date,
           lpad ( ascii ( mod ( rownum, 26 ) + 65 ), 50, 'x' )
    from   rws;
    
create global temporary table bricks_temp as
  select * from bricks
  where  1 = 0;

create index brick_weight_i on 
  bricks ( weight );
  
create index brick_shape_i on 
  bricks ( shape );

create index brick_colour_i on 
  bricks ( colour );
  
create index brick_insert_date_i on 
  bricks ( insert_date );
  
exec dbms_stats.gather_table_stats ( null, 'bricks' ) ;

Module 1

When Is an Index Useful?

An index is said to be useful when it locates few rows in the table. But what exactly does "few" mean?

The BRICKS table stores 10,000 rows:

select count(*) from bricks;

It has indexes on the BRICK_ID, COLOUR, WEIGHT, SHAPE, and INSERT_DATE columns:

select ui.index_name, 
       listagg ( uic.column_name, ',' )
         within group ( order by column_position ) cols
from   user_indexes ui
join   user_ind_columns uic
on     ui.index_name = uic.index_name
where  ui.table_name = 'BRICKS'
group  by ui.index_name

There are 88 rows with weights between 1 and 10, a small fraction of the total in the table (0.88%). But the database opts for a full table scan:

select /*+ gather_plan_statistics */ count ( distinct junk ), count (*)
from  bricks
where  weight between 1 and 10;

select * 
from   table(dbms_xplan.display_cursor( :LIVESQL_LAST_SQL_ID, format => 'IOSTATS LAST'));

But this query searches for the first thousand BRICK_IDs returns 10% of the table's rows. Yet uses an index!

select /*+ gather_plan_statistics */ count ( distinct junk ), count (*)
from  bricks
where  brick_id between 1 and 1000;

select * 
from   table(dbms_xplan.display_cursor( :LIVESQL_LAST_SQL_ID, format => 'IOSTATS LAST'));

So why does the optimizer use an index for one query but not the other? It's all to do with how the rows are stored in the table.

Module 2
Physical Row Location
Oracle Database stores rows in blocks. You can find the block number for a row using DBMS_rowid:
```
select brick_id, 
       dbms_rowid.rowid_block_number ( rowid ) blk# 
from   bricks
where  mod ( brick_id, 100 ) = 0;
```
By default, tables are heap organized in Oracle Database. This means that the database is free to place rows wherever it wants.

But indexes are ordered data structures. New entries must go in the correct location. For example, if you insert 42 in a number column, this entry must go between the last entry of 41 or lower and the first entry of 43 or higher in any indexes on that column.

The closer the physical order of rows matches the logical order of an index, the more effective that index will be.

The smallest unit of I/O in Oracle Database is a block. So the more consecutive index entries there are that point to the same block, the more rows you can fetch in one I/O. Thus the more effective the index will be.

This query build a matrix of BRICK_IDs and blocks, counting the number of rows in the block and brick ranges:
```
with rws as ( 
  select ceil ( brick_id / 1000 ) id, 
         ceil ( 
           dense_rank () over (
             order by dbms_rowid.rowid_block_number ( rowid )
           ) / 10 
         ) rid
  from   bricks
)
  select * from rws
  pivot (
    count (*) for rid in (
      1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12
    )
  )
  order by id;
```
You can see the values for BRICK_ID are well clustered. All the rows for a given ID range are all in the same block range. There are many block ranges with zero rows for an ID range.

Contrast this with the weights. Plugging these values into the matrix shows weight values are spread throughout the whole table:
```
with rws as ( 
  select ceil ( weight / 100 ) wt, 
         ceil ( 
           dense_rank () over (
             order by dbms_rowid.rowid_block_number ( rowid )
           ) / 10 
         ) rid
  from   bricks
)
  select * from rws
  pivot (
    count (*) for rid in (
      1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12
    )
  )
  order by wt;
```
The rows for the first weight range are stored in every block in the table. Whereas the rows for the first ID range are in 3-4 blocks.

This means the database will have to do lots more I/O to fetch the same number of rows for a range of weights compared to a range of identifiers. Thus an index on WEIGHT will be less effective than on BRICK_ID.

When determining how effective an index is, it's the number of I/O operations that matter. Not how many rows it accesses!

So how does the optimizer know how closely the logical and physical orders match?

It estimates using the clustering factor.
Module 3
The Clustering Factor
The clustering factor is a measure of how closely the logical index order matches the physical table order for rows. The database calculates this when gathering stats. It does this by asking:

Is the row for the current index entry in the same block as the previous entry or a different one?

Each time consecutive index entries are in different blocks, the optimizer increments a counter. When stats gathering is complete this gives a value ranging from the number of blocks in the table to the number of rows. The lower this value, the better clustered the rows are and the more likely the database is to use the index.

You can view the clustering factor by querying the *_INDEXES views:
```
select index_name, clustering_factor, ut.num_rows, ut.blocks
from   user_indexes ui
join   user_tables ut
on     ui.table_name = ut.table_name
where  ui.table_name = 'BRICKS';
```
The index on BRICK_ID (BRICK_PK) has perfect clustering (clustering factor <= blocks). Whereas the clustering for the weight index is almost as bad as it can be (clustering factor ~ number of rows). The other indexes all vary between these extremes.

Module 4

Fetching Partly Clustered Rows

The INSERT_DATE index has an "average" clustering factor. This means the database is unlikely to use it, even when fetching relatively few rows. For example, this query fetches less than 5% of the rows, but still opts for a full table scan:

select /*+ gather_plan_statistics */ count ( distinct junk ), count (*)
from  bricks
where  insert_date >= date'2020-02-01'
and    insert_date < date'2020-02-21';

select * 
from   table(dbms_xplan.display_cursor( :LIVESQL_LAST_SQL_ID, format => 'IOSTATS LAST'));

But zooming out months, it seems the values are almost as well clustered as BRICK_ID:

with rws as ( 
  select trunc ( insert_date, 'mm' ) dt, 
         ceil ( 
           dense_rank () over (
             order by dbms_rowid.rowid_block_number ( rowid )
           ) / 10 
         ) rid
  from   bricks
)
  select * from rws
  pivot (
    count (*) for rid in (
      1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12
    )
  )
  order by dt;

So why is the clustering factor for its index so high?

This is because there are several rows and the start and end of each day that hop back-and-forth between two blocks:

with rws as (
  select brick_id,
         to_char ( insert_date, 'DD MON HH24:MI' ) dt, 
         dbms_rowid.rowid_block_number ( rowid ) current_block,
         lag ( dbms_rowid.rowid_block_number ( rowid ) ) over (
           order by insert_date
         ) prev_block
  from   bricks
  where  insert_date >= date '2020-01-01'
  and    insert_date <  date '2020-02-01'
)
  select * from rws
  where  current_block <> prev_block
  order  by dt;

This gives the optimizer the impression that the rows are much poorly clustered than they really are. This is a common problem for always increasing values such as insert dates which are "mostly" clustered. Luckily there's a trick to help in these scenarios:

Set the TABLE_CACHED_BLOCKS preference!

Module 5
Improving Stats for Mostly Clustered Rows
The TABLE_CACHED_BLOCKS preference tells the optimizer to keep track of how recently it saw a value. By default this is one:
```
select dbms_stats.get_prefs ( 'table_cached_blocks', null, 'bricks' ) 
from   dual;
```
This means when consecutive index entries point to different blocks, the clustering factor increases. Upping the preference slightly, e.g. to 16 enables the database to spot these mostly clustered values. You do this by setting the table preference:
```
begin 
  dbms_stats.set_table_prefs ( 
    null, 'bricks', 'table_cached_blocks', 16 
  );
end;
/
```
You need to regather stats for this change to take effect; doing so brings the clustering factor for BRICK_INSERT_DATE_I down to the same value as the primary key index:
```
select index_name, clustering_factor 
from   user_indexes
where  table_name = 'BRICKS';

exec dbms_stats.gather_table_stats ( null, 'bricks' ) ;

select index_name, clustering_factor 
from   user_indexes
where  table_name = 'BRICKS';
```
This makes the optimizer much more likely to use the index when searching by date. In fact, the optimizer will now use this index when fetching more than 5% of the rows in the table:
```
select /*+ gather_plan_statistics cached_blocks */ count ( distinct junk ), count (*)
from   bricks
where  insert_date >= date'2020-02-01'
and    insert_date < date'2020-03-01';

select * 
from   table(dbms_xplan.display_cursor ( :LIVESQL_LAST_SQL_ID, format => 'IOSTATS LAST' ) );
```

Module 6

Try It!

Replace /* TODO */ to set the table_cached_blocks to 128 for the bricks table:

begin 
  dbms_stats.set_table_prefs ( 
    null, 'bricks', 'table_cached_blocks', /* TODO */
  );  
end;
/

select dbms_stats.get_prefs ( 'table_cached_blocks', null, 'bricks' ) 
from   dual;

What effect do you think this will have on the clustering factor for all the indexes? Regather the stats to find out:

select index_name, clustering_factor 
from   user_indexes
where  table_name = 'BRICKS';

exec dbms_stats.gather_table_stats ( null, 'bricks' ) ;

select index_name, clustering_factor 
from   user_indexes
where  table_name = 'BRICKS';

select dbms_stats.get_prefs ( 'table_cached_blocks', null, 'bricks' ) 
from   dual;

So why has this happened?

All the rows fit within 128 blocks. So the cached blocks parameter is higher than the number of blocks in the table. This makes all indexes too attractive. Avoid setting this value too high!

Set the TABLE_CACHED_BLOCKS to 8 before continuing:

begin 
  dbms_stats.set_table_prefs ( 
    null, 'bricks', 'table_cached_blocks',  8
  );  
end;
/

Module 7
Improving Clustering for a Column
Setting TABLE_CACHED_BLOCKS helps the optimizer spot mostly clustered values.

But setting this to a huge value merely tricks the optimizer. This can fool it to thinking that values such as WEIGHT that are spread throughout the whole table are clustered in a few blocks. The database still needs to do the same number of I/Os to fetch the rows.

To make indexes on WEIGHT more effective, you need to physically re-order rows in the table. This re-ordering must move rows with the same value for WEIGHT to the same blocks.

It's possible to do this by copying all the rows to a temporary table. Then deleting all the rows from BRICKS and re-inserting them sorted by WEIGHT:
```
insert into bricks_temp
  select * from bricks;
  
delete bricks;

insert into bricks
  select * from bricks_temp
  order  by weight;
  
commit;

exec dbms_stats.gather_table_stats ( null, 'bricks' );

select index_name, clustering_factor 
from   user_indexes
where  table_name = 'BRICKS';
```
This leads to a big improvement for the clustering of the weight index. But comes at the cost of the other columns. The indexes on BRICK_ID and INSERT_DATE are now nearly imperfectly clustered!

Re-sorting all the rows in a table like this is impractical in most applications. Luckily Oracle Database has several data structures you can use to force physical order on data.
Module 8
Changing the Physical Data Model
Deleting then re-inserting all the rows in a table is an unworkable solution for most tables. It's better to place rows where you want them on insert.

Oracle Database has several data structures that do this to a greater or lesser extent. These include:
- Index-organized tables
- Partitioning
- Table clusters
These all have various pros and cons. And require a rebuild to add to existing tables. Making it impractical to change most existing tables. There is another, more lightweight technique available from 12.2:

Attribute Clustering
Module 9
Attribute Clustering
Attribute clustering defines a physical order for rows. Using a linear order sorts the values in the same way as ORDER BY; it works through the columns left-to-right, only using the next column to resolve ties.

This adds linear clustering by INSERT_DATE. So the database will sort the rows by this column:
```
alter table bricks
  add clustering 
  by linear order ( insert_date );
```
But this has no effect on existing data. Only rows added using direct-path inserts use this ordering. To apply it to current rows, move the table:
```
select index_name, clustering_factor 
from   user_indexes
where  table_name = 'BRICKS';

alter table bricks
  move online;
  
exec dbms_stats.gather_table_stats ( null, 'bricks' ) ;

select index_name, clustering_factor 
from   user_indexes
where  table_name = 'BRICKS';
```
Linear sorting gives perfect clustering for the leading column of the cluster. Along with any other columns highly correlated with these values.

But it's rare all queries on a table use the same column in the where clause. So sorting on one column improves the performance of these queries, at the cost of queries on other columns.

Usually there are several key queries, each filtering on different columns. You want to ensure all these columns can use an index if possible. To help with this it's possible to interleave the values from two or more columns.

Module 10

Interleaved Attribute Clustering

Interleaved clustering sorts similar values together. This can improve the clustering for two or more loosely correlated columns at the same time. This interleaves the values for BRICK_ID and SHAPE:

select index_name, clustering_factor 
from   user_indexes
where  table_name = 'BRICKS';

alter table bricks
  drop clustering ;
  
alter table bricks
  add clustering 
  by interleaved order ( brick_id, shape );
  
alter table bricks
  move online;
  
exec dbms_stats.gather_table_stats ( null, 'bricks' ) ;

select index_name, clustering_factor 
from   user_indexes
where  table_name = 'BRICKS';

This gives near-perfect clustering for these two columns. The effect of this is to trace a Z-shape in a matrix of these values:

with rws as ( 
  select ceil ( brick_id / 100 ) id, 
         shape,
         dense_rank () over ( order by dbms_rowid.rowid_block_number ( rowid ) ) rid
  from   bricks
)
  select * from rws
  pivot (
    max ( rid ) 
    for id in (
      1, 2, 3, 4, 5, 6, 7, 8, 9, 10
    )
  )
  order by shape;

Notice that some RID values appear in two rows. If you trace these you'll follow a Z-like pattern through the matrix.

Module 11
Try It Challenge!
Experiment with replacing /* TODO */ with different sets of columns from BRICKS. Which combination of columns gives the lowest overall clustering factor? Which gives the highest?
```
alter table bricks
  drop clustering ;
  
alter table bricks
  add clustering 
  by interleaved order ( /* TODO */ );
  
alter table bricks
  move online;
  
exec dbms_stats.gather_table_stats ( null, 'bricks' ) ;

select index_name, sum ( clustering_factor )
from   user_indexes
where  table_name = 'BRICKS'
group  by rollup ( index_name );
```
But it's not the total clustering factor for your indexes that matters. You need to assess the performance needs of your queries. It may be that queries by weight or colour are part of the application search page. These are constantly executed and must be as fast as possible.

Whereas searches by date only happen once per day as part of a background job. So this query can take a little longer. Once you've found these requirements you can choose the appropriate columns to cluster on. In this example WEIGHT and COLOUR.

Additional Information

Database on OTN SQL and PL/SQL Discussion forums
Oracle Database
Download Oracle Database

When is an Index Useful? Databases for Developers: Performance #5

Prerequisite SQL

When Is an Index Useful?

Physical Row Location

The Clustering Factor

Fetching Partly Clustered Rows

Improving Stats for Mostly Clustered Rows

Try It!

Improving Clustering for a Column

Changing the Physical Data Model

Attribute Clustering

Interleaved Attribute Clustering

Try It Challenge!

Links

Additional Information