Approximate Query Processing - Getting Started

View All Tutorials Login to Run Tutorial

Tutorial Approximate Query Processing - Getting Started
Description Approximate Query Processing provides approximate result capabilities in SQL. The motivation behind this feature is to provide highly accurate results and order of magnitude faster than exact query equivalents.
Tags sql analytics,approximate query processing
Area SQL Analytics
Contributor Nigel Bayliss (Oracle)
Created Wednesday January 04, 2017
Modules 13

Module 1

Create Test Schema

We'll start by creating a simple table called approxt. It includes a value column we'll use for calculations and we'll be aggregating the results by US state and county.

declare
  ORA_00942 exception; pragma Exception_Init(ORA_00942, -00942);
begin
  execute immediate 'drop table approxt';
exception when ORA_00942 then null;
end;
/

create table approxt (
id number, 
volume number, 
state varchar2(5), 
county varchar2(30));

insert into approxt
        select level id, floor(dbms_random.value()*1000) volume, 'CA' state, 'COUNTY1' county
                from dual
                connect by level <= 10000;

insert into approxt
        select level+10000 id, floor(dbms_random.value()*1000) volume, 'CA' , 'COUNTY2' town
                from dual
                connect by level <= 10000;

insert into approxt
        select level+20000 id,  floor(dbms_random.value()*1000) volume, 'NM' , 'COUNTY3' 
                from dual
                connect by level <= 10000;

insert into approxt
        select level+30000 id,  floor(dbms_random.value()*1000) volume, 'DL' , 'COUNTY4' 
                from dual
                connect by level <= 10000;

insert into approxt
        select level+40000 id,  floor(dbms_random.value()*1000) volume, 'MA' , 'COUNTY5' 
                from dual
                connect by level <= 10000;

commit;

exec dbms_stats.gather_table_stats(ownname=>null,tabname=>'approxt');

Module 2

Approximate Count Distinct

Let us recap approximate count distinct, which has been available since Oracle Database 12.1.

REM Use ALTER SESSION to ensure that COUNT DISTINCT will not be converted to APPROX COUNT DISTINCT. More on this later.

alter session set approx_for_count_distinct = FALSE ;

REM Exact count(distinct volume) calculation.

select count(distinct volume) from approxt;

REM Looking at the execution execution plan, you'll see that it includes a hash group by and sort aggregate.

explain plan for select COUNT(distinct volume) from approxt;

select * from table(dbms_xplan.display);

REM This time, use the approximate calculation using the explicit function: APPROX_COUNT_DISTINCT(volume).

select APPROX_COUNT_DISTINCT(volume) from approxt;

REM Notice that the execution plan for approximate count distinct includes a "sort aggregate approx".
REM This operation is much cheaper than the equivalent exact sort operation. 
REM The performance benefits of approximate processing are, to a large extent, associated 
REM with the removal of the expensive sort operation. 
REM This saving becomes more significant for larger data volumes on columns with 
REM large numbers of distinct values.

explain plan for select APPROX_COUNT_DISTINCT(volume) from approxt;

select * from table(dbms_xplan.display);

Module 3

Zero Code Changes

Oracle Database 12.2 added the ability to replace exact calculations with approximate calculations without making code changes. This can be done using a database parameter, "approx_for_count_distinct", which can be set at the session or database-level.

REM Transparently replace exact COUNT DISTINCT with APPROX COUNT DISTINCT

alter session set approx_for_count_distinct = TRUE ;

REM Approximate count distinct, even though query is "count(distinct)". 
REM Check the result with the previous one.

select count(distinct volume) from approxt;

REM An now, "back to normal"...

alter session set approx_for_count_distinct = FALSE ;

REM Exact...

select count(distinct volume) from approxt;

Module 4

New Median Calculation

The new approximate median calculation.

REM Exact median calculation...

select median(volume) from approxt where state = 'CA';

REM The new approximate calculation for median...

select approx_median(volume) from approxt where state = 'CA';

REM Look at the execution plan for approximate median...

explain plan for select approx_median(volume) from approxt where state = 'CA';

select * from table(dbms_xplan.display);

Module 5
New Percentile Calculation
The new approximate percentile calculation.

"Approximate percentile" is an umbrella term for "approx_percentile" and "approx_median" because these functions are closely related to one another. A percentile of "0.5" is equivalent to a median. For the rest of this tutorial, the term "approximate percentile" can be taken to be referring to the percentile and median functions.
```
REM A percentile of "0.5" is equivalent to MEDIAN...

select approx_percentile(0.5) within group (order by volume) 
from approxt where state = 'CA';

select approx_percentile(0.1) within group (order by volume) 
from approxt where state = 'CA';
```

Module 6

Error Rate and Confidence Level

The percentile calculations offer error rate and confidence reporting.


REM MEDIAN error rate...
select approx_median(volume,'ERROR_RATE') from approxt where state = 'CA';
REM MEDIAN confidence...
select approx_median(volume,'CONFIDENCE') from approxt where state = 'CA';
REM PERCENTILE error rate...
select approx_percentile(0.1,'ERROR_RATE') within group (order by volume) 
from approxt where state = 'CA';
REM PERCENTILE confidence...
select approx_percentile(0.1,'CONFIDENCE') within group (order by volume) 
from approxt where state = 'CA';

Module 7

Deterministic and Non-deterministic

Two approximate algorithms are offered for the approximate percentile functions:

Non-deterministic: fastest method, whenever reproducibility is not required. This is the default method.
Deterministic: slightly slower, but delivers deterministic results.

Note that the differences will not be apparent with the small dataset in use here.

REM Approximate median, NON DETERMINISTIC

select approx_median(volume), approx_median(volume,'ERROR_RATE') 
from approxt where state = 'CA';

REM Approximate median, DETERMINISTIC

select approx_median(volume deterministic), approx_median(volume deterministic,'ERROR_RATE') 
from approxt 
where state = 'CA';

REM Approximate percentile, DETERMINISTIC

select approx_percentile(0.1 deterministic) within group (order by volume) 
from approxt where state = 'CA';

Module 8
Control Using Database Parameters
Transparent replacement of exact percentile and median calculations is possible using database parameters.

The umbrella parameter approx_for_aggregation controls both approximate count distinct and the approximate percentile functions. Setting to TRUE will cause the exact count distinct and percentile calculations to be replaced with the equivalent approximate functions.

Finer control is achieved using approx_for_count_distinct (demonstrated above) and approx_for_percentile.
The parameter approx_for_percentile can be used to finely control what exact percentile functions are replaced and whether to use the deterministic or non-deterministic algorithm. The choices for this parameter are: all deterministic, percentile_disc deterministic, percentile_cont deterministic, all, percentile_disc, percentile_cont and none.
```
REM Set "all deterministic" so that all percentile calculations are replaced with approximate, deterministic calculation...

alter session set approx_for_percentile = 'all deterministic';

REM This query uses MEDIAN, which will be replaced with the approx deterministic calculation...

select median(volume) from approxt where state = 'CA';

REM Check the execution plan...

explain plan for select median(volume) from approxt where state = 'CA';

select * from table(dbms_xplan.display);

REM Back to the default

alter session set approx_for_percentile = 'none';
```

Module 9

New Aggregation Capabilities

Approximate aggregations can be materialized and converted back to approximate count distinct and approximate percentiles. In this way, a single table can be used for queries that require approximate answers for aggregations sliced-and-diced over different dimensions.


declare
  ORA_00942 exception; pragma Exception_Init(ORA_00942, -00942);
begin
  execute immediate 'drop table agg';
exception when ORA_00942 then null;
end;
/

REM Materialize the aggregations in a table called <i>agg</i>.
REM The <i>detail</i> column stores a binary object that encodes the distinct value information
REM for interpretation by the <i>approx_percentile_agg</i> function for different aggregations.

create table agg as
select state,
       county,
       approx_percentile_detail(volume) detail 
from   approxt group by state,county;

REM Look at the non-readable content of the <i>detail</i> column...
select county, utl_raw.cast_to_raw( dbms_lob.substr( detail, 128, 1 ) ) as detail
from agg
fetch first 5 rows only;

REM Aggregation by state.
REM The table was created using an aggregation by state and 
REM county but below, the "agg" function is able to calculate
REM the approximate percentile when aggregating by state alone.

REM First, here is the exact result from the base table...
select state,median(volume) 
from approxt 
group by state;

REM This is the approximate result generated using the intermediate <i>agg</i> table. 

select state, to_approx_percentile(approx_percentile_agg(detail),0.5) 
from agg 
group by state;

REM This is the foundation for materialized view rewrite functionality (coming up next).

Module 10

New Materialized View Support

The approximate query functions can be used with materialized views for transparent query re-write.

REM Clean up...

declare
  ORA_12003 exception; pragma Exception_Init(ORA_12003, -12003);
begin
  execute immediate 'drop materialized view count_d_mview';
exception when ORA_12003 then null;
end;
/

REM Create a materialized view (MV) for APPROX PERCENTILE

create materialized view pctl_mview enable query rewrite as
select state, county, approx_percentile_detail(volume) detail
from   approxt
group by state, county;

REM Gather stats...

exec dbms_stats.gather_table_stats(ownname=>null,tabname=>'pctl_mview');

REM The following query re-writes to the materialized view, aggregating by state.
REM Note how Approx Median can use the Approx Percentile that was defined in the MV.

select state, approx_median(volume) 
from approxt
where state = 'CA'
group by state;

REM Take a look at the plan...

explain plan for 
select state, approx_median(volume) 
from approxt
where state = 'CA'
group by state;

SELECT * FROM TABLE(dbms_xplan.display);

Module 11

MV and Approx Count Distinct

Query re-write can be used with approximate count distinct too.

declare
  ORA_12003 exception; pragma Exception_Init(ORA_12003, -12003);
begin
  execute immediate 'drop materialized view count_d_mview';
exception when ORA_12003 then null;
end;
/

REM Create an MV for approximate count distinct

create materialized view count_d_mview enable query rewrite as
select state, county, approx_count_distinct_detail(volume) detail
from  approxt
group by state, county;

REM Gather stats...

exec dbms_stats.gather_table_stats(ownname=>null,tabname=>'count_d_mview');

REM Query re-writes to MV aggregating by state and county...

select state, county, approx_count_distinct(volume) 
from  approxt
where state = 'CA'
group by state,county;

REM check the execution plan...

explain plan for select state, county, approx_count_distinct(volume) 
from  approxt
where state = 'CA'
group 
by state,county;

SELECT * FROM TABLE(dbms_xplan.display);

REM Query can also re-write to the MV, aggregating by state alone...

select state, approx_count_distinct(volume) 
from approxt
where state = 'CA'
group by state;

REM Check the execution plan...

explain plan for
select state, approx_count_distinct(volume) 
from approxt
where state = 'CA'
group by state;

SELECT * FROM TABLE(dbms_xplan.display);

Module 12

Re-write Exact Queries

This example is very similar to the previous step, but this time it demonstrates how it is possible to re-write an exact query transparently to use an approximate calculation defined in a materialized view. This is a very powerful feature: it is possible to achieve huge performance benefits without changing existing queries.

declare
  ORA_12003 exception; pragma Exception_Init(ORA_12003, -12003);
begin
  execute immediate 'drop materialized view count_d_mview';
exception when ORA_12003 then null;
end;
/

REM Create an MV for approximate count distinct

create materialized view count_d_mview enable query rewrite as
select state, county, approx_count_distinct_detail(volume) detail
from  approxt
group by state, county;

REM Gather stats...

exec dbms_stats.gather_table_stats(ownname=>null,tabname=>'count_d_mview');

REM Alter the session to enable approx for count distinct...

alter session set approx_for_count_distinct = TRUE ;

REM Query re-writes to MV aggregating by state and county...

select state, county, count(distinct volume) 
from  approxt
where state = 'CA'
group by state,county;

REM check the execution plan...

explain plan for 
select state, county, count(distinct volume) 
from  approxt
where state = 'CA'
group by state,county;

SELECT * FROM TABLE(dbms_xplan.display);

REM Query can also re-write to the MV, aggregating by state alone...

select state, count(distinct volume) 
from approxt
where state = 'CA'
group by state;

REM Check the execution plan...

explain plan for
select state, count(distinct volume) 
from approxt
where state = 'CA'
group by state;

SELECT * FROM TABLE(dbms_xplan.display);

REM Back to "normal"...

alter session set approx_for_count_distinct = FALSE ;

Module 13

Incremental Refresh

Materialized views with approximate aggregations can be incrementally refreshed


REM Clean up...

declare
  ORA_12003 exception; pragma Exception_Init(ORA_12003, -12003);
begin
  execute immediate 'drop materialized view count_d_mview';
exception when ORA_12003 then null;
end;
/

declare
  ORA_12002 exception; pragma Exception_Init(ORA_12002, -12002);
begin
  execute immediate 'drop materialized view log on approxt';
exception when ORA_12002 then null;
end;
/

REM Create MV log...

create materialized view log on approxt
with sequence, rowid (id,volume,state,county)
including new values;

REM Create MV...

create materialized view count_d_mview 
refresh fast
enable query rewrite 
as
select state, county, approx_count_distinct_detail(volume) detail
from approxt
group by state, county;


REM Gather stats...

exec dbms_stats.gather_table_stats(ownname=>null,tabname=>'count_d_mview');

REM Get status of materialized view...

select mview_name, staleness 
from user_mviews where mview_name='COUNT_D_MVIEW';

REM Do some dml...


insert into approxt
        select level+50000 id,  floor(dbms_random.value()*1000) volume, 'FL' , 'COUNTY5' 
                from dual
                connect by level <= 100;
                
commit;

REM Check status of materialized view...

select mview_name, staleness 
from user_mviews where mview_name='COUNT_D_MVIEW';

REM Do an incremental refresh...

begin
   DBMS_MVIEW.REFRESH('count_d_mview', method => 'F');
end;
/

REM Check status of materialized view...

select mview_name, staleness 
from user_mviews where mview_name='COUNT_D_MVIEW';

Additional Information

Database on OTN SQL and PL/SQL Discussion forums
Oracle Database
Download Oracle Database