The basis of this module is a ticker table storing stock prices per day, in this case for the imaginary stock PLCH (PL/SQL Challenge).
select *
from ticker
order by symbol, day;
A classic example of row pattern matching is to search to V shapes in the graph. Here is an example of that taken from the Data Warehousing Guide manual. Just try to execute it, then afterwards we'll go through the parts of it.
select *
from ticker
match_recognize (
partition by symbol
order by day
measures
strt.day as start_day
, final last(down.day) as bottom_day
, final last(up.day) as end_day
, match_number() as match_num
, classifier() as var_match
all rows per match
after match skip to last up
pattern ( strt down+ up+ )
define
down as down.price < prev(down.price)
, up as up.price > prev(up.price)
)
order by symbol, match_num, day;
So we'll try different statements each showing parts of the statement above until we end with the same statement.
At first in the next statement we'll try to search for the "down" part of the V shape first, using the different parts of the MATCH_RECOGNIZE clause:
select *
from ticker
match_recognize (
partition by symbol
order by day
measures
strt.day as start_day
, running last(down.day) as down_day
, final last(down.day) as bottom_day
, match_number() as match_num
, classifier() as var_match
all rows per match
after match skip past last row
pattern ( strt down+ )
define
down as down.price < prev(down.price)
)
order by symbol, match_num, day;
We can do the exact same thing defining UP to search for "up" parts of the V shape. Just like in the statement above, we can observe the difference between RUNNING and FINAL keyword in front of the navigational functions - in this case LAST. RUNNING (which is the default if nothing is specified) evaluates on the row itself, while FINAL evaluates on the last row of the match.
MATCH_NUMBER and CLASSIFIER are very useful to add to your query while you are developing it and testing out your pattern. When you go to production it may often not be necessary to include these functions, but they are very instructive to observe how the pattern matching works.
select *
from ticker
match_recognize (
partition by symbol
order by day
measures
strt.day as start_day
, running last(up.day) as up_day
, final last(up.day) as top_day
, match_number() as match_num
, classifier() as var_match
all rows per match
after match skip past last row
pattern ( strt up+ )
define
up as up.price > prev(up.price)
)
order by symbol, match_num, day;
Just like we can use RUNNING and FINAL on the navigational functions (used on LAST in the previous statement), we can also use them on aggregate/analytic functions. Here we use a RUNNING and FINAL COUNT, where RUNNING works like analytic ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW and FINAL works like ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING.
select *
from ticker
match_recognize (
partition by symbol
order by day
measures
final count(up.*) as days_up
, up.price - prev(up.price) as up_day
, (up.price - strt.price) / running count(up.*) as running_up_avg
, (final last(up.price) - strt.price) / final count(up.*) as total_up_avg
, up.price - strt.price as up_total
, match_number() as match_num
, classifier() as var_match
all rows per match
after match skip to last up
pattern ( strt up+ )
define
up as up.price > prev(up.price)
)
order by symbol, match_num, day;
So we can combine the search for downs and the search for ups in a single statement, where we have two DEFINE classifications and use both in the PATTERN clause, so we specify we want any one row, followed by at least one DOWN row, followed by at least one UP row - thereby finding V shapes in the graph.
We use the FINAL LAST twice, since by specifying down.day respectively up.day, we get the last DAY value of those rows classified DOWN respectively UP. If we had only used last(day) without specifying any classification variably in front, we would get the DAY value of the last row in the match, whichever classification that row might have. In this case the last row of the match will always be an UP row, but that is not always the case. You can try running this query as is and then edit and change last(down.day) to last(day) and see what happens.
select *
from ticker
match_recognize (
partition by symbol
order by day
measures
strt.day as start_day
, final last(down.day) as bottom_day
, final last(up.day) as end_day
, match_number() as match_num
, classifier() as var_match
all rows per match
after match skip past last row
pattern ( strt down+ up+ )
define
down as down.price < prev(down.price)
, up as up.price > prev(up.price)
)
order by symbol, match_num, day;
But the above query is not quite identical to the example from the manual given at the top of the module, and it does also not quite give the same result. The query from the manual does not use AFTER MATCH SKIP PAST LAST ROW, it uses AFTER MATCH SKIP TO LAST UP. The difference is that when a match has been found, it will begin searching for a new match from the last row of the old match, not the row that follows the last row. For our V shape search, this is important, as after the last UP row can very easily be a DOWN row.
So here we'll repeat the manual query with the change to SKIP TO LAST UP, so you can observe the difference between the output of the previous query and this one, where you'll notice two rows with 2011-04-10, one as last UP row in MATCH_NUM 1, then again as STRT row in MATCH_NUM 2. And this is perfectly OK that a row can become part of more than one match, and hence be output more than once.
select *
from ticker
match_recognize (
partition by symbol
order by day
measures
strt.day as start_day
, final last(down.day) as bottom_day
, final last(up.day) as end_day
, match_number() as match_num
, classifier() as var_match
all rows per match
after match skip to last up -- Changed here
pattern ( strt down+ up+ )
define
down as down.price < prev(down.price)
, up as up.price > prev(up.price)
)
order by symbol, match_num, day;
All the examples so far have used ALL ROWS PER MATCH. But we can also specify ONE ROW PER MATCH, which in effect is somewhat like changing from using analytic functions (ALL ROWS PER MATCH) to aggregating with GROUP BY (ONE ROW PER MATCH.) So we can get an aggregated row for each match like this.
select *
from ticker
match_recognize (
partition by symbol
order by day
measures
strt.day as start_day
, strt.price as start_price
, final last(down.day) as bottom_day
, final last(down.price) as bottom_price
, final last(up.day) as end_day
, final last(up.price) as end_price
, match_number() as match_num
one row per match
after match skip to last up
pattern ( strt down+ up+ )
define
down as down.price < prev(down.price)
, up as up.price > prev(up.price)
)
order by symbol, match_num;
The use of FINAL can be nice as sort of "self-documenting", but when we use ONE ROW PER MATCH it does not matter that much, as RUNNING in effect gives us the same result. Date in measures are evaluated on the last row when we use ONE ROW PER MATCH, so using RUNNING will be evaluated on the last row and hence give the same result as FINAL.
select *
from ticker
match_recognize (
partition by symbol
order by day
measures
strt.day as start_day
, strt.price as start_price
, running last(down.day) as bottom_day
, running last(down.price) as bottom_price
, running last(up.day) as end_day
, running last(up.price) as end_price
, match_number() as match_num
one row per match
after match skip to last up
pattern ( strt down+ up+ )
define
down as down.price < prev(down.price)
, up as up.price > prev(up.price)
)
order by symbol, match_num;
But in these examples we can even simplify it further when we use ONE ROW PER MATCH, since when it evaluates on the last row, we don't even need to use LAST navigational function. We can get the same result by simply using the classification variables.
select *
from ticker
match_recognize (
partition by symbol
order by day
measures
strt.day as start_day
, strt.price as start_price
, down.day as bottom_day
, down.price as bottom_price
, up.day as end_day
, up.price as end_price
, match_number() as match_num
one row per match
after match skip to last up
pattern ( strt down+ up+ )
define
down as down.price < prev(down.price)
, up as up.price > prev(up.price)
)
order by symbol, match_num;
However, using FINAL LAST can be nice as self-documenting code, plus it has the advantage of making it quicker to switch between ONE ROW PER MATCH and ALL ROWS PER MATCH, which often is needed as a debugging device for finding out if you have problems with your pattern definitions.
You have now seen the basic elements of the MATCH_RECOGNIZE clause, exemplified with stock ticker data. The next modules will give various use-cases, some where the thinking of "patterns" make sense, some where it is more obscure how it is a "pattern" but where the MATCH_RECOGNIZE clause is a very efficient solution.
The basis of this module is a simple table of integers.
select *
from numbers
order by numval;
Our task is to group all consecutive numbers (sequences). Whenever there is a gap, a new group should be started.
To start with, in the MATCH_RECOGNIZE we order by NUMVAL, so we can define a row classifier B with a condition that the value should be precisely 1 larger than the previous value.
The pattern A B* then states, that we start with any row, then must come zero or more occurences of a row classified as B. So the first row will be A, then the match will go on matching B rows until we reach a "gap" where the row no longer will be classified B. At that point the match ends, and the following row (which will be the first in the new group) will be an A row and starts a new match. We can observe the behaviour with ALL ROWS PER MATCH.
select *
from numbers
match_recognize (
order by numval
measures
match_number() match_no
, classifier() class
, a.numval a_numval
, b.numval b_numval
, first(numval) firstval
, final last(numval) lastval
, final count(*) cnt
all rows per match
pattern ( a b* )
define
b as numval = prev(numval) + 1
)
order by numval;
Having tested the behaviour with ALL ROWS PER MATCH, it is simple to change to ONE ROW PER MATCH (which is default so we can simply omit the PER MATCH specification) to get the desired groups of sequences.
select *
from numbers
match_recognize (
order by numval
measures
first(numval) firstval
, last(numval) lastval
, count(*) cnt
pattern (
a b*
)
define
b as numval = prev(numval) + 1
)
order by firstval;
The method can easily be adapted to for example find groups of consecutive dates. Also the condition could be more complex than simply "value must be exactly one higher than the previous".
For example we can say that we can accept a gap of one integer missing from the sequence and still allow it to be a group by a simple change to the DEFINE clause. Or you can try it out with "+ 3" and see what happens.
select *
from numbers
match_recognize (
order by numval
measures
first(numval) firstval
, last(numval) lastval
, count(*) cnt
pattern (
a b*
)
define
b as numval <= prev(numval) + 2 -- Just change here
)
order by firstval;
This module has shown a basic, but useful, pattern matching technique to group rows as long as some condition is met. A pattern of A B* and a definition of B with some condition (simple or complex) using PREV to compare the row to the previous row, that is a "template" that can be applied to several use cases.
The basis of this module is a table defining periods with a START_DATE and an END_DATE. We can sort the data by START_DATE.
select *
from periods
order by start_date, end_date;
Or we can sort the data by END_DATE - later we'll see what difference that makes.
select *
from periods
order by end_date, start_date;
The START_DATE is included in the period and the END_DATE is excluded. This means if the END_DATE of one period is equal to the START_DATE of another period, the two periods are contiguous and should be merged. At first we only want to consider merging for cases where the periods match up exactly as described - overlapping periods we are not merging.
So for row pattern matching we can define classifier B to be a row where the START_DATE is exactly equal to the END_DATE of the previous row (sorted by START_DATE, END_DATE.) Then a pattern of A B* will match first any row (classified A) followed by zero or more rows where the START_DATE is equal to the previous END_DATE (classified B.)
select *
from periods
match_recognize(
order by start_date, end_date
measures
prev(end_date) prev_end_date
, match_number() match_no
, classifier() class
all rows per match
pattern( a b* )
define
b as start_date = prev(end_date)
)
order by start_date, end_date;
Looking at the output we notice that the last period in the output has a START_DATE equal to the END_DATE of one of the periods before. But because there is the overlapping period "in between", this is not matched. If this is not the result we want, we can change the ORDER BY clause in the MATCH_RECOGNIZE and observe the difference this makes in the classification and matching of the rows.
select *
from periods
match_recognize(
order by end_date, start_date
measures
prev(end_date) prev_end_date
, match_number() match_no
, classifier() class
all rows per match
pattern( a b* )
define
b as start_date = prev(end_date)
)
order by end_date, start_date;
So if this is the merging we want, we can now change to ONE ROW PER MATCH to get the "merged" periods.
select *
from periods
match_recognize(
order by end_date, start_date
measures
first(start_date) start_date, last(end_date) end_date
, first(id) start_id , last(id) end_id
, min(id) min_id , max(id) max_id
, count(*) periods
one row per match
pattern( a b* )
define
b as start_date = prev(end_date)
)
order by end_date, start_date;
If we also wish to merge overlapping, then change "=" in the DEFINE clause to "<=". (This won't handle all cases of overlap, but for this simple situation it suffices.)
select *
from periods
match_recognize(
order by end_date, start_date
measures
first(start_date) start_date, last(end_date) end_date
, first(id) start_id , last(id) end_id
, min(id) min_id , max(id) max_id
, count(*) periods
one row per match
pattern( a b* )
define
b as start_date <= prev(end_date)
);
We might have a situation where the periods can be "open-ended", meaning that a NULL means "infinity", so a NULL in START_DATE means "from the beginning of time until END_DATE", and a NULL in END_DATE means "from START_DATE until the end of time." Let us add some periods with NULL dates.
begin
insert into periods values ( 8, null, date '2014-01-01');
insert into periods values ( 9, null, date '2014-01-02');
insert into periods values (10, date '2014-02-13', null);
insert into periods values (11, date '2014-02-14', null);
commit;
end;
/
To handle this, we need to handle the NULLs in the sorting by approriate NULLS LAST and NULLS FIRST, so a NULL in END_DATE is considered "greater than" any other date, and a NULL in START_DATE is considered "less than" any other date. With that ordering in place, we can add NULL handling to the definition of classifier B to take the NULLs into consideration.
select *
from periods
match_recognize(
order by end_date nulls last, start_date nulls first
measures
first(start_date) start_date, last(end_date) end_date
, first(id) start_id , last(id) end_id
, min(id) min_id , max(id) max_id
, count(*) periods
one row per match
pattern( a b* )
define
b as start_date is null
or start_date <= prev(end_date)
or prev(end_date) is null
)
order by end_date nulls last, start_date nulls first;
You have seen that date comparison between rows in a pattern classification definition can "match up" periods for merging. Quite similarly to how we grouped sequential numbers in the previous row. This way of comparison from row to row can form the basis of several similar case solutions.
The basis of this module is a table that every midnight samples the used space in each tablespace in the database. The dates are continuous (without gaps.)
select tabspace, sampledate, gigabytes
from space
order by tabspace, sampledate;
We want to discover where there has been "spurts" of growth in a tablespace, but spurts can be two different things - either a "fast" spurt where a tablespace grew at least 25% in a single day, or a "slow" spurt where the tablespace grew between 10% and 25% every day for at least 3 days.
As the samples are at midnight, if we compare the value of the current row with the value of the next row, we get the growth for the current day. This is easily accomplished in the DEFINE clauses by using the NEXT function to read the gigabyte values of the next row (= next date). So we create two classifiers, FAST and SLOW, each with a formula that fits the growth percentage we want to search for. Then in the PATTERN clause we use the operator | which means OR, so that we search for a pattern of either one or more FAST rows or three or more consecutive SLOW rows.
select tabspace, spurttype, match_no, sampledate, gigabytes
, end_of_day_gb, end_of_day_gb - gigabytes as growth
, round(100 * (end_of_day_gb - gigabytes) / gigabytes, 1) as pct
from space
match_recognize (
partition by tabspace
order by sampledate
measures
classifier() as spurttype
, match_number() as match_no
, next(gigabytes) as end_of_day_gb
all rows per match
after match skip past last row
pattern ( fast+ | slow{3,} )
define
fast as next(fast.gigabytes) / fast.gigabytes >= 1.25
, slow as next(slow.gigabytes) / slow.gigabytes >= 1.10 and
next(slow.gigabytes) / slow.gigabytes < 1.25
)
order by tabspace, sampledate;
Having observed with ALL ROWS PER MATCH that the pattern gives us what we want, we can now transform it to ONE ROW PER MATCH to get a simpler output for our growth spurt report.
Note in particular here, that in ONE ROW PER MATCH output, aggregates and PREV/NEXT functions get their values from the last row of the match. When we use ALL ROWS PER MATCH we could get the same value for DAYS by using FINAL COUNT(*), since FINAL keyword specifies to get the last (final) value. But if we tried FINAL NEXT(gigabytes) we would have gotten an error, as the combination of FINAL and NEXT is for some reason not allowed. However, here in a ONE ROW PER MATCH output, we can achieve the result as if we had used FINAL NEXT, since the NEXT is from the last row and therefore in effect FINAL (just without FINAL keyword.)
select tabspace, spurttype, startdate, startgb, enddate, endgb, days
, (endgb - startgb) / days as avg_growth
, round(100 * ((endgb - startgb) / days) / startgb, 1) as avg_pct
from space
match_recognize (
partition by tabspace
order by sampledate
measures
classifier() as spurttype
, first(sampledate) as startdate
, first(gigabytes) as startgb
, last(sampledate) as enddate
, next(gigabytes) as endgb
, count(*) as days
one row per match
after match skip past last row
pattern ( fast+ | slow{3,} )
define
fast as next(fast.gigabytes) / fast.gigabytes >= 1.25
, slow as next(slow.gigabytes) / slow.gigabytes >= 1.10 and
next(slow.gigabytes) / slow.gigabytes < 1.25
)
order by tabspace, startdate;
In the previous query we calculated the AVG_GROWTH and AVG_PCT columns in the SELECT list. But it is also allowed to put complex expressions in the MEASURES clause, so we could just as well place the calculations in the measures like this.
select tabspace, spurttype, startdate, startgb, enddate, endgb, days
, avg_growth, avg_pct
from space
match_recognize (
partition by tabspace
order by sampledate
measures
classifier() as spurttype
, first(sampledate) as startdate
, first(gigabytes) as startgb
, last(sampledate) as enddate
, next(gigabytes) as endgb
, count(*) as days
, (next(gigabytes) - first(gigabytes)) / count(*) as avg_growth
, round(100 * ((next(gigabytes) - first(gigabytes)) / count(*)) / first(gigabytes), 1) as avg_pct
one row per match
after match skip past last row
pattern ( fast+ | slow{3,} )
define
fast as next(gigabytes) / gigabytes >= 1.25
, slow as next(slow.gigabytes) / slow.gigabytes >= 1.10 and
next(slow.gigabytes) / slow.gigabytes < 1.25
)
order by tabspace, startdate;
You have now seen how to use multiple classifiers and | (OR) in the expressions to search for multiple row patterns simultaneously, if what you are searching is multiple conditions.
The basis of this module is the standard SCOTT.EMP table, where we want a hierarchical output with a column for each employee showing how many subordinates that employee has. With a scalar subquery we can find this with a fairly simple query.
select empno
, lpad(' ', (level-1)*2) || ename as ename
, (
select count(*)
from emp sub
start with sub.mgr = emp.empno
connect by sub.mgr = prior sub.empno
) subs
from emp
start with mgr is null
connect by mgr = prior empno
order siblings by empno;
But that requires accessing the table many times - not very efficient. We can make a much more efficient query using row pattern matching.
Since row pattern matching is very dependent on something to order the data by, we need to be able to preserve the hierarchical ordering for use in the MATCH_RECOGNIZE clause. For that purpose we create a subquery factoring WITH clause where we put the hierarchical query in an inline view, so we can use ROWNUM and give it an alias RN. This WITH clause will be the basis of our row pattern queries.
with hierarchy as (
select lvl, empno, ename, rownum as rn
from (
select level as lvl, empno, ename
from emp
start with mgr is null
connect by mgr = prior empno
order siblings by empno
)
)
select *
from hierarchy
order by rn;
If we order the data by the hierarchy (order by rn) and start at any given employee, then following employees with a higher level are subordinates - until we reach a point where the level is the same or less. In MATCH_RECOGNIZE we use DEFINE to define a HIGHER row to be one where the level is greater than the starting row of the match. Then the pattern simply is that a given row must be followed by zero or more HIGHER rows. When we reach a row with the same or lower level than the starting row, the match stops.
Counting the number of HIGHER rows then is the number of subordinates we want. But once we've reached the end of the match, normally we'd search from there for a new match - in this case instead we want to go back to row after the beginning of the match and start searching for a new match from there. This is accomplished by AFTER MATCH SKIP TO NEXT ROW.
ALL ROWS PER MATCH allows us to see the details of each match in the output, so we can observe what actually is happening here.
with hierarchy as (
select lvl, empno, ename, rownum as rn
from (
select level as lvl, empno, ename
from emp
start with mgr is null
connect by mgr = prior empno
order siblings by empno
)
)
select match_no, rn
, empno, lpad(' ', (lvl-1)*2) || ename as ename
, rolling_cnt, subs, class
, strt_no, strt_name, high_no, high_name
from hierarchy
match_recognize (
order by rn
measures
match_number() as match_no
, classifier() as class
, strt.empno as strt_no
, strt.ename as strt_name
, higher.empno as high_no
, higher.ename as high_name
, count(higher.lvl) as rolling_cnt
, final count(higher.lvl) as subs
all rows per match
after match skip to next row
pattern ( strt higher* )
define
higher as higher.lvl > strt.lvl
)
order by match_no, rn;
Just as a visual aid, we can take the match numbers of the previous query and PIVOT it to visualize which rows are in match number 1, which are in match number 2, and so on.
with hierarchy as (
select lvl, empno, ename, rownum as rn
from (
select level as lvl, empno, ename
from emp
start with mgr is null
connect by mgr = prior empno
order siblings by empno
)
)
select rn, empno, ename
, case "1" when 1 then 'XX' end "1"
, case "2" when 1 then 'XX' end "2"
, case "3" when 1 then 'XX' end "3"
, case "4" when 1 then 'XX' end "4"
, case "5" when 1 then 'XX' end "5"
, case "6" when 1 then 'XX' end "6"
, case "7" when 1 then 'XX' end "7"
, case "8" when 1 then 'XX' end "8"
, case "9" when 1 then 'XX' end "9"
, case "10" when 1 then 'XX' end "10"
, case "11" when 1 then 'XX' end "11"
, case "12" when 1 then 'XX' end "12"
, case "13" when 1 then 'XX' end "13"
, case "14" when 1 then 'XX' end "14"
from (
select match_no, rn, empno
, lpad(' ', (lvl-1)*2) || ename as ename
from hierarchy
match_recognize (
order by rn
measures match_number() as match_no
all rows per match
after match skip to next row
pattern ( strt higher* )
define
higher as higher.lvl > strt.lvl
)
)
pivot (
count(*)
for match_no in (
1,2,3,4,5,6,7,8,9,10,11,12,13,14
)
)
order by rn;
So if we change to using ONE ROW PER MATCH, we can get an output just like the original scalar subquery based query shown in the beginning of the module. But this query does not access the table multiple times - for larger datasets this scales much better and uses far fewer ressources.
with hierarchy as (
select lvl, empno, ename, rownum as rn
from (
select level as lvl, empno, ename
from emp
start with mgr is null
connect by mgr = prior empno
order siblings by empno
)
)
select empno
, lpad(' ', (lvl-1)*2) || ename as ename
, subs
from hierarchy
match_recognize (
order by rn
measures
strt.rn as rn
, strt.lvl as lvl
, strt.empno as empno
, strt.ename as ename
, final count(higher.lvl) as subs
one row per match
after match skip to next row
pattern ( strt higher* )
define
higher as higher.lvl > strt.lvl
)
order by rn;
Of course even when using ONE ROW PER MATCH we can still include more data if we wish, using FIRST or LAST or aggregate functions, like demonstrated here with MAX.
with hierarchy as (
select lvl, empno, ename, rownum as rn
from (
select level as lvl, empno, ename
from emp
start with mgr is null
connect by mgr = prior empno
order siblings by empno
)
)
select empno, lpad(' ', (lvl-1)*2) || ename as ename
, subs, high_from, high_to, high_max
from hierarchy
match_recognize (
order by rn
measures
strt.rn as rn
, strt.lvl as lvl
, strt.empno as empno
, strt.ename as ename
, count(higher.lvl) as subs
, first(higher.ename) as high_from
, last(higher.ename) as high_to
, max(higher.lvl) as high_max
one row per match
after match skip to next row
pattern ( strt higher* )
define
higher as higher.lvl > strt.lvl
)
order by rn;
It could be that we only want to output those employees that have subordinates. One way could be to put the query above in an inline view and filter on "subs > 0", but a simpler method is to simply change the PATTERN clause. If we change HIGHER* to HIGHER+, we indicate that a match only can be found if there is at least one HIGHER row, which means at least one subordinate.
with hierarchy as (
select lvl, empno, ename, rownum as rn
from (
select level as lvl, empno, ename
from emp
start with mgr is null
connect by mgr = prior empno
order siblings by empno
)
)
select empno
, lpad(' ', (lvl-1)*2) || ename as ename
, subs
from hierarchy
match_recognize (
order by rn
measures
strt.rn as rn
, strt.lvl as lvl
, strt.empno as empno
, strt.ename as ename
, count(higher.lvl) as subs
one row per match
after match skip to next row
pattern ( strt higher+ )
define
higher as higher.lvl > strt.lvl
)
order by rn;
Here you have seen a use case for AFTER MATCH SKIP TO NEXT ROW - a clause that not often is used, so can be easy to forget. But if you have a use case for it, it is highly efficient.
The basis of this module is a table with results of scientific studies of "sites" (like genomes or similar). Our task is assign the study sites to groups in such a manner, that the sum of the CNT column does not exceed 65000 - but the sites in each group has to be consecutive. In other words we need to perform a rolling sum of CNT in order of STUDY_SITE, but stop the rolling sum at the row before it exceeds 65000. Let us try a simple analytic rolling sum.
select study_site, cnt
, sum(cnt) over (
order by study_site
rows between unbounded preceding and current row
) as rolling_sum
from sites
order by study_site;
We can spot that at site 1023 the rolling sum is 73946, which is too much, so our first group (bin) will have to be from site 1001 to site 1022, and then the rolling sum will have to be restarted from site 1023 so we can begin "filling a new bin".
The analytic sum cannot be "restarted" like that, but pattern matching can do it. Here we make a definition of classifier A that a row is classified A as long as the sum of all the rows in the match so far (including the current row) is less than or equal to 65000. The pattern simply states that a match is one or more consecutive rows classified A. So the rolling sum will continue up to site 1022, then in site 1023 the row will no longer be classified as an A row, so the match stops. With AFTER MATCH SKIP PAST LAST ROW, a new match is then started at the next row after the last row of the previous match, so the new match starts at site 1023. With ALL ROWS PER MATCH we can see how the rolling sum restarts whenever needed, so the MATCH_NO identifies which "bin" we group the rows in.
select study_site, cnt, rolling_sum, match_no
from sites
match_recognize (
order by study_site
measures
sum(cnt) rolling_sum
, match_number() match_no
all rows per match
after match skip past last row
pattern ( a+ )
define a as sum(cnt) <= 65000
)
order by study_site;
Of course we could then group by MATCH_NO to get the set of "bins" and which sites go where, but we can easily get the same result by changing the query to use ONE ROW PER MATCH and then use aggregates in the MEASURES clause.
select first_site, last_site, cnt_sum
from sites
match_recognize (
order by study_site
measures
first(study_site) first_site
, last(study_site) last_site
, sum(cnt) cnt_sum
one row per match
after match skip past last row
pattern ( a+ )
define a as sum(cnt) <= 65000
)
order by first_site;
In the previous modules, formulas in DEFINE have accessed the current row, the previous row, following row, first row, or some combination thereof. In this module you have now learned that the DEFINE clause can contain aggregates too, just like MEASURES, enabling you to define a row classification based on data from a whole set of rows.
The basis of this module is a simple table of items with a certain value (indicating for example weight or volume or some other measure).
select *
from items
order by item_value desc;
We have 3 bins that we want to fill as equally as possible - meaning that the sum of the item values in each bin should be as near equal as we can get it. There is a fairly simple algorithm for this purpose:
In the following MATCH_RECOGNIZE clause we accomplish this with the use of the OR operator | to define a pattern, that matches zero or more occurences of either a BIN1 row or a BIN2 row or a BIN3 row.
So when a row is examined to find out which classifier the row matches, it will first try to find out if it matches the definition of BIN1. When testing if it matches, it is always assumed that it does, which means that in the formula used in the BIN1 definition, the COUNT and SUM will include the row we are testing to find out if it is a BIN1 row or not. So when we test COUNT(bin1.*) = 1, we actually test will this row be the first row in BIN1. Similarly, the SUM(bin1.item_value) is the sum of all BIN1 rows, which includes this row, so we need to subtract this row from the sum to get sum of all the previous rows. That way we can test if the sum of all previous rows in BIN1 is less than or equal to the smallest of the sums of BIN2 and BIN3 - if it is, then BIN1 is the bin with the smallest sum so far, so the row will be assigned to BIN1.
If the row was not assigned to BIN1, then BIN1 was not the bin with the smallest sum so far. Then we test for BIN2 if the row would be the first row in BIN2 or if the sum of previous rows in BIN2 is smaller than or equal to the BIN3 sum - if yes, then BIN2 is the the bin with the smallest sum so far, so we assign the row to BIN2.
if the row was not assigned to BIN2, then it must go in BIN3. We do not need to do any define for BIN3, as any classification used in the PATTERN but not defined in DEFINE is automatically true for any row (if the row has not matched any of the other defined classifications.)
As we defined the pattern with the * meaning zero or more occurences, the match will actually continue and include all the rows in a single match (so in a sense we are not really "searching for a pattern", as the pattern matches all rows.) But doing that allows the running COUNT and SUM in the DEFINE clauses to keep adding to the running totals all the way to the last row.
select item_name, item_value, bin#, bin1, bin2, bin3
from items
match_recognize (
order by item_value desc
measures
to_number(substr(classifier(),4)) bin#,
sum(bin1.item_value) bin1,
sum(bin2.item_value) bin2,
sum(bin3.item_value) bin3
all rows per match
pattern (
(bin1|bin2|bin3)*
)
define
bin1 as count(bin1.*) = 1
or sum(bin1.item_value) - bin1.item_value <= least(sum(bin2.item_value), sum(bin3.item_value))
, bin2 as count(bin2.*) = 1
or sum(bin2.item_value) - bin2.item_value <= sum(bin3.item_value)
);
With a little bit of simplification, the output gives us very easily the result of which items are distributed in each bin.
select bin#, item_value, bin_total, item_name
from items
match_recognize (
order by item_value desc
measures
to_number(substr(classifier(),4)) bin#
, case classifier()
when 'BIN1' then final sum(bin1.item_value)
when 'BIN2' then final sum(bin2.item_value)
when 'BIN3' then final sum(bin3.item_value)
end bin_total
all rows per match
pattern (
(bin1|bin2|bin3)*
)
define
bin1 as count(bin1.*) = 1
or sum(bin1.item_value) - bin1.item_value <= least(sum(bin2.item_value), sum(bin3.item_value))
, bin2 as count(bin2.*) = 1
or sum(bin2.item_value) - bin2.item_value <= sum(bin3.item_value)
)
order by bin#, item_value desc;
We can observe that the match actually includes all the rows if we change ALL ROWS PER MATCH to ONE ROW PER MATCH. But then we lose information which items are in each bin.
select *
from items
match_recognize (
order by item_value desc
measures
sum(bin1.item_value) bin1_total,
sum(bin2.item_value) bin2_total,
sum(bin3.item_value) bin3_total
one row per match
pattern (
(bin1|bin2|bin3)*
)
define
bin1 as count(bin1.*) = 1
or sum(bin1.item_value) - bin1.item_value <= least(sum(bin2.item_value), sum(bin3.item_value))
, bin2 as count(bin2.*) = 1
or sum(bin2.item_value) - bin2.item_value <= sum(bin3.item_value)
);
You now know that not just the patterns may be complex in row pattern matching - the DEFINE clause can also incorporate a logic in them with a higher complexity than merely comparing values. With techniques like this, you can make your row pattern matching solve very complex problems.
In total in these modules you have seen various examples of very different use cases for row pattern matching. It is a quite different way of thinking, but when you get into the habit, it can solve cases where even the use of analytic functions still make for complex queries. It can also be more efficient by being designed for working with multiple rows rather than one row at a time. When you learn the syntax you can write highly declarative analytic queries, where (if you choose good names in the DEFINE clause) you write the data patterns you search for almost in plain readable English.