MATCH_RECOGNIZE - importance of PARTITION BY and ORDER BY

View All Tutorials Login to Run Tutorial

Tutorial MATCH_RECOGNIZE - importance of PARTITION BY and ORDER BY
Description The aim of this tutorial is to explain the importance of using PARTITION BY and ORDER BY to ensure the correct results are returned and explores how and when predicates are applied.
Tags PATTERN MATCHING, MATCH_RECOGNIZE,
Area SQL Analytics
Contributor Keith Laker (Oracle)
Created Thursday October 13, 2016
Modules 8

Module 1
Overview of SQL Pattern Matching
The following tutorial provides an introduction to the various keywords within the 12c MATCH_RECOGNIZE clause.
Module 2
Reset my tutorial environment
Just in case...

If at anytime you need to restart this tutorial then simply run the following statement to reset your environment
```
DROP TABLE ticker PURGE;
```

Module 3

Create my dataset

Creating our ticker data set

First step is to setup our data table and then populate it with data

CREATE TABLE ticker (SYMBOL VARCHAR2(10), tstamp DATE, price NUMBER);

BEGIN
INSERT INTO ticker VALUES('ACME', '01-Apr-11', 12);
INSERT INTO ticker VALUES('GLOBEX', '17-Apr-11', 8);
INSERT INTO ticker VALUES('GLOBEX', '01-Apr-11', 11);
INSERT INTO ticker VALUES('OSCORP', '20-Apr-11', 9);
INSERT INTO ticker VALUES('ACME', '02-Apr-11', 17);
INSERT INTO ticker VALUES('OSCORP', '19-Apr-11', 11);
INSERT INTO ticker VALUES('ACME', '03-Apr-11', 19);
INSERT INTO ticker VALUES('GLOBEX', '03-Apr-11', 13);
INSERT INTO ticker VALUES('OSCORP', '18-Apr-11', 12);
INSERT INTO ticker VALUES('GLOBEX', '02-Apr-11', 12);
INSERT INTO ticker VALUES('ACME', '04-Apr-11', 21);
INSERT INTO ticker VALUES('GLOBEX', '04-Apr-11', 12);
INSERT INTO ticker VALUES('OSCORP', '17-Apr-11', 14);
INSERT INTO ticker VALUES('OSCORP', '15-Apr-11', 12);
INSERT INTO ticker VALUES('OSCORP', '14-Apr-11', 15);
INSERT INTO ticker VALUES('OSCORP', '16-Apr-11', 16);
INSERT INTO ticker VALUES('ACME', '05-Apr-11', 25);
INSERT INTO ticker VALUES('GLOBEX', '05-Apr-11', 11);
INSERT INTO ticker VALUES('ACME', '06-Apr-11', 12);
INSERT INTO ticker VALUES('GLOBEX', '06-Apr-11', 10);
INSERT INTO ticker VALUES('ACME', '07-Apr-11', 15);
INSERT INTO ticker VALUES('GLOBEX', '07-Apr-11', 9);
INSERT INTO ticker VALUES('GLOBEX', '08-Apr-11', 8);
INSERT INTO ticker VALUES('ACME', '08-Apr-11', 20);
INSERT INTO ticker VALUES('OSCORP', '13-Apr-11', 11);
INSERT INTO ticker VALUES('ACME', '13-Apr-11', 25);
INSERT INTO ticker VALUES('ACME', '10-Apr-11', 25);
INSERT INTO ticker VALUES('ACME', '11-Apr-11', 19);
INSERT INTO ticker VALUES('ACME', '09-Apr-11', 24);
INSERT INTO ticker VALUES('GLOBEX', '09-Apr-11', 9);
INSERT INTO ticker VALUES('OSCORP', '12-Apr-11', 12);
INSERT INTO ticker VALUES('GLOBEX', '10-Apr-11', 9);
INSERT INTO ticker VALUES('OSCORP', '11-Apr-11', 15);
INSERT INTO ticker VALUES('GLOBEX', '11-Apr-11', 9);
INSERT INTO ticker VALUES('OSCORP', '10-Apr-11', 15);
INSERT INTO ticker VALUES('ACME', '12-Apr-11', 15);
INSERT INTO ticker VALUES('GLOBEX', '12-Apr-11', 9);
INSERT INTO ticker VALUES('OSCORP', '09-Apr-11', 16);
INSERT INTO ticker VALUES('GLOBEX', '13-Apr-11', 10);
INSERT INTO ticker VALUES('OSCORP', '08-Apr-11', 20);
INSERT INTO ticker VALUES('ACME', '14-Apr-11', 25);
INSERT INTO ticker VALUES('GLOBEX', '14-Apr-11', 11);
INSERT INTO ticker VALUES('OSCORP', '07-Apr-11', 17);
INSERT INTO ticker VALUES('OSCORP', '06-Apr-11', 20);
INSERT INTO ticker VALUES('ACME', '15-Apr-11', 14);
INSERT INTO ticker VALUES('GLOBEX', '15-Apr-11', 12);
INSERT INTO ticker VALUES('ACME', '17-Apr-11', 14);
INSERT INTO ticker VALUES('ACME', '16-Apr-11', 12);
INSERT INTO ticker VALUES('GLOBEX', '16-Apr-11', 11);
INSERT INTO ticker VALUES('OSCORP', '05-Apr-11', 17);
INSERT INTO ticker VALUES('ACME', '18-Apr-11', 24);
INSERT INTO ticker VALUES('GLOBEX', '18-Apr-11', 7);
INSERT INTO ticker VALUES('OSCORP', '04-Apr-11', 18);
INSERT INTO ticker VALUES('OSCORP', '03-Apr-11', 19);
INSERT INTO ticker VALUES('ACME', '19-Apr-11', 23);
INSERT INTO ticker VALUES('GLOBEX', '19-Apr-11', 5);
INSERT INTO ticker VALUES('OSCORP', '02-Apr-11', 22);
INSERT INTO ticker VALUES('ACME', '20-Apr-11', 22);
INSERT INTO ticker VALUES('GLOBEX', '20-Apr-11', 3);
INSERT INTO ticker VALUES('OSCORP', '01-Apr-11', 22);

commit;

END;

Module 4
Viewing my data
Quick review of our ticker data

Now let's check to see how many rows are in our dataset
```
SELECT count(*) FROM ticker;
SELECT symbol, min(tstamp), max(tstamp), count(*) FROM ticker GROUP BY symbol;
```
You should have 60 rows of data spread across three symbols (ACME, GLOBEX, OSCORP) with 20 rows of data for each ticker symbol. Our ticker data for each symbol starts on April 1 and ends on April 20.

You can view the full data set using the following code:
```
SELECT * FROM ticker ORDER BY symbol, tstamp;
```
Compared to previous MATCH_RECOGNIZE tutorials, here there is a small change to the data for ticker symbol ACME. The price on Apr-16 has been changed to 14 so that there are now three consecutive rows (15-Apr, 16-Apr and 17-Apr) with the same value (14) for price.
```
SELECT * FROM ticker 
WHERE symbol = 'ACME' 
ORDER BY tstamp;
```
Module 5
Correctly bucketing your data
The following pattern matching query just uses the mandatory order by clause and is using an always true event (price=price). This means that every row matches the pattern
```
SELECT * FROM ticker 
MATCH_RECOGNIZE (
 ORDER BY tstamp 
 MEASURES 
  e.symbol as symbol,
  e.tstamp as tstamp,
  e.price as price 
 ONE ROW PER MATCH 
 PATTERN (e) 
 DEFINE 
   e AS price=price);
```
Why is the data in a random order?

You will see that the data is in random order when it comes to the stock ticker because there is more than one record per tstamp. Now we do it the right way and add the PARTITION BY clause
```
SELECT * FROM ticker 
MATCH_RECOGNIZE (
 PARTITION BY symbol ORDER BY tstamp 
 MEASURES
  e.tstamp as tstamp,
  e.price as price 
 ONE ROW PER MATCH 
 PATTERN (e) 
 DEFINE 
   e AS price=price);
```
Module 6
Missing rows
In this example we have enhanced our pattern by slightly changing our always-true event: this time we are going to test if the current tstamp is greater than the previous timestamp, which of course if will be!
```
SELECT * FROM ticker 
MATCH_RECOGNIZE (
  ORDER BY tstamp 
  ALL ROWS PER MATCH 
  PATTERN (e+) 
  DEFINE 
    e AS tstamp > prev(tstamp));
```
Something is incorrect with the output because we are only getting 19 rows returned even though we are using the keywords ALL ROWS PER MATCH. Why is this?

The answer is because like earlier, we have multiple stock ticker entries per tstamp
```
SELECT * 
FROM ticker 
MATCH_RECOGNIZE (
 PARTITION BY symbol ORDER BY tstamp 
 ALL ROWS PER MATCH 
 PATTERN (e+) 
 DEFINE 
   e AS tstamp > prev(tstamp));
```
But we still seem to have a incorrect result because there are only 57 rows output and not the 60 rows that we were expecting. Why is this?

This is because the very first record in each data stream has no previous timestamp to compare with, therefore, the first record is excluded from our output!

This can easily be corrected by incorporating a dummy event
```
SELECT *
FROM ticker 
 MATCH_RECOGNIZE (
  PARTITION BY symbol ORDER BY tstamp 
  ALL ROWS PER MATCH 
  PATTERN (strt e+) 
  DEFINE 
   e AS tstamp > prev(tstamp));
```
Module 7
Same result is not same result
Now let's our simple always-true pattern again
```
SELECT
 esymbol,
 etstamp,
 eprice 
FROM ticker 
MATCH_RECOGNIZE (
 ORDER BY tstamp 
 MEASURES 
    e.symbol as esymbol, 
    e.tstamp as etstamp, 
    e.price as eprice 
 ALL ROWS PER MATCH 
 PATTERN (e) 
 DEFINE 
   e AS price=price)
WHERE symbol='ACME';
```
Notice that we get 20 rows back. Now let's run the almost same query again but first let's make a subtle alteration to the pattern definition of the always true event:
```
SELECT
 esymbol,
 etstamp,
 eprice 
FROM ticker 
MATCH_RECOGNIZE (
 ORDER BY tstamp 
 MEASURES 
    e.symbol as esymbol, 
    e.tstamp as etstamp, 
    e.price as eprice 
 ALL ROWS PER MATCH 
 PATTERN (e+)
 DEFINE 
  e AS price=price)
WHERE symbol='ACME';
```
Is this the same result? We have still returned 20 rows, so it must be the same result, correct?

The answer to the question lies in the match of individual records to a pattern. Let's run the same statements again, using some built-in functions:
- classifier()
- match_number()
Please note that if you want more information about these two built-in measures there is a separate tutorial available.

Now, Let's run the last example again:
```
SELECT
 esymbol,
 etstamp,
 eprice,
 event,
 match 
FROM ticker 
MATCH_RECOGNIZE (
 ORDER BY tstamp 
 MEASURES 
    e.symbol as esymbol, 
    e.tstamp as etstamp, 
    e.price as eprice, 
    classifier() event, 
    match_number() match  
 ALL ROWS PER MATCH 
 PATTERN (e+) 
 DEFINE 
   e AS price=price)
WHERE symbol='ACME';
```
Now let's run the first example again:
```
SELECT
 esymbol,
 etstamp,
 eprice,
 event,
 match 
FROM ticker 
MATCH_RECOGNIZE (
 ORDER BY tstamp 
 MEASURES 
    e.symbol as esymbol, 
    e.tstamp as etstamp, 
    e.price as eprice, 
    classifier() event, 
    match_number() match  
 ALL ROWS PER MATCH 
 PATTERN (e) 
 DEFINE 
   e AS price=price)
WHERE symbol='ACME';
```
You can immediately see that the two results are not in fact the same result!

The e+ pattern searches for one or more instances of our price = price pattern which results in a single match across all 20 rows. The first example, e, searches for a single occurrence of price = price and it is able to match 20 individual occurrences where price equals price.

BUT, the number for each match is not contiguous. Why is that? Why does our match number column not start at 1?

The answer is because the match number function is evaluated prior to filtering the resultset to using the predicate WHERE symbol='ACME'.

How do we solve this? There are two options: 1) we can reintroduce the PARTITION BY clause - did you spot that it was removed! or 2) we can move the predicate inside the pattern, as shown here:
```
SELECT
 esymbol,
 etstamp,
 eprice,
 event,
 match 
FROM ticker 
MATCH_RECOGNIZE (
 PARTITION BY symbol ORDER BY tstamp 
 MEASURES 
    e.symbol as esymbol, 
    e.tstamp as etstamp, 
    e.price as eprice,
    classifier() event, 
    match_number() match  
 ALL ROWS PER MATCH 
 PATTERN (e) 
 DEFINE 
  e AS price=price and symbol='ACME');
```
Now we have a sequential number for our match number column.
Module 8
Summary
Hopefully this tutorial has helped you to understand the importance of using the keywords PARTITION BY and ORDER BY within your MATCH_RECOGNIZE statements to ensure that the correct results are returned.

The last example illustrated how predicates are applied when used in conjunction with MATCH_RECOGNIZE statement. Essentially, they are applied after the pattern matching process has been completed. That is to day that the pattern matching process is run against the complete data set and then any predicates that are part of the WHERE clause are finally applied.

Additional Information

Database on OTN SQL and PL/SQL Discussion forums
Oracle Database
Download Oracle Database

MATCH_RECOGNIZE - importance of PARTITION BY and ORDER BY

Overview of SQL Pattern Matching

Reset my tutorial environment

Just in case...

Create my dataset

Creating our ticker data set

Viewing my data

Quick review of our ticker data

Correctly bucketing your data

Why is the data in a random order?

Missing rows

Same result is not same result

Summary

Links

Additional Information