This simple tutorial provides a nice gentle introduction to the various keywords within the 12c MATCH_RECOGNIZE clause. It's broken into three parts
If you have any questions about this tutorial or the MATCH_RECOGNIZE clause then please contact me via email: keith.laker@oracle.com.Part 1 covers setting up the ticker dataset
Part 2 covers running your first simple pattern matching query. This focuses on the PARTITION BY and ORDER BY clauses showing what happens if you omit these optional keywords.
Part 3 covers running a more complex pattern matching query where you are going to search for V-shaped patters within the ticker data. This section looks at the other keywords that make up the MATCH_RECOGNIZE clause such as PATTER, DEFINE, SKIP TO..., ALL ROWS, ONE ROW etc.
First step is to setup our data table and then populate it with data
CREATE TABLE ticker (SYMBOL VARCHAR2(10), tstamp DATE, price NUMBER);
BEGIN
INSERT INTO ticker VALUES('ACME', '01-Apr-11', 12);
INSERT INTO ticker VALUES('GLOBEX', '17-Apr-11', 8);
INSERT INTO ticker VALUES('GLOBEX', '01-Apr-11', 11);
INSERT INTO ticker VALUES('OSCORP', '20-Apr-11', 9);
INSERT INTO ticker VALUES('ACME', '02-Apr-11', 17);
INSERT INTO ticker VALUES('OSCORP', '19-Apr-11', 11);
INSERT INTO ticker VALUES('ACME', '03-Apr-11', 19);
INSERT INTO ticker VALUES('GLOBEX', '03-Apr-11', 13);
INSERT INTO ticker VALUES('OSCORP', '18-Apr-11', 12);
INSERT INTO ticker VALUES('GLOBEX', '02-Apr-11', 12);
INSERT INTO ticker VALUES('ACME', '04-Apr-11', 21);
INSERT INTO ticker VALUES('GLOBEX', '04-Apr-11', 12);
INSERT INTO ticker VALUES('OSCORP', '17-Apr-11', 14);
INSERT INTO ticker VALUES('OSCORP', '15-Apr-11', 12);
INSERT INTO ticker VALUES('OSCORP', '14-Apr-11', 15);
INSERT INTO ticker VALUES('OSCORP', '16-Apr-11', 16);
INSERT INTO ticker VALUES('ACME', '05-Apr-11', 25);
INSERT INTO ticker VALUES('GLOBEX', '05-Apr-11', 11);
INSERT INTO ticker VALUES('ACME', '06-Apr-11', 12);
INSERT INTO ticker VALUES('GLOBEX', '06-Apr-11', 10);
INSERT INTO ticker VALUES('ACME', '07-Apr-11', 15);
INSERT INTO ticker VALUES('GLOBEX', '07-Apr-11', 9);
INSERT INTO ticker VALUES('GLOBEX', '08-Apr-11', 8);
INSERT INTO ticker VALUES('ACME', '08-Apr-11', 20);
INSERT INTO ticker VALUES('OSCORP', '13-Apr-11', 11);
INSERT INTO ticker VALUES('ACME', '13-Apr-11', 25);
INSERT INTO ticker VALUES('ACME', '10-Apr-11', 25);
INSERT INTO ticker VALUES('ACME', '11-Apr-11', 19);
INSERT INTO ticker VALUES('ACME', '09-Apr-11', 24);
INSERT INTO ticker VALUES('GLOBEX', '09-Apr-11', 9);
INSERT INTO ticker VALUES('OSCORP', '12-Apr-11', 12);
INSERT INTO ticker VALUES('GLOBEX', '10-Apr-11', 9);
INSERT INTO ticker VALUES('OSCORP', '11-Apr-11', 15);
INSERT INTO ticker VALUES('GLOBEX', '11-Apr-11', 9);
INSERT INTO ticker VALUES('OSCORP', '10-Apr-11', 15);
INSERT INTO ticker VALUES('ACME', '12-Apr-11', 15);
INSERT INTO ticker VALUES('GLOBEX', '12-Apr-11', 9);
INSERT INTO ticker VALUES('OSCORP', '09-Apr-11', 16);
INSERT INTO ticker VALUES('GLOBEX', '13-Apr-11', 10);
INSERT INTO ticker VALUES('OSCORP', '08-Apr-11', 20);
INSERT INTO ticker VALUES('ACME', '14-Apr-11', 25);
INSERT INTO ticker VALUES('GLOBEX', '14-Apr-11', 11);
INSERT INTO ticker VALUES('OSCORP', '07-Apr-11', 17);
INSERT INTO ticker VALUES('OSCORP', '06-Apr-11', 20);
INSERT INTO ticker VALUES('ACME', '15-Apr-11', 14);
INSERT INTO ticker VALUES('GLOBEX', '15-Apr-11', 12);
INSERT INTO ticker VALUES('ACME', '17-Apr-11', 14);
INSERT INTO ticker VALUES('ACME', '16-Apr-11', 12);
INSERT INTO ticker VALUES('GLOBEX', '16-Apr-11', 11);
INSERT INTO ticker VALUES('OSCORP', '05-Apr-11', 17);
INSERT INTO ticker VALUES('ACME', '18-Apr-11', 24);
INSERT INTO ticker VALUES('GLOBEX', '18-Apr-11', 7);
INSERT INTO ticker VALUES('OSCORP', '04-Apr-11', 18);
INSERT INTO ticker VALUES('OSCORP', '03-Apr-11', 19);
INSERT INTO ticker VALUES('ACME', '19-Apr-11', 23);
INSERT INTO ticker VALUES('GLOBEX', '19-Apr-11', 5);
INSERT INTO ticker VALUES('OSCORP', '02-Apr-11', 22);
INSERT INTO ticker VALUES('ACME', '20-Apr-11', 22);
INSERT INTO ticker VALUES('GLOBEX', '20-Apr-11', 3);
INSERT INTO ticker VALUES('OSCORP', '01-Apr-11', 22);
commit;
END;
Now let's check to see how many rows are in our dataset
SELECT count(*) FROM ticker;
SELECT symbol, min(tstamp), max(tstamp), count(*) FROM ticker GROUP BY symbol;
You should have 60 rows of data spread across three symbols (ACME, GLOBEX, OSCORP) with 20 rows of data for each ticker symbol. Our ticker data for each symbol starts on April 1 and ends on April 20.
For this data set we would expect to be searching for patterns within each stock symbol and to correctly find a "pattern" we will need to have the data ordered by the timestamp for each trade. Hopefully, that is an obvious statement.
Here is your first MATCH_RECOGNIZE statement. It's relatively simple and we are going to built on it during this tutorial. Firstly, notice that we have used the ORDER BY clause to sort the data by the timestamp for each trade.
Let's just run the code and view the results
SELECT * FROM ticker
MATCH_RECOGNIZE (
ORDER BY tstamp
MEASURES e.tstamp as st, e.symbol as s, e.price as p
ONE ROW PER MATCH
PATTERN (e)
DEFINE
e AS price=price);
First observation: you should have noticed that the data is in random order when it comes to the stock ticker because there is more than one record per tstamp. This is why the results appear in a random order in terms of the stock ticker column.
What we need to do is group the data by each stock ticker symbol so we need to use the PARTITION BY clause
Let's amend our example and now we do it the right way by adding a PARTITION BY clause:
SELECT * FROM ticker
MATCH_RECOGNIZE (
PARTITION BY symbol ORDER BY tstamp
MEASURES e.tstamp as st, e.symbol as s, e.price as p
ONE ROW PER MATCH
PATTERN (e)
DEFINE
e AS price=price);
There is more information about PARTITION BY and ORDER BY in section 14 of this tutorial
Now let's change our simple pattern and test to see if the timestamp in the current row is greater than the timestamp in the previous row. Which it obviously will be because we are ordering by the timestamp column. The new pattern is described within the DEFINE clause.
How many rows will be returned when you run this query?
SELECT * FROM ticker
MATCH_RECOGNIZE (
PARTITION BY symbol ORDER BY tstamp
ALL ROWS PER MATCH
PATTERN (e+)
DEFINE
e AS tstamp > prev(tstamp));
You should be expecting to see 60 rows returned because that is how many rows are in our source table. So why do we only get 57 rows returned and not 60?
This answer is because the very first record in each partition has no previous timestamp to compare it with, therefore, the first record in each partition is not matched. As we have three partitions (3 symbols: ACME, GLOBEX, OSCORP) that means we lose three rows.
To correct the "missing first row" issue we need to add an always true event to our PATTERN clause, as shown below.
This query will now return all 60 rows from our dataset as expected because the timestamp for each row is greater than the timestamp in the previous row.
SELECT * FROM ticker
MATCH_RECOGNIZE (
PARTITION BY symbol ORDER BY tstamp
ALL ROWS PER MATCH
PATTERN (strt e+)
DEFINE
e AS tstamp > prev(tstamp));
How can we check the results from our MATCH_RECOGNIZE query? We can use two of the built-in measures to help us make sure that our pattern is being correctly applied to our data set. These two measures are CLASSIFIER and MATCH_NUMBER().
Let's run the code below and focus on the results just for symbol ACME by adding a WHERE clause.
SELECT * FROM ticker
MATCH_RECOGNIZE (
PARTITION BY symbol ORDER BY tstamp
MEASURES classifier() event,
match_number() match
ALL ROWS PER MATCH
PATTERN (strt e+)
DEFINE
e AS tstamp > prev(tstamp))
WHERE symbol='ACME';
What you should see is that the first row (01-Apr-11) is mapped to our always-true event "STRT" and all the other rows in our partition are mapped to our timestamp-based event "E".
Now let's look at what happens to our MATCH_NUMBER() function within each PARTITION by removing the WHERE clause.
Let's run the code below:
SELECT * FROM ticker
MATCH_RECOGNIZE (
PARTITION BY symbol ORDER BY tstamp
MEASURES e.tstamp as t,
e.symbol as s,
e.price as p,
classifier() event,
match_number() match
ALL ROWS PER MATCH
PATTERN (strt e+)
DEFINE
e AS tstamp > prev(tstamp));
What you should see is that the first row (01-Apr-11) in each partition is mapped to the "STRT" event and the MATCH_NUMBER function starts at 1 within each partition.
Now let's build a slightly more complex example where we search our dataset for V-shaped trading patterns, i.e. where the price goes down and then the price goes up.
Below we have changed the PATTERN and DEFINE clauses to use an always-true event (so we can capture the first row in each partition) as before along with two pattern variables that will test the value of the price in the current row against the price in the previous row
Let's run the code:
SELECT *
FROM Ticker
MATCH_RECOGNIZE (
PARTITION BY symbol
ORDER BY tstamp
MEASURES
FIRST(DOWN.tstamp) AS start_ts,
LAST(DOWN.tstamp) AS bottom_ts,
LAST(UP.tstamp) AS end_ts
ONE ROW PER MATCH
AFTER MATCH SKIP PAST LAST ROW
PATTERN (STRT DOWN+ UP+)
DEFINE
DOWN AS price < PREV(price),
UP AS price > PREV(price)
) MR
ORDER BY MR.symbol, MR.start_ts;
You should see 6 rows returned
The following sections will explain each of the main keywords in more detail.
If you have used the analytic SQL functions (windows) then you will already be familiar with how these keywords are used and MATCH_RECOGNIZE simply refuses these same concepts.
The keywords PARTITION BY divides the data from the Ticker table into logical groups where, in this case, each group contains one stock symbol.
This clause is optional but it is highly likely that every MATCH_RECOGNIZE clause that you write will contain a PARTITION BY clause
The ORDER BY clause orders the data within each logical group. This clause is also optional but as before, it is highly likely that every MATCH_RECOGNIZE clause that you write will contain a ORDER BY clause
The pattern matching clause enables you to create expressions useful in a wide range of analyses. These are presented as columns in the output table by using the MEASURES clause. This clause defines row pattern measure columns, whose value is computed by evaluating an expression related to a particular match.
In our example we are defining three measures:
The start_ts uses the FIRST() function to extract the first timestamp value from the rows matched to the DOWN pattern variable. The bottom_ts and end_ts measures use the LAST() function to extract the last timestamp values from the rows matched to the DOWN and UP pattern variables.
You will sometimes want summary data about the matches and other times need more detailed information. You can control this by using one of following clauses...
ONE ROW PER MATCH means that for every match found, there will be one row of output.
The alternative is to use the ALL ROWS PER MATCH keywords which provides a more verbose report and opens up the opportunity to use some of the built-in MEASURES that can be used to help you debug your pattern matching process
Compare the output of the two statements:
SELECT *
FROM Ticker
MATCH_RECOGNIZE (
PARTITION BY symbol
ORDER BY tstamp
MEASURES
FIRST(down.tstamp) AS start_ts,
LAST(DOWN.tstamp) AS bottom_ts,
LAST(UP.tstamp) AS end_ts
ALL ROWS PER MATCH
AFTER MATCH SKIP PAST LAST ROW
PATTERN (STRT DOWN+ UP+)
DEFINE
DOWN AS price < PREV(price),
UP AS price > PREV(price)
) MR
ORDER BY MR.symbol, MR.start_ts;
Notice that when we run the code below the number of columns returned is reduced as well as the number of rows
SELECT *
FROM Ticker
MATCH_RECOGNIZE (
PARTITION BY symbol
ORDER BY tstamp
MEASURES
FIRST(down.tstamp) AS start_ts,
LAST(DOWN.tstamp) AS bottom_ts,
LAST(UP.tstamp) AS end_ts
ONE ROW PER MATCH
AFTER MATCH SKIP PAST LAST ROW
PATTERN (STRT DOWN+ UP+)
DEFINE
DOWN AS price < PREV(price),
UP AS price > PREV(price)
) MR
ORDER BY MR.symbol, MR.start_ts;
The AFTER MATCH SKIP PAST LAST ROW clause means that whenever we find a match, we restart the search process for the next match at the next row after the last row mapped to the UP pattern variable. This is the default behavior.
We have a lot of control over where to restart the matching process. For this example what we want to explore is the impact on the results if we change the default behavior. Lets try AFTER MATCH SKIP TO LAST UP and see what happens
SELECT *
FROM Ticker
MATCH_RECOGNIZE (
PARTITION BY symbol
ORDER BY tstamp
MEASURES
FIRST(down.tstamp) AS start_ts,
LAST(DOWN.tstamp) AS bottom_ts,
LAST(UP.tstamp) AS end_ts
ONE ROW PER MATCH
AFTER MATCH SKIP TO LAST UP
PATTERN (STRT DOWN+ UP+)
DEFINE
DOWN AS price < PREV(price),
UP AS price > PREV(price)
) MR
ORDER BY MR.symbol, MR.start_ts;
You will now find that more rows are returned because we are starting our search for the next match at a point where we a guaranteed to find that the price of next row is lower than the price of the previous row. The STRT variable is mapped to the last row mapped to up - in effect we have a row being double mapped.
Hopefully, it is obvious that this is an important clause and you will need to think very carefully about where you to restart searching for your patterns.
The PATTERN clause defines a regular expression in similar way to the existing Oracle regular expression functions. However, within this clause we have create a completely new a highly expressive way to search for patterns.
In the above example, the PATTERN clause is defined as:
PATTERN (STRT DOWN+ UP+)
this syntax indicates that the pattern we are searching for has three pattern variables: STRT, DOWN, and UP. The plus sign (+) after DOWN and UP means that at least one row must be mapped to each of them.
The plus sign (+) is known as a "quantifier" and there is a large library of quantifiers available to help you define your own patterns. Quantifiers define the number of iterations accepted for a match. Use the link below to access the documentation to learn more about quantifiers.
The DEFINE keyword gives us the conditions that must be met for a row to map to your row pattern variables STRT, DOWN, and UP. Because there is no condition for STRT, any row can be mapped to STRT. It is in effect and "always-true" event.
Why have a pattern variable with no condition?
We are using the STRT variable as a starting point for testing and anchoring our matching process. It ensures that we capture and test our pattern beginning at the very first row within each of our partition.
Both DOWN and UP take advantage of the PREV() function, which lets them compare the price in the current row to the price in the prior row. DOWN is matched when a row has a lower price than the row that preceded it, so it defines the downward (left) leg of our V-shape.
A row can be mapped to UP if the row has a higher price than the row that preceded it.