First step is to setup our data table and then populate it with data
CREATE TABLE ticker (SYMBOL VARCHAR2(10), tstamp DATE, price NUMBER);
BEGIN
INSERT INTO ticker VALUES('ACME', '01-Apr-11', 12);
INSERT INTO ticker VALUES('GLOBEX', '17-Apr-11', 8);
INSERT INTO ticker VALUES('GLOBEX', '01-Apr-11', 11);
INSERT INTO ticker VALUES('OSCORP', '20-Apr-11', 9);
INSERT INTO ticker VALUES('ACME', '02-Apr-11', 17);
INSERT INTO ticker VALUES('OSCORP', '19-Apr-11', 11);
INSERT INTO ticker VALUES('ACME', '03-Apr-11', 19);
INSERT INTO ticker VALUES('GLOBEX', '03-Apr-11', 13);
INSERT INTO ticker VALUES('OSCORP', '18-Apr-11', 12);
INSERT INTO ticker VALUES('GLOBEX', '02-Apr-11', 12);
INSERT INTO ticker VALUES('ACME', '04-Apr-11', 21);
INSERT INTO ticker VALUES('GLOBEX', '04-Apr-11', 12);
INSERT INTO ticker VALUES('OSCORP', '17-Apr-11', 14);
INSERT INTO ticker VALUES('OSCORP', '15-Apr-11', 12);
INSERT INTO ticker VALUES('OSCORP', '14-Apr-11', 15);
INSERT INTO ticker VALUES('OSCORP', '16-Apr-11', 16);
INSERT INTO ticker VALUES('ACME', '05-Apr-11', 25);
INSERT INTO ticker VALUES('GLOBEX', '05-Apr-11', 11);
INSERT INTO ticker VALUES('ACME', '06-Apr-11', 12);
INSERT INTO ticker VALUES('GLOBEX', '06-Apr-11', 10);
INSERT INTO ticker VALUES('ACME', '07-Apr-11', 15);
INSERT INTO ticker VALUES('GLOBEX', '07-Apr-11', 9);
INSERT INTO ticker VALUES('GLOBEX', '08-Apr-11', 8);
INSERT INTO ticker VALUES('ACME', '08-Apr-11', 20);
INSERT INTO ticker VALUES('OSCORP', '13-Apr-11', 11);
INSERT INTO ticker VALUES('ACME', '13-Apr-11', 25);
INSERT INTO ticker VALUES('ACME', '10-Apr-11', 25);
INSERT INTO ticker VALUES('ACME', '11-Apr-11', 19);
INSERT INTO ticker VALUES('ACME', '09-Apr-11', 24);
INSERT INTO ticker VALUES('GLOBEX', '09-Apr-11', 9);
INSERT INTO ticker VALUES('OSCORP', '12-Apr-11', 12);
INSERT INTO ticker VALUES('GLOBEX', '10-Apr-11', 9);
INSERT INTO ticker VALUES('OSCORP', '11-Apr-11', 15);
INSERT INTO ticker VALUES('GLOBEX', '11-Apr-11', 9);
INSERT INTO ticker VALUES('OSCORP', '10-Apr-11', 15);
INSERT INTO ticker VALUES('ACME', '12-Apr-11', 15);
INSERT INTO ticker VALUES('GLOBEX', '12-Apr-11', 9);
INSERT INTO ticker VALUES('OSCORP', '09-Apr-11', 16);
INSERT INTO ticker VALUES('GLOBEX', '13-Apr-11', 10);
INSERT INTO ticker VALUES('OSCORP', '08-Apr-11', 20);
INSERT INTO ticker VALUES('ACME', '14-Apr-11', 25);
INSERT INTO ticker VALUES('GLOBEX', '14-Apr-11', 11);
INSERT INTO ticker VALUES('OSCORP', '07-Apr-11', 17);
INSERT INTO ticker VALUES('OSCORP', '06-Apr-11', 20);
INSERT INTO ticker VALUES('ACME', '15-Apr-11', 14);
INSERT INTO ticker VALUES('GLOBEX', '15-Apr-11', 12);
INSERT INTO ticker VALUES('ACME', '17-Apr-11', 14);
INSERT INTO ticker VALUES('ACME', '16-Apr-11', 12);
INSERT INTO ticker VALUES('GLOBEX', '16-Apr-11', 11);
INSERT INTO ticker VALUES('OSCORP', '05-Apr-11', 17);
INSERT INTO ticker VALUES('ACME', '18-Apr-11', 24);
INSERT INTO ticker VALUES('GLOBEX', '18-Apr-11', 7);
INSERT INTO ticker VALUES('OSCORP', '04-Apr-11', 18);
INSERT INTO ticker VALUES('OSCORP', '03-Apr-11', 19);
INSERT INTO ticker VALUES('ACME', '19-Apr-11', 23);
INSERT INTO ticker VALUES('GLOBEX', '19-Apr-11', 5);
INSERT INTO ticker VALUES('OSCORP', '02-Apr-11', 22);
INSERT INTO ticker VALUES('ACME', '20-Apr-11', 22);
INSERT INTO ticker VALUES('GLOBEX', '20-Apr-11', 3);
INSERT INTO ticker VALUES('OSCORP', '01-Apr-11', 22);
commit;
END;
Now let's check to see how many rows are in our dataset
SELECT count(*) FROM ticker;
SELECT symbol, min(tstamp), max(tstamp), count(*) FROM ticker GROUP BY symbol;
You should have 60 rows of data spread across three symbols (ACME, GLOBEX, OSCORP) with 20 rows of data for each ticker symbol. Our ticker data for each symbol starts on April 1 and ends on April 20.
You can view the full data set using the following code:
SELECT * FROM ticker ORDER BY symbol, tstamp;
Compared to previous MATCH_RECOGNIZE tutorials, here there is a small change to the data for ticker symbol ACME. The price on Apr-16 has been changed to 14 so that there are now three consecutive rows (15-Apr, 16-Apr and 17-Apr) with the same value (14) for price.
SELECT * FROM ticker
WHERE symbol = 'ACME'
ORDER BY tstamp;
The following pattern matching query just uses the mandatory order by clause and is using an always true event (price=price). This means that every row matches the pattern
SELECT * FROM ticker
MATCH_RECOGNIZE (
ORDER BY tstamp
MEASURES
e.symbol as symbol,
e.tstamp as tstamp,
e.price as price
ONE ROW PER MATCH
PATTERN (e)
DEFINE
e AS price=price);
You will see that the data is in random order when it comes to the stock ticker because there is more than one record per tstamp. Now we do it the right way and add the PARTITION BY clause
SELECT * FROM ticker
MATCH_RECOGNIZE (
PARTITION BY symbol ORDER BY tstamp
MEASURES
e.tstamp as tstamp,
e.price as price
ONE ROW PER MATCH
PATTERN (e)
DEFINE
e AS price=price);
In this example we have enhanced our pattern by slightly changing our always-true event: this time we are going to test if the current tstamp is greater than the previous timestamp, which of course if will be!
SELECT * FROM ticker
MATCH_RECOGNIZE (
ORDER BY tstamp
ALL ROWS PER MATCH
PATTERN (e+)
DEFINE
e AS tstamp > prev(tstamp));
Something is incorrect with the output because we are only getting 19 rows returned even though we are using the keywords ALL ROWS PER MATCH. Why is this?
The answer is because like earlier, we have multiple stock ticker entries per tstamp
SELECT *
FROM ticker
MATCH_RECOGNIZE (
PARTITION BY symbol ORDER BY tstamp
ALL ROWS PER MATCH
PATTERN (e+)
DEFINE
e AS tstamp > prev(tstamp));
But we still seem to have a incorrect result because there are only 57 rows output and not the 60 rows that we were expecting. Why is this?
This is because the very first record in each data stream has no previous timestamp to compare with, therefore, the first record is excluded from our output!
This can easily be corrected by incorporating a dummy event
SELECT *
FROM ticker
MATCH_RECOGNIZE (
PARTITION BY symbol ORDER BY tstamp
ALL ROWS PER MATCH
PATTERN (strt e+)
DEFINE
e AS tstamp > prev(tstamp));
Now let's our simple always-true pattern again
SELECT
esymbol,
etstamp,
eprice
FROM ticker
MATCH_RECOGNIZE (
ORDER BY tstamp
MEASURES
e.symbol as esymbol,
e.tstamp as etstamp,
e.price as eprice
ALL ROWS PER MATCH
PATTERN (e)
DEFINE
e AS price=price)
WHERE symbol='ACME';
Notice that we get 20 rows back. Now let's run the almost same query again but first let's make a subtle alteration to the pattern definition of the always true event:
SELECT
esymbol,
etstamp,
eprice
FROM ticker
MATCH_RECOGNIZE (
ORDER BY tstamp
MEASURES
e.symbol as esymbol,
e.tstamp as etstamp,
e.price as eprice
ALL ROWS PER MATCH
PATTERN (e+)
DEFINE
e AS price=price)
WHERE symbol='ACME';
Is this the same result? We have still returned 20 rows, so it must be the same result, correct?
The answer to the question lies in the match of individual records to a pattern. Let's run the same statements again, using some built-in functions:
Please note that if you want more information about these two built-in measures there is a separate tutorial available.
Now, Let's run the last example again:
SELECT
esymbol,
etstamp,
eprice,
event,
match
FROM ticker
MATCH_RECOGNIZE (
ORDER BY tstamp
MEASURES
e.symbol as esymbol,
e.tstamp as etstamp,
e.price as eprice,
classifier() event,
match_number() match
ALL ROWS PER MATCH
PATTERN (e+)
DEFINE
e AS price=price)
WHERE symbol='ACME';
Now let's run the first example again:
SELECT
esymbol,
etstamp,
eprice,
event,
match
FROM ticker
MATCH_RECOGNIZE (
ORDER BY tstamp
MEASURES
e.symbol as esymbol,
e.tstamp as etstamp,
e.price as eprice,
classifier() event,
match_number() match
ALL ROWS PER MATCH
PATTERN (e)
DEFINE
e AS price=price)
WHERE symbol='ACME';
You can immediately see that the two results are not in fact the same result!
The e+ pattern searches for one or more instances of our price = price pattern which results in a single match across all 20 rows. The first example, e, searches for a single occurrence of price = price and it is able to match 20 individual occurrences where price equals price.
BUT, the number for each match is not contiguous. Why is that? Why does our match number column not start at 1?
The answer is because the match number function is evaluated prior to filtering the resultset to using the predicate WHERE symbol='ACME'.
How do we solve this? There are two options: 1) we can reintroduce the PARTITION BY clause - did you spot that it was removed! or 2) we can move the predicate inside the pattern, as shown here:
SELECT
esymbol,
etstamp,
eprice,
event,
match
FROM ticker
MATCH_RECOGNIZE (
PARTITION BY symbol ORDER BY tstamp
MEASURES
e.symbol as esymbol,
e.tstamp as etstamp,
e.price as eprice,
classifier() event,
match_number() match
ALL ROWS PER MATCH
PATTERN (e)
DEFINE
e AS price=price and symbol='ACME');
Now we have a sequential number for our match number column.
Hopefully this tutorial has helped you to understand the importance of using the keywords PARTITION BY and ORDER BY within your MATCH_RECOGNIZE statements to ensure that the correct results are returned.
The last example illustrated how predicates are applied when used in conjunction with MATCH_RECOGNIZE statement. Essentially, they are applied after the pattern matching process has been completed. That is to day that the pattern matching process is run against the complete data set and then any predicates that are part of the WHERE clause are finally applied.