Recognizing patterns in a sequence of rows has been a capability that was widely desired, but not really possible with SQL until now. There were many workarounds, but these were difficult to write, hard to understand, and inefficient to execute.
With Oracle Database 12c you can use the MATCH_RECOGNIZE clause to perform pattern matching in SQL to do the following:
This tutorial is designed to help show you how you can run sessionization analysis on application logs, web logs, etc.
For this tutorial we are going to define a session as a sequence of one or more events where the inter-timestamp gap is less than 10 minutes between events.
Using our simplified table of click data, which will create in the next step, we will create a sessionization data set which tracks each session, the duration of the session and the number of clicks/events using the new 12c MATCH_RECOGNIZE feature.
First step is to setup our data table and then populate it with data
CREATE TABLE clickdata (tstamp integer, userid varchar2(15));
INSERT INTO clickdata VALUES(1, 'Mary');
INSERT INTO clickdata VALUES(2, 'Sam');
INSERT INTO clickdata VALUES(11, 'Mary');
INSERT INTO clickdata VALUES(12, 'Sam');
INSERT INTO clickdata VALUES(22, 'Sam');
INSERT INTO clickdata VALUES(23, 'Mary');
INSERT INTO clickdata VALUES(32, 'Sam');
INSERT INTO clickdata VALUES(34, 'Mary');
INSERT INTO clickdata VALUES(43, 'Sam');
INSERT INTO clickdata VALUES(44, 'Mary');
INSERT INTO clickdata VALUES(47, 'Sam');
INSERT INTO clickdata VALUES(48, 'Sam');
INSERT INTO clickdata VALUES(53, 'Mary');
INSERT INTO clickdata VALUES(59, 'Sam');
INSERT INTO clickdata VALUES(60, 'Sam');
INSERT INTO clickdata VALUES(63, 'Mary');
INSERT INTO clickdata VALUES(68, 'Sam');
commit;
Running our first sessionization report using only the main keywords within MATCH_RECOGNIZE clause
The keywords that we are going to use are:
ALL ROWS PER MATCH
You will sometimes want summary data about the matches and other times need details. For this report we are going to return a detailed report.PATTERN
The PATTERN clause lets you define which pattern variables must be matched, the sequence in which they must be matched, and the quantity of rows which must be matched. The PATTERN clause specifies a regular expression for the match search.DEFINE
The PATTERN clause depends on pattern variables, therefore, you must have a clause to define these variables. They are specified in the DEFINE clause. DEFINE is a required clause, used to specify the conditions that a row must meet to be mapped to a specific pattern variable.
So let's write our first MATCH_RECOGNIZE statement based on our business requirements:
SELECT
tstamp,
userid
FROM clickdata MATCH_RECOGNIZE(
ALL ROWS PER MATCH
PATTERN (b s+)
DEFINE
s as (tstamp - prev(tstamp) <= 10)
);
As we mentioned earlier, the PATTERN defines a regular expression, which is a highly expressive way to search for patterns. PATTERN (b s+) says that the pattern we are searching for has two pattern variables: b and s. The star sign (*) after s means that zero or more rows must be mapped to confirm a match.
The plus sign (+) is part of a library of quantifiers that can be used to describe your pattern. We will explore these in a separate tutorial
DEFINE gives us the conditions that must be met for a row to map to your row pattern variables b and s. Because there is no condition for b, any row can be mapped to b. Why have a pattern variable with no condition? You use it as a starting point for testing for matches. S takes advantage of the PREV() function, which lets them compare the timestamp in the current row to the timestamp in the prior row. S is matched when a row has a timestamp than is within 10 minutes of the row that preceded it.
Congratulations your report runs! We have a list of events where each event is within 10 minutes of the previous event
But we don't actually know how many sessions each user logged, so how can we find out this information?
We can use the internal MATCH_NUMBER() function to return the session id for each user - assuming that a session contains clicks that are within a 10 minute interval of previous click
What does MATCH_NUMBER() actually do?
You might have a large number of matches for your pattern inside a given row partition. How do you tell these matches apart? This is done with the MATCH_NUMBER() function. Matches within a row pattern partition are numbered sequentially starting with 1 in the order they are found. This numbering starts at 1 within each row pattern partition, because there is no linked ordering across row pattern partitions.
SELECT
tstamp,
userid,
session_id
FROM clickdata MATCH_RECOGNIZE(
MEASURES match_number() as session_id
ALL ROWS PER MATCH
PATTERN (b s+)
DEFINE
s as (tstamp - prev(tstamp) <= 10)
);
What you will notice is that each row has a session id of 1 which means that we are only finding one instance of our pattern within our data set.
Obviously this is not correct because just looking at the raw data we can see that Mary has logged more than one session
SELECT *
FROM clickdata
WHERE userid = 'Mary'
What we need to do is use the PARTITION BY and ORDER BY key phrases to correctly divide up our data set by each user and then within each user make sure the data is ordered by our timestamp so we can determine if the clicks recorded in our log by each user are within 10 minutes of their previous click.
We need to expand our MATCH_RECOGNIZE() clause as shown here:
SELECT
tstamp,
userid,
session_id
FROM clickdata MATCH_RECOGNIZE(
PARTITION BY userid ORDER BY tstamp
MEASURES match_number() as session_id
ALL ROWS PER MATCH
PATTERN (b s+)
DEFINE
s as (tstamp - prev(tstamp) <=10)
);
Now you should see that Mary has logged 3 sessions and Sam has logged three sessions. Which means that have now correctly specified our MATCH_RECOGNIZE clause to create our sessionization report.
After we have found a match (and this actually includes empty matches as well!) we need to decide where to start searching for the next match. This is controlled by the AFTER MATCH SKIP clause. There are five possible options for controlling the start point and the default for the clause is AFTER MATCH SKIP PAST LAST ROW.
In this case we are going to use the default.
SELECT
tstamp,
userid,
session_id
FROM clickdata MATCH_RECOGNIZE(
PARTITION BY userid ORDER BY tstamp
MEASURES match_number() as session_id
ALL ROWS PER MATCH
AFTER MATCH SKIP PAST LAST ROW
PATTERN (b s+)
DEFINE
s as (tstamp - prev(tstamp) <=10)
);
Note the position of the SKIP TO syntax within the MATCH_RECOGNIZE code - it comes just before the PATTERN clause
The results from the above code are exactly the same as the results from the previous example which indicates that although the previous example did not include a SKIP TO clause the default processing was implemented.
The next stage is to count the number of events within each session and calculate the duration of each session
How do we do that? We can use some of the other built-in measures such as FIRST() and LAST() to extract values from our resultset and we can calculate new values such as the duration of a session.
In our code we will add some new measures
SELECT
tstamp,
userid,
session_id,
no_of_events,
start_time,
session_duration
FROM clickdata MATCH_RECOGNIZE(
PARTITION BY userid ORDER BY tstamp
MEASURES match_number() as session_id,
COUNT(*) as no_of_events,
FIRST(b.tstamp) start_time,
LAST(s.tstamp) end_time,
LAST(s.tstamp) - FIRST(s.tstamp) session_duration
ALL ROWS PER MATCH
PATTERN (b s+)
DEFINE
s as (tstamp - prev(tstamp) <= 10)
);
This is looking good but it would it would be better if we could create a summary report
What we really want is a report that shows one line for each session, shows the number of events, the start time and the session duration.
To return a summary report we only need to change one line of code!
We just need to change the output type from ALL ROWS PER MATCH to ONE ROW PER MATCH and remove the reference to tstamp within the SELECT clause.
SELECT
userid,
session_id,
no_of_events,
start_time,
session_duration
FROM clickdata MATCH_RECOGNIZE(
PARTITION BY userid ORDER BY tstamp
MEASURES match_number() as session_id,
COUNT(*) as no_of_events,
FIRST(b.tstamp) start_time,
LAST(s.tstamp) end_time,
LAST(s.tstamp) - FIRST(b.tstamp) session_duration
ONE ROW PER MATCH
PATTERN (b s+)
DEFINE
s as (tstamp - prev(tstamp) <= 10)
);
...and there is our finished sessionization report.