In Database 12c there is a new SQL construct MATCH_RECOGNIZE for finding rows matching a pattern across a sequence of rows, using a regular expressions.
Pattern matching using SQL is based around four logical concepts:
In this tutorial we want to explore how to use MATCH_RECOGNIZE to find potentially fraudulent transactions.
Let's define a suspicious money transfer pattern for an account as:
When we find this pattern we need to report the following information: account, date of first small transfer, date of last large transfer and the amount of the large transfer.
First step is to create the JSON table that will hold our transaction log. The log will provide the details of the time, account/user id, type of event (deposit or transfer) and the transaction amount.
Oracle Database 12c supports storing JSON documents inside the database. Use following code to create a table to store the transaction log which is in JSON format
CREATE TABLE json_transactions
(transaction_doc CLOB,
CONSTRAINT "VALID_JSON" CHECK (transaction_doc IS JSON) ENABLE
);
Next step is to add some data to our JSON table using the normal JSON notation of key-value pairs.
INSERT INTO json_transactions VALUES ('{"time_id":"01-JAN-12","user_id":"John","event_id":"Deposit","trans_amount":1000000}');
INSERT INTO json_transactions VALUES ('{"time_id":"02-JAN-12","user_id":"John","event_id":"Transfer","trans_amount":1000}');
INSERT INTO json_transactions VALUES ('{"time_id":"05-JAN-12","user_id":"John","event_id":"Withdrawal","trans_amount":2000}');
INSERT INTO json_transactions VALUES ('{"time_id":"10-JAN-12","user_id":"John","event_id":"Transfer","trans_amount":1500}');
INSERT INTO json_transactions VALUES ('{"time_id":"20-JAN-12","user_id":"John","event_id":"Transfer","trans_amount":1200}');
INSERT INTO json_transactions VALUES ('{"time_id":"25-JAN-12","user_id":"John","event_id":"Deposit","trans_amount":1200000}');
INSERT INTO json_transactions VALUES ('{"time_id":"27-JAN-12","user_id":"John","event_id":"Transfer","trans_amount":1000000}');
INSERT INTO json_transactions VALUES ('{"time_id":"02-FEB-12","user_id":"John","event_id":"Deposit","trans_amount":500000}');
COMMIT;
Using the new JSON SQL notation we can query our data using a simple SELECT statement
SELECT
j.transaction_doc.time_id as time_id,
j.transaction_doc.user_id as user_id,
j.transaction_doc.event_id as event_id,
j.transaction_doc.trans_amount as amount
FROM json_transactions j;
Here is the list of transfer transactions that we are of interest for our analysis
SELECT
TO_DATE(j.transaction_doc.time_id, 'DD-MON-YYYY') as time_id,
j.transaction_doc.user_id as user_id,
j.transaction_doc.event_id as event_id,
to_number(j.transaction_doc.trans_amount) as amount
FROM json_transactions j
WHERE j.transaction_doc.event_id = 'Transfer';
The input into MATCH_RECOGNIZE is a SELECT statement that returns the JSON data in a standard relational table format
SELECT user_id, first_t, last_t, big_amount
FROM (SELECT
TO_DATE(j.transaction_doc.time_id, 'DD-MON-YYYY') as time_id,
j.transaction_doc.user_id as user_id,
j.transaction_doc.event_id as event_id,
j.transaction_doc.transfer_id as transfer_id,
TO_NUMBER(j.transaction_doc.trans_amount) as amount
FROM json_transactions j
WHERE j.transaction_doc.event_id = 'Transfer')
MATCH_RECOGNIZE(
PARTITION BY user_id
ORDER BY time_id
MEASURES
FIRST(x.time_id) AS first_t,
LAST(y.time_id) AS last_t,
y.amount AS big_amount
ONE ROW PER MATCH
PATTERN (X{3,} Y)
DEFINE
X as (amount < 2000) AND
LAST(time_id) - FIRST(time_id) < 30,
Y as (amount >= 1000000) AND
time_id - LAST(x.time_id) < 10);
How does this statement breakdown?
Need to group and order the data to make the pattern “visible” within the sequence of rows...,therefore, we need to group and partition the data by each user_id and then sort the data within each partition by the transaction date.
The MEASURES clause allows us to list the columns that will be returned. These can be existing columns from the input table/view and/or calculated columns. In this case we need to return the account/user id along with the value of the large amount and we need to calculate the date of first small transfer and the date of last large transfer
For this query we only need a summary report so the ONE ROW PER MATCH clause is used.
We have two pattern variables: X and Y. In this example we are searching for three or more occurrences of X followed by a single occurrence of Y.
For a match on variable X we are searching for small transfers of amounts less than 2K and all three transfers must occur within a 30 day window
For a match on variable Y we are searching for a large transfer of more than 10K. The large transfer must occur within 10 days of last small transfer.
We can check if out pattern has found all the relevant matches by using two of the built-in measures:
You might have a large number of matches for your pattern inside a given row partition. How do you tell these matches apart? This is done with the MATCH_NUMBER() function. Matches within a row pattern partition are numbered sequentially starting with 1 in the order they are found. This numbering starts at 1 within each row pattern partition, because there is no linked ordering across row pattern partitions.
Along with knowing which MATCH_NUMBER you are seeing, you may want to know which component of a pattern applies to a specific row. This is done using the CLASSIFIER() function. The classifier of a row is the pattern variable that the row is mapped to by a row pattern match. The CLASSIFIER() function returns a character string whose value is the name of the pattern variable defined within the PATTERN clause.
The last step we need to do is change the amount of information returned from summary to detailed. Therefore, we need to use ALL ROWS PER MATCH WITH UNMATCHED ROW - there is a separate livesql tutorial that covers this topic
SELECT user_id, time_id, amount, first_t, last_t, first_small_amount, last_small_amount, big_amount, mn, classifier
FROM (SELECT
TO_DATE(j.transaction_doc.time_id, 'DD-MON-YYYY') as time_id,
j.transaction_doc.user_id as user_id,
j.transaction_doc.event_id as event_id,
j.transaction_doc.transfer_id as transfer_id,
TO_NUMBER(j.transaction_doc.trans_amount) as amount
FROM json_transactions j
WHERE j.transaction_doc.event_id = 'Transfer')
MATCH_RECOGNIZE(
PARTITION BY user_id
ORDER BY time_id
MEASURES
FIRST(x.time_id) AS first_t,
LAST(y.time_id) AS last_t,
FIRST(x.amount) AS first_small_amount,
LAST(x.amount) AS last_small_amount,
y.amount AS big_amount,
match_number() AS mn,
classifier() AS classifier
ALL ROWS PER MATCH WITH UNMATCHED ROWS
PATTERN (X{3,} Y)
DEFINE
X as (amount < 2000) AND
LAST(time_id) - FIRST(time_id) < 30,
Y as (amount >= 1000000) AND
time_id - LAST(x.time_id) < 10);
As you can now see, we have correctly identified the three small transfers on 02-Jan, 10-Jan and 20-Jan, the difference between 02-Jan and 20-Jan is less than 30 days. The large transfer is within our range of 10,000 or greater and the time difference between the last small transfer on Jan 20 and the large transfer on 27-Jan is less than 10 days.
Having found some fraudulent transfers the business has now come back with an additional requirement:
The application developers have modified the JSON model to include the id of the person receiving the transfer so let's update our table with this new information:
TRUNCATE TABLE json_transactions;
INSERT INTO json_transactions VALUES ('{"time_id":"01-JAN-12","user_id":"John","event_id":"Deposit","trans_amount":1000000}');
INSERT INTO json_transactions VALUES ('{"time_id":"02-JAN-12","user_id":"John","event_id":"Transfer","transfer_id":"John","trans_amount":100}');
INSERT INTO json_transactions VALUES ('{"time_id":"02-JAN-12","user_id":"John","event_id":"Transfer","transfer_id":"Bob","trans_amount":1000}');
INSERT INTO json_transactions VALUES ('{"time_id":"05-JAN-12","user_id":"John","event_id":"Withdrawal","trans_amount":2000}');
INSERT INTO json_transactions VALUES ('{"time_id":"10-JAN-12","user_id":"John","event_id":"Transfer","transfer_id":"Allen","trans_amount":1500}');
INSERT INTO json_transactions VALUES ('{"time_id":"20-JAN-12","user_id":"John","event_id":"Transfer","transfer_id":"Tim","trans_amount":1200}');
INSERT INTO json_transactions VALUES ('{"time_id":"25-JAN-12","user_id":"John","event_id":"Deposit","trans_amount":1200000}');
INSERT INTO json_transactions VALUES ('{"time_id":"27-JAN-12","user_id":"John","event_id":"Transfer","transfer_id":"Tim","trans_amount":1000000}');
INSERT INTO json_transactions VALUES ('{"time_id":"02-FEB-12","user_id":"John","event_id":"Deposit","trans_amount":500000}');
COMMIT;
As before we can use the new JSON SQL notation to query our data using a simple SELECT statement
SELECT
j.transaction_doc.time_id as time_id,
j.transaction_doc.user_id as user_id,
j.transaction_doc.event_id as event_id,
j.transaction_doc.transfer_id as transfer_id,
j.transaction_doc.trans_amount as amount
FROM json_transactions j;
Here is the list of the transfer transactions that we are of interest for our analysis which includes the new column: transfer_id
SELECT
TO_DATE(j.transaction_doc.time_id, 'DD-MON-YYYY') as time_id,
j.transaction_doc.user_id as user_id,
j.transaction_doc.event_id as event_id,
j.transaction_doc.transfer_id as transfer_id,
to_number(j.transaction_doc.trans_amount) as amount
FROM json_transactions j
WHERE j.transaction_doc.event_id = 'Transfer';
Now that we have modified our original JSON model to now include the id of the person receiving the transfer it is still possible to run the original pattern matching query which is based on the original business rules:
SELECT user_id, first_t, last_t, big_amount
FROM (SELECT
TO_DATE(j.transaction_doc.time_id, 'DD-MON-YYYY') as time_id,
j.transaction_doc.user_id as user_id,
j.transaction_doc.event_id as event_id,
j.transaction_doc.transfer_id as transfer_id,
TO_NUMBER(j.transaction_doc.trans_amount) as amount
FROM json_transactions j
WHERE j.transaction_doc.event_id = 'Transfer')
MATCH_RECOGNIZE(
PARTITION BY user_id
ORDER BY time_id
MEASURES
FIRST(x.time_id) AS first_t,
LAST(y.time_id) AS last_t,
y.amount AS big_amount
ONE ROW PER MATCH
PATTERN (X{3,} Y)
DEFINE
X as (amount < 2000) AND
LAST(time_id) - FIRST(time_id) < 30,
Y as (amount >= 1000000) AND
time_id - LAST(x.time_id) < 10);
As we can see, our original query does in fact still run even though we have added new fields to our JSON data.
Let's change the definitions of the two pattern variables to include the new business requirements:
1) Check for transfers across different accounts
PREV(transfer_id) <> transfer_id
2) Total sum of small transfers must be less than 20K
SUM(x.amount) < 20000)
SELECT user_id, first_t, last_t, big_amount
FROM (SELECT
TO_DATE(j.transaction_doc.time_id, 'DD-MON-YYYY') as time_id,
j.transaction_doc.user_id as user_id,
j.transaction_doc.event_id as event_id,
j.transaction_doc.transfer_id as transfer_id,
TO_NUMBER(j.transaction_doc.trans_amount) as amount
FROM json_transactions j
WHERE j.transaction_doc.event_id = 'Transfer')
MATCH_RECOGNIZE(
PARTITION BY user_id
ORDER BY time_id
MEASURES
FIRST(x.time_id) AS first_t,
LAST(y.time_id) AS last_t,
y.amount AS big_amount
ONE ROW PER MATCH
PATTERN (X{3,} Y)
DEFINE
X as (amount < 2000) AND
PREV(transfer_id) <> transfer_id AND
LAST(time_id) - FIRST(time_id) < 30,
Y as (amount >= 1000000) AND
time_id - LAST(x.time_id) < 10 AND
SUM(x.amount) < 20000);
We can extract a lot more information from our pattern using some of the built-in functions. For example we can extract data for each of the three small transfers by using the FIRST() and LAST() functions. Notice that we can use the FIRST(value, x) syntax to extract information about the 2nd transfer.
Therefore, FIRST(x.transfer_id,1) will return the transfer_id for the second match.
SELECT user_id, first_t, amount_1, transfer_1, amount_2, transfer_2, amount_3, transfer_3, last_t, big_amount,transfer_4
FROM (SELECT
TO_DATE(j.transaction_doc.time_id, 'DD-MON-YYYY') as time_id,
j.transaction_doc.user_id as user_id,
j.transaction_doc.event_id as event_id,
j.transaction_doc.transfer_id as transfer_id,
TO_NUMBER(j.transaction_doc.trans_amount) as amount
FROM json_transactions j
WHERE j.transaction_doc.event_id = 'Transfer')
MATCH_RECOGNIZE(
PARTITION BY user_id
ORDER BY time_id
MEASURES
FIRST(x.time_id) AS first_t,
LAST(y.time_id) AS last_t,
FIRST(x.amount) AS amount_1,
FIRST(x.transfer_id) AS transfer_1,
FIRST(x.amount,1) AS amount_2,
FIRST(x.transfer_id,1) AS transfer_2,
LAST(x.amount) AS amount_3,
LAST(x.transfer_id) AS transfer_3,
y.amount AS big_amount,
y.transfer_id AS transfer_4
ONE ROW PER MATCH
PATTERN (X{3,} Y)
DEFINE
X as (amount < 2000) AND
PREV(transfer_id) <> transfer_id AND
LAST(time_id) - FIRST(time_id) < 30,
Y as (amount >= 1000000) AND
time_id - LAST(x.time_id) < 10 AND
SUM(x.amount) < 20000);
This tutorial has explained how to use the new JSON feature to store data inside the database and query it using familiar SQL syntax.
We have designed a fraud pattern matching SQL statement based on a set of business rules and then enhanced the pattern matching SQL statement in response to changing business rules.
Finally we have used some of the built-in measures to validate that our pattern matching statement is working correctly (using the MATCH_NUMBER() and CLASSIFIER() functions) along with the FIRST() and LAST() functions to retrieve specific data points within our pattern.