create table quickstart (
id number primary key,
country varchar2(2),
full_name varchar2(40)
);
Now insert some test data into that table:
insert into quickstart values (1, 'US', 'John Doe');
insert into quickstart values (2, 'GB', 'John Smith');
insert into quickstart values (3, 'NZ', 'Peter Smith-Smith');
Now a simple query against that table might look like
select * from quickstart where country = 'GB';
Now let's say we want to search for anyone with "Smith" in their name. We could do this by simply extending the query to:
select * from quickstart where country = 'GB' and upper(full_name) like '%SMITH%';
but there are a few problems with this:
create index full_name_index on quickstart( full_name ) indextype is ctxsys.context;
Note this is like any other "create index" statement, with the addition of the phrase "indextype is ctxsys.context", telling the kernel to create a specialized CONTEXT index and allow the use of query operators associated with that indextype.
For example we can do:
select * from quickstart where country = 'GB' and contains ( full_name, 'smith') > 0;
A few things to note:
select score(99), id, full_name from quickstart where contains ( full_name, 'smith', 99) > 0 order by score(99) desc;
This produces the output:
SCORE(99) ID FULL_NAME ---------- ---------- ------------------------------ 7 3 Peter Smith-Smith 4 2 John SmithNote that the first item here scores higher than the second, because the search term smith appears twice. Note there is no "absolute" meaning to those scores - you can't say the first is a "7% match" or anything like that - all you can say is that a record with a higher score is more relevant than one with a lower score. Scoring is quite a complex topic, which we'll cover in more detail later.
We will be looking at Oracle Text queries in a bit more detail. If you've just completed Part 1, jump to "Jump Here" below. Otherwise, we'll need to recreate the table, data and index:
create table quickstart ( id number primary key, country varchar2(2), full_name varchar2(40) );
insert into quickstart values (1, 'US', 'John Doe');
insert into quickstart values (2, 'GB', 'John Smith');
insert into quickstart values (3, 'NZ', 'Peter Smith-Smith');
create index full_name_index on quickstart( full_name ) indextype is ctxsys.context;
We mentioned before that Oracle Text CONTAINS queries look for whole words. This means that a search for "smit" would not succeed, since "smit" does not occur as a whole word in any of our text. If we want to do partial word matches, we must use wildcards. As with the SQL "LIKE" operator, we can use underscore ("_") to mean "any unknown single character" and percent ("%") to mean "any number of unknown characters - including none".
So to find words starting with "smit" we would do:
select score(99), id, full_name from quickstart where contains ( full_name, 'smi%', 99) > 0 order by score(99) desc;
SCORE(99) ID FULL_NAME ---------- ---------- ------------------------------ 7 3 Peter Smith-Smith 4 2 John Smith
Again we see that the first record scores higher than the second as there are two matches for "smit%" in the first.
Note that a search for "%mit%" will work, but may be slow. If you need to use leading wildcards (that is, a wild card at the beginning of the word) you should consider using the SUBSTRING_INDEX option - look it upif you need it. But remember that Oracle Text is primarily a word-based index, it's not really designed, at least with its default settings, for finding substrings anywhere in the text.
So what if we want to search for a phrase - two or more words together, in order? That's easy, we just use the phrase in the search:
select score(99), id, full_name from quickstart where contains ( full_name, 'peter smith', 99) > 0 order by score(99) desc;
Note that unlike some search engines such as Google, the CONTAINS search is quite precise. If you look for the phrase "peter smith", it will not find documents containing "Smith Peter" or "Peter John Smith". If you wanted to find containing combinations of those words, you could use an AND or an OR query. Let's look at one of those now:
select score(99), id, full_name from quickstart where contains ( full_name, 'john OR smith', 99) > 0 order by score(99) desc;
SCORE(99) ID FULL_NAME ---------- ---------- ------------------------------ 7 3 Peter Smith-Smith 4 1 John Doe 4 2 John Smith
We'll take an aside to look at the scoring here - skip to the next paragraph if you're not interested. The first row scores higher because "smith" occurs twice. OK, but why doesn't the third one score higher, as it has two hit terms. "john" and "smith"? The answer is that the OR operator scores the higher of the two expressions it joins. Since "john" and "smith" both score similar low scores, the result is just the one low score. In contrast, the AND operator scores the lower of two expressions. This might seem unobvious, but it makes sense if you consider what happens when one of the search terms doesn't exist - and therefore scores zero. The OR operator scores the higher of (zero and something), so always returns a score when one term is present. The AND operator scores the lower of (zero and something) - which is zero - so always returns a zero score unless both terms are present.
Let's see a few more simple queries and their results. First the AND operator we've already mention - both terms must be present.
select score(99), id, full_name from quickstart where contains ( full_name, 'john AND smith', 99) > 0 order by score(99) desc;
SCORE(99) ID FULL_NAME ---------- ---------- ------------------------------ 4 2 John Smith
Then there's the NOT operator - if the second term is present at all then the query fails:
select score(99), id, full_name from quickstart where contains ( full_name, 'john NOT smith', 99) > 0 order by score(99) desc;
SCORE(99) ID FULL_NAME ---------- ---------- ------------------------------ 4 1 John Doe
And the ACCUM operator - similar to OR but instead of scoring the lower of the two expressions, it adds them together (or accumulates them)
select score(99), id, full_name from quickstart where contains ( full_name, 'john ACCUM smith', 99) > 0 order by score(99) desc;
SCORE(99) ID FULL_NAME ---------- ---------- ------------------------------ 52 2 John Smith 4 3 Peter Smith-Smith 2 1 John Doe
And a glimpse at one of the more powereful capabilities of the query language - the FUZZY operator: do an "inexact" search for words which are spelled or sound similar:
select score(99), id, full_name from quickstart where contains ( full_name, 'fuzzy((smythe))', 99) > 0 order by score(99) desc;
SCORE(99) ID FULL_NAME ---------- ---------- ------------------------------ 7 3 Peter Smith-Smith 4 2 John Smith
You can add these operators together in arbitrarily complex expressions, using parenthesis to establish precedent if required:
select score(99), id, full_name from quickstart where contains ( full_name, '((john ACCUM peter) OR smith) NOT peter', 99) > 0 order by score(99) desc;
SCORE(99) ID FULL_NAME ---------- ---------- ------------------------------ 4 2 John Smith 2 1 John Doe
A few last things:
Oracle Text has a large number of reserved words - AND, OR, ACCUM and FUZZY all being in that list. If we want to search for any of those reserved words, we must enclose them in braces to negate their special meaning:
select score(99), id, full_name from quickstart where contains ( full_name, '{fuzzy} OR {accum}', 99) > 0 order by score(99) desc;
Case sensitivity: CONTEXT indexes are case-insensitive by default in most environments. "smith" will match "Smith", etc. However, for certain languages (notably German) this is switched and the indexes are case-sensitive by default. The default language will be based on the normal NLS settings for your database. Case sensitivity for an index can be chosen as a setting when the index is created.
This is only a glimpse at the capabilities of the Oracle Text CONTAINS query capabilities. For a more detailed look at the operators available, see the documentation.
In this module we look at keeping indexes up to date and in good order
If you've just completed Part 2, jump to "Jump Here" below. Otherwise, we'll need to recreate the table, data and index:
create table quickstart ( id number primary key, country varchar2(2), full_name varchar2(40) );
insert into quickstart values (1, 'US', 'John Doe');
insert into quickstart values (2, 'GB', 'John Smith');
insert into quickstart values (3, 'NZ', 'Peter Smith-Smith');
create index full_name_index on quickstart( full_name ) indextype is ctxsys.context;
(note: to restart the this module, manually enter "drop table quickstart" into the SQL Worksheet and run it, then start from above)
This post follows on from Part 1 and Part 2, and uses the same example "quickstart" table.
One thing that surprised new users is that Oracle Text CONTEXT indexes are not synchronous. That is, updates to the table are not immediately reflected in the index. Instead, the index must be synchronized. We can do this manually with a call to CTX_DDL.SYNC_INDEX, or we can arrange to have it done automatically for us.
Let's show the manual method first:. We'll insert a new run then search for it. We won't find it because the index is not up-to-date. Then we'll call SYNC_INDEX, giving it the name of the index, and search again:
insert into quickstart values (4, 'GB', 'Michael Smith');
select score(99), id, full_name from quickstart where contains ( full_name, 'michael', 99) > 0 order by score(99) desc;
Nothing is found because the update has not been sync'd. So sync the index:
execute ctx_ddl.sync_index('full_name_index')
and search again...
select score(99), id, full_name from quickstart where contains ( full_name, 'michael', 99) > 0 order by score(99) desc;
This time we find our new record.
If we don't want to bother with manual SYNC calls, we can use the parameter string "SYNC (ON COMMIT)" when we create the index. That means that immediately after updates are committed to the table, a SYNC operation will be run to get the index into step with the table. We need to add a PARAMETERS clause to the create index statement for this:
drop index full_name_index;
create index full_name_index on quickstart( full_name )
indextype is ctxsys.context
parameters ('sync (on commit)');
We have a new index. Now insert yet another row - and immediately search for it:
insert into quickstart values (5, 'US', 'Scott Peters');
select score(99), id, full_name from quickstart where contains ( full_name, 'scott', 99) > 0 order by score(99) desc;
We didn't find anything from that search because the insert wasn't committed yet. But LiveSQL commits after every batch of statements, so if we try the query again:
select score(99), id, full_name from quickstart where contains ( full_name, 'scott', 99) > 0 order by score(99) desc;
This time it finds it.
Because of the way CONTEXT indexes are built, frequent changes will cause fragmentation of the index, decreasing performance over time. So a common way is to arrange for indexes to be synchronized at certain fixed intervals. We can do this manually, using the database scheduler, or we can have it done automatically with a "SYNC (EVERY ... )" string in the parameters clause. The time clause in that uses scheduler syntax, and can be a little convoluted. Every 1 minute can be represented in days as 1/24/60".
Unfortunately we can't try that here, as LiveSQL doesn't allow us to create jobs. But if you want to try this on your own database, the syntax would be:
create index full_name_index on quickstart( full_name ) indextype is ctxsys.context parameters ('sync ( every SYSDATE+1/24/60 )');Back in Part 1 we mentioned that it was a good idea to grant the "CREATE JOB" privilege to a text user - this is why. If the user doesn't have CREATE JOB privilege, then we will get an "insufficient privileges" error if we try to create the index with "SYNC (EVERY ...)".
Finally, we should talk about index optimization. CONTEXT indexes, over time, get less efficient as they get updated. This inefficiency takes two forms:
execute ctx_ddl.optimize_index('full_name_index', 'FULL')
The "FULL" parameter there is the mode of optimization. You can choose from the following:
A typical "maintenance regime" at many customers is to run SYNC every five minutes, then run OPTIMIZE in FULL mode nightly, and OPTIMIZE in REBUILD mode once a week.