Oracle Text Quickstart

View All Tutorials Login to Run Tutorial

Tutorial Oracle Text Quickstart
Description Getting started with Oracle Text
Tags Oracle Text, CTX, CONTAINS, CONTEXT
Area Oracle Text
Contributor Roger Ford (Oracle)
Created Tuesday June 04, 2019
Modules 3

Module 1
Part 1 - Basic Index Creation and Queries
Creating an index and running a query against it

If you're running this tutorial outside of Live SQL, you'll need to make sure the user who runs it has CTXAPP role. CREATE JOB privilege is strongly recommended as well.
Let's create a simple table:
```
create table quickstart (
  id         number primary key,
  country    varchar2(2),
  full_name  varchar2(40) 
);
```
Now insert some test data into that table:
```
insert into quickstart values (1, 'US', 'John Doe');
insert into quickstart values (2, 'GB', 'John Smith');
insert into quickstart values (3, 'NZ', 'Peter Smith-Smith');
```
Now a simple query against that table might look like
```
select * from quickstart where country = 'GB';
```
Now let's say we want to search for anyone with "Smith" in their name. We could do this by simply extending the query to:
```
select * from quickstart where country = 'GB' and upper(full_name) like '%SMITH%';
```
but there are a few problems with this:
- There is no index. The kernel must read every full_name field and scan it. Not a problem here, but certainly a problem if there are millions of records.
- The query would match SMITHSON and BLACKSMITH. Not a problem if that's what we want, but tricky otherwise.
We can create a word-based index on the FULL_NAME column using the following syntax:
```
create index full_name_index on quickstart( full_name ) indextype is ctxsys.context;
```
Note this is like any other "create index" statement, with the addition of the phrase "indextype is ctxsys.context", telling the kernel to create a specialized CONTEXT index and allow the use of query operators associated with that indextype.
For example we can do:
```
select * from quickstart where country = 'GB' and contains ( full_name, 'smith') > 0;
```
A few things to note:
- The CONTAINS operator can only be used if an index is present on the column specified in the first argument.
- The search string doesn't need any wildcards either side. We are looking for the whole word 'smith' (if we wanted to match 'smithson' as well, then we could use a wildcard: 'smith%').
- SQL doesn't have boolean functions, so CONTAINS returns zero for a non-match, and greater-than-zero for a match. The actual value returned is the 'score' (see later) but there are very rarely any circumstances where you do anything other than testing where the return value is greater than zero.
Scoring: Most SQL queries just select between rows that match a criterion, and those that don't. There is rarely the concept of a good or better match. But if we're searching a lot of technical papers for the word "Exadata", then a document which has many occurrences of this term is likely to be a better match for our search than one where there is only a single match.
Therefore CONTEXT indexes return a score for each search. As mentioned, the score is returned by the CONTAINS function, but is rarely useful there. It's usually better to use it in the SELECT clause, and perhaps in an ORDER BY clause as well.
A score is associated with a particular CONTAINS clause, of which a query may have several, so we use a number as the third argument in the CONTAINS function, and SCORE operator takes the same number. It doesn't matter what that number is so long as it's the same in both cases. We'll use 99 in this example:
```
select score(99), id, full_name from quickstart where contains ( full_name, 'smith', 99) > 0 order by score(99) desc;
```
This produces the output:
```
 SCORE(99)         ID FULL_NAME
---------- ---------- ------------------------------
         7          3 Peter Smith-Smith
         4          2 John Smith
```
Note that the first item here scores higher than the second, because the search term smith appears twice. Note there is no "absolute" meaning to those scores - you can't say the first is a "7% match" or anything like that - all you can say is that a record with a higher score is more relevant than one with a lower score.
Scoring is quite a complex topic, which we'll cover in more detail later.
Module 2
Part 2 - More Queries
More Queries

We will be looking at Oracle Text queries in a bit more detail. If you've just completed Part 1, jump to "Jump Here" below. Otherwise, we'll need to recreate the table, data and index:
```
create table quickstart ( id number primary key, country varchar2(2), full_name varchar2(40) );

insert into quickstart values (1, 'US', 'John Doe');
insert into quickstart values (2, 'GB', 'John Smith');
insert into quickstart values (3, 'NZ', 'Peter Smith-Smith');

create index full_name_index on quickstart( full_name ) indextype is ctxsys.context;
```
Jump Here
Start here if you've come straight from Part 1:
We mentioned before that Oracle Text CONTAINS queries look for whole words. This means that a search for "smit" would not succeed, since "smit" does not occur as a whole word in any of our text. If we want to do partial word matches, we must use wildcards. As with the SQL "LIKE" operator, we can use underscore ("_") to mean "any unknown single character" and percent ("%") to mean "any number of unknown characters - including none".

So to find words starting with "smit" we would do:
```
select score(99), id, full_name from quickstart where contains ( full_name, 'smi%', 99) > 0 order by score(99) desc;
```
```
 SCORE(99)         ID FULL_NAME
---------- ---------- ------------------------------
         7          3 Peter Smith-Smith
         4          2 John Smith
```
Again we see that the first record scores higher than the second as there are two matches for "smit%" in the first.

Note that a search for "%mit%" will work, but may be slow. If you need to use leading wildcards (that is, a wild card at the beginning of the word) you should consider using the SUBSTRING_INDEX option - look it upif you need it. But remember that Oracle Text is primarily a word-based index, it's not really designed, at least with its default settings, for finding substrings anywhere in the text.

So what if we want to search for a phrase - two or more words together, in order? That's easy, we just use the phrase in the search:
```
select score(99), id, full_name from quickstart where contains ( full_name, 'peter smith', 99) > 0 order by score(99) desc;
```
Note that unlike some search engines such as Google, the CONTAINS search is quite precise. If you look for the phrase "peter smith", it will not find documents containing "Smith Peter" or "Peter John Smith". If you wanted to find containing combinations of those words, you could use an AND or an OR query. Let's look at one of those now:
```
select score(99), id, full_name from quickstart where contains ( full_name, 'john OR smith', 99) > 0 order by score(99) desc;
```
```
 SCORE(99)         ID FULL_NAME
---------- ---------- ------------------------------
         7          3 Peter Smith-Smith
         4          1 John Doe
         4          2 John Smith
```
We'll take an aside to look at the scoring here - skip to the next paragraph if you're not interested. The first row scores higher because "smith" occurs twice. OK, but why doesn't the third one score higher, as it has two hit terms. "john" and "smith"? The answer is that the OR operator scores the higher of the two expressions it joins. Since "john" and "smith" both score similar low scores, the result is just the one low score. In contrast, the AND operator scores the lower of two expressions. This might seem unobvious, but it makes sense if you consider what happens when one of the search terms doesn't exist - and therefore scores zero. The OR operator scores the higher of (zero and something), so always returns a score when one term is present. The AND operator scores the lower of (zero and something) - which is zero - so always returns a zero score unless both terms are present.

Let's see a few more simple queries and their results. First the AND operator we've already mention - both terms must be present.
```
select score(99), id, full_name from quickstart where contains ( full_name, 'john AND smith', 99) > 0 order by score(99) desc;
```
```
 SCORE(99)         ID FULL_NAME
---------- ---------- ------------------------------
         4          2 John Smith
```
Then there's the NOT operator - if the second term is present at all then the query fails:
```
select score(99), id, full_name from quickstart where contains ( full_name, 'john NOT smith', 99) > 0 order by score(99) desc;
```
```
 SCORE(99)         ID FULL_NAME
---------- ---------- ------------------------------
         4          1 John Doe
```
And the ACCUM operator - similar to OR but instead of scoring the lower of the two expressions, it adds them together (or accumulates them)
```
select score(99), id, full_name from quickstart where contains ( full_name, 'john ACCUM smith', 99) > 0 order by score(99) desc;
```
```
 SCORE(99)         ID FULL_NAME
---------- ---------- ------------------------------
        52          2 John Smith
         4          3 Peter Smith-Smith
         2          1 John Doe
```
And a glimpse at one of the more powereful capabilities of the query language - the FUZZY operator: do an "inexact" search for words which are spelled or sound similar:
```
 select score(99), id, full_name from quickstart where contains ( full_name, 'fuzzy((smythe))', 99) > 0 order by score(99) desc;
```
```
 SCORE(99)         ID FULL_NAME
---------- ---------- ------------------------------
         7          3 Peter Smith-Smith
         4          2 John Smith
```
You can add these operators together in arbitrarily complex expressions, using parenthesis to establish precedent if required:
```
select score(99), id, full_name from quickstart where contains ( full_name, '((john ACCUM peter) OR smith) NOT peter', 99) > 0 order by score(99) desc;
```
```
 SCORE(99)         ID FULL_NAME
---------- ---------- ------------------------------
         4          2 John Smith
         2          1 John Doe
```
A few last things:

Oracle Text has a large number of reserved words - AND, OR, ACCUM and FUZZY all being in that list. If we want to search for any of those reserved words, we must enclose them in braces to negate their special meaning:
```
select score(99), id, full_name from quickstart where contains ( full_name, '{fuzzy} OR {accum}', 99) > 0 order by score(99) desc;
```
Case sensitivity: CONTEXT indexes are case-insensitive by default in most environments. "smith" will match "Smith", etc. However, for certain languages (notably German) this is switched and the indexes are case-sensitive by default. The default language will be based on the normal NLS settings for your database. Case sensitivity for an index can be chosen as a setting when the index is created.

This is only a glimpse at the capabilities of the Oracle Text CONTAINS query capabilities. For a more detailed look at the operators available, see the documentation.
Module 3
Part 3 - Index Maintenance
Part 3: Index Maintenance

In this module we look at keeping indexes up to date and in good order
If you've just completed Part 2, jump to "Jump Here" below. Otherwise, we'll need to recreate the table, data and index:
```
create table quickstart ( id number primary key, country varchar2(2), full_name varchar2(40) );

insert into quickstart values (1, 'US', 'John Doe');
insert into quickstart values (2, 'GB', 'John Smith');
insert into quickstart values (3, 'NZ', 'Peter Smith-Smith');

create index full_name_index on quickstart( full_name ) indextype is ctxsys.context;
```
(note: to restart the this module, manually enter "drop table quickstart" into the SQL Worksheet and run it, then start from above)
Jump Here

Start here if you've come straight from Part 2:

This post follows on from Part 1 and Part 2, and uses the same example "quickstart" table.
One thing that surprised new users is that Oracle Text CONTEXT indexes are not synchronous. That is, updates to the table are not immediately reflected in the index. Instead, the index must be synchronized. We can do this manually with a call to CTX_DDL.SYNC_INDEX, or we can arrange to have it done automatically for us.
Let's show the manual method first:. We'll insert a new run then search for it. We won't find it because the index is not up-to-date. Then we'll call SYNC_INDEX, giving it the name of the index, and search again:
```
insert into quickstart values (4, 'GB', 'Michael Smith');

select score(99), id, full_name from quickstart where contains ( full_name, 'michael', 99) > 0 order by score(99) desc;
```
Nothing is found because the update has not been sync'd. So sync the index:
```
execute ctx_ddl.sync_index('full_name_index')
```
and search again...
```
select score(99), id, full_name from quickstart where contains ( full_name, 'michael', 99) > 0 order by score(99) desc;
```
This time we find our new record.
If we don't want to bother with manual SYNC calls, we can use the parameter string "SYNC (ON COMMIT)" when we create the index. That means that immediately after updates are committed to the table, a SYNC operation will be run to get the index into step with the table. We need to add a PARAMETERS clause to the create index statement for this:
```
drop index full_name_index;

create index full_name_index on quickstart( full_name )
indextype is ctxsys.context
parameters ('sync (on commit)');
```
We have a new index. Now insert yet another row - and immediately search for it:
```
insert into quickstart values (5, 'US', 'Scott Peters');

select score(99), id, full_name from quickstart where contains ( full_name, 'scott', 99) > 0 order by score(99) desc;
```
We didn't find anything from that search because the insert wasn't committed yet. But LiveSQL commits after every batch of statements, so if we try the query again:
```
select score(99), id, full_name from quickstart where contains ( full_name, 'scott', 99) > 0 order by score(99) desc;
```
This time it finds it.

Time-scheduled Sync
SYNC(ON COMMIT) is fine if you don't make too many updates to your table, but is not ideal if you have a very large table and lots of changes take place on it.
Because of the way CONTEXT indexes are built, frequent changes will cause fragmentation of the index, decreasing performance over time. So a common way is to arrange for indexes to be synchronized at certain fixed intervals. We can do this manually, using the database scheduler, or we can have it done automatically with a "SYNC (EVERY ... )" string in the parameters clause. The time clause in that uses scheduler syntax, and can be a little convoluted. Every 1 minute can be represented in days as 1/24/60".
Unfortunately we can't try that here, as LiveSQL doesn't allow us to create jobs. But if you want to try this on your own database, the syntax would be:
```
create index full_name_index on quickstart( full_name )
indextype is ctxsys.context 
parameters ('sync ( every SYSDATE+1/24/60 )');
```
Back in Part 1 we mentioned that it was a good idea to grant the "CREATE JOB" privilege to a text user - this is why. If the user doesn't have CREATE JOB privilege, then we will get an "insufficient privileges" error if we try to create the index with "SYNC (EVERY ...)".
Finally, we should talk about index optimization. CONTEXT indexes, over time, get less efficient as they get updated. This inefficiency takes two forms:
1. The index gets filled with garbage - it contains references to deleted or updated documents which are no longer current
2. The "postings lists" get fragmented. The pointers to which documents contain each word get broken up into small chunks instead of the idea long strings.
We can fix this by running index optimization. As with the manual call to sync indexes, this is done with a call to the PL/SQL package CTX_DDL:
```
execute ctx_ddl.optimize_index('full_name_index', 'FULL')
```
The "FULL" parameter there is the mode of optimization. You can choose from the following:
- FAST: Only posting fragmentation is fixed - no garbage collection occurs
- FULL: Both fragmentation and garbage collection is dealt with
- REBUILD: The entire index is copied to a fresh table (requires more disk space but produces the best results).
The more frequently you run SYNC, the more fragmented your index will become, and the more often you will need to optimize.
A typical "maintenance regime" at many customers is to run SYNC every five minutes, then run OPTIMIZE in FULL mode nightly, and OPTIMIZE in REBUILD mode once a week.

Additional Information

Database on OTN SQL and PL/SQL Discussion forums
Oracle Database
Download Oracle Database

Oracle Text Quickstart

Part 1 - Basic Index Creation and Queries

Creating an index and running a query against it

Part 2 - More Queries

More Queries

Jump Here

Part 3 - Index Maintenance

Part 3: Index Maintenance

Jump Here

Start here if you've come straight from Part 2:

Time-scheduled Sync

Additional Information