Remove Outlier (stddev)

View All Public Scripts Login to Run Script

Script Name Remove Outlier (stddev)
Visibility Unlisted - anyone with the share link can access
Description Example using the stddev function to remove outlier data from reports. For more detail, see http://www.grassroots-oracle.com/2017/06/removing-outliers-using-stddev.html
Area SQL General / Analytics
Contributor swesley_perth
Created Friday June 16, 2017

Statement 1
Create a simple table to store a bunch of values - easy skill to acquire, should be entitled privileges in a dev database.
Create test table
```
create table outlier_demo  (a number, val number)
```
```
Table created.
```

Statement 2

Insert a bunch of random data into the table - great skill to have

Populate table

insert into outlier_demo  
select rownum -- we don't care  
  , round(dbms_random.value(20,50),1) -- some random numbers between 20 and 50  
from dual   
-- let's create 49 rows  
connect by level < 50

Statement processed.

Statement 3
Let's add an outlier to the data. Something that doesn't normally belong, and 'the business' would like to discard/ignore for certain reports.
Add outlier
```
insert into outlier_demo values (0,500)
```
```
1 row(s) inserted.
```

Statement 4

We now have a bunch of random data within a certain range, plus the value of 500

Check sample data

select * from outlier_demo


A VAL
1 27.9
2 42.5
3 49.7
4 22.3
5 42.3
6 20.2
7 33.9
8 46.8
9 34
10 26.6
11 29.9
12 37
13 28.1
14 34.6
15 35.7
16 45.2
17 24.1
18 35.4
19 28.4
20 42.2
21 33
22 30
23 47
24 32.8
25 24
26 48.6
27 47.4
28 41.9
29 47.4
30 46.2
31 25.9
32 40.2
33 38.7
34 42
35 35.1
36 35.7
37 46.4
38 47.2
39 28.5
40 49.3
41 29.2
42 45.4
43 35.3
44 27.3
45 33.4
46 49.3
47 47.2
48 45.7
49 21.8
0 500

50 rows selected.

A	VAL
1	27.9
2	42.5
3	49.7
4	22.3
5	42.3
6	20.2
7	33.9
8	46.8
9	34
10	26.6
11	29.9
12	37
13	28.1
14	34.6
15	35.7
16	45.2
17	24.1
18	35.4
19	28.4
20	42.2
21	33
22	30
23	47
24	32.8
25	24
26	48.6
27	47.4
28	41.9
29	47.4
30	46.2
31	25.9
32	40.2
33	38.7
34	42
35	35.1
36	35.7
37	46.4
38	47.2
39	28.5
40	49.3
41	29.2
42	45.4
43	35.3
44	27.3
45	33.4
46	49.3
47	47.2
48	45.7
49	21.8
0	500

Statement 5

The average may be in the 40s, but without the outlier it would be back down in 30s. Mediam can a better assessment of typical data

Calculate averages

select count(*) c   
   -- what's our average?   
  ,round(avg(val)) the_avg   
   -- what would the average be without the known outlier?   
  ,round(avg(case when val < 500 then val end)) no_outlier   
  ,round(stddev(val)) std   
   -- what's the median  
  ,round(median(val)) the_median  
from outlier_demo


C THE_AVG NO_OUTLIER STD THE_MEDIAN
50 46 37 66 36

C	THE_AVG	NO_OUTLIER	STD	THE_MEDIAN
50	46	37	66	36

Statement 6

Tranform the stddev the group operator to the analytic syntax so the value can be referenced by each row. Here I've multipled the standard deviation by 2.

Go Analytical!

select a, val  
   -- what is 2 standard deviations away  
   -- provide result for each row  
   ,round(2*stddev(val) over (order by null)) local_std  
from outlier_demo


A VAL LOCAL_STD
1 27.9 132
2 42.5 132
3 49.7 132
4 22.3 132
5 42.3 132
6 20.2 132
7 33.9 132
8 46.8 132
9 34 132
10 26.6 132
11 29.9 132
12 37 132
13 28.1 132
14 34.6 132
15 35.7 132
16 45.2 132
17 24.1 132
18 35.4 132
19 28.4 132
20 42.2 132
21 33 132
22 30 132
23 47 132
24 32.8 132
25 24 132
26 48.6 132
27 47.4 132
28 41.9 132
29 47.4 132
30 46.2 132
31 25.9 132
32 40.2 132
33 38.7 132
34 42 132
35 35.1 132
36 35.7 132
37 46.4 132
38 47.2 132
39 28.5 132
40 49.3 132
41 29.2 132
42 45.4 132
43 35.3 132
44 27.3 132
45 33.4 132
46 49.3 132
47 47.2 132
48 45.7 132
49 21.8 132
0 500 132

50 rows selected.

A	VAL	LOCAL_STD
1	27.9	132
2	42.5	132
3	49.7	132
4	22.3	132
5	42.3	132
6	20.2	132
7	33.9	132
8	46.8	132
9	34	132
10	26.6	132
11	29.9	132
12	37	132
13	28.1	132
14	34.6	132
15	35.7	132
16	45.2	132
17	24.1	132
18	35.4	132
19	28.4	132
20	42.2	132
21	33	132
22	30	132
23	47	132
24	32.8	132
25	24	132
26	48.6	132
27	47.4	132
28	41.9	132
29	47.4	132
30	46.2	132
31	25.9	132
32	40.2	132
33	38.7	132
34	42	132
35	35.1	132
36	35.7	132
37	46.4	132
38	47.2	132
39	28.5	132
40	49.3	132
41	29.2	132
42	45.4	132
43	35.3	132
44	27.3	132
45	33.4	132
46	49.3	132
47	47.2	132
48	45.7	132
49	21.8	132
0	500	132

Statement 7

Well, one solution, depending on how the check would integrate with your SQL. I've turned the previous query into a From Clause / Inline View, then limited my results to where value is less than 2 standard deviations. Final result shows average matching what we expect (case stmt), and stddev now much lower, sub 10.

The Solution

select count(*) c   
  ,round(avg(val)) the_avg   
  ,round(avg(case when val < 500 then val end)) no_outlier   
  ,round(stddev(val)) new_stddev  
  ,round(avg(local_std)) filter  
from    
  (select a, val   
   -- what is 2 standard deviations away   
   -- provide result for each row   
   ,round(2*stddev(val) over (order by null)) local_std   
   from outlier_demo   
  )   
-- only average those values within two standard deviations   
where val < local_std


C THE_AVG NO_OUTLIER NEW_STDDEV FILTER
49 37 37 9 132

C	THE_AVG	NO_OUTLIER	NEW_STDDEV	FILTER
49	37	37	9	132

Additional Information

Database on OTN SQL and PL/SQL Discussion forums
Oracle Database
Download Oracle Database