- Tutorial Aggregating Data using Weighted Averages in Analytic Views
- Description Weighted averages are often needed to correctly aggregate average in aggregate levels of a hierarchy. For example, demographic statistics are typically aggregated with averages weighted by population. This tutorial explains how to easily create a calculated measure that aggregates using a weighted average.
- Tags analytic view, aggregation, weighted average
- Category SQL Analytics
- Contributor Oracle
- Created Wednesday August 15, 2018
- Modules 5

- Module 1
## About Weighted Averages

A simple average calculates the center of a data set. For example, the simple average of 1,2,3,4,5 is 2.5. A simple average is meaningful when each data point counts equally in the average. In many cases each data point does not county equally, for example when certain data points have a higher population or a greater number of units sold. In that case, a weighted average is more appropriate.

Consider a case where there are a certain number of students in a school between the ages of 5 and 10.

- Age 5, 100 students.
- Age 6, 200 students.
- Age 7, 300 students.
- Age 8, 500 students.
- Age 9, 800 students.
- Age 10, 1200 students.
A simple average of ages would be calculated as:

(5 + 6 + 7+ 8 + 9 + 10) / 6 = 7.5

Because there are more older students than younger students, a simple average might be misleading. An average of age weighted by the number of students might be more appropriate. The weighted average would be calculated as the sum of the products of age and number of students, divided by the number of students. In this example:

((5 * 100) + (6 * 200) + (7 * 300) + (8 * 500) + (9 * 800) + (10 * 1200)) / (100 + 200 + 300 + 500 + 800 + 1200)

(500 + 1200 + 2100 + 4000 + 7200 + 12000 ) / 3100

27000 / 3100

Weighted Average = 8.7 years.

Module
2

This tutorial uses data in the AV schema, which contains a star schema with sales data varying by time, product and geography. A copy of the fact table was created in the current schema by the setup script of this tutorial. This copy of the SALES_FACT table includes a new column PERCENT_MARGIN, which is the profit margin on for that particular sale.

```
SELECT * FROM sales_fact WHERE rownum <= 10;
```

The setup script also created three attribute dimensions and three hierarchies.

```
SELECT * FROM user_hier_levels ORDER BY hier_name, order_num;
```

Module
3

By looking at a small slice of the fact table it can be seen that a simple average of PERCENT_MARGIN is not very meaningful. Category -532 has a high margin at 36% but relatively few sales at 675 units. Category -529 has a low margin at 7% and the most sales at 7,336 units. Category -532 should have a greater weighting in the average margin calculation.

```
SELECT * FROM sales_fact WHERE rownum <= 50 AND month_id = 'Apr-15' AND state_province_id = 'TEXAS_US';
```

The following query returns simple average of PERCENT_MARGIN. This might be very misleading.

`SELECT AVG(percent_margin) FROM sales_fact WHERE rownum <= 50 AND month_id = 'Apr-15' AND state_province_id = 'TEXAS_US';`

It is probably more meaningful to weight the average of PERCENT_MARGIN using the number of units sold. The weighted average is calculated as the sum of (UNITS * PERCENT_MARGIN) divided by the sum of UNITS.

```
SELECT
SUM(units * percent_margin) / SUM(units)
FROM
sales_fact
WHERE
rownum <= 50
AND month_id = 'Apr-15'
AND state_province_id = 'TEXAS_US';
```

Using this formula, categories that have a higher number of units sold are given a higher weight than categories with a lower number of units sold.

The following query shows that there are varying numbers of units sold for each product category in South America.

```
SELECT
t.year_name,
p.category_name,
g.region_name,
SUM(f.units) AS units
FROM
av.time_dim t,
av.product_dim p,
av.geography_dim g,
sales_fact f
WHERE
t.month_id = f.month_id
AND p.category_id = f.category_id
AND g.state_province_id = f.state_province_id
AND t.year_name = 'CY2015'
AND g.region_name = 'South America'
GROUP BY
t.year_name,
p.category_name,
g.region_name;
```

The next query calculates the simple and weighted average of percent margin. Note that in this query, PERCENT_MARGIN * UNITS is first calculated for each row and then summed.

```
SELECT
t.year_name,
g.region_name,
SUM(f.units) AS units,
SUM(f.sales) AS sales,
ROUND(AVG(f.percent_margin),3) AS simple_average_percent_margin,
ROUND(SUM(f.percent_margin * f.units) / SUM(f.units),3) AS units_weighted_average_percent_margin
FROM
av.time_dim t,
av.geography_dim g,
sales_fact f
WHERE
t.month_id = f.month_id
AND g.state_province_id = f.state_province_id
AND t.year_name = 'CY2015'
AND g.region_name = 'South America'
GROUP BY
t.year_name,
g.region_name;
```

Module
4

To calculate the weighted average in the analytic view, the same calculation steps are required as with the SELECT from tables.

- Calculate the weighting factor for each detail row.
- Sum the weighting factor and the number of units sold
- Divide the summed weighting factor by the summed units.

To do this:

- Add a virtual column (PERCENT_MARGIN_WEIGHT) to the table to calculate the weighting factor for each detail row (or add a column to a view.)
- Add the measure PERCENT_MARGIN_WEIGHT aggregated by SUM to the analytic view.
- Add the WEIGHTED_AVERAGE_PERCENT_MARGIN calculated measure to the analytic view.

Add PERCENT_MARGIN_WEIGHT to the SALES_FACT table:

```
ALTER TABLE sales_fact ADD percent_margin_weight AS (units * percent_margin);
```

Query the fact table to view the new column:

```
SELECT *
FROM sales_fact
WHERE
rownum <= 50
AND month_id = 'Apr-15'
AND state_province_id = 'TEXAS_US';
```

Create the analytic view (note the weighting factor and weighted average measures):

```
CREATE OR REPLACE ANALYTIC VIEW sales_av
CLASSIFICATION caption VALUE 'Sales AV'
CLASSIFICATION description VALUE 'Sales Analytic View'
CLASSIFICATION created_by VALUE 'George Jones'
USING sales_fact
DIMENSION BY
(time_attr_dim
KEY month_id REFERENCES month_id
HIERARCHIES (
time_hier DEFAULT),
product_attr_dim
KEY category_id REFERENCES category_id
HIERARCHIES (
product_hier DEFAULT),
geography_attr_dim
KEY state_province_id
REFERENCES state_province_id
HIERARCHIES (
geography_hier DEFAULT)
)
MEASURES
(sales FACT sales AGGREGATE BY SUM,
units FACT units AGGREGATE BY SUM,
-- Simple average.
avg_percent_margin FACT percent_margin AGGREGATE BY AVG,
-- Weighting factor.
percent_margin_weight FACT percent_margin_weight AGGREGATE BY SUM,
-- Weighted average.
weighted_average_percent_margin AS (percent_margin_weight / units)
)
DEFAULT MEASURE SALES;
```

Query the analytic view:

```
SELECT
time_hier.member_name AS time,
geography_hier.member_name AS geography,
sales,
units,
ROUND(avg_percent_margin,3) AS average_price_per_unit,
ROUND(weighted_average_percent_margin,3) AS weighted_average_percent_margin
FROM sales_av
HIERARCHIES (
time_hier,
geography_hier)
WHERE
time_hier.level_name = 'YEAR'
AND geography_hier.level_name = 'REGION'
AND time_hier.member_name = 'CY2015'
AND geography_hier.member_name = 'South America'
ORDER BY
time_hier.member_name,
geography_hier.member_name;
```

Module
5

In the SALES_FACT table PERCENT_MARGIN is dense relative to UNITS. That is, for every row where there is UNITS data there is also PERCENT_MARGIN data. Many times it will be the case that one of the measure will be NULL. These cases require addition handling.

Consider the case where a row exists with a value for UNITS and there are NULL values for PERCENT_MARGIN. In this case the sum of UNITS cannot be used at the denominator to the weighted average calculation because weight of those rows is unknown.

Run the following query and note that for every value of UNITS there is a value for PERCENT_MARGIN.

```
SELECT *
FROM
sales_fact
WHERE
state_province_id like '%_MX'
AND month_id = 'Dec-15'
ORDER BY
state_province_id,
category_id;
```

Run the following query to select UNITS, PERCENT_MARGIN_WEIGHT and WEIGHTED_AVERAGE_PERCENT_MARGIN from the tables.

```
SELECT
f.month_id,
g.country_id,
SUM(f.units),
SUM(f.percent_margin_weight) AS percent_margin_weight,
ROUND(SUM(f.percent_margin_weight) / SUM(f.units),3) AS weighted_average_percent_margin
FROM
av.geography_dim g,
sales_fact f
WHERE
g.state_province_id = f.state_province_id
AND g.country_id = 'MX'
AND f.month_id = 'Dec-15'
GROUP BY
f.month_id,
g.country_id;
```

The weighted average percent margin is 12316.52 percent_margin_weight / 95778 units = .129.

Next, set PERCENT_MARGIN = NULL for Baja California.

```
UPDATE sales_fact
SET percent_margin = null
WHERE
month_id = 'Dec-15'
AND state_province_id = 'BAJA_CALIFORNIA_MX';
COMMIT;
```

View the updated data.

```
SELECT *
FROM sales_fact
WHERE
state_province_id LIKE '%_MX'
AND month_id = 'Dec-15'
ORDER BY
state_province_id,
category_id;
```

Note that PERCENT_MARGIN_WEIGHT is also NULL.

Run the following query to select UNITS, PERCENT_MARGIN_WEIGHT and WEIGHTED_AVERAGE_PERCENT_MARGIN again.

```
SELECT
f.month_id,
g.country_id,
SUM(f.units),
SUM(f.percent_margin_weight) AS percent_margin_weight,
ROUND(SUM(f.percent_margin_weight) / SUM(f.units),3) AS weighted_average_percent_margin
FROM
av.geography_dim g,
sales_fact f
WHERE
g.state_province_id = f.state_province_id
AND g.country_id = 'MX'
AND f.month_id = 'Dec-15'
GROUP BY
f.month_id,
g.country_id;
```

Note that the WEIGHTED_AVERAGE_PERCENT_MARGIN is .118 rather than .129. This is because the sum of PERCENT_MARGIN weight has decreased while the sum of UNITS has remained the same.

Because the weight of rows with NULL values is unknown those rows should not be considered in the weighted average calculation. To account for this, create a new column in the fact table that returns NULL when PERCENT_MARGIN is NULL.

```
ALTER TABLE sales_fact
ADD percent_margin_units
AS (CASE
WHEN percent_margin IS NULL THEN NULL
ELSE units
END);
```

View the data.

`SELECT * FROM sales_fact WHERE state_province_id like '%_MX' and month_id = 'Dec-15' ORDER BY state_province_id, category_id;`

Run the following query to view UNITS, PERCENT_MARGIN_UNITS, PERCENT_MARGIN_WEIGHT and WEIGHTED_AVERAGE_PERCENT_MARGIN again.

```
SELECT
f.month_id,
g.country_id,
SUM(f.units),
SUM(f.percent_margin_units),
SUM(f.percent_margin_weight) AS percent_margin_weight,
ROUND(SUM(f.percent_margin_weight) / SUM(f.percent_margin_units),3) AS weighted_average_percent_margin
FROM
av.geography_dim g,
sales_fact f
WHERE
g.state_province_id = f.state_province_id
AND g.country_id = 'MX'
AND f.month_id = 'Dec-15'
GROUP BY
f.month_id,
g.country_id;
```

The rows where PERCENT_MARGIN are NULL are not longer considered in the calculation and the weighted average percent margin is .129.

To apply this the analytic view add the PERCENT_MARGIN_UNITS measure and use it in the denominator of the WEIGHTED_AVERAGE_PERCENT_MARGIN measure.

```
CREATE OR REPLACE ANALYTIC VIEW sales_av
CLASSIFICATION caption VALUE 'Sales AV'
CLASSIFICATION description VALUE 'Sales Analytic View'
CLASSIFICATION created_by VALUE 'George Jones'
USING sales_fact
DIMENSION BY
(time_attr_dim
KEY month_id REFERENCES month_id
HIERARCHIES (
time_hier DEFAULT),
product_attr_dim
KEY category_id REFERENCES category_id
HIERARCHIES (
product_hier DEFAULT),
geography_attr_dim
KEY state_province_id
REFERENCES state_province_id
HIERARCHIES (
geography_hier DEFAULT)
)
MEASURES
(sales FACT sales AGGREGATE BY SUM,
units FACT units AGGREGATE BY SUM,
-- Weighted Average Units.
percent_margin_units FACT percent_margin_units AGGREGATE BY SUM,
-- Simple average.
avg_percent_margin FACT percent_margin AGGREGATE BY AVG,
-- Weighting factor.
percent_margin_weight FACT percent_margin_weight AGGREGATE BY SUM,
-- Weighted average.
weighted_average_percent_margin AS (percent_margin_weight / percent_margin_units)
)
DEFAULT MEASURE SALES;
```

Run the following query to select UNITS, PERCENT_MARGIN_UNITS, PERCENT_MARGIN_WEIGHT and WEIGHTED_AVERAGE_PERCENT_MARGIN from the analytic view.

```
SELECT
time_hier.member_name AS time,
geography_hier.member_name AS geography,
units,
percent_margin_units,
percent_margin_weight,
ROUND(weighted_average_percent_margin,3) AS weighted_average_percent_margin
FROM sales_av
HIERARCHIES (
time_hier,
geography_hier)
WHERE
time_hier.level_name = 'MONTH'
AND geography_hier.level_name = 'COUNTRY'
AND time_hier.member_name = 'Dec-15'
AND geography_hier.member_name = 'Mexico'
ORDER BY
time_hier.member_name,
geography_hier.member_name;
```