**The Data Prep (Paxata) documentation is now available on the DataRobot public documentation site.** See the Data Prep section for user documentation and connector information. After the 2021.2 SP1 release, the content on this site will be removed and replaced with a link to the DataRobot public documentation site.

# Group By Aggregate Functions

The following is a list of the Aggregate Functions available to the user during **Group By** shaping operations. Click the link to learn more about each individual function:

Array

Average

Count

Count (Numbers Only)

Count Distinct

First

Last

Max

Min

Median

Mode

Stdev

Stdevp

Sum

Var

Varp

## Array

Of the available aggregate functions available, **array** is unique in that it operates on both text and numeric values. Rather than perform a mathematical operation on collapsed rows, all values in the reference column (the column to which **array** is applied) are temporarily stored. As the unique single row is created, **array** assembles the reference column data from the set into a single, comma-separated string within the column.

Excluding the column to which the **array** is being applied, all rows are examined column-by-column in order to find identical rows. The header of the reference column receives a name change to become “Array of <*column name*>”.

**Examples**The small dataset below will be used to show how

**array**operates.

Column A | Column B | Column C |
---|---|---|

1 | two | 5 |

1 | two | 6 |

two | two | 7 |

1 | two | 4 |

**Example 1**

Applying the **array** function to *Column C* reduces the row count from four to two. The value in the column *Array of Column C* shows the sum of of the *Column C* values in the duplicate rows that were collapsed during the operation.

Column A | Column B | Array of Column C |
---|---|---|

1 | two | 5.0, 6.0, 4.0 |

two | two | 7.0 |

**Example 2 **

Applying the **array** function to *Column A* results in no row count reduction since examination of *Column B* and *Column C* values in each row reveals that each is already unique. The value in the column *Array of Column A* therefore displays each row with its original value—however, numbers have been converted into text.

Array of Column A | Column B | Column C |
---|---|---|

1.0 | two | 5 |

two | two | 7 |

1.0 | two | 6 |

1.0 | two | 4 |

## Average

This aggregate function finds an average of the numbers in the reference column (the column to which **average** is applied) as all duplicate rows are collapsed into unique single rows. Excluding the column to which the **average** is being applied, all rows are examined column-by-column in order to find identical rows.

A mathematical average (also called an “arithmetic mean” or simply “mean”) is calculated by adding all of the numeric values in a set and then dividing the resulting sum by the number of items the set contained. Note that attempts to apply **average** to a text value in the reference column will result in a **0** for that row. The header of the reference column receives a name change to become “Average of <*column name*>”.

Related aggregate functions are Median and Mode.

**Examples**The small dataset below will be used to show how

**average**operates. All numbers are numeric values (not text) in order to illustrate behavior of this aggregate function.

Column A | Column B | Column C |
---|---|---|

1 | two | 5 |

1 | two | 6 |

two | two | 7 |

1 | two | 4 |

**Example 1 **

Applying the **average** function to *Column C* reduces the row count from four to two. The value in the column *Average of Column C* shows the average of of the *Column C* values in the duplicate rows that were collapsed during the operation: (5 + 6 + 4) ÷ 3 = 5 while 7 ÷ 1 = 7.

Column A | Column B | Average of Column C |
---|---|---|

1 | two | 5 |

two | two | 7 |

**Example 2 **Applying the

**average**function to

*Column A*results in no row count reduction since examination of

*Column B*and

*Column C*values in each row reveals that each is already unique. The value in the column

*Average of Column A*therefore displays each row with its original numeric value (in this case, a number

**1**) since the

**average**function found no duplicate rows that could be involved in a mathematical operation. The

**0**replaces the text “two” because the

**average**operation cannot be applied to text values.

Average of Column A | Column B | Column C |
---|---|---|

1 | two | 5 |

1 | two | 6 |

1 | two | 4 |

0 | two | 7 |

## Count

This aggregate function returns the number of duplicate rows in a dataset. Excluding the column to which the **count** is being applied (the reference column), all rows are examined column-by-column in order to find duplicate rows. Those rows that contain duplicate data are collapsed into a single unique row. The reference column receives a name change to become “Count of <*column name*>” and the number that appears in the column indicates the number of duplicate rows that have been collapsed.

**Examples**

The small dataset below will be used to show how **count** operates.

Column A | Column B | Column C |
---|---|---|

one | two | 5 |

one | two | 6 |

two | two | 7 |

one | two | 4 |

**Example 1**

Applying the **count** function to *Column C* reduces the row count from four to two. The value in the column *Count of Column C* shows the count of (number of times) duplicate rows that were collapsed during the operation.

Column A | Column B | Count of Column C |
---|---|---|

one | two | 3 |

two | two | 1 |

**Example 2 **

Applying the **count** function to *Column A* results in no row count reduction since examination of *Column B* and *Column C* values in each row reveals that each is already unique. The value in the column *Count of Column A* therefore shows a value of **1** for each of the four rows.

Count of Column A | Column B | Column C |
---|---|---|

1 | two | 5 |

1 | two | 6 |

1 | two | 7 |

1 | two | 4 |

## Count (Numbers Only)

This aggregate function operates exactly the same as the Count function (see "**Count**" above.) However, **Count (Numbers Only)** only counts numeric values and ignores text values during the counting process.

## Count Distinct

Count Distinct returns the number of unique values in the column being counted, as opposed to the Count function which counts all values.

## First

The First function searches the data for duplicate rows (based on the columns chosen by the user) and condenses them into a single row of data. The First function will then display the first value within the data that appeared for the duplicate rows. Any values following the first value will be lost during this process.

**Example**The small dataset below will be used to show how First operates.

Column A | Column B | Column C |
---|---|---|

one | two | 5 |

one | two | 6 |

two | two | 7 |

one | two | 4 |

Applying the First function to *Column C* reduces the row count from four to two. The value in the column *First of Column C* shows the first value from Column C of the duplicate rows that were collapsed during the operation.

Column A | Column B | First of Column C |
---|---|---|

one | two | 5 |

two | two | 7 |

## Last

The Last function searches the data for duplicate rows (based on the columns chosen by the user) and condenses them into a single row of data. The Last function will then display the last value within the data that appeared for the duplicate rows. Any values prior to the last value will be lost during this process.

**Example**The small dataset below will be used to show how Last operates.

Column A | Column B | Column C |
---|---|---|

one | two | 5 |

one | two | 6 |

two | two | 7 |

one | two | 4 |

Applying the Last function to *Column C* reduces the row count from four to two. The value in the column *Last** of Column C* shows the last value from Column C of the duplicate rows that were collapsed during the operation.

Column A | Column B | Last of Column C |
---|---|---|

one | two | 4 |

two | two | 7 |

## Max

This aggregate function returns the largest of the numbers in the reference column (the column to which **max** is applied) as all duplicate rows are collapsed into unique single rows. Excluding the column to which the **max** is being applied, all rows are examined column-by-column in order to find identical rows.

The counterpart to this function is **Min (Minimum)**. See the next section for details.

**Example**The small dataset below will be used to show how

**max**operates. All numbers are numeric values (not text) in order to illustrate behavior of this aggregate function.

Column A | Column B | Column C |
---|---|---|

1 | two | 5 |

1 | two | 6 |

two | two | 7 |

1 | two | 4 |

Applying the **max** function to *Column C* reduces the row count from four to two. The value in the column *Max of Column C* shows the maximum of of the *Column C* values in the duplicate rows that were collapsed during the operation.

In the dataset that is returned by the **max** function (shown below) number **6** in the first row resulted from the set of numbers {4, 5, 6}. Each of these numbers exists in this set because each was a member of an identical row when *Column A* and *Column B* were examined. (*Column C* was excluded from this examination because it is the reference column.) In this set of three numbers, **6** is the largest—therefore it became the value shown in the reference column.

The number **7** results from a set with a single number {7} because there were no duplicate rows that could contribute other numbers. Since **7** is both the minimum and maximum of the single-row set, the function returns a **7** for that row.

## Min

This aggregate function returns the smallest of the numbers in the reference column (the column to which **min** is applied) as all duplicate rows are collapsed into unique single rows. Excluding the column to which the **min** is being applied, all rows are examined column-by-column in order to find identical rows.

The counterpart to this function is **Max (Maximum)**. See the previous section for details.

**Example**The small dataset below will be used to show how

**min**operates. All numbers are numeric values (not text) in order to illustrate behavior of this aggregate function.

Column A | Column B | Column C |
---|---|---|

1 | two | 5 |

1 | two | 6 |

two | two | 7 |

1 | two | 4 |

Applying the **min** function to *Column C* reduces the row count from four to two. The value in the column *Min of Column C* shows the minimum of of the *Column C* values in the duplicate rows that were collapsed during the operation.

In the dataset that is returned by the **min** function (shown below) number **4** in the first row resulted from the set of numbers {4, 5, 6}. Each of these numbers exists in this set because each was a member of an identical row when *Column A* and *Column B* were examined. (*Column C* was excluded from this examination because it is the reference column.) In this set of three numbers, **4** is the smallest—therefore it became the value shown in the reference column.

The number **7** results from a set with a single number {7} because there were no duplicate rows that could contribute other numbers. Since **7** is both the minimum and maximum of the single-row set, the function returns a **7** for that row.

Column A | Column B | Min of Column C |
---|---|---|

1 | two | 4 |

two | two | 7 |

## Median

This aggregate function finds the median of the numbers in the reference column (the column to which **median** is applied) as all duplicate rows are collapsed into unique single rows. Excluding the column to which the **median** is being applied, all rows are examined column-by-column in order to find identical rows.

A median value is one in the middle of a range of numbers ordered from lowest to highest value. This means that half of the numbers are to the “right” of the value returned and half of the numbers are to the “left” of the value returned. Where there is an even set of numbers (*i.e.*, no single number in the middle of the collection), the function calculates an average on the pair of numbers in the middle of the range (*i.e.*, the two numbers on either side of the mid-point).

Note that attempts to apply **median** to a text value in the reference column will result in a error for that row. The header of the reference column receives a name change to become “Median of <*column name*>”.

Related aggregate functions are Average and Mode.

**Example**The small dataset below will be used to show how

**median**operates. All numbers are numeric values (not text) in order to illustrate behavior of this aggregate function.

Column A | Column B | Column C |
---|---|---|

one | two | 5 |

one | two | 6 |

two | two | 7 |

one | two | 4 |

Applying the **median** function to *Column C* reduces the row count from four to two. The value in the column *Median of Column C* shows the median of the *Column C* values in the duplicate rows that were collapsed during the operation.

In the dataset that is returned by the **median** function (shown below) number **5** in the first row resulted from the ordered set of numbers {4, 5, 6}. Each of these numbers exists in this set because each was a member of an identical row when *Column A* and *Column B*were examined. (*Column C* was excluded from this examination because it is the reference column.) In this set of three numbers, **5** is the middle value with one number in the set on either side of it.

The number **7** results from a set with a single number {7} because there were no duplicate rows that could contribute other numbers. Since **7** is in the middle of the set (there are zero numbers on either side of it) the function returns a **7** for that row.

Column A | Column B | Median of Column C |
---|---|---|

one | two | 5 |

two | two | 7 |

## Mode

A mode is the value that occurs most frequently in a set of numbers. This aggregate function finds the most frequently occurring number in the reference column (the column to which **mode** is applied) among those rows that are identical. All duplicate rows are discovered by conducting a column-by-column examination (excluding the column to which **mode** is being applied) before they are collapsed into unique single rows. For each resulting single row, the value in the reference column from the contributing duplicate row becomes part of a set against which the mode function operates.

Although **mode** will still combine rows that have text in the reference column, it ignores text in finding the actual mode for the set. The header of the reference column receives a name change to become “Mode of <*column name*>”.

**Important:** Where there is a “tie” for the mode of a set (*i.e.*, where multiple numbers have an equal number of occurrences and there is no other number that appears more frequently) the result of **mode** is unpredictable. Although one of the “tied” numbers will appear in the column “Mode of <*column name*>” for the appropriate row, it is not possible to determine which of the values will occur.

Related aggregate functions are Average and Median.

**Example**The small dataset below will be used to show how

**mode**operates. All numbers are numeric values (not text) in order to illustrate behavior of this aggregate function.

Column A | Column B | Column C |
---|---|---|

one | two | 3 |

one | two | 6 |

two | two | 7 |

one | two | 3 |

Applying the **mode** function to *Column C* reduces the row count from four to two. The value in the column *Mode of Column C* shows the mode of the *Column C* values in the duplicate rows that were collapsed during the operation.

In the resulting dataset (shown below) number **3** in the first row resulted from the ordered set of numbers {3, 6, 3} where each member of the set was contributed by row that was duplicate and then collapsed into a single row. In this set of three numbers, **3** occurs most frequently (twice in the set of three numbers) and is therefore the mode for the set.

The number **7** results from a set with a single number {7} because there were no duplicate rows that could contribute other numbers. Since **7** is the only number in the set, it is therefore also the most frequently occurring.

Column A | Column B | Median of Column C |
---|---|---|

one | two | 3 |

two | two | 7 |

## Stdev (Standard Deviation)

Estimates the standard deviation (how much variation from the average) exists within a sample set of data. This aggregate function calculates the standard deviation of the numeric values in the reference column (the column to which **stdev** is applied) among those rows that are identical.

All duplicate rows are discovered by conducting a column-by-column examination (excluding the column to which **stdev** is being applied) before they are collapsed into unique single rows. For each resulting single row, the value in the reference column from the contributing duplicate row becomes part of the standard deviation calculation. The header of the reference column receives a name change to become “Stdev of <*column name*>”.

If there are text values in the reference column, they will be ignored within the **stdev **calculation. Note also that the aggregate **stdev** function requires at least two values. In other words, there must be at least two identical rows available for aggregation for each unique single row that is returned. Rows that occur only once will contribute only one value in the reference column for the calculation—resulting in an error.

The standard deviation for data is the square root of its Var. If the set under analysis represents all data points (referred to as a “population”) use of Stdevp is recommended for a more accurate result. A related function that deals with statistical variance is Varp.

**Example**The dataset below will be used to show how

**stdev**operates. All numbers are numeric values (not text) in order to illustrate behavior of this aggregate function.

Column A | Column B | Column C |
---|---|---|

one | two | 0.2 |

one | two | 0.1 |

one | two | 1.1 |

one | two | 0.2 |

one | two | 0.6 |

one | one | 0.2 |

one | one | 0.27 |

one | two | 0.2 |

one | two | 0.4 |

As shown in the table below, applying the **stdev** function to *Column C* reduces the row count from nine to two. The value in the column *Stdev of Column C* shows the standard deviation of the *Column C* sample data values in the duplicate rows that were collapsed during the operation.

Column A | Column B | Stdev of Column C |
---|---|---|

one | two | 0.3511884584284246 |

one | one | 0.049497474683058325 |

## Stdevp (Standard Deviation for a Population)

Computes the standard deviation (how much variation from the average) exists within an entire set of data (population). This aggregate function calculates the standard deviation for a population using the numeric values in the reference column (the column to which **stdevp** is applied) among those rows that are identical.

All duplicate rows are discovered by conducting a column-by-column examination (excluding the column to which **stdevp** is being applied) before they are collapsed into unique single rows. For each resulting single row, the value in the reference column from the contributing duplicate row becomes part of the standard deviation calculation for the population. The header of the reference column receives a name change to become “StdevP of <*column name*>”.

If there are text values in the reference column, they will be ignored within the **stdevp **calculation. Note also that the aggregate **stdevp** function requires at least two values. In other words, there must be at least two identical rows available for aggregation for each unique single row that is returned. Rows that occur only once will contribute only one value in the reference column for the calculation—resulting in an error.

If the set under analysis represents only a sample of data, use of Stdev is recommended for a more accurate result. Other related functions that deal with statistical variance are Var and Varp.

**Example**The dataset below will be used to show how

**stdevp**operates. All numbers are numeric values (not text) in order to illustrate behavior of this aggregate function.

Column A | Column B | Column C |
---|---|---|

one | two | 0.2 |

one | two | 0.1 |

one | two | 1.1 |

one | two | 0.2 |

one | two | 0.6 |

one | one | 0.2 |

one | one | 0.27 |

one | two | 0.2 |

one | two | 0.4 |

As shown in the table below, applying the **stdevp** function to *Column C* reduces the row count from nine to two. The value in the column *StdevP of Column C* shows the standard deviation of the population contained in *Column C* among the duplicate rows that were collapsed during the operation.

Column A | Column B | StdevP of Column C |
---|---|---|

one | two | 0.32513733362117264 |

one | one | 0.034999999999999996 |

## Sum

This aggregate function executes an addition operation on the numbers in the reference column (the column to which **sum** is applied) as all duplicate rows are collapsed into unique single rows. Excluding the column to which the **sum** is being applied, all rows are examined column-by-column in order to find identical rows. Note that attempts to apply **sum** to a text value in the reference column will result in a **0** for that row. The header of the reference column receives a name change to become “Sum of <*column name*>”.

**Examples**

The small dataset below will be used to show how **sum** operates. All numbers are numeric values (not text) in order to illustrate behavior of this aggregate function.

Column A | Column B | Column C |
---|---|---|

1 | two | 5 |

1 | two | 6 |

two | two | 7 |

1 | two | 4 |

**Example 1**

Applying the **sum** function to *Column C* reduces the row count from four to two. The value in the column *Sum of Column C* shows the sum of of the *Column C* values in the duplicate rows that were collapsed during the operation: 5 + 6 + 4 = 15 while 7 + 0 = 7.

Column A | Column B | Sum of Column C |
---|---|---|

1 | two | 15 |

two | two | 7 |

**Example 2 **

Applying the **sum** function to *Column A* results in no row count reduction since examination of *Column B* and *Column C* values in each row reveals that each is already unique. The value in the column *Sum of Column A* therefore displays each row with its original numeric value (in this case, a number **1**) since the **sum** function found no duplicate rows that could be involved in an adding operation. The **0** replaces the text “two” because the **sum** operation cannot be applied to text values.

Sum of Column A | Column B | Column C |
---|---|---|

1 | two | 5 |

1 | two | 6 |

1 | two | 4 |

0 | two | 7 |

## Var (Variance)

Estimates the how much dispersion exists (how much the values are spread out) within a sample set of data. This aggregate function calculates the variance of the numeric values in the reference column (the column to which **var** is applied) among those rows that are identical.

All duplicate rows are discovered by conducting a column-by-column examination (excluding the column to which **var** is being applied) before they are collapsed into unique single rows. For each resulting single row, the value in the reference column from the contributing duplicate row becomes part of the variance calculation. The header of the reference column receives a name change to become “Var of <*column name*>”.

If there are text values in the reference column, they will be ignored within the **var **calculation. Note also that the aggregate **var** function requires at least two values. In other words, there must be at least two identical rows available for aggregation for each unique single row that is returned. Rows that occur only once will contribute only one value in the reference column for the calculation—resulting in an error.

If the set under analysis represents all data points (referred to as a “population”) use of Varp is recommended for a more accurate result. Related functions that deal with statistical variance are Stdev and Stdevp.

**Example**The dataset below will be used to show how

**var**operates. All numbers are numeric values (not text) in order to illustrate behavior of this aggregate function.

Column A | Column B | Column C |
---|---|---|

one | two | 0.2 |

one | two | 0.1 |

one | two | 1.1 |

one | two | 0.2 |

one | two | 0.6 |

one | one | 0.2 |

one | one | 0.27 |

one | two | 0.2 |

one | two | 0.4 |

As shown in the table below, applying the **var** function to *Column C* reduces the row count from nine to two. The value in the column *Var of Column C* shows the variance of the*Column C* sample data values in the duplicate rows that were collapsed during the operation.

Column A | Column B | Var of Column C |
---|---|---|

one | two | 0.12333333333333334 |

one | one | 0.00245 |

## Varp (Variance for a Population)

Computes the dispersion that exists (how much the values are spread out) for an entire set of data (population). This aggregate function calculates the variance for a population contained within the reference column (the column to which **varp** is applied) among those rows that are identical.

All duplicate rows are discovered by conducting a column-by-column examination (excluding the column to which **varp** is being applied) before they are collapsed into unique single rows. For each resulting single row, the value in the reference column from the contributing duplicate row becomes part of the variance calculation for the population. The header of the reference column receives a name change to become “VarP of <*column name*>”.

If there are text values in the reference column, they will be ignored within the **varp **calculation. Note also that the aggregate **varp** function requires at least two values. In other words, there must be at least two identical rows available for aggregation for each unique single row that is returned. Rows that occur only once will contribute only one value in the reference column for the calculation—resulting in an error.

If the set under analysis represents only a sample set of data, use of Var is recommended for a more accurate result. Related functions that deal with statistical variance are Stdev and Stdevp.

**Example**The dataset below will be used to show how

**varp**operates. All numbers are numeric values (not text) in order to illustrate behavior of this aggregate function.

Column A | Column B | Column C |
---|---|---|

one | two | 0.2 |

one | two | 0.1 |

one | two | 1.1 |

one | two | 0.2 |

one | two | 0.6 |

one | one | 0.2 |

one | one | 0.27 |

one | two | 0.2 |

one | two | 0.4 |

As shown in the table below, applying the **varp** function to *Column C* reduces the row count from nine to two. The value in the column *VarP of Column C* shows the variance of the population contained in *Column C* among the duplicate rows that were collapsed during the operation.

Column A | Column B | VarP of Column C |
---|---|---|

one | two | 0.10571428571428572 |

one | one | 0.001225 |