DataRobot Paxata Official Cloud DocumentationGroup By Aggregate Functions

(warning) The Data Prep (Paxata) documentation is now available on the DataRobot public documentation site. See the Data Prep section for user documentation and connector information. After the 2021.2 SP1 release, the content on this site will be removed and replaced with a link to the DataRobot public documentation site.

Group By Aggregate Functions

The following is a list of the Aggregate Functions available to the user during Group By shaping operations. Click the link to learn more about each individual function:


Array
Average
Count
Count (Numbers Only)
Count Distinct
First
Last
Max
Min
Median
Mode
Stdev
Stdevp
Sum
Var
Varp


Array

Of the available aggregate functions available, array is unique in that it operates on both text and numeric values. Rather than perform a mathematical operation on collapsed rows, all values in the reference column (the column to which array is applied) are temporarily stored. As the unique single row is created, array assembles the reference column data from the set into a single, comma-separated string within the column.

Excluding the column to which the array is being applied, all rows are examined column-by-column in order to find identical rows. The header of the reference column receives a name change to become “Array of <column name>”.

Examples
The small dataset below will be used to show how array operates.

Column AColumn BColumn C
1two5
1two6
twotwo7
1two4

Example 1 
Applying the array function to Column C reduces the row count from four to two. The value in the column Array of Column C shows the sum of of the Column C values in the duplicate rows that were collapsed during the operation.

Column AColumn BArray of Column C
1two5.0, 6.0, 4.0
twotwo7.0

Example 2 
Applying the array function to Column A results in no row count reduction since examination of Column B and Column C values in each row reveals that each is already unique. The value in the column Array of Column A therefore displays each row with its original value—however, numbers have been converted into text.

Array of Column AColumn BColumn C
1.0two5
twotwo7
1.0two6
1.0two4

Average

This aggregate function finds an average of the numbers in the reference column (the column to which average is applied) as all duplicate rows are collapsed into unique single rows. Excluding the column to which the average is being applied, all rows are examined column-by-column in order to find identical rows.

A mathematical average (also called an “arithmetic mean” or simply “mean”) is calculated by adding all of the numeric values in a set and then dividing the resulting sum by the number of items the set contained. Note that attempts to apply average to a text value in the reference column will result in a 0 for that row. The header of the reference column receives a name change to become “Average of <column name>”.

Related aggregate functions are Median and Mode.

Examples
The small dataset below will be used to show how average operates. All numbers are numeric values (not text) in order to illustrate behavior of this aggregate function.

Column AColumn BColumn C
1two5
1two6
twotwo7
1two4

Example 1 
Applying the average function to Column C reduces the row count from four to two. The value in the column Average of Column C shows the average of of the Column C values in the duplicate rows that were collapsed during the operation: (5 + 6 + 4) ÷ 3 = 5 while 7 ÷ 1 = 7.

Column AColumn BAverage of Column C
1two5
twotwo7

Example 2 
Applying the average function to Column A results in no row count reduction since examination of Column B and Column C values in each row reveals that each is already unique. The value in the column Average of Column A therefore displays each row with its original numeric value (in this case, a number 1) since the average function found no duplicate rows that could be involved in a mathematical operation. The 0 replaces the text “two” because the average operation cannot be applied to text values.

Average of Column AColumn BColumn C
1two5
1two6
1two4
0two7


Count

This aggregate function returns the number of duplicate rows in a dataset. Excluding the column to which the count is being applied (the reference column), all rows are examined column-by-column in order to find duplicate rows. Those rows that contain duplicate data are collapsed into a single unique row. The reference column receives a name change to become “Count of <column name>” and the number that appears in the column indicates the number of duplicate rows that have been collapsed.

Examples 
The small dataset below will be used to show how count operates.

Column AColumn BColumn C
onetwo5
onetwo6
twotwo7
onetwo4

Example 1
Applying the count function to Column C reduces the row count from four to two. The value in the column Count of Column C shows the count of (number of times) duplicate rows that were collapsed during the operation.

Column AColumn BCount of Column C
onetwo3
twotwo1

Example 2 
Applying the count function to Column A results in no row count reduction since examination of Column B and Column C values in each row reveals that each is already unique. The value in the column Count of Column A therefore shows a value of 1 for each of the four rows.

Count of Column AColumn BColumn C
1two5
1two6
1two7
1two4



Count (Numbers Only)

This aggregate function operates exactly the same as the Count function (see "Count" above.) However, Count (Numbers Only) only counts numeric values and ignores text values during the counting process.



Count Distinct

Count Distinct returns the number of unique values in the column being counted, as opposed to the Count function which counts all values.



First

The First function searches the data for duplicate rows (based on the columns chosen by the user) and condenses them into a single row of data. The First function will then display the first value within the data that appeared for the duplicate rows. Any values following the first value will be lost during this process.

Example
The small dataset below will be used to show how First operates.

Column AColumn BColumn C
onetwo5
onetwo6
twotwo7
onetwo4

Applying the First function to Column C reduces the row count from four to two. The value in the column First of Column C shows the first value from Column C of the duplicate rows that were collapsed during the operation.

Column AColumn BFirst of Column C
onetwo5
twotwo7



Last

The Last function searches the data for duplicate rows (based on the columns chosen by the user) and condenses them into a single row of data. The Last function will then display the last value within the data that appeared for the duplicate rows. Any values prior to the last value will be lost during this process.

Example
The small dataset below will be used to show how Last operates.

Column AColumn BColumn C
onetwo5
onetwo6
twotwo7
onetwo4

Applying the Last function to Column C reduces the row count from four to two. The value in the column Last of Column C shows the last value from Column C of the duplicate rows that were collapsed during the operation.

Column AColumn BLast of Column C
onetwo4
twotwo7



Max

This aggregate function returns the largest of the numbers in the reference column (the column to which max is applied) as all duplicate rows are collapsed into unique single rows. Excluding the column to which the max is being applied, all rows are examined column-by-column in order to find identical rows.

The counterpart to this function is Min (Minimum). See the next section for details.

Example
The small dataset below will be used to show how max operates. All numbers are numeric values (not text) in order to illustrate behavior of this aggregate function.

Column AColumn BColumn C
1two5
1two6
twotwo7
1two4

Applying the max function to Column C reduces the row count from four to two. The value in the column Max of Column C shows the maximum of of the Column C values in the duplicate rows that were collapsed during the operation.

In the dataset that is returned by the max function (shown below) number 6 in the first row resulted from the set of numbers {4, 5, 6}. Each of these numbers exists in this set because each was a member of an identical row when Column A and Column B were examined. (Column C was excluded from this examination because it is the reference column.) In this set of three numbers, 6 is the largest—therefore it became the value shown in the reference column.

The number 7 results from a set with a single number {7} because there were no duplicate rows that could contribute other numbers. Since 7 is both the minimum and maximum of the single-row set, the function returns a 7 for that row.


Min

This aggregate function returns the smallest of the numbers in the reference column (the column to which min is applied) as all duplicate rows are collapsed into unique single rows. Excluding the column to which the min is being applied, all rows are examined column-by-column in order to find identical rows.

The counterpart to this function is Max (Maximum). See the previous section for details.

Example
The small dataset below will be used to show how min operates. All numbers are numeric values (not text) in order to illustrate behavior of this aggregate function.

Column AColumn BColumn C
1two5
1two6
twotwo7
1two4

Applying the min function to Column C reduces the row count from four to two. The value in the column Min of Column C shows the minimum of of the Column C values in the duplicate rows that were collapsed during the operation.

In the dataset that is returned by the min function (shown below) number 4 in the first row resulted from the set of numbers {4, 5, 6}. Each of these numbers exists in this set because each was a member of an identical row when Column A and Column B were examined. (Column C was excluded from this examination because it is the reference column.) In this set of three numbers, 4 is the smallest—therefore it became the value shown in the reference column.

The number 7 results from a set with a single number {7} because there were no duplicate rows that could contribute other numbers. Since 7 is both the minimum and maximum of the single-row set, the function returns a 7 for that row.

Column AColumn BMin of Column C
1two4
twotwo7



Median

This aggregate function finds the median of the numbers in the reference column (the column to which median is applied) as all duplicate rows are collapsed into unique single rows. Excluding the column to which the median is being applied, all rows are examined column-by-column in order to find identical rows.

A median value is one in the middle of a range of numbers ordered from lowest to highest value. This means that half of the numbers are to the “right” of the value returned and half of the numbers are to the “left” of the value returned. Where there is an even set of numbers (i.e., no single number in the middle of the collection), the function calculates an average on the pair of numbers in the middle of the range (i.e., the two numbers on either side of the mid-point).

Note that attempts to apply median to a text value in the reference column will result in a error for that row. The header of the reference column receives a name change to become “Median of <column name>”.

Related aggregate functions are Average and Mode.

Example
The small dataset below will be used to show how median operates. All numbers are numeric values (not text) in order to illustrate behavior of this aggregate function.

Column AColumn BColumn C
onetwo5
onetwo6
twotwo7
onetwo4

Applying the median function to Column C reduces the row count from four to two. The value in the column Median of Column C shows the median of the Column C values in the duplicate rows that were collapsed during the operation.

In the dataset that is returned by the median function (shown below) number 5 in the first row resulted from the ordered set of numbers {4, 5, 6}. Each of these numbers exists in this set because each was a member of an identical row when Column A and Column Bwere examined. (Column C was excluded from this examination because it is the reference column.) In this set of three numbers, 5 is the middle value with one number in the set on either side of it.

The number 7 results from a set with a single number {7} because there were no duplicate rows that could contribute other numbers. Since 7 is in the middle of the set (there are zero numbers on either side of it) the function returns a 7 for that row.

Column AColumn BMedian of Column C
onetwo5
twotwo7



Mode

A mode is the value that occurs most frequently in a set of numbers. This aggregate function finds the most frequently occurring number in the reference column (the column to which mode is applied) among those rows that are identical. All duplicate rows are discovered by conducting a column-by-column examination (excluding the column to which mode is being applied) before they are collapsed into unique single rows. For each resulting single row, the value in the reference column from the contributing duplicate row becomes part of a set against which the mode function operates.

Although mode will still combine rows that have text in the reference column, it ignores text in finding the actual mode for the set. The header of the reference column receives a name change to become “Mode of <column name>”.

Important: Where there is a “tie” for the mode of a set (i.e., where multiple numbers have an equal number of occurrences and there is no other number that appears more frequently) the result of mode is unpredictable. Although one of the “tied” numbers will appear in the column “Mode of <column name>” for the appropriate row, it is not possible to determine which of the values will occur.

Related aggregate functions are Average and Median.

Example
The small dataset below will be used to show how mode operates. All numbers are numeric values (not text) in order to illustrate behavior of this aggregate function.

Column AColumn BColumn C
onetwo3
onetwo6
twotwo7
onetwo3

Applying the mode function to Column C reduces the row count from four to two. The value in the column Mode of Column C shows the mode of the Column C values in the duplicate rows that were collapsed during the operation.

In the resulting dataset (shown below) number 3 in the first row resulted from the ordered set of numbers {3, 6, 3} where each member of the set was contributed by row that was duplicate and then collapsed into a single row. In this set of three numbers, 3 occurs most frequently (twice in the set of three numbers) and is therefore the mode for the set.

The number 7 results from a set with a single number {7} because there were no duplicate rows that could contribute other numbers. Since 7 is the only number in the set, it is therefore also the most frequently occurring.

Column AColumn BMedian of Column C
onetwo3
twotwo7



Stdev (Standard Deviation)

Estimates the standard deviation (how much variation from the average) exists within a sample set of data. This aggregate function calculates the standard deviation of the numeric values in the reference column (the column to which stdev is applied) among those rows that are identical.

All duplicate rows are discovered by conducting a column-by-column examination (excluding the column to which stdev is being applied) before they are collapsed into unique single rows. For each resulting single row, the value in the reference column from the contributing duplicate row becomes part of the standard deviation calculation. The header of the reference column receives a name change to become “Stdev of <column name>”.

If there are text values in the reference column, they will be ignored within the stdev calculation. Note also that the aggregate stdev function requires at least two values. In other words, there must be at least two identical rows available for aggregation for each unique single row that is returned. Rows that occur only once will contribute only one value in the reference column for the calculation—resulting in an error.

The standard deviation for data is the square root of its Var. If the set under analysis represents all data points (referred to as a “population”) use of Stdevp is recommended for a more accurate result. A related function that deals with statistical variance is Varp.

Example
The dataset below will be used to show how stdev operates. All numbers are numeric values (not text) in order to illustrate behavior of this aggregate function.

Column AColumn BColumn C
onetwo0.2
onetwo0.1
onetwo1.1
onetwo0.2
onetwo0.6
oneone0.2
oneone0.27
onetwo0.2
onetwo0.4

As shown in the table below, applying the stdev function to Column C reduces the row count from nine to two. The value in the column Stdev of Column C shows the standard deviation of the Column C sample data values in the duplicate rows that were collapsed during the operation.

Column AColumn BStdev of Column C
onetwo0.3511884584284246
oneone0.049497474683058325



Stdevp (Standard Deviation for a Population)

Computes the standard deviation (how much variation from the average) exists within an entire set of data (population). This aggregate function calculates the standard deviation for a population using the numeric values in the reference column (the column to which stdevp is applied) among those rows that are identical.

All duplicate rows are discovered by conducting a column-by-column examination (excluding the column to which stdevp is being applied) before they are collapsed into unique single rows. For each resulting single row, the value in the reference column from the contributing duplicate row becomes part of the standard deviation calculation for the population. The header of the reference column receives a name change to become “StdevP of <column name>”.

If there are text values in the reference column, they will be ignored within the stdevp calculation. Note also that the aggregate stdevp function requires at least two values. In other words, there must be at least two identical rows available for aggregation for each unique single row that is returned. Rows that occur only once will contribute only one value in the reference column for the calculation—resulting in an error.

If the set under analysis represents only a sample of data, use of Stdev is recommended for a more accurate result. Other related functions that deal with statistical variance are Var and Varp.

Example
The dataset below will be used to show how stdevp operates. All numbers are numeric values (not text) in order to illustrate behavior of this aggregate function.

Column AColumn BColumn C
onetwo0.2
onetwo0.1
onetwo1.1
onetwo0.2
onetwo0.6
oneone0.2
oneone0.27
onetwo0.2
onetwo0.4

As shown in the table below, applying the stdevp function to Column C reduces the row count from nine to two. The value in the column StdevP of Column C shows the standard deviation of the population contained in Column C among the duplicate rows that were collapsed during the operation.

Column AColumn BStdevP of Column C
onetwo0.32513733362117264
oneone0.034999999999999996




Sum

This aggregate function executes an addition operation on the numbers in the reference column (the column to which sum is applied) as all duplicate rows are collapsed into unique single rows. Excluding the column to which the sum is being applied, all rows are examined column-by-column in order to find identical rows. Note that attempts to apply sum to a text value in the reference column will result in a 0 for that row. The header of the reference column receives a name change to become “Sum of <column name>”.

Examples
The small dataset below will be used to show how sum operates. All numbers are numeric values (not text) in order to illustrate behavior of this aggregate function.

Column AColumn BColumn C
1two5
1two6
twotwo7
1two4

Example 1
Applying the sum function to Column C reduces the row count from four to two. The value in the column Sum of Column C shows the sum of of the Column C values in the duplicate rows that were collapsed during the operation: 5 + 6 + 4 = 15 while 7 + 0 = 7.

Column AColumn BSum of Column C
1two15
twotwo7

Example 2 
Applying the sum function to Column A results in no row count reduction since examination of Column B and Column C values in each row reveals that each is already unique. The value in the column Sum of Column A therefore displays each row with its original numeric value (in this case, a number 1) since the sum function found no duplicate rows that could be involved in an adding operation. The 0 replaces the text “two” because the sum operation cannot be applied to text values.

Sum of Column AColumn BColumn C
1two5
1two6
1two4
0two7

Var (Variance)

Estimates the how much dispersion exists (how much the values are spread out) within a sample set of data. This aggregate function calculates the variance of the numeric values in the reference column (the column to which var is applied) among those rows that are identical.

All duplicate rows are discovered by conducting a column-by-column examination (excluding the column to which var is being applied) before they are collapsed into unique single rows. For each resulting single row, the value in the reference column from the contributing duplicate row becomes part of the variance calculation. The header of the reference column receives a name change to become “Var of <column name>”.

If there are text values in the reference column, they will be ignored within the var calculation. Note also that the aggregate var function requires at least two values. In other words, there must be at least two identical rows available for aggregation for each unique single row that is returned. Rows that occur only once will contribute only one value in the reference column for the calculation—resulting in an error.

If the set under analysis represents all data points (referred to as a “population”) use of Varp is recommended for a more accurate result. Related functions that deal with statistical variance are Stdev and Stdevp.

Example
The dataset below will be used to show how var operates. All numbers are numeric values (not text) in order to illustrate behavior of this aggregate function.

Column AColumn BColumn C
onetwo0.2
onetwo0.1
onetwo1.1
onetwo0.2
onetwo0.6
oneone0.2
oneone0.27
onetwo0.2
onetwo0.4

As shown in the table below, applying the var function to Column C reduces the row count from nine to two. The value in the column Var of Column C shows the variance of theColumn C sample data values in the duplicate rows that were collapsed during the operation.

Column AColumn BVar of Column C
onetwo0.12333333333333334
oneone0.00245

Varp (Variance for a Population)


Computes the dispersion that exists (how much the values are spread out) for an entire set of data (population). This aggregate function calculates the variance for a population contained within the reference column (the column to which varp is applied) among those rows that are identical.

All duplicate rows are discovered by conducting a column-by-column examination (excluding the column to which varp is being applied) before they are collapsed into unique single rows. For each resulting single row, the value in the reference column from the contributing duplicate row becomes part of the variance calculation for the population. The header of the reference column receives a name change to become “VarP of <column name>”.

If there are text values in the reference column, they will be ignored within the varp calculation. Note also that the aggregate varp function requires at least two values. In other words, there must be at least two identical rows available for aggregation for each unique single row that is returned. Rows that occur only once will contribute only one value in the reference column for the calculation—resulting in an error.

If the set under analysis represents only a sample set of data, use of Var is recommended for a more accurate result. Related functions that deal with statistical variance are Stdev and Stdevp.

Example
The dataset below will be used to show how varp operates. All numbers are numeric values (not text) in order to illustrate behavior of this aggregate function.

Column AColumn BColumn C
onetwo0.2
onetwo0.1
onetwo1.1
onetwo0.2
onetwo0.6
oneone0.2
oneone0.27
onetwo0.2
onetwo0.4

As shown in the table below, applying the varp function to Column C reduces the row count from nine to two. The value in the column VarP of Column C shows the variance of the population contained in Column C among the duplicate rows that were collapsed during the operation.

Column AColumn BVarP of Column C
onetwo0.10571428571428572
oneone0.001225