The Data Prep (Paxata) documentation is now available on the DataRobot public documentation site. See the Data Prep section for user documentation and connector information. After the 2021.2 SP1 release, the content on this site will be removed and replaced with a link to the DataRobot public documentation site.
In your Project, locate the Window icon in the tool-bar along the left-hand side of your Project to perform Aggregate, Shift, or Rank. In order to perform the Fill function, simply hover over the column operations menu in your data grid and select the desired Fill operation from the drop-down.
Note: If you do not see the Window tool, contact your System Administrator who can enable this option for you.
Window Aggregates is a tool that enables you to group sets of rows for the purpose of computing a specific function. In Paxata, instead of writing formulas, each calculation is easily created through a few point-and-click Steps.
In Paxata, a Window is defined as the set of rows participating in a computation. Windows can be identified as:
Fixed window: a grouping of rows based on a shared similar value in one or more columns.
Sliding window: a grouping of rows relative to the current row (i.e. rolling or moving functions).
Note: Aggregation in GroupBy is different than the Window Aggregates function. One main distinction is that in GroupBy the number of rows are reduced to only leave the aggregate row value, while Window functions compute the aggregates on each row of the dataset. For more information on GroupBy please see our GroupBy documentation.
Define your Window Aggregates
Create your output column. Notice that a preview of this new column appears next to the column you select to do the function on.
Function type: Drop-down includes Average, Sum, Count, First, Last, Min, Max, Median. Average is the default.
Pick the column to do the computation on. The column type must be numeric for Average, Sum, Min, Max, and Median. Count, First, and Last can be any column type.
Name the output column.
Group and Sort your rows.
Group by: From the drop-down, select which column(s) contains the values by which you want to group your rows by. Select at least one, but you may include multiple columns here to make more/smaller windows. Aggregated rows in the column(s) will show up in ascending order.
Sort by (optional): From the drop-down, select the column(s) to define the order by which your rows within the aggregate should be stacked. Click the blue arrow on the selected column to toggle between sorting the values in ascending or descending order. This is especially important for the First and Last functions, as the selection of the First/Last is done after the sort is applied.
Set the boundaries for the function.
Top of window: For each row in the aggregate, the function will begin with the value in the first row of the window to which it belongs.
Bottom of window: For each row in the aggregate, the function will end with the value in the last row of the window to which it belongs.
Current row with offset of __ rows: Selecting this as the top, bottom, or both boundaries will create a rolling window that changes with relation to the row being computed (the current row). The computation will begin/end with the specified number of rows above or below the row being calculated. For instance, if you select the current row with offset of 0 as starting and ending, the computation for each row will just be based on that single row value. If you offset by -1, it will compute the function, including the row itself and 1 row before. See the examples below.
Top of window to bottom of window (Fixed window)
Top of window to current row with offset (Sliding window)
Current row with offset to bottom of window (Sliding window)
Current row with offset to current row with offset (Sliding window)
Retail Company Sales
The simple dataset below shows regional sales for a retail company.
Use case 1: Fixed Window
The company wants to compare the Average Regional Sales between territories to compare their performances.
With the settings in the image below, the rows have been grouped by Territory and sorted in ascending order alphabetically by their Region.
Setting the boundaries as the top of the window through the bottom window is indicative of a fixed window. This means the average is computed for each row using all of the values in the Regional Sales column within the aggregate. The new preview column, Average Regional Sales, shows up next to the Regional Sales column which is being used for the computation.
You can now see that all rows in a given window have the same calculated average value.
Thus, the company can now compare the Average Regional Sales per Territory and find that, on average, the APAC territory is performing the best and EMEA is performing the worst.
The simple dataset below shows employee salaries in a company.
Use Case 2: Sliding Window
The company wants to analyze salaries within the Sales department compared to the average salary of the two people previously hired in that department.
To do this, group the employees based on their Department, and stack them in the order they were hired using their Employee Number as the sorting order. Instead of finding the average for the entire Sales department, create a sliding (or rolling) window that shows the average salary for each Sales employee and the Sales employee hired immediately prior. Set the window to start one row before each employee (with offset of -1) and ending with that row (offset of 0).
Now, look at the Sales department to see that the window aggregate changes for each computation. Each average is based on the previous and current row.
Using this information, the Company determines employee number 7 is the only employee in the Sales department who earns a salary lower, on average, than the previously hired Sales team members.
Use Case 3: Sliding Window
The company wants to analyze how the average salary in a department changes over time.
To do this, create a sliding window that starts at the top of the window and grows with each row that follows. This is similar to creating a RunningAverage function in Excel, but without needing to write a formula including “$” to set your absolute cells.
Like the previous example, group the rows together by Department and sort the Employee Numbers in ascending order. But unlike the previous example, the top boundary remains as a fixed starting point while the bottom of the window changes with relation to the row being calculated (i.e current row). Thus, set the window to start from the top row and end at the current row with an offset of 0. See the windows created below for the Sales Department:
Using the output column of this sliding window function, you can see that over time the average salary in the Sales Department has generally increased. Also, an interesting trend emerges from this calculation which shows that the variance between the averages is getting smaller as time goes on, meaning salaries are becoming more consistent.
Use Case 4: Sliding Window
The company wants a running total of salaries within each department for the purpose of analyzing which department’s total salary exceeds $15,000.
Grouping by Department and sorting Employee Number in descending order will stack your rows with the most recent hires at the top of your aggregate. Set a sliding window by starting the window with the current row being calculated (offset by 0) and end it at the bottom of the window. This is similar to creating a Running Total in Excel, but without needing to write a formula including “$” to set your absolute cells.
As illustrated in the image above, the running total for the HR and Research departments are both under the $15,000 amount being analyzed. However, the current total payroll for the Sales Department is $25,400. Upon further examination, you can locate when the total salary threshold was crossed. The green window has salaries for Employees 4, 5, and 7 and is below the $15,000 threshold. However, when Employee 10 is hired, the threshold is crossed as indicated by the blue window.
The Shift feature will create a new column next to the original column being referenced with the row values shifted up or down by the set number of cells. You may be familiar with this action as Lag or Lead in SQL. Upon selecting the Shift tool, identify the column to shift and give it a new name. Then, define the direction and amount of rows to offset by. Optionally, you may use the Group and Sort fields to order the row values as needed.
For example, if a company wants to compare monthly sales to the previous month, it would be useful to see the sales figures for consecutive months next to each other for comparison. Use the Shift Down feature to create a new "Previous Month" column based on the original Sales column, with an offset of 1 row. Now, with the Sales values side-by-side the company can easily create a Computed Column in order to quantify the difference between the sales figures month to month.
The Numbering function includes the ability to add a column to your dataset to denote Rank, Dense Rank, or Row Number.
Rank: Two or more rows share the same rank if they have tied values. The next row will have the next nonconsecutive value. For example, two matching cells would both be given the rank of 1, and the next cell will be ranked 3. If you use the Sort by option, the ranking is based on the sort order. If you use the Group by option, the ranking is specified within a group and resets to 1 for each top entry of the next grouping of rows.
Dense Rank: Similar to Rank, each tied row value will share the same rank. However, the next unique value will simply continue with the next consecutive integer, without regard to the number of values that shared the rank before it. For example, two matching cells would both be given the rank of 1, and the next unique cell will be ranked 2. The same Sort and Group by rules apply as the rank operation.
Row Number: This function assigns a sequential ID to each row, starting with 1, based on the sort order of the window (see Image 1 below). Use the Group or Sort fields to change the order of the rows, and the number values will start with 1 for each partition created (see Image 2 below).
The Fill function can be used to populate blank cells within a column based on the known values directly preceding or following the blank(s). To use the Fill function for a column, hover over the column operations menu, scroll to Fill in the drop-down, and then select the action of choice. The Group and Sort by options are available for all Fill functions. Click Save to apply the filled in values to your Project.
Fill Up - When a data point is missing, it will be filled in using the next available non-blank value.
Fill Down - When a data point is missing, it will be filled in using the most recently seen non-blank value.
Fill Average - When a data point is missing, it will be filled in with the average of the nearest previous and following non-blank values within a partition. Consecutive blanks will be filled with the same value because their input values are the same.
Fill Linear Fit - When a series of data points are missing, they will be filled in with values fitting a line between the surrounding available values. This is different from Fill Average, as Fill Average assigns the same average value to all missing values. Fill Linear Fit calculates a linear average and adjusts the values based on the number of missing values.
Note: Fill Average and Linear Fit may only be applied to numeric column types.