Data Shaping Tools explained

The following operations can be performed with the Shape Tool:

Each are discussed below.


Deduplicate

The Deduplicate function searches your data for rows that match each other exactly, and condenses them into a single row, removing duplicate entries from your data.

In the Deduplicate window, you have the option to add/remove columns from the Columns field. Any columns added to this field will be included in the deduplication process. As you add columns, they will appear in the preview. NOTE: Only those columns added to the Columns field will be included in the deduplication process and resulting output; any columns not selected will not be considered during the deduplication process and will be removed from the data upon its completion. Once you are satisfied with the parameters of the deduplication function, click the Save button to finalize the process and commit it to your Project.

A Beta feature that can be enabled for the Deduplicate function is a "Fuzzy" option. When enabled, you see this option as an alternate selection to "Exact" match.



Fuzzy deduplication finds matching rows using a fuzzy algorithm. This means similar values, that are not exact matches, are grouped together and deduplicated. For example:

Similarly, Fuzzy Deduplication will group similar items even if they have blank values:

 

To enable this Beta feature, contact Paxata System Administrator and request that Fuzzy Deduplication be enabled.


Group By

This function allows for the user to utilize a number of different aggregate functions (listed below) on any of the existing columns within the dataset. When the Group By button is selected, a pane will appear above the dataset which allows the user to specify which columns to include in the Group By process, which column they wish to perform an aggregate function on, which aggregate function to use, as well as specify the name of the new aggregate column being created. The data will display a preview of the selections made by the user, highlighted in blue, so that you can see how it will affect the data. It is important to remember that only those columns included in the "Columns (Aggregates)" field will remain in your data following the Group By function. Those columns included will be used when identifying duplicate rows for grouping.

For a list of available aggregate functions, see Group by Aggregate Functions.

These operations are called aggregate because they find matching rows in the dataset and then combine them into one row. A matching row is defined as one that, excluding the reference column, share the same values in a column-by-column examination. The reference column is excluded from the column-by-column examination because its value are submitted to the aggregate function in order to produce the reference column value in the single-row result.


Transpose

Transpose is the simplest of the Shape functions, essentially allowing you to switch your rows and columns (imagine rotating your data 90 degrees).

The transpose function allows you to select any one column to create new column headers with. The values of the selected column will become the new column headers, while all other column headers will shift and become row headers (unless the user removes them during the transpose process). In this new transposition, if there is more than one value that matches a particular set of headings, Paxata will display the last available value from the original dataset.

Here we have a simple sales dataset:

Upon opening the Shaping function, you will have five options to choose from. Selecting the "Transpose" button at the top of the Shaping window will open the Transpose options.

You will be able to choose one column for the column header (the values of this chosen column will become the new column headers) as well as any columns you would like to include as rows in the newly transposed data. Pay attention to the grid below the Shaping window as it will display a preview of how the data will appear based on the options you have chosen for the Transpose process.

Be aware, depending on which column you choose for your transpose, you may not end up with a complete picture of your data. Notice how in our example there is more than one instance where the "Item Sold" value is "Baseball". If you were to transpose with the "Item Sold" as your new columns, Paxata will only display the last value in the dataset, as shown here:

Once you are satisified with the options you have chosen for the Transpose process, clicking the "Save" button in the upper-right hand corner of the screen will finalize and commit the process to your project.


Pivot

Pivot is similar to transpose in that it allows you to rearrange your column headers into row headers; but, it also allows you to perform aggregate functions on a selected column to be displayed as the body of data within your pivot table. Unlike the Transpose function, multiple headings can be chosen as column headers.

When you open the Shaping function and select the Pivot button, the Pivot options will display. You will be able to choose which columns to use as your column headers, as well as which columns to include as rows in the Pivot table to be created. Finally, you will also be able to choose which aggregate function will be conducted on which column to create the body of the Pivot table (clicking the "+" button will allow you to add more than one aggregation to your Pivot table).

The aggregate function chosen will be conducted on the chosen column, and then display the total based on the intersection of your chosen Column and Row headers. For a list of the available Aggregate functions, see Group by Aggregate Functions.

Remember to pay attention to the data displayed below the Shaping window, as it will show a preview of the Pivot table you are generating based on your selected options. Once you are satisfied with the setup of your Pivot table, clicking the "Save" button in the upper-right hand corner of the screen will finalize and commit the action to your project.



Depivot

The Depivot function is a very powerful function that allows you to take your data and stack columns into two columns (one column containing the Column headers that the original data came from, a second column containing the value from that column). While a very powerful feature, for users who have never used a function like Depivot, it can be tough to understand exactly what is happening during the Depivot process.

In the Depivot window, you will be given multiple options to choose/designate.

Row Labels: Here, you select which columns you want to remain static in your data and not be "stacked" during the Depivot process. 

Values: This option allows you to choose which columns from your data will be included (or "stacked") during the Depivot process.

Column Label: Allows you to designate a column name for the new column containing the Column Labels from the columns selected in the "Values" field.

Value Label: Allows you to designate a column name for the new column containing the values from the columns selected in the "Values" field.

In the example above, the preview displays how the Depivot function takes the selected columns from the "Values" field and "stacks" them into the two new columns. For an easier way to visualize exactly what is happening, view the images below. The columns have been color coded to make it more easily noticeable where specific values are coming from and moving to during the Depivot process.

In this example, the "LOCATION" column has been selected as the only value for the "Row Label". Due to the "stacking" of the column values during the Depivot process, you will notice that there are now duplicate rows for each location. This is to allow each value from the columns selected in the "Values" field to have its own row. You can see that by using the Depivot function, all of the sales numbers are in a single column. In this format, Group By functions can be very useful and provide insight into your data that may otherwise go unnoticed.