The Cluster + Edit function is used to quickly normalize column data. It is especially useful in spotting inconsistencies and errors in a column. When you execute this function on a column in your dataset:
All column values are searched and closely matching values are grouped together in a cluster.
Each cluster is listed in the Cluster + Edit panel, along with its Cluster Size (each unique value in the cluster) and its Row Count (the number of times each unique value occurs in the column.)
Based on the clustered data, Paxata suggests a single replacement value to normalize all of the values in the cluster. You can accept the suggestion or specify another value to use for normalizing the cluster.
Common scenarios for using Cluster + Edit
Bulk correction of data entry mistakes, spelling errors, and use of different abbreviations or shorthand conventions (e.g., Acme Co., Acme Company, Acme Comp)
Reclassifying detailed values into an aggregate value (e.g., “12oz Soda” and “8oz Soda” both become “Soda”)
Consolidating consistent, but different, values that emerge when the data from different systems is combined in a single column.
Steps to Perform a Cluster + Edit operation on a Column
- From the column where you want to perform the cluster and edit, click the drop-down arrow for that column and select "Cluster + Edit":
- The Cluster + Edit panel opens:
A. Verify this is the column on which you want to perform the operation.
B. Use the drop-down box to select one of the available algorithms to use for the clustering operation. See the Understanding the algorithms section below for details.
C. Use the drop-down box to select one of the available algorithms to use for the output option. The algorithm you choose determines the new value suggestion for the cluster. See the Understanding the output options section below for detasils. Note that you can always edit the suggested value.
D. Click to select the cluster you want to edit.
E. Optional: remove any value from a cluster so it will not be impacted by the edit operation. Click the “x” at the left side of the value to remove it.
F. Optional: if you want to change the suggested value to use for the cluster, enter a different value or click one from the cluster list to use.
Save your changes and the cluster is automatically updated.
- Continue making individual cluster edits as described in the steps above. Or, you can make bulk edits to quickly normalize all of the clusters. The following tools are available for bulk editing:
A. “Select page” selects all clusters on the current page. Note: the number of clusters on the page is determined by the adjacent setting for "Page Size". After you have reviewed all of the suggested replacements and made any edits, click Save to update all of the clusters on the page.
B. “Cluster automatically” selects every cluster in the column. This is a quick way to normalize every cluster if you are sure you want accept all of the suggested replacement values. Click Save to update all clusters.
Tools for working with Clusters