The Data Prep (Paxata) documentation is now available on the DataRobot public documentation site. See the Data Prep section for user documentation and connector information. After the 2021.2 SP1 release, the content on this site will be removed and replaced with a link to the DataRobot public documentation site.
Cluster and Edit
Overview
The Cluster + Edit function is used to quickly normalize column data. It is especially useful in spotting inconsistencies and errors in a column. When you execute this function on a column in your dataset:
All column values are searched and closely matching values are grouped together in a cluster.
Each cluster is listed in the Cluster + Edit panel, along with its Cluster Size (each unique value in the cluster) and its Row Count (the number of times each unique value occurs in the column.)
Based on the clustered data, Paxata suggests a single replacement value to normalize all of the values in the cluster. You can accept the suggestion or specify another value to use for normalizing the cluster.
Common scenarios for using Cluster + Edit
Bulk correction of data entry mistakes, spelling errors, and use of different abbreviations or shorthand conventions (e.g., Acme Co., Acme Company, Acme Comp)
Reclassifying detailed values into an aggregate value (e.g., “12oz Soda” and “8oz Soda” both become “Soda”)
Consolidating consistent, but different, values that emerge when the data from different systems is combined in a single column.
Steps to Perform a Cluster + Edit operation on a Column
- From the column where you want to perform the cluster and edit, click the drop-down arrow for that column and select "Cluster + Edit":
- The Cluster + Edit panel opens:
A. Verify this is the column on which you want to perform the operation.
B. Use the drop-down box to select one of the available algorithms to use for the clustering operation. See the Understanding the algorithms section below for details.
C. Use the drop-down box to select one of the available algorithms to use for the output option. The algorithm you choose determines the new value suggestion for the cluster. See the Understanding the output options section below for detasils. Note that you can always edit the suggested value.
D. Click to select the cluster you want to edit.
E. Optional: remove any value from a cluster so it will not be impacted by the edit operation. Click the “x” at the left side of the value to remove it.
F. Optional: if you want to change the suggested value to use for the cluster, enter a different value or click one from the cluster list to use.Save your changes and the cluster is automatically updated.
- Continue making individual cluster edits as described in the steps above. Or, you can make bulk edits to quickly normalize all of the clusters. The following tools are available for bulk editing:
A. “Select page” selects all clusters on the current page. Note: the number of clusters on the page is determined by the adjacent setting for "Page Size". After you have reviewed all of the suggested replacements and made any edits, click Save to update all of the clusters on the page.
B. “Cluster automatically” selects every cluster in the column. This is a quick way to normalize every cluster if you are sure you want accept all of the suggested replacement values. Click Save to update all clusters.
Tools for working with Clusters
The following tools provide visual queues to better recognize how the suggested value for a Cluster was derived:
- Fixed-width font setting: by default, Cluster values display in a variable-width font. Click this option to display Cluster values in a fixed-width font. The fixed-width option aligns all text characters, which allows you to more easily identify extra spaces within a Cluster value and differentiate characters across the Clusters.
- Highlight tools: highlighting allows you to easily recognize how the suggested Cluster replacement value was derived. The Additions tool highlights the characters that are in addition to all common characters. The Deletions tool indicates where deletions have been made in order to derive the common characters. Deletions are condensed into a red (x). The Additions and Deletions tools can be simultaneously enabled.
Understanding the algorithms
- Metaphone groups words together based on their English language pronunciation. It is classified as a “phonetic” algorithm because it is based on how similar or different the text would sound if spoken. This algorithm is particularly useful when working with manually entered data (where misspellings may occur) and data appended from multiple source systems (where minor variations may occur).
- N-gram breaks the data in the column into a specified number (n) of characters. These “chunks” (or grams) of text are then compared based on the probability of what might follow each. A frequently seen application of the n-gram algorithm is used by search engines: as a user enters characters into the search bar, the engine examines the probability of what form the final search terms might take and makes suggestions as the user types.
- Fingerprint groups similar values into a cluster where the only differences are: punctuation, word order and capitalization. A frequently seen application of the Fingerprint algorithm is for matching names, for example: "Adèle Smith" and "SMITH, ADELE".
For all three of these algorithms, blanks and nulls are not included when building a cluster.
Understanding the output options
The output option determines the default replacement value for the cluster. There are three options:
- Most Frequent Value: the value that occurs most frequently in the cluster.
- All Common Words: the string of matching words, starting at the beginning of the string, regardless of order. The frequency at which each string occurs then determines the New Value
Cluster Example:
Apple Computer Corporation
Apple Computer Inc
Apple Corporation Computer
Apple Computer
Apple Corp Computer
New Value: Apple Computer
Important: the algorithm used to build your cluster(s) affects the New Value suggestion:
- Because metaphone intends to preserve the semantic meaning of the word(s) in your cluster(s), you may notice that some of the New Value suggestions do not strictly reflect all common words in your cluster(s). For example, this may be the case when punctuation is included in your cluster(s).
- The ngram algorithm must be used in order to include non-consecutive, common words in the cluster.
- Consecutive Common Words: the longest sequence of matching, consecutive words, starting at the beginning of the string. Values that occur in less that 10% of the Cluster are not included when determining the New Value recommendation. Note that most punctuation does not interrupt the sequencing for the match.
Example:
Apple-Computer
Apple Computer
Apple ComputerAG
Apple Computer Corp
Apple Computer Corporation
Apple Computer Inc
New Cluster Value: Apple Computer
The output options listed above intend to make the best recommendation for the New Value replacement. However, the replacement value can always be manually edited to meet your specific business requirements.
#FORM FOLLOWS