(warning) The Data Prep (Paxata) documentation is now available on the DataRobot public documentation site. See the Data Prep section for user documentation and connector information. After the 2021.2 SP1 release, the content on this site will be removed and replaced with a link to the DataRobot public documentation site.

Cluster and Edit



Overview

The Cluster + Edit function is used to quickly normalize column data. It is especially useful in spotting inconsistencies and errors in a column. When you execute this function on a column in your dataset:

  • All column values are searched and closely matching values are grouped together in a cluster.

  • Each cluster is listed in the Cluster + Edit panel, along with its Cluster Size (each unique value in the cluster) and its Row Count (the number of times each unique value occurs in the column.)

Based on the clustered data, Paxata suggests a single replacement value to normalize all of the values in the cluster. You can accept the suggestion or specify another value to use for normalizing the cluster.

Common scenarios for using Cluster + Edit

  • Bulk correction of data entry mistakes, spelling errors, and use of different abbreviations or shorthand conventions (e.g., Acme Co., Acme Company, Acme Comp)

  • Reclassifying detailed values into an aggregate value (e.g., “12oz Soda” and “8oz Soda” both become “Soda”)

  • Consolidating consistent, but different, values that emerge when the data from different systems is combined in a single column. 

Steps to Perform a Cluster + Edit operation on a Column


  1. From the column where you want to perform the cluster and edit, click the drop-down arrow for that column and select "Cluster + Edit":


  2. The Cluster + Edit panel opens:

    A. Verify this is the column on which you want to perform the operation.

    B. Use the drop-down box to select one of the available algorithms to use for the clustering operation. See the Understanding the algorithms section below for details.

    C. Use the drop-down box to select one of the available algorithms to use for the output option. The algorithm you choose determines the new value suggestion for the cluster. See the Understanding the output options section below for detasils. Note that you can always edit the suggested value.

    D. Click to select the cluster you want to edit.

    E. Optional: remove any value from a cluster so it will not be impacted by the edit operation. Click the “x” at the left side of the value to remove it. 

    F. Optional: if you want to change the suggested value to use for the cluster, enter a different value or click one from the cluster list to use.

    Save your changes and the cluster is automatically updated.

  3. Continue making individual cluster edits as described in the steps above. Or, you can make bulk edits to quickly normalize all of the clusters. The following tools are available for bulk editing:

    A. “Select page” selects all clusters on the current page. Note: the number of clusters on the page is determined by the adjacent setting for "Page Size". After you have reviewed all of the suggested replacements and made any edits, click Save to update all of the clusters on the page.

    B. “Cluster automatically” selects every cluster in the column. This is a quick way to normalize every cluster if you are sure you want accept all of the suggested replacement values. Click Save to update all clusters.

Tools for working with Clusters

The following tools provide visual queues to better recognize how the suggested value for a Cluster was derived:

  • Fixed-width font setting: by default, Cluster values display in a variable-width font. Click this option to display Cluster values in a fixed-width font. The fixed-width option aligns all text characters, which allows you to more easily identify extra spaces within a Cluster value and differentiate characters across the Clusters.

  • Highlight tools: highlighting allows you to easily recognize how the suggested Cluster replacement value was derived. The Additions tool highlights the characters that are in addition to all common characters. The Deletions tool indicates where deletions have been made in order to derive the common characters. Deletions are condensed into a red (x). The Additions and Deletions tools can be simultaneously enabled.

Understanding the algorithms

In order to find the values that should be grouped together, the application makes available three (3) algorithms for your use. For most people, however, the names of the algorithms do not effectively communicate how each works, why they may give different results, or when the use of one may be preferred over another.
  • Metaphone groups words together based on their English language pronunciation. It is classified as a “phonetic” algorithm because it is based on how similar or different the text would sound if spoken. This algorithm is particularly useful when working with manually entered data (where misspellings may occur) and data appended from multiple source systems (where minor variations may occur).

  • N-gram breaks the data in the column into a specified number (n) of characters. These “chunks” (or grams) of text are then compared based on the probability of what might follow each. A frequently seen application of the n-gram algorithm is used by search engines: as a user enters characters into the search bar, the engine examines the probability of what form the final search terms might take and makes suggestions as the user types.

  • Fingerprint groups similar values into a cluster where the only differences are: punctuation, word order and capitalization. A frequently seen application of the Fingerprint algorithm is for matching names, for example: "Adèle Smith" and "SMITH, ADELE".

For all three of these algorithms, blanks and nulls are not included when building a cluster.

Understanding the output options

The output option determines the default replacement value for the cluster. There are three options:

  • Most Frequent Value: the value that occurs most frequently in the cluster.

  • All Common Words: the string of matching words, starting at the beginning of the string, regardless of order. The frequency at which each string occurs then determines the New Value 
    Cluster Example:
    Apple Computer Corporation
    Apple Computer Inc
    Apple Corporation Computer
    Apple Computer
    Apple
     Corp Computer
    New Value: Apple Computer

Important: the algorithm used to build your cluster(s) affects the New Value suggestion:

  • Because metaphone intends to preserve the semantic meaning of the word(s) in your cluster(s), you may notice that some of the New Value suggestions do not strictly reflect all common words in your cluster(s). For example, this may be the case when punctuation is included in your cluster(s).

  • The ngram algorithm must be used in order to include non-consecutive, common words in the cluster.
  • Consecutive Common Words: the longest sequence of matching, consecutive words, starting at the beginning of the string. Values that occur in less that 10% of the Cluster are not included when determining the New Value recommendation. Note that most punctuation does not interrupt the sequencing for the match.
    Example:
    Apple-Computer
    Apple Computer

    Apple ComputerAG
    Apple Computer Corp
    Apple Computer Corporation
    Apple Computer Inc
    New Cluster Value: Apple Computer

The output options listed above intend to make the best recommendation for the New Value replacement. However, the replacement value can always be manually edited to meet your specific business requirements.


#FORM FOLLOWS