The following topics are reviewed in this article:
Matching Methods for Lookups and Joins
Select an IntellifusionTM matching method algorithm to customize how to match text values in columns that have a connection: Exact or Fuzzy. Each option is explained below.
There are three selections you can make for an Exact match:
- Automatic Match (the default): for matching text values, this data prep step ignores case, word order, and punctuation around words. Generally speaking, space characters and punctuation define word boundaries, but there are important exceptions. This matching method ignores punctuation around words, so .ave. matches AVE. However, Automatic Match detects special kinds of words that have punctuation inside the word. Any punctuation inside a word does not define a word boundary, and the punctuation must match exactly in both datasets.
Automatic Match detects the following words with inner punctuation:
- Numbers with periods: currency and floating point numbers such as 12.34. Currency symbols and punctuation around the number are ignored. Comparison is text comparison not numeric, so 3.0 and 3 do not match. Tip: To reduce issues, use the "Parse cell text into numbers" option during import or convert a column to numbers using the column drop-down menu.
- Email addresses: email addresses are one word including internal periods.
- Acronyms with periods: acronyms such as U.S.A. include inner punctuation and counts as one word. However, U.S.A. does not match USA because punctuation inside a word is not ignored when comparing words.
The following table shows the behavior of Automatic Match. The first two columns show example values. The third column shows if those values match with Automatic Match. For examples where Exact Match and Automatic Match return different results, the answers appears in bold.
- Exact Match: every character in the two values must exactly match.
- Custom Match: for making precise adjustments, this option allows you to selectively choose how to handle: word order, case, whitespace, and specific punctuation values. This matching method ignores or keeps word order and case depending on your selection. These options are commonly used for data containing names. Any combination of Ignore and Keep for word order and case can be selected. Click the Ignore and Keep buttons to see an example of how word order and case is handled based on your selections.
For Whitespace, specify whether to ignore, keep or split on when whitespace is encountered. The Split On button splits your data into separate terms on whitespace. Whitespace includes characters from the space bar, tab key and carriage returns. The Whitespace options are commonly used to increase matches for data containing address information. Click the Ignore, Keep and Split On buttons to see an example of how a whitespace is handled based on your selection.
Punctuation options allow you to ignore, keep or split on for specific punctuation. The default punctuation values are: comma and hyphen. Click the white Other button to add to the default values. A new field appears for adding more punctuation. To delete the defaults or any punctuation you have added, move your mouse over a punctuation field and click the orange "X" that appears in the top of that field. If you delete all of the punctuation fields, the default is to keep all punctuation.
Settings that you specify for word order, case, whitespace, and punctuation apply only to the current Lookup data prep step. These settings do not apply to other Lookup steps
This option is only available for a Standard Lookup and must be enabled by your System Administrator in order for it to display as an option in the UI.
The Fuzzy option employs an edit distance algorithm to predict the probability of likely matches between two join keys that you select. For example, you have a base dataset with company names and you want to enrich that data with addresses for each company. You have a second dataset, with address information, but the company names do not match closely enough to your base dataset. In this case, you can use the Fuzzy option to quickly create a join for the company names without having to do any cleanup ahead of time on the "company name" column in either dataset:
1. Your base dataset has company names:
Note that you can simultaneously create multiple join keys for the fuzzy option by adding more column names in the Connections panel:
However, both of your datasets must contain the exact same number of columns. Otherwise, you will receive an error and cannot proceed.
Additional options you can take after the data is joined:
- use the Columns tool to remove the Similarity Score column and one of the Company name columns.
- use a Filtergram on the remaining Company column to ensure the name formats are all correct. If you find names that you want to edit, simply use the Find and Replace option to correctly update those names.
Examples for using Lookups and Joins
A Reverse Lookup operation is useful when you want to achieve a one-to-many relationship between your current dataset and an enrichment dataset. This Reverse operation uses your current base dataset as the Lookup table—meaning only the first match from the base dataset is brought into the selected enrichment dataset.
Example: you have two datasets—Orderswith unique Order IDs andOrder Detailsthat has all of the ordering details associated with an Order ID.
You create a new Project with the Orders dataset as your Base dataset and you want to enrich it with data from Order Details based on the "Order ID" column in both datasets. In this case, you select the Lookup Reverse option, and then choose "Order_ID" as the column for the matching the datasets:
Note: to keep any unmatched (blank) rows containing the "Order_ID", choose an Outer or Left Lookup type. Alternatively, you can discard those rows by selecting an Inner Lookup type. Refer to Lookup Type, which defines what to do with any unmatched rows.
Important: a Reverse Lookup affects the number of rows that will be added to your base dataset and, possibly, the sort order.