(warning) The Data Prep (Paxata) documentation is now available on the DataRobot public documentation site. See the Data Prep section for user documentation and connector information. After the 2021.2 SP1 release, the content on this site will be removed and replaced with a link to the DataRobot public documentation site.

Lookup and Join functions

The following topics are reviewed in this article:

Lookup functions

Join functions

Matching Methods for Lookups and Joins

Examples of how to use Lookups and Joins


Lookup Functions

The Lookup tool provides a lookup type operation, which is comparable to an MS Excel VLOOKUP. A reverse lookup is also supported.

After you select the Lookup tool, select a "Lookup Source" dataset from your Data Library. You then select the columns for the lookup operation. The "Current" column refers to your base dataset, and when you click in each column, the available columns from each dataset display.

Note: if you use click to select the green "Detect Joins" option, the Connections menu displays the column(s) that your two datasets share. Additionally, a percentage score provides guidance for how best to combine the datasets. See How scores are calculated for Detect Joins for details.

After selecting the columns for the lookup, a preview of the data as it will be combined is displayed on the grid:



Next, click the Options tab to define:

  • the Lookup Type, which defines what to do with any unmatched rows.
  • the Matching Method algorithm to use for the lookup operation. Note: the Fuzzy Matching Method is only available for a Standard type lookup.

When you are satisfied with how your lookup is previewed on the grid, click the green "Save" button to complete the lookup operation.


How Lookup Scores are Calculated for the "Detect Joins" option

A score calculation considers two factors:

  • selectivity: the extent to which each column in the lookup connection is populated with unique values
  • overlap: the percentage of rows that match in the lookup connection.

These two factors produce a percentage score to reflect the quality of the connection. So the higher the percentage score, the better the connection. However, scores less than 100% do not necessarily indicate a problem with the proposed combine operation. In fact, there are legitimate use cases where the percentage may be less than 1%

Rather than being an absolute indicator of “right” or “wrong,” the percentage should serve as a sanity check that should match some level of expectation you have regarding the amount of data the two datasets have in common.

Examples of data that produces low and high scores:

Low Scores

Base (Driving) dataset

Lookup (Source) dataset

Explanation

a,b,c

a,a,b,b,c,c

The lookup has duplicate values

a,b,c

c,d,e,f

The lookup has only one overlap value—'c'



High Scores

Base (Driving) dataset

Lookup (Source) dataset

Explanation

a,b,c

a,b,c,d,e,f,g

All lookup values are unique

a,a,b,b,c,c

a,b,c

All lookup values overlap



Join Functions

The Join tool supports the following join types:

  • Standard Join that combines all matches across both datasets—comparable to a SQL join.
  • Range Join that combines datasets based on a marker column that is matched against two columns representing a range in the joining dataset.
  • Cross Product Join that combines all rows from both datasets. Important: the Cross-Product join significantly increases the number of rows that will be added to your Project because every row from the joining dataset is brought into the Project for each row in the Base dataset.

Note: the Join tool must be enabled by your System Administrator in order for it to be available in the UI.


Standard Join

After you select the Join tool, select the "Standard" type, then select the columns for the join operation. The "Current" column refers to your base dataset, and when you click in each column, the available columns from each dataset display. Begin by selecting the columns on which to create the Join:

After selecting the columns for the join, a preview of the data as it will be combined is displayed on the grid:

Next, click the Options tab to define:

  1. the Lookup Type, which defines what to do with any unmatched rows.
  2. the Matching Method algorithm to use for the join operation.

When you are satisfied with how your lookup is previewed on the grid, click the green "Save" button to complete the lookup operation


Range Join

The Range Join feature allows you to join datasets together based on a "marker" column in your base dataset that is matched against two separate "range" columns in the adjoining dataset. Important: the Range Join feature is only for column types numeric and date. You can always use the Change Menu to convert columns into these types if the data in those columns can be validly converted.
Note: the Range Join feature must be enabled by your System Administrator in order for it to display as a join option in the UI.

After you select the Join tool, select the "Range" option, then select the columns for the join operation. The "Current" column refers to the marker column in your base dataset. It is used to match against the lower and upper range of values in the adjoining dataset:

There are two ways to utilize the Range Join feature:

  • Keyless: a keyless range join groups together entries that fall within a certain range of the marker value.
  • Keyed: "Optional Keys" are extra columns you can select from the base and adjoining dataset to more carefully increase the specificity of the join criteria.


After selecting the columns for the join, a preview of the data as it will be combined is displayed on the grid. Example: our base dataset is a lending club dataset that contains typical information about individual loans. To that base, we are joining the "loan_riskfactors.csv" dataset that contains averaged risk factor of loans based on grade and date. The goal of our Project is to determine the risk factor of each loan in lending club base dataset with the industry averages provided by the risk factor dataset.

Here we are using the column "loan_date" as our marker column in the lending club base dataset and matched it against the "StartDate" and "EndDate" columns from the "loan_riskfactors.csv" dataset. Notice the grid displays a preview of the resulting join:

Tip: if the resulting dataset contains too many rows after the join, remember that you can use the Optional Keys to create a keyed join that increases the specificity of the join criteria, thereby reducing the number of resulting rows.

Next, click the Options tab to define:

  • the Lookup Type, which defines what to do with any unmatched rows.
  • the Matching Method algorithm to use for the join operation.

When you are satisfied with how your lookup is previewed on the grid, click the green "Save" button to complete the lookup operation.


Cross Product Join

After you select the Join tool, select the "Cross Product" type, then select the columns for the join operation. The "Current" column refers to your base dataset, and when you click in each column, the available columns from each dataset display. Begin by selecting the columns on which to create the Join:

After selecting the columns for the join, a preview of the data as it will be combined is displayed on the grid:

Note: unlike the other Join types, there are no lookup or matching options for a Cross Product join because all rows will be matched in this operation.

When you are satisfied with how your lookup is previewed on the grid, click the green "Save" button to complete the lookup operation.

Important: the Cross-Product join significantly increases the number of rows that will be added to your Project because every row from the joining dataset is brought into the Project for each row in the Base dataset. If the number of rows exceeds your Project's row limit, then the Join operation will fail with an error message. In this case, reduce the number of rows in your datasets before combining them or speak with your System Administrator regarding the Project row limit.



Lookup Type Options for Lookups and Joins

The Lookup Type options specify what to do with unmatched rows. There are four options: Left Lookup, Inner Lookup, Right Lookup, and Outer Lookup. Note: when you mouse over the icons for each type, its associated type displays.

One of the easiest ways to understand the different Lookup Types is by using a Venn diagram:

Left Lookup

A Left Lookup will return a combined dataset containing all rows with matching values in the specified columns, as well as all rows from the left (Base) dataset that had no corresponding matching value in the right (Lookup) dataset.

Inner Lookup

An Inner Lookup will return a combined dataset containing only those rows which contain matching values within the specified columns:

Right Lookup

A Right Lookup is similar to a Left Lookup in that it will return a combined dataset containing all rows with matching values in the specified columns; however, rather than return the rows with non-matching values from the left (Base) dataset, it will return all non-matching rows from the right (Lookup) dataset instead.

Outer Lookup

An Outer Lookup will return a combined dataset containing all rows with matching values in the specified columns, as well as all rows from both datasets that contained no matching values:



Matching Methods for Lookups and Joins

Select an IntellifusionTM matching method algorithm to customize how to match text values in columns that have a connection: Exact or Fuzzy. Each option is explained below.

Exact option

There are three selections you can make for an Exact match:

  • Automatic Match (the default): for matching text values, this data prep step ignores case, word order, and punctuation around words. Generally speaking, space characters and punctuation define word boundaries, but there are important exceptions. This matching method ignores punctuation around words, so .ave. matches AVE. However, Automatic Match detects special kinds of words that have punctuation inside the word. Any punctuation inside a word does not define a word boundary, and the punctuation must match exactly in both datasets.
    Automatic Match detects the following words with inner punctuation:
    • Numbers with periods: currency and floating point numbers such as 12.34. Currency symbols and punctuation around the number are ignored. Comparison is text comparison not numeric, so 3.0 and 3 do not match. Tip: To reduce issues, use the "Parse cell text into numbers" option during import or convert a column to numbers using the column drop-down menu.
    • Email addresses: email addresses are one word including internal periods.
    • Acronyms with periods: acronyms such as U.S.A. include inner punctuation and counts as one word. However, U.S.A. does not match USA because punctuation inside a word is not ignored when comparing words.

The following table shows the behavior of Automatic Match. The first two columns show example values. The third column shows if those values match with Automatic Match. For examples where Exact Match and Automatic Match return different results, the answers appears in bold.

  • Exact Match: every character in the two values must exactly match.

  • Custom Match: for making precise adjustments, this option allows you to selectively choose how to handle: word order, case, whitespace, and specific punctuation values. This matching method ignores or keeps word order and case depending on your selection. These options are commonly used for data containing names. Any combination of Ignore and Keep for word order and case can be selected. Click the Ignore and Keep buttons to see an example of how word order and case is handled based on your selections.

    For Whitespace, specify whether to ignore, keep or split on when whitespace is encountered. The Split On button splits your data into separate terms on whitespace. Whitespace includes characters from the space bar, tab key and carriage returns. The Whitespace options are commonly used to increase matches for data containing address information. Click the IgnoreKeep and Split On buttons to see an example of how a whitespace is handled based on your selection.

    Punctuation options allow you to ignore, keep or split on for specific punctuation. The default punctuation values are: comma and hyphen. Click the white Other button to add to the default values. A new field appears for adding more punctuation. To delete the defaults or any punctuation you have added, move your mouse over a punctuation field and click the orange "X" that appears in the top of that field. If you delete all of the punctuation fields, the default is to keep all punctuation.

    Settings that you specify for word order, case, whitespace, and punctuation apply only to the current Lookup data prep step. These settings do not apply to other Lookup steps


Fuzzy Option

This option is only available for a Standard Lookup and must be enabled by your System Administrator in order for it to display as an option in the UI.

The Fuzzy option employs an edit distance algorithm to predict the probability of likely matches between two join keys that you select. For example, you have a base dataset with company names and you want to enrich that data with addresses for each company. You have a second dataset, with address information, but the company names do not match closely enough to your base dataset. In this case, you can use the Fuzzy option to quickly create a join for the company names without having to do any cleanup ahead of time on the "company name" column in either dataset:

1. Your base dataset has company names:

while your enrichment dataset with addresses has company names listed this way:


2. Now tune the Fuzzy options: 


3. After finding the best threshold match, click "Save" and your data is joined.

Note that you can simultaneously create multiple join keys for the fuzzy option by adding more column names in the Connections panel:

However, both of your datasets must contain the exact same number of columns. Otherwise, you will receive an error and cannot proceed.

Additional options you can take after the data is joined:

  • use the Columns tool to remove the Similarity Score column and one of the Company name columns.
  • use a Filtergram on the remaining Company column to ensure the name formats are all correct. If you find names that you want to edit, simply use the Find and Replace option to correctly update those names.



Examples for using Lookups and Joins

Reverse Lookup

A Reverse Lookup operation is useful when you want to achieve a one-to-many relationship between your current dataset and an enrichment dataset. This Reverse operation uses your current base dataset as the Lookup table—meaning only the first match from the base dataset is brought into the selected enrichment dataset.

Example: you have two datasets—Orderswith unique Order IDs andOrder Detailsthat has all of the ordering details associated with an Order ID.

OrdersOrder Details

You create a new Project with the Orders dataset as your Base dataset and you want to enrich it with data from Order Details based on the "Order ID" column in both datasets. In this case, you select the Lookup Reverse option, and then choose "Order_ID" as the column for the matching the datasets:

Note: to keep any unmatched (blank) rows containing the "Order_ID", choose an Outer or Left Lookup type. Alternatively, you can discard those rows by selecting an Inner Lookup type. Refer to Lookup Type, which defines what to do with any unmatched rows.

Important: a Reverse Lookup affects the number of rows that will be added to your base dataset and, possibly, the sort order.


Join Datasets

This selection combines all matches across both datasets. The Join operation is useful when you want to achieve a many-to-many relationship between two datasets. Note: this selection affects the number of rows that will be added to your Project.

Example:
You have two datasets:

  • Transactions with a set of purchase transaction IDs, each row indicating a book that was purchased. Note that there may be multiple transactions IDs with the same book because that book may have sold multiple times.
  • Books with a set of books and their authors. Note that each book may have multiple authors, such that one book may be listed in multiple rows, once for each co-author.
TransactionsBooks

Your objective is to determine the number of transactions per author. And because there are multiple authors for many of the books—instead of a single, unique value for author—the "Join" is the option you want to select for combining these datasets.

Create a new Project with the Transactions dataset. Then do a standard Join operation with the Books dataset, and choose "Book" as the column for joining the datasets:

Note that performing a Join can substantially increase the number of rows in the resulting dataset. If the number of rows exceeds your Project's row limit, then the Join operation will fail with an error message. In this case, reduce the number of rows in your datasets before joining them or speak with your System Administrator regarding the Project row limit.


Cross Product Join to return all combinations

This selection combines all rows from both of the datasets in the operation.

Example:
You have three datasets:

  • Customer Master with all of your customers IDs.
  • Products Master with all of your products and associated IDs.
  • Customer Orders with the Customer IDs, and the associated Products and Quantity that were purchased.
Customer MasterProduct MasterCustomer Orders

Your objective is to create a master Project that includes all customers and all products, and then determine all of the products that each customer has not purchased.

Create a new Project with the Customer Master as your base dataset. Then do a Cross-Product join operation with the Products Master dataset. Choose "Customer_ID" and "Product_ID" as the columns for the cross-product operation.

You now have a master dataset that has all of your customers and all of your products:

Next, do a Standard Lookup operation with the Customer Orders dataset. Choose "Customer_ID" and "Product_ID" as the columns for the lookup, and be sure to keep all unmatched rows. Refer to Lookup Type, which defines what to do with any unmatched rows.

The resulting dataset enables you to easily identify all of the products that each customer has not yet purchased:

Each time you update the Customer Orders dataset for this Project, you automatically see the data on the grid updated to reflect the new purchases. To capture all of the purchase information over time, you can create a Project Lens to publish an AnswerSet for each result.

Note: the Cross-Product operation significantly increases the number of rows that will be added to your Project because every row from the Lookup source is brought into the Project for each row in the Base dataset. If the number of rows exceeds your Project's row limit, then the Join operation will fail with an error message. In this case, reduce the number of rows in your datasets before combining them or speak with your System Administrator regarding the Project row limit.