(warning) The Data Prep (Paxata) documentation is now available on the DataRobot public documentation site. See the Data Prep section for user documentation and connector information. After the 2021.2 SP1 release, the content on this site will be removed and replaced with a link to the DataRobot public documentation site.

Column typing and working with heterogeneous data types in Paxata

Paxata's data transformation capabilities support heterogenous data types—meaning that data types are automatically identified at the cell level when importing a dataset into the Paxata Library. The ability to support heterogenous data types, and within the same Paxata Project column, is powerful because it enables you to bring all of the data into your Project. The mixed data types with inherent data quality issues can then be easily homogenized and harmonized with Paxata, as explained in the Best Practices section below. In other strongly-typed applications that do not support heterogenous data, the source data must be homogenized, using a different tool, before your data prep work can even begin.

The purpose of this article is to explain when and how Paxata determines data types for cells and columns, and the best practices for working with heterogeneous data in a Paxata Project.

When and how does Paxata identify data types?
When you import a dataset into the Paxata Library, all data, in all cells, is automatically identified as one of the following data types during the import process:

·       Numeric

·       Boolean

·       String or Text

·       Date Time (under conditions described below)

Paxata does this through an algorithm that follows these rules:

1.     If the value is null, ignore the value

2.     If the value is, exactly, either “true” or “false”, treat the value as a Boolean

3.     If the value can be programmatically read as a number, treat it as Numeric

4.     All other values default to String

For example: you have a dataset with 10 columns and 1 million rows of data. This translates to a total of 10 million cells. In this case, Paxata identifies the data type for each one of the 10 Million cells following the algorithm rules above.

Then, the data type for each column is cast based on the predominant data type that resides within each cell of a column.

What about Date Time values?
As a rule, due to the inherent complexities required to analyze and resolve for the multitude of date time formats, Paxata does not identify date time values in flat files. However, there are some exceptions to this rule. Under the following conditions, cell data will be recognized as type date time when imported from:

·       Database table (JDBC, Hive, etc) and the database provides a schema

·       Parquet file and the Parquet format provides the schema

·       Microsoft Excel file and the Excel format specifies a data type for each cell

How does Paxata determine the column type for heterogeneous data within the same column?
Returning to the example above, in 1 million rows of data, there's a good possibility that, within the same column, there is data belonging to different data types—for example string and numeric values could be mixed in the same column. In this case, Paxata has further logic to determine how to cast the column's data type. Let's use another very simple example to illustrate that logic.

Here is a column of data that has 15 rows. The first 9 rows are identified as type Numeric, and the remaining 6 values are identified as type String. Notice the column type has been cast as Numeric:

During the import process, the algorithm calculation is automatically performed to detect that 60% of values are Numeric and 40% Strings. The predominant data type in a column then determines how Paxata will cast the column type—in this case, the column is cast as type Numeric. 

What if there's a tie for data types in a column?
In the event there is a tie—meaning 50% of the column values are one type while 50% are another type—the calculation logic provides these additional rules to break the tie:

To summarize, in order of predominance, ties are broken are as follows:
1. Boolean
2. String
3. Numeric
4. Date

What happens if the predominant data type in a column changes when I bring in new data to my Paxata Library or an existing Project?
Column type inferencing and subsequent casting only occurs during the import process to the Paxata Library. There are two potential scenarios when a column type may not accurately reflect the predominant data type in a column:

(1) During import into the Paxata Library, the first 1,000 rows of data are used to inference the column type. As a general rule, Paxata has found that 1,000 rows of data—a configurable value—are sufficient to accurately inference and cast column types for your datasets. These first 1,000 rows are informally known as the "preview" state and it's the state you see in the application while a dataset is loading—either for the first time, or as an updated version for an existing dataset:

There may be unusual cases in which the predominant data type for a column changes after the first 1,000 rows. In this case, the column will remain cast using those first 1,000 rows. Though this "preview" state of 1,000 rows is configurable during import, Paxata best practices recommends that you use Filtergrams in your Project, as part of your standard data harmonization practice, to identify and address data quality issues. See the next section for Best Practices details.

(2) After a lookup or an append operation in an existing Project, the predominant type for a column may change based on the data that comes into the column as a result of the operation. Because inferencing for column type occurs only during the import process, the column type, as it was originally cast, will remain despite the new predominant type. However, Paxata best practices recommends that, as part of your standard data harmonization practices, you always use Filtergrams after blending your data from multiple sources to identify and address data quality issues. See the next section for Best Practices details.

Best Practices: how do I use Paxata to locate and remediate data typing issues in my data?
Paxata was built from the very beginning to identify and address such data quality issues. Typically, as soon as a dataset is imported into the Library or appended in a Project, the next recommended step is to harmonize the data type so that data quality is enhanced. Data harmonization is one of the key aspects of data preparation and Paxata provides you with visual indicators and tools like Filtergrams for your harmonization exercise.

Example: following an append into an existing Project, the predominant data type changes for this column from "numeric" to "string". The column was initially and correctly cast as numeric during the import process and subsequent use in this Project. After the append operation, notice the column type remains string type, though the predominant type is now numeric. However, a visual indicator—numeric values are right-justified—allows you to quickly notice a typing disparity in the column:

When opening a Filtergram on this column, you can quickly determine the values in the column that are not "valid" for this type of column:

After you've identified the "invalid" data types, which are all types other than the predominate type, and filtered to display only those values, you can create a lens in your Project to generate an AnswerSet that lists only those non-conforming values. The AnswerSet can then be used to assist in your remediation process for those values. If, after reviewing the "invalid" types, you want to convert the column type to another data type, this can easily be done through the column drop-down menu: