Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Paxata's data transformation capabilities support heterogenous data types—meaning that data types are automatically identified at the cell level when importing a dataset into the Paxata Library. The ability to support heterogenous data types, and within the same Paxata Project column, is powerful because it enables you to bring all of the data into your Project. The mixed data types with inherent data quality issues can then be easily homogenized and harmonized with Paxata, as explained in the Best Practices section below. In other strongly-typed applications that do not support heterogenous data, the source data must be homogenized, using a different tool, before your data prep work can even begin.

...

·       Boolean

·       String or Text

·       Date Time (under conditions described below)

Paxata does this through an algorithm that follows these rules:

1.     If the value is null, ignore the value

...

3.     If the value can be programmatically read as a number, treat it as Numeric

4.     All other values default to String


For example: you have a dataset with 10 columns and 1 million rows of data. This translates to a total of 10 million cells. In this case, Paxata identifies the data type for each one of the 10 Million cells following the algorithm rules above.

...

·       Parquet file and the Parquet format provides the schema

·       Microsoft Excel file and the Excel format specifies a data type for each cell



How does Paxata determine the column type for heterogeneous data within the same column?
Returning to the example above, in 1 million rows of data, there's a good possibility that, within the same column, there is data belonging to different data types—for example string and numeric values could be mixed in the same column. In this case, Paxata has further logic to determine how to cast the column's data type. Let's use another very simple example to illustrate that logic.

...


What if there's a tie for data types in a column?
In the event there is a tie—meaning 50% of the column values are one type while 50% are another type—the calculation logic provides these additional rules to break the tie:

To summarize, in order of predominance, ties are broken are as follows:
1. Boolean
2. String
3. Numeric
4. Date


What happens if the predominant data type in a column changes when I bring in new data to my Paxata Library or an existing Project?
Column type inferencing and subsequent casting only occurs during the import process to the Paxata Library. There are two potential scenarios when a column type may not accurately reflect the predominant data type in a column:

(1) During import into the Paxata Library, the first 1,000 rows of data are used to inference the column type. As a general rule, Paxata has found that 1,000 rows of data—a configurable value—are sufficient to accurately inference and cast column types for your datasets. These first 1,000 rows are informally known as the "preview" state and it's the state you see in the application while a dataset is loading—either for the first time, or as an updated version for an existing dataset:Image Modified

There may be unusual cases in which the predominant data type for a column changes after the first 1,000 rows. In this case, the column will remain cast using those first 1,000 rows. Though this "preview" state of 1,000 rows is configurable during import, Paxata best practices recommends that you use Filtergrams in your Project, as part of your standard data harmonization practice, to identify and address data quality issues. See the next section for Best Practices details.

(2) After a lookup or an append operation in an existing Project, the predominant type for a column may change based on the data that comes into the column as a result of the operation. Because inferencing for column type occurs only during the import process, the column type, as it was originally cast, will remain despite the new predominant type. However, Paxata best practices recommends that, as part of your standard data harmonization practices, you always use Filtergrams after blending your data from multiple sources to identify and address data quality issues. See the next section for Best Practices details.



Best Practices: how do I use Paxata to locate and remediate data typing issues in my data?
Paxata was built from the very beginning to identify and address such data quality issues. Typically, as soon as a dataset is imported into the Library or appended in a Project, the next recommended step is to harmonize the data type so that data quality is enhanced. Data harmonization is one of the key aspects of data preparation and Paxata provides you with visual indicators and tools like Filtergrams for your harmonization exercise.

...

After you've identified the "invalid" data types, which are all types other than the predominate type, and filtered to display only those values, you can create a lens in your Project to generate an AnswerSet that lists only those non-conforming values. The AnswerSet can then be used to assist in your remediation process for those values. If, after reviewing the "invalid" types, you want to convert the column type to another data type, this can easily be done through the column drop-down menu:


#FORM FOLLOWS

...