(warning) The Data Prep (Paxata) documentation is now available on the DataRobot public documentation site. See the Data Prep section for user documentation and connector information. After the 2021.2 SP1 release, the content on this site will be removed and replaced with a link to the DataRobot public documentation site.

Profile a Dataset

Your Paxata Administrator must enable this feature in your application.

Overview

When you profile a dataset, you generate statistics about the data in that dataset. The results are displayed on the dataset's Profile page:


Your profile is also automatically saved in the Library with a name to indicate that it's a profile type AnswerSet:



The video below provides a high-level overview of Paxata's data profiling feature:

How can I use profiles of my data?

Data profiles are essential for determining the quality of data in a dataset before you begin working with that data. For example, you can quickly determine if there are mixed types in the data, nulls, non-printable characters, patterns that don’t belong, etc., and then address those quality issues by bringing that data into a Paxata Project. As you continue to update versions of a dataset in the Library—through either manual or automated import—you can continue to profile each subsequent version. In this way, you can begin to monitor the data quality of a dataset version over version and remediate as necessary.

What is the meaning of each column that I see in a profile's AnswerSet?

When you profile a dataset, the result is an AnswerSet that has a row to represent each column in your dataset. Each column in the profile AnswerSet provides statistics about the columns in your dataset:


Column definitions in a profile AnswerSet
Column NameDefinition
% Blankthe percentage of blanks in the column
% Textthe percentage of text values in the column
% Numberthe percentage of numeric values in the column
% Datethe percentage of date values in the column
% Booleanthe percentage of boolean values in the column
% of Predominant Data Typethe percentage of values in the column that contain the most dominant data type
Predominant Data Typethe most dominant data type in the column
# of Unique Valuesnumber of unique values in the column
# of Phonetically Unique Values (Metaphone)number of unique values in the column after clustering like values using the metaphone (sounds like) algorithm—for example "Good Samaritan" and "Good Samertitan" are clustered to count as the same value.

% Possible Phonetic Duplicates (Metaphone)

# of Phonetically Unique Values (Metaphone) / # of Unique Values
This ratio indicates the possibility of duplicates in the column. A higher number indicates a higher probability of values that are potential duplicates, and indicates that you may need to do a cluster-and-edit operation on the column to identify duplicate values.

Top 5the top five most common values in the column
Min String Lengththe shortest string length in the column
Max String Lengththe longest string length in the column
Avg String Lengththe average string length for strings in the column
# of NAs or NONEs or NULLsthe number of times the column contains "na"or "none" or null"
% All Upper Casethe percentage of cells containing all upper case characters
% All Lower Casethe percentage of cells containing all lower case characters
% with Non Standard ASCII Charsthe percentage of cells containing non printable characters, for example the control character
% with HTML Tagsthe percentage of cells containing HTML tags
Avg # of Consecutive Spacesthe average number of consecutive spaces that are found in the column
% Negative Numbersthe percentage of cells containing negative numbers
% Zerosthe percentage of cells containing the value zero


How do I create a new profile for my dataset?

Follow these steps to create a data profile.

StepAction
1

On the Library screen, hover your cursor over the dataset for which you want to create a data profile.

2

Click the "More Actions" button, and then select the "profile" option.
Result: The Profile screen appears.

3

Click the green Generate Profile button.

Result: The profile appears in the Profile panel. In addition, the profile is automatically saved as an AnswerSet in the Library.
Note: the Library preview of the AnswerSet is limited to the first 100 rows of the profile.