The Data Prep (Paxata) documentation is now available on the DataRobot public documentation site. See the Data Prep section for user documentation and connector information. After the 2021.2 SP1 release, the content on this site will be removed and replaced with a link to the DataRobot public documentation site.
Profile a Dataset
Your Paxata Administrator must enable this feature in your application. |
---|
Overview
When you profile a dataset, you generate statistics about the data in that dataset. The results are displayed on the dataset's Profile page:
Your profile is also automatically saved in the Library with a name to indicate that it's a profile type AnswerSet:
The video below provides a high-level overview of Paxata's data profiling feature:
How can I use profiles of my data?
Data profiles are essential for determining the quality of data in a dataset before you begin working with that data. For example, you can quickly determine if there are mixed types in the data, nulls, non-printable characters, patterns that don’t belong, etc., and then address those quality issues by bringing that data into a Paxata Project. As you continue to update versions of a dataset in the Library—through either manual or automated import—you can continue to profile each subsequent version. In this way, you can begin to monitor the data quality of a dataset version over version and remediate as necessary.
What is the meaning of each column that I see in a profile's AnswerSet?
When you profile a dataset, the result is an AnswerSet that has a row to represent each column in your dataset. Each column in the profile AnswerSet provides statistics about the columns in your dataset:
Column definitions in a profile AnswerSet
Column Name | Definition |
---|---|
% Blank | the percentage of blanks in the column |
% Text | the percentage of text values in the column |
% Number | the percentage of numeric values in the column |
% Date | the percentage of date values in the column |
% Boolean | the percentage of boolean values in the column |
% of Predominant Data Type | the percentage of values in the column that contain the most dominant data type |
Predominant Data Type | the most dominant data type in the column |
# of Unique Values | number of unique values in the column |
# of Phonetically Unique Values (Metaphone) | number of unique values in the column after clustering like values using the metaphone (sounds like) algorithm—for example "Good Samaritan" and "Good Samertitan" are clustered to count as the same value. |
% Possible Phonetic Duplicates (Metaphone) | # of Phonetically Unique Values (Metaphone) / # of Unique Values |
Top 5 | the top five most common values in the column |
Min String Length | the shortest string length in the column |
Max String Length | the longest string length in the column |
Avg String Length | the average string length for strings in the column |
# of NAs or NONEs or NULLs | the number of times the column contains "na"or "none" or null" |
% All Upper Case | the percentage of cells containing all upper case characters |
% All Lower Case | the percentage of cells containing all lower case characters |
% with Non Standard ASCII Chars | the percentage of cells containing non printable characters, for example the control character |
% with HTML Tags | the percentage of cells containing HTML tags |
Avg # of Consecutive Spaces | the average number of consecutive spaces that are found in the column |
% Negative Numbers | the percentage of cells containing negative numbers |
% Zeros | the percentage of cells containing the value zero |
How do I create a new profile for my dataset?
Follow these steps to create a data profile.
Step | Action |
---|---|
1 | On the Library screen, hover your cursor over the dataset for which you want to create a data profile. |
2 | Click the "More Actions" button, and then select the "profile" option. |
3 | Click the green Generate Profile button. Result: The profile appears in the Profile panel. In addition, the profile is automatically saved as an AnswerSet in the Library. |