The Data Prep (Paxata) documentation is now available on the DataRobot public documentation site. See the Data Prep section for user documentation and connector information. After the 2021.2 SP1 release, the content on this site will be removed and replaced with a link to the DataRobot public documentation site.
Getting Started in Paxata
Contents
What you can do with Paxata
Video overview of Paxata
Tour the basics of Paxata
Gather your data in the Library
Library overview
Sources of datasets
Import a local file
Tour the Library screen
Project overview
Video overview of Projects
Start a new Project
Tour the Projects screen
Overview of column operations
Overview of TOOLS
Share your prepped data as an AnswerSet
AnswerSet overview
Publish an AnswerSet
Export overview
Export your AnswerSet to your computer
Meet Paxata
What you can do with Paxata
Paxata provides a clean, familiar, spreadsheet-like feel. The challenge of prepping data is simplified to single clicks for each action. This provides a point-and-click experience that empowers you to quickly gather data, simply explore and prepare it, and then easily share it.
Quickly gather all your data into Paxata’s Library | Simply prepare your data in a Paxata Project – with clicks, not code | Easily publish your work as a Paxata AnswerSet™for reliable analytics |
Video overview of Paxata
The below video provides a high-level overview of how you will use Paxata.
Tour the basics of Paxata
Here is an overview of the basic elements of Paxata. These elements are always available and provide quick access to common functions.
Element | Function |
---|---|
Account Menu | Access account specific options like updating your password or logging out. |
Help Toggle | Show or hide the Help Panel |
Help Panel | Get helpful information related to the current screen |
Navigation Menu | Navigate between the screens used to perform specific actions in Paxata. The primary screens are:
|
Notification Bell | Know when Paxata encounters a warning or error |
Gather your data in the Library
Library Overview
dataset. Once you have imported a dataset, you can begin prepping your data in a Project. See the Prep your data in a Project section of this article. When you have finished prepping your data, you can publish it back to the Library as an AnswerSet. See the Share your prepped data as an AnswerSet section of this article.
The Library is where you gather your data. Data, like an Excel spreadsheet, that is imported into the Library is called aSources of datasets
Datasets can be imported from local files on your computer or from connected data sources. Some examples of connected data sources are:
- Cloud storage like Amazon S3
- The Hadoop Distributed File System (HDFS)
- Relational databases like MySQL
- Secure File Transfer Protocol (SFTP)
Import a local file
Follow these steps to import a dataset from your computer:
Step | Action | ||||||
---|---|---|---|---|---|---|---|
1 | On the Library screen, click + import | ||||||
2 | Click + Upload local file Result: The Upload local file panel appears | ||||||
3 | To upload a file,
| ||||||
4 | Check the preview. Does your data look correct?
| ||||||
5 | Click Finish Result: Your data is imported as a dataset and ready to be prepared |
Tour the Library screen
Now that there's data in your Library, here are the main sections of the Library:
For complete details of all information provide on this page and the actions you can take, see the Library article.
Prep your data in a Project
Project overview
A Project is where you explore and prepare your data. The following is an overview of how you can explore and prepare your data in a Project.
Action | Description |
---|---|
Explore your data using dynamic visuals to highlight patterns, duplicates, blanks, errors and missing data. | |
Clean your data by standardizing values, removing duplicates, finding and fixing errors, and more. | |
Shape your data using tools like pivot, transpose, group by and more | |
Combine additional datasets to enrich your data and provide more context. |
Video overview of Projects
The below video provides a high-level overview of what you do in a Project.
Start a new Project
There are two ways you can start a new Project:
- from the Projects screen where you start an empty Project and then add your data to it
- from the Library where you select the dataset you know you want to use as the starting point for your Project
To start a new Project from the Projects screen:
Step | Action | ||||||
---|---|---|---|---|---|---|---|
1 | On the Projects screen, click + add | ||||||
2 | Enter a name for your Project in the Name field | ||||||
3 | To save your new Project and start prepping your data
|
To start a new Project from the Library:
- Open the Library page and locate the dataset that you want to use as your base dataset for your new Project.
- Hover over that dataset and click the Create Project button that displays:
Tour the Project Preparation screen
Before you begin preparing data, check out some of the useful areas of the Projects screen.
Element | Description |
---|---|
Project Name | The name you gave your Project will display here. In the above example, the Project name is Example Data Prep |
TOOLS | These are the tools you will use to clean, shape, combine, and ultimately prep your data. See the Overview of TOOLS section of this article |
Steps Editor | Every action you perform while prepping your data is logged as a step. The Steps Editor panel allows you to:
|
Versions Panel | Any time you save your Project, a new version is created. The Version Panel gives you access to previous versions of your Project. See the Version History article |
Data Preview | Simply put, this is your data. You will see your data change as you prep it |
Column Operations | Opens the Column Operations Menu. These operations will help you clean and standardize your data. See the Overview of Column Operations section of this article |
Grid Tools and Status Updates | Grid Tools allow to you locate specific columns in your dataset, specify column widths and adjust how cell text displays. Status Updates display when transformations that affect the Data Preview grid or filters are in progress. Note: the number of tasks displayed in the update messages may dynamically change as an operation progresses towards completion. |
Overview of Column Operations
The following is an overview of the operations you can perform to clean your data.
Operation | Description |
---|---|
FILTER | The FILTER operation combines functionality of filters with the power of histograms. The result is called a Filtergram™. With a Filtergram, you see the relative frequency of each value in a column and select values to temporarily hide some of your data. See the Data Filtergrams article |
CHANGE | The CHANGE operations allow you to standardize the values in a column. For example, you could change all numbers in a column to numeric values |
COLUMN | The COLUMN operations allows you to make changes to the column of data. You can do things like:
|
WHITESPACE | The WHITESPACE operations allow you to remove leading and trailing spaces as well as extra spaces within your data |
OTHER | The OTHER operation is Cluster + Edit. This operation allows you to find values in a column that are similar and edit them so they are the same Example: Before using Cluster + Edit, Apple Computer appears in a column as "Apple Computer", "Apple Corporation", and "Apple Computer Corporation". After using Cluster + Edit all instances for Apple Computer can be standardized to "Apple Computer" See the Cluster and Edit article. |
Overview of TOOLS
The following is an overview of the tools you will use to prepare your data.
Tool | Name | Description |
---|---|---|
highlight | The more data you have, the harder it can be to notice small details. The highlight tools provide visual cues to help you see:
| |
attach | When you need to add additional datasets to your Project, use the attach tool. Rows of data can be added to the bottom of your Project. If your datasets have a matching column of data, the additional data can be combined with the data in the Project. See the Lookup article | |
columns | Sometimes you may want to make minor adjustments to your columns. The columns tools let you:
| |
compute | There may be a time when you need to write an expression. Maybe you want to concatenate data from multiple cells into one value, or perform mathematical operation based on data. The compute tool is how you do that. See the Computed Columns article | |
remove | Part of cleaning data is removing information that is not needed. The Remove tool lets you remove rows of data. See the Remove Rows article | |
sampling | You may find it useful to work with a sample of a dataset before bringing all the data into your Project. For large datasets, this can make initial exploration and discovery easier. The sampling tool also gives you the flexibility to filter down to a specific set of rows in your data, and then sample on the remainder | |
shape | Change shape of your data using the Shape tools. With these tools, you can:
| |
auto # | The auto # tool assigns each row a number. This is helpful if you need to give each row a unique identifier | |
new lens | Lenses create publishing points from Steps in your Project. When you publish from a Lens, the resulting AnswerSet is a snapshot at a particular Step in your Project. The AnswerSet is saved to the Library. Lenses are also essential for Project Automation because they define the publishing points to use for automated jobs. See the Project Lenses article |
Share your prepped data as an AnswerSet
AnswerSet Overview
AnswerSet. An AnswerSet is like a dataset. The difference is an AnswerSet is the published result of your data prep. Once published, you can reuse the AnswerSet in other Projects or export the AnswerSet to share with other applications. Your published AnswerSet is always published to the Library. AnswerSets can also be created at any time and for any set of specific Steps in your Project using a Lens.
When you're ready to save and share the data you prepped, publish it to the Library as aPublish an AnswerSet
Follow these steps to publish an AnswerSet:
Step | Action |
---|---|
1 | Click steps in the TOOLS menu Result: The Steps Editor panel opens |
2 | Click the step you want to publish an AnswerSet from Note: Paxata defaults to the last step in the Project |
3 | At the top of the Steps Editor, click Publish Result: The Publish AnswerSet to Library window appears |
4 | Enter a name for the AnswerSet in the Name field |
5 | Click Publish Result: The Publishing AnswerSet message appears. Paxata publishes an AnswerSet using the steps up to and including the selected step. The AnswerSet is published to the Library |
Export your prepped data
Export overview
Datasets and AnswerSets can be exported out of Paxata. Exporting amplifies your ability to get the most out of your data.
Export your AnswerSet to your computer
Follow these steps to download an AnswerSet or dataset locally:
Step | Action |
---|---|
1 | On the Library screen, hover your mouse over the AnswerSet you want to export |
2 | Click the Export button that displays |
3 | Click Download file locally Result: The Export Settings screen appears |
4 | Click Export Result: The AnswerSet is downloaded to your computer as a comma separated values file. The Export Logscreen appears |
Glossary
The following definitions for terms used in this document.
Term | Definition |
---|---|
AnswerSet | Like a dataset except that it is the published result of your data prep |
Base dataset | The data on which all other action in the Project will be performed |
Data source | The source of your dataset |
Dataset | Data that is imported into the Library is called a dataset |
Filtergram | The combination of the functionality of filters with the power of histograms |