(warning) The Data Prep (Paxata) documentation is now available on the DataRobot public documentation site. See the Data Prep section for user documentation and connector information. After the 2021.2 SP1 release, the content on this site will be removed and replaced with a link to the DataRobot public documentation site.

Getting Started in Paxata

Contents


Meet Paxata

What you can do with Paxata

Paxata provides a clean, familiar, spreadsheet-like feel. The challenge of prepping data is simplified to single clicks for each action. This provides a point-and-click experience that empowers you to quickly gather data, simply explore and prepare it, and then easily share it.


Quickly gather all your data into Paxata’s LibrarySimply prepare your data in a Paxata Project – with clicks, not codeEasily publish your work as a Paxata AnswerSetfor reliable analytics


Video overview of Paxata

The below video provides a high-level overview of how you will use Paxata.


Tour the basics of Paxata

Here is an overview of the basic elements of Paxata. These elements are always available and provide quick access to common functions.


ElementFunction
Account MenuAccess account specific options like updating your password or logging out.
Help ToggleShow or hide the Help Panel
Help PanelGet helpful information related to the current screen
Navigation MenuNavigate between the screens used to perform specific actions in Paxata. The primary screens are:
  • Library, where you access your imported and published data
  • Projects, where you prepare your data
  • Admin, where connections to data sources are made and users permissions are controlled
  • Automation, where details for automated datasets and Projects are provided
Note: The screens available to each user are based on the user's permissions
Notification BellKnow when Paxata encounters a warning or error



Gather your data in the Library

Library Overview

The Library is where you gather your data. Data, like an Excel spreadsheet, that is imported into the Library is called a dataset. Once you have imported a dataset, you can begin prepping your data in a Project. See the Prep your data in a Project section of this article. When you have finished prepping your data, you can publish it back to the Library as an AnswerSet. See the Share your prepped data as an AnswerSet section of this article.

Sources of datasets


Datasets can be imported from local files on your computer or from connected data sources. Some examples of connected data sources are:

  • Cloud storage like Amazon S3
  • The Hadoop Distributed File System (HDFS)
  • Relational databases like MySQL
  • Secure File Transfer Protocol (SFTP)


Import a local file

Follow these steps to import a dataset from your computer:

StepAction
1

On the Library screen, click + import 
 
Result: The Import Data screen appears

2Click + Upload local file 
Result: The Upload local file panel appears
3To upload a file,
  • Click the Upload local file panel and select the file or
  • Drag-and-drop the file into the Upload local file panel
Result: The Parsing screen appears. The Parsing displays a preview of how Paxata structured your data
4

Check the preview. Does your data look correct? 

If ...Then ...
YesContinue to the next step
NoTry adjusting the import options
5Click Finish 
Result: Your data is imported as a dataset and ready to be prepared


Tour the Library screen

Now that there's data in your Library, here are the main sections of the Library:

For complete details of all information provide on this page and the actions you can take, see the Library article.




Prep your data in a Project

Project overview

A Project is where you explore and prepare your data. The following is an overview of how you can explore and prepare your data in a Project.

ActionDescription

Explore your data using dynamic visuals to highlight patterns, duplicates, blanks, errors and missing data.

Clean your data by standardizing values, removing duplicates, finding and fixing errors, and more.

Shape your data using tools like pivot, transpose, group by and more

Combine additional datasets to enrich your data and provide more context.


Video overview of Projects

The below video provides a high-level overview of what you do in a Project.


Start a new Project

There are two ways you can start a new Project:

  • from the Projects screen where you start an empty Project and then add your data to it
  • from the Library where you select the dataset you know you want to use as the starting point for your Project

To start a new Project from the Projects screen:

StepAction
1

On the Projects screen, click + add 
 
Result: The Start a New Project window appears

2Enter a name for your Project in the Name field
3To save your new Project and start prepping your data
  • Click Save and Open 
    Result: The Start with a dataset screen appears. This is where you select the base dataset. A base dataset forms the foundation of your Project. It is the data on which all other actions in the Project will be performed.
  • Are you able to locate the dataset you want to add?

    If ...  
    Then ...
    YesNext to the dataset, click SELECT
    No
    Click Datasets + import button at the top of the screen to import the dataset you want to use.
  • Result: The Project opens with the selected base dataset, which is now ready for preparation.

To start a new Project from the Library:

  1. Open the Library page and locate the dataset that you want to use as your base dataset for your new Project. 
  2. Hover over that dataset and click the Create Project button that displays:


Tour the Project Preparation screen

Before you begin preparing data, check out some of the useful areas of the Projects screen.

ElementDescription
Project NameThe name you gave your Project will display here. In the above example, the Project name is Example Data Prep
TOOLSThese are the tools you will use to clean, shape, combine, and ultimately prep your data. See the Overview of TOOLS section of this article
Steps EditorEvery action you perform while prepping your data is logged as a step. The Steps Editor panel allows you to:
  • View your steps in order
  • Mute a step
  • Edit what happens during a step
  • Rearrange the order of your data preparation steps
  • Delete steps
See the Steps Editor article
Versions PanelAny time you save your Project, a new version is created. The Version Panel gives you access to previous versions of your Project. See the Version History article
Data PreviewSimply put, this is your data. You will see your data change as you prep it
Column OperationsOpens the Column Operations Menu. These operations will help you clean and standardize your data. See the Overview of Column Operations section of this article
Grid Tools and 
Status Updates
Grid Tools allow to you locate specific columns in your dataset, specify column widths and adjust how cell text displays.
Status Updates display when transformations that affect the Data Preview grid or filters are in progress. Note: the number of tasks displayed in the update messages may dynamically change as an operation progresses towards completion.

Overview of Column Operations

The following is an overview of the operations you can perform to clean your data.

OperationDescription
FILTERThe FILTER operation combines functionality of filters with the power of histograms. The result is called a Filtergram. With a Filtergram, you see the relative frequency of each value in a column and select values to temporarily hide some of your data. See the Data Filtergrams article
CHANGEThe CHANGE operations allow you to standardize the values in a column. For example, you could change all numbers in a column to numeric values
COLUMNThe COLUMN operations allows you to make changes to the column of data. You can do things like:
  • Split values into multiple columns based on a delimiter character or a given number of characters. See the Split Column article
  • Find and replace specified values in the column
  • Duplicate the column
WHITESPACEThe WHITESPACE operations allow you to remove leading and trailing spaces as well as extra spaces within your data
OTHERThe OTHER operation is Cluster + Edit. This operation allows you to find values in a column that are similar and edit them so they are the same 
Example: Before using Cluster + Edit, Apple Computer appears in a column as "Apple Computer", "Apple Corporation", and "Apple Computer Corporation". After using Cluster + Edit all instances for Apple Computer can be standardized to "Apple Computer" 
See the Cluster and Edit article.

Overview of TOOLS

The following is an overview of the tools you will use to prepare your data.

ToolNameDescription

highlightThe more data you have, the harder it can be to notice small details. The highlight tools provide visual cues to help you see:
  • Patterns
  • Spaces
  • Ranges

attachWhen you need to add additional datasets to your Project, use the attach tool. Rows of data can be added to the bottom of your Project. If your datasets have a matching column of data, the additional data can be combined with the data in the Project. See the Lookup article

columnsSometimes you may want to make minor adjustments to your columns. The columns tools let you:
  • Edit your column names
  • Rearrange your column order
  • Remove columns

computeThere may be a time when you need to write an expression. Maybe you want to concatenate data from multiple cells into one value, or perform mathematical operation based on data. The compute tool is how you do that. See the Computed Columns article

removePart of cleaning data is removing information that is not needed. The Remove tool lets you remove rows of data. See the Remove Rows article

samplingYou may find it useful to work with a sample of a dataset before bringing all the data into your Project. For large datasets, this can make initial exploration and discovery easier. The sampling tool also gives you the flexibility to filter down to a specific set of rows in your data, and then sample on the remainder

shapeChange shape of your data using the Shape tools. With these tools, you can:
  • Deduplicate
  • Group data
  • Pivot
  • Depivot
  • Transpose
See the Data Shaping Tools article

auto #The auto # tool assigns each row a number. This is helpful if you need to give each row a unique identifier

new lensLenses create publishing points from Steps in your Project. When you publish from a Lens, the resulting AnswerSet is a snapshot at a particular Step in your Project. The AnswerSet is saved to the Library. Lenses are also essential for Project Automation because they define the publishing points to use for automated jobs. See the Project Lenses article



Share your prepped data as an AnswerSet

AnswerSet Overview

When you're ready to save and share the data you prepped, publish it to the Library as a AnswerSet. An AnswerSet is like a dataset. The difference is an AnswerSet is the published result of your data prep. Once published, you can reuse the AnswerSet in other Projects or export the AnswerSet to share with other applications. Your published AnswerSet is always published to the Library. AnswerSets can also be created at any time and for any set of specific Steps in your Project using a Lens.

Publish an AnswerSet

Follow these steps to publish an AnswerSet:

StepAction
1Click steps in the TOOLS menu 
Result: The Steps Editor panel opens
2Click the step you want to publish an AnswerSet from 
Note: Paxata defaults to the last step in the Project
3At the top of the Steps Editor, click Publish 
Result: The Publish AnswerSet to Library window appears
4Enter a name for the AnswerSet in the Name field
5Click Publish 
Result: The Publishing AnswerSet message appears. Paxata publishes an AnswerSet using the steps up to and including the selected step. The AnswerSet is published to the Library



Export your prepped data

Export overview

Datasets and AnswerSets can be exported out of Paxata. Exporting amplifies your ability to get the most out of your data.

Export your AnswerSet to your computer

Follow these steps to download an AnswerSet or dataset locally:

StepAction
1On the Library screen, hover your mouse over the AnswerSet you want to export
2

Click the Export button that displays
Result: The Exporting screen appears

3Click Download file locally 
Result: The Export Settings screen appears
4Click Export 
Result: The AnswerSet is downloaded to your computer as a comma separated values file. The Export Logscreen appears


Glossary

The following definitions for terms used in this document.

TermDefinition
AnswerSetLike a dataset except that it is the published result of your data prep
Base datasetThe data on which all other action in the Project will be performed
Data sourceThe source of your dataset
DatasetData that is imported into the Library is called a dataset
FiltergramThe combination of the functionality of filters with the power of histograms