(warning) The Data Prep (Paxata) documentation is now available on the DataRobot public documentation site. See the Data Prep section for user documentation and connector information. After the 2021.2 SP1 release, the content on this site will be removed and replaced with a link to the DataRobot public documentation site.

Automatic Project Flows (APF)

Your Paxata Administrator must enable this feature in your application.

Contents


Introduction

Automatic Project Flows, or APF, allows you to intelligently operationalize curated data flows. With a single click, APF computes the entire sequence of data prep Steps across Paxata Projects, datasets and AnswerSets to produce an end-to-end, automated output Flow for your data. You can set the Flow to run on a recurring time-based schedule, or run it just once to produce an end-result AnswerSet. All runs can then be easily managed through the Monitoring Interface.

How does APF help you with your data prep work?
Business Analysts and Data Engineers can simplify complex data flows by breaking them into smaller groups of Paxata Projects that can be operationalized—with each Project focused on performing a related or cohesive set of Steps for improved readability and limited complexity. When you're finished creating your Projects, simply select the final Project in the sequence as your "target" Project. APF takes care of the rest—sequencing, preparing and automating the entire end-to-end flow without any manual stitching required.

How does APF help your team with its data prep work?
Your team that requires input from both Business and the IT Leader can simplify the data prep process when members build Paxata Projects that depend on output AnswerSets created by others. Everyone completes their data prep work in their own Paxata Project, and then the entire sequence is operationalized from a single "target" Project. APF takes care of the rest with no manual stitching required, regardless of who created or owns the Projects and AnswerSets. Members of the team can then use the APF Monitoring Interface to view how their Projects and AnswerSets participate in the Flow's final output.

Example of APF in action

In the example illustrated above, the end-state "Sales Variance Report" is produced from a series of Paxata Projects and AnswerSets produced by multiple people. Bob connects to the data lake for his "Product Hierarchy" data, preps and produces an AnswerSet that is shared with Susan who, additionally, pulls in "Sales Transaction history" data from a Cloud application. She preps all of this data and produces an AnswerSet, which is then shared with you for the Sales Variance Project that you maintain. In addition to the AnswerSet from Susan, you also need to combine data from an Excel report that you pull in from a cloud storage system. When you're finished with your data prep, you then produce a "Sales Variance Report" AnswerSet. Because you need to produce this report each week, the APF feature makes your data prep work a breeze. You simply click the "Create Project Flow" button in your Sales Variance Project, configure a time-based trigger for running the Flow, and APF takes care of the rest by intelligently traversing back through the Flow of related Projects, AnswerSets and datasets to create the dependency chain required to produce your end-state AnswerSet. You can then use the APF Monitoring Interface to manage all subsequent runs of your Flow.

A few important things to note about APF:

  • You must have permissions to all of the datasets and all of the Projects in the Flow before creating a Flow, otherwise it will never successfully run.
    Note: if a user has permissions to an AnswerSet, but not the Project from which that AnswerSet was produced, the user can still create a Flow up to the point at which the user ceases to have the "read" permission. This flexibility in Flow creation enables users to independently manage the operationalization of Flows for the portions they have permissions to access.

  • You must also have permissions to all of the datasets and all of the Projects in the Flow in order to manage it from the Monitoring Interface. Your Paxata System Administrator provides these permissions.

  • Anything produced downstream from your target Project is not included in your defined Flow. For example, returning to the illustration above, if there were a Project that consumed your "Sales Variance Report" AnswerSet, then that Project would never be included in the Flow—the target Project is always the end point for a Flow.



Steps to Setup a Project Flow

Creating a Project Flow is as simple as opening your "target" Project—the Project that will produce your end-state AnswerSet—and clicking the "Create Project Flow" button in the top right-hand corner of that Project.
Note: APF is a feature that must be enabled. If you do not see this button in your Project, contact your Paxata System Administrator.


You are prompted to provide a name and optional description for the Flow. By default, your target Project's name is used for the Flow. But you can change it here. Then click the Create button.

The intelligent automation engine then calculates all of the Flow dependencies for you and presents the Configuration Interface where you set the triggers and notifications, and tweak any settings for the Flow's input and output datasets. See the next section for an explanation of the Configuration Interface.



APF Configuration Interface

The Configuration Interface has three tabs where you configure the settings for your Flow. The Configuration Interface is presented when you first create a Flow and also opened when you choose to "edit" any of the saved Flows displayed in the Monitoring InterfaceThree configuration tabs are used to configure your Flow. The three tabs are described below. Note that buttons for "Graph", "Actions" and "Discard Changes" and "Save" are always present in the Configuration Interface and provide common actions you can take for all Flows.


General tab

  • Update the Name and Description of a Flow that you've created.
  • Specify the triggers to run your Flow. The triggers are time and frequency based. You can also use the custom option to provide a cron expression for the trigger.
  • Provide email addresses for run status. Separate each address you add with a comma.

Note: as soon as a Flow is created, a Project ID Flow will also display on the General tab. This ID is used to identify the Flow for REST API calls and also for any required troubleshooting of the Flow.


Inputs tab



The Inputs tab provides a list of all the datasets used to in the Flow, the versions of those datasets are that used to create the Flow, and the Projects in which each dataset is used. There are three actions you can take on this tab:

  • specify that a dataset is automatically reimported every time the Flow is run: by default, all Projects are configured to use the latest version of a dataset saved in the Library. However, newer versions of a dataset may be available from the original data source before a new version of it is manually imported to the Paxata Library. In this case, you can configure a dataset to be automatically reimported from its original data source every time the Flow is run. This latest version will then always be saved in the Library. To enable this automatic update, click the "Reimport dataset on run" option. When the option is enabled, a button for "Configure Reimport Options" also displays. The button opens the Library import panel where you can change the data source path or query and any parse options. Note that these options are always saved with the dataset in the Library and you only need to configure them here if you want to change the current settings.

  • configure a dataset's version to use for the Project: by default, all Projects are configured to use the latest versions of datasets saved in the Library. However, you may want to change this default behavior, which can be done when you click the Edit button (in the "Options for Datasets as used in Projects" column):
    • Pin to version: specify that dataset remains the exact version currently used by the Project.
    • Fail if columns changed: specify that dataset should fail to import into Project if latest version coming in from Library has a different layout (schema)—for example new columns added, removed columns that are not used in the Project's Steps, different column types for existing columns, a new column order, etc.

  • if more than one Project is using the same dataset as input for the Flow, then this is noted in the Projects column. Click "See All Projects" to view all of the Projects using the dataset and to, optionally, configure different versions of the dataset to use per Project—for example, you can specify that one Project use the latest version of the dataset from the Library, while another Project uses the exact version of the dataset currently saved in the associated version of the Project. See below for explanation of configuration options.

Note: you can easily determine metadata statistics for the dataset inputs by hovering your mouse over a dataset name in the DATASETS column. The dataset's version, creation date and user who added it to the Library, and the number of columns and rows are displayed in a pop-up window.


Outputs tab

The Outputs tab provides a list of all the output AnswerSets that are published from the Flow. Because a publishing lens is always required to create a publishing point from a Paxata Project, all of the outputs are configured at the lens level. There are times when your Flow may include a Project that has multiple lenses, but not all of those lenses are required to produce output AnswerSets required for the Flow. By default, only required lenses automatically publish AnswerSets that are saved in the Library. However, if you'd like to enable the publish for AnswerSets that are not required for the Flow, then you can enable them here.
Note: lenses that produce output AnswerSets required for the Flow can never be disabled.

In addition to adjusting the publish options for non-essential AnswerSets, you may choose to publish any lens output AnswerSet to an external data source, for example a database, cloud storage system, etc. To specify a publish location in addition to the Paxata Library, click the Configure Lens button to open the Exports panel.

There are two actions you can take on this tab:

  • disable non-bridging lens to prevent it from publishing AnswerSets to the Library. Click the slider adjacent to that lens to disable it.

  • export the published AnswerSet to a data source (in addition to the default Library setting). Click the Configure Lens button for the lens and the Export panel opens at the bottom of the page. By default, AnswerSets are published to the Paxata Library. To also publish out to an external data source, click the drop-down for the Export Lens field and select "Library and Export". You can then specify the output location details and any export parsing options for that AnswerSet.



APF Monitoring Interface

The Monitoring Interface is used to monitor the status of all Flows. The interface is organized by Snapshots, Runs, and Chores because these are the key components for generating a Flow's output.

The following diagram illustrates how to locate the monitoring information you need for a Flow. Each of these pages are explained below the diagram.

1. Project Flows page lists all of the Flows that you have permissions to view and edit, and the current status of the most recent run for each (succeeded, failed, etc.) There are three actions you can take from this page:

  • Edit the configuration details for the Flow. Click the Edit button to open the APF Configuration Interface where you can make any adjustments to the configuration. See the APF Configuration Interface section of this document for all configuration options.
  • Run the Flow manually by clicking the Run button. A manual start of a Flow is particularly useful if you need to test out a new Flow, or a configuration change to the Flow, and don't want to wait for the time-based trigger to start it.
  • Show all of the Snapshots for the Flow. Click Show all Snapshots to open the Snapshots Panel. 

In addition, the More Actions option on the page allows you to:

  • view the Permissions required to work with this Flow. You will need to know these Permission settings if you want to share this Flow with another person. Note that Permissions are only visible to the user who created the Flow or to users with whom the creator has shared all of the permissions.
  • quickly jump to the latest Run Details for the Flow. Note this will not display until there is at least one run of the Flow.

2. Snapshots page lists all of the Snapshots for a Flow. Every time a Flow is executed, which is called a "run" of the Flow, a Snapshot is created to capture the configuration settings used to create the output for the run. The runs will continue with this Snapshot until any configuration changes are made to the Flow—for example changes to the schedule, notifications, inputs, output settings, etc. Then a new Snapshot is created for the Flow and the new Snapshot captures all of the executed runs with the modified configuration settings. 
Snapshots provide clear audit-ability of the exact state of a Project Flow for each run.
Important: a new Snapshot is not created if datasets are configured to use the latest version from the Library. See the Inputs section for dataset configuration options.

There are two actionable items on this page:

  • The View button opens a read-only view of the APF Configuration Interface where you can view all of the configuration settings for a Snapshot.
  • The Show All Runs button opens the runs list page, which details every run for the Snapshot. See below for details of the Runs page.

3. Run List page captures all details for each individual run under a Snapshot. The number of discreet chores that must be completed in order to finish the run—for example, publishing a dependency AnswerSet—are listed on the page. So every time a Flow is run, a new run entry displays on this page. Important: if there is no change to the data used to create the Flow, for example all of the datasets used in the Flow remain exactly the same version as were used in the previous run, then the APF engine will conserve resources and not re-run the Flow again until new data inputs are available. There is one actionable item on this page:

  • The View button opens a read-only view of the APF Configuration Interface where you can view all of the configuration settings associated with a run.


Important: the APF quotas meter displays at the top of the Flows page to indicate your usage. When hovering over any one of the counts for Daily, Weekly or Monthly, a tooltip displays to provide details of your current usage and limit.

Note that quotas are based on "chore" count, and chores are defined as:

  • the running of an individual Project that is required to produce a Flow.
  • performing an import (but not a publish) of any dataset or AnswerSet that is required to produce a Flow.

The sum of all chores ultimately produces the output for your Flow. While a Flow is in the process of running, you will need to refresh your browser to update the quotas meter on the Flow's page. If you need your quotas for chore count increased, please contact your Paxata Administrator or Paxata Customer Success.



Common Actions for Flows

The following actions can be taken for all saved Flows:


Generate a visual graph for Flow
The Graph button generates an APF graph in a new browser window that displays the datasets and how they flow into the individual Projects used to generate a Flow's final output AnswerSet.

Hovering over any dataset or Project in the Flow also displays the corresponding downstream lineage (in pink) and upstream dependencies (in blue). 

For example, when hovering over the dataset for March 2016 Transactions:


When hovering over an intermediate Project in the Flow—in this example Customer Loyalty-Women Members—the upstream dependencies display through the blue lines while the downstream lineage displays through the pink lines.

Notice in both examples that if datasets and Projects do not participate in the portion of the Flow that you've selected, then they are grayed out in the graph.

Note that you may see a dotted line in a graph for some Flows. The dotted line indicates that an AnswerSet was published from a Project in the Flow, and then later consumed again by the same or another Project in the Flow. This is referred to as a looping input and is represented by the dotted line.

Run a Flow manually
There may be times when you want to manually kick off a run of a Flow without having to wait for its scheduled start time. This can be done from the "Actions" drop-down. Select "Run now" and the Flow will be prepared for a run.

Delete a Flow
If you no longer want to keep a saved Flow, you can delete it. This can be done from the "Actions" drop-down. Select "Delete" and you are promoted to confirm your selection. Note that any AnswerSets, that were published to the Library as a result of running this Flow, will not be deleted as a result of deleting the Flow.


Update a Flow to use latest the latest Project Versions
Every time an action is taken in your Project—for example adding a Step, removing a Step, re-arranging Steps—a new version of your Project is created. Each version provides an audit trail of the changes you have made to your data during the course of your data prep work. When creating a Project Flow, the Flow is always pinned to the specific Project versions at the time of the Flow's creation. However, you can update a Flow to use the latest version of all Projects. This can be done from the "Actions" drop-down while on the Outputs tab. Select "Update Projects" and you are prompted to confirm your selection. 
Note there are conditions that apply to updating Project versions, and Project versions cannot be updated if any Project in your Flow:

  • adds a new datafile
  • removes an existing datafile
  • replaces a dataset with a different dataset (new versions of the same dataset are permissible)
  • adds a new lens
  • removes a lens (moving a lens to a different Step in a Project is permissible)
  • changes the name of a lens

If you want to update on a specific Project's version—instead of all Projects in the Flow—this can be done from the Outputs tab: mouse-over the Project for which you want to update the version, then click the blue "Update Project Version" button that displays in the right-hand column.



Glossary of APF Terms

TermDefinition
ChoreA chore is a dataset import or a Project execution. The dataset import chore performs a re-import of your dataset through a data source. The Project execution chore addresses all other tasks required for the Flow, such as publishing an AnswerSet to the Library, export of an AnswerSet, etc.
FlowA collection of Projects that can be run as a unit. One or more frequency-based schedules can be associated with a Flow, which allows a Flow to run on a recurring basis.
InputsDatasets from the Library that are required to run a Flow.
OutputsThe AnswerSets written to the Library generated by the run of a Flow.
RunThe execution of each of the Projects that are required by the Target Project. The run executes all of the Steps from the upstream dependency Projects and then writes the resulting AnswerSet(s) to the Library.
SnapshotYour Paxata Administrator must enable this feature in your application.
Target ProjectThe Paxata Project from which a Flow is created. Once a Flow is created, all upstream dependencies are automatically calculated by the APF engine.