The Data Prep (Paxata) documentation is now available on the DataRobot public documentation site. See the Data Prep section for user documentation and connector information. After the 2021.2 SP1 release, the content on this site will be removed and replaced with a link to the DataRobot public documentation site.
Column Lineage
Column Lineage enables a lineage mode in the Step panel that allows you to identify Project Steps that affected the selected column.
When Lineage Mode is enabled, Step-level transformations that affected the selected column are outlined:
The outlines allow you to quickly identify the Steps that affected the column or changed its data. If there are Steps in the Editor that did not affect the column, those Steps are grayed out, collapsed and labeled to note how many Steps are collapsed.
Options when working in lineage mode:
Click any grayed out Step to expand the associated, collapsed Steps.
Click the "Show all Steps" link in the orange lineage mode header to expand all collapsed Steps in your script.
Click the (x) button in the lineage mode header to close lineage mode.
Note that lineage mode is automatically closed when you mute a Step in the Steps Editor panel or begin making new transformations in the Project.
Example: a Project has the following six Steps.
Import a base dataset for customer contact information that has a column for "int’l cell numbers". In that column, all numbers follow this format: +44-2071838750.
Perform a split operation on the dash in the "int’l cell numbers" column to create two new columns: one for the country code and another for the cell number.
Rename the first newly created column: "country code".
Perform a "find + replace" operation on the "country code" column to remove the preceding (+) character.
Rename the second newly created column: "cell number".
Use the column tool to hide the original "int’l cell numbers" column.
When you enable Column Lineage mode for the "cell number" column, the second and fifth Steps above are highlighted in the Steps Editor panel because those Steps directly affected the data in the "cell number" column—the second Step is the origin for the data and the fifth Step is the new column name. All other Steps are grayed out and collapsed because they did not affect the column.
Note: in addition to lineage mode, a column’s header color provides a quick reference to indicate the original Data Source for the column’s data. The color of the input Step for the Data Source is used to identify all columns originating from that source. If there is no input Data Source for the column, for example the column was created as the result of a compute column operation, then the column is color-coded with the Project’s color.