The Data Prep (Paxata) documentation is now available on the DataRobot public documentation site. See the Data Prep section for user documentation and connector information. After the 2021.2 SP1 release, the content on this site will be removed and replaced with a link to the DataRobot public documentation site.
Splitting a column involves taking the values in a single column and distributing them into one or more new columns in the same row. The split is performed either by selecting a character string on which to separate the column or by giving a character count (number of characters) along which the split should occur. There is also a regexoption for more advanced users as well.
In the “Separator” field, enter one or more characters on which the split should occur. This will serve as the pattern against which the match will occur. The character(s) specified will not become part of any of the resulting columns—they will be excluded from the distribution to the new column(s). Note: the text separator value is case-sensitive.
Based on the separator you have chosen for the operation and the number of times that separator exists across the cells of this column, the Min and Max links allow you to quickly select the number of new columns to create after the split. In addition, the Custom link allows you to provide the exact number of columns you want to generate. Adjacent to each field, you can also click the plus and minus buttons to manually add and remove new column fields.
You also have the option to select "Right to left" for the split. By default, the split function parses left to right using the specified separator. The "Right to left" option allows you to parse the text in the column starting on the right. This is particularly useful in cases where you want to separate a string of text, for example a file name, from its directory path. In this case, specify the slash "/" as your Separator, and then select the "Right to left" option.
The “Lengths” field requires one or more numbers, separated by commas. The “Lengths” field is used to determine the number of characters in each of the new columns. Thus, the values “2,3,2” placed in the field will generate one column with the first two characters, create a second column with the subsequent three characters, and create a third column with the next two characters within the original column.
Unlike “Split by Separator,” this type of split does not drop any characters from the middle of the column. However, instead of any unaccounted-for (left-over) characters being placed together in the last column, they are excluded from the new columns completely. To avoid inadvertently truncating a column, it is recommended that the last number specified be large enough to account for the remaining column length.
Note that doing this will not generate a “whitespace” buffer to account if this value is larger than the length of the available character values. This will only ensure that all characters in the column are accounted for in the last column. Extra spaces specified by the “Lengths” parameter that go beyond the number of characters in the column are simply ignored.
Regular Expression (Regex)
If you are familiar withRegular expression (Regex), you can use this option to define a search pattern for locating strings where you want to enforce the split. In the following example, Regex is used to split on alpha characters in the string. The split then results in two new numeric columns:
TheCapture Modeoption allows you to extract strings that match your Regex pattern. In the following example, Regex Capture mode is used to extract only the second set of numbers in the string. The split then results in a new numeric column: