(warning) The Data Prep (Paxata) documentation is now available on the DataRobot public documentation site. See the Data Prep section for user documentation and connector information. After the 2021.2 SP1 release, the content on this site will be removed and replaced with a link to the DataRobot public documentation site.

Cloudera CDH5 Hive Connector Documentation

User Persona: Paxata Admin - Data Source Admin - IT/DevOps

Availability: Please note this Connector is not available to Paxata SaaS customers.

*Note: This document covers all configuration fields available during Connector setup. Some fields may have already been filled out by your Admin at an earlier step of configuration and may not be visible to you. For more information on Paxata’s Connector Framework, please see here.

Also: Your Admin may have named this Connector something else in the list of Data Sources.

Configuring Paxata

This connector allows you to connect to a Cloudera CDH 5.16 Hive for import and export. The following fields are used to define the connection parameters. The fields you are required to set up here depend on the authentication method you select — Simple, Kerberos, or Hybrid. The type of authentication you select will apply to all Data Sources that you create based on a connector configuration.

Note: Configuring this Connector requires file system access on the Paxata Server and a core-site.xml with the Hadoop cluster configuration. Please reach out to your Customer Success representative for assistance with this step.

General

  • Name: Name of the data source as it will appear to users in the UI.
  • Description: Description of the data source as it will appear to users in the UI.

Something to consider: You may connect Paxata to multiple Hive Databases and having a descriptive name can be a big help to users in identifying the appropriate data source.

Hadoop Cluster

  • HDFS User: The username on the HDFS cluster used to write files for export to Hive.

Kerberos Configuration

The following parameters are required for Kerberos and Hybrid authentication.

  • Principal: Kerberos Principal.
  • Realm: Kerberos Realm.
  • KDC Hostname: Kerberos Key Distribution Center Hostname.
  • Kerberos Configuration File: Fully-qualified path of Kerberos configuration file on webserver.
  • Keytab File: Fully-qualified path of Kerberos Keytab File on webserver.
  • Use Application User: Check this box to read/write as the logged-in application user, or uncheck to use proxy user.
  • Proxy User: The proxy used to authenticate with the cluster. ${user.name} can be entered as the proxy user. ${user.name} works similar to selecting Use Application User but allows for more flexibility. For example:
    • To add a domain to the user’s credentials, enter \domain_name\${user.name} in the Proxy User field. Paxata will pass the username and the domain.
      • Example: \Accounts\${user.name} results in AccountsJoe (assuming Joe is the username).
    • To apply a text modifier to the username, add .modifier to the key ${user.name}. The acceptable modifiers are: toLower, toUpper, toLowerCase, toUpperCase, and trim.
      • For example ${user.name.toLowerCase} converts Joe into joe (assuming Joe is the username).

Hive Configuration

  • JDBC URL: The URL used to access Hive for import and registration of external tables. If Kerberos authentication is used, the following string must be added to the URL: ";auth=kerberos;hive.server2.proxy.user=${user.name}"
    • If a proxy user is used, then the string ${user.name} must be replaced with the proxy username
  • Hive File Location: The location on the HDFS cluster used to store Hive files for external tables.

Credentials

  • Hive User: The username used to access Hive for Simple and Hybrid authentication.
  • Hive Password: The password used to access Hive for Simple and Hybrid authentication.

Hive Options

  • Pre-Import SQL: SQL to be executed before import process. This SQL may execute multiple times (for preview and import) and could be multiple SQL statements, newline-delimited. 
  • Post-Import SQL: SQL to be executed after import process. This SQL may execute multiple times (for preview and import) and could be multiple SQL statements, newline-delimited. 

Please Note: As the Pre- and Post-Import SQL may be executed multiple times throughout the import process, please take care when specifying these values in the Connector/Datasource Configuration as they will be executed for every import performed with this configuration.

  • Pre-Export SQL: SQL to be executed before export process. This SQL will execute once and could be multiple SQL statements, newline-delimited. 
  • Post-Export SQL: SQL to be executed after export process. This SQL will execute once and could be multiple SQL statements, newline-delimited.

Data Import Information

Via Browsing

Not Supported

Via SQL Query

Using SQL Select queries