The Data Prep (Paxata) documentation is now available on the DataRobot public documentation site. See the Data Prep section for user documentation and connector information. After the 2021.2 SP1 release, the content on this site will be removed and replaced with a link to the DataRobot public documentation site.
User Persona: Paxata Admin - Data Source Admin - IT/DevOps
*Note: This document covers all configuration fields available during Connector setup. Some fields may have already been filled out by your Admin at an earlier step of configuration and may not be visible to you. For more information on Paxata’s Connector Framework, please see here.
Also: Your Admin may have named this Connector something else in the list of Data Sources.
This connector allows you to connect to Amazon Redshift as an import & export source. The following fields are used to define the connection parameters.
Name: Name of the data source as it will appear to users in the UI.
Description: Description of the data source as it will appear to users in the UI.
Something to consider: You may connect Paxata to multiple Redshift data warehouses and having a descriptive name can be a big help to users in identifying the appropriate data source.
JDBC URL: the JDBC connection string; the database name can be included in the URL.
Example JDBC URL: jdbc:redshift://examplecluster.abc123xyz789.us-west-2.redshift.amazonaws.com:5439/dev?ssl=true
For more information on connection string options, please see this AWS Documentation.
You can control the database, schemas, and tables that are shown to users when they browse a data source during import. For databases, schemas, and tables, you can choose to:
"Show only" which returns only the databases, schemas or tables that you specify here.
"Hide" which hides the databases, schemas, and tables that you specify here.
"Show all" which is the default setting to display everything in the data source.
When you select the "Show only" or "Hide" options, a field is provided for specifying the databases, schemas or tables on which you want the option enforced.
Important: These settings are not enforced when users query against the data source; query results still return a complete list of matches. For example, if you choose to "hide" a specific database, users can still execute queries that pull data from tables within that database. However, that database will not be displayed to users when they browse the data source.
Pre-import SQL: a SQL statement to execute before beginning import, after the table’s schema is determined.
Post-import SQL: a SQL statement to execute after import completes.
Pre-export SQL: a SQL statement to execute before beginning export, after the table is created if auto-create is enabled.
Post-export SQL: a SQL statement to execute after export completes.
The user authentication can be done through a shared account or an individual account. Depending on your choice, the following fields are required:
User: the username for an individual account used to authenticate to the database.
Password: the password for an individual account used to authenticate to the database.
User: the username for a shared account used to authenticate to the database.
Password: the password for a shared account used to authenticate to the database.
Role: if roles have been implemented for this database, authenticated users with this user role can perform queries after authentication.
Amazon S3 Client Configuration
Export using S3?: This option specifies whether the connector will export to Redshift by uploading data into Amazon S3 and then copying it into Redshift, or by inserting data directly into Redshift.
Export using S3: The connector will upload data into Amazon S3 and copy it into Redshift. This is the recommended approach for larger datasets as it will allow for a more performant export.
Bucket name: An S3 Bucket represents a collection of objects stored in Amazon S3.
Prefix: Limits results to only those keys that begin with the specified prefix.
Socket Timeout Seconds: The number of seconds to wait for a response from an established S3 connection. The default value is 5 minutes and may need to be increased to handle the export of large files.
Export using SQL insert statements: The connector will insert data directly into Redshift. This option will result in slower exports. If you plan to only perform imports from Redshift, you can select this option in order to not have to enter your S3 account details.
Important: The connector requires the s3:ListBucket permission on the bucket. Bucket contents require permissions s3:ListBucket, s3:GetObject, and (for export only) s3:PutObject. In addition, if there is a SourceIP condition block specified in your bucket policy, then you must include the IP addresses for your Paxata server and any servers that are used to run automation jobs.
If you connect to Amazon Redshift through a proxy server, these fields define the proxy details.
Web Proxy: 'None' if no proxy is required or 'Proxied' if connection to the Amazon Redshift REST Endpoint should be made via a proxy server. If a web proxy server is required, the following fields are required to enable a proxied connection.
Proxy host: The host name or IP address of the web proxy server.
Proxy port: The port on the proxy server for Data Source.
Proxy username: The username for the proxy server.
Proxy password: The password for the proxy server. *Leave username & password blank for an unauthenticated proxy connection.
Data Import & Export Information
The browsing experience will vary and is based on the database, schemas, and tables visibility settings selected above along with the user credentials supplied.
Via SQL Query
As noted in the visibility section, the limits to what a user can import via query are only limited to their authorization as determined by the credentials supplied for the connection.