The Data Prep (Paxata) documentation is now available on the DataRobot public documentation site. See the Data Prep section for user documentation and connector information. After the 2021.2 SP1 release, the content on this site will be removed and replaced with a link to the DataRobot public documentation site.
User Persona: Paxata User - Paxata Admin - Data Source Admin
*Note: This document covers all configuration fields available during Connector setup. Some fields may have already been filled out by your Admin at an earlier step of configuration and may not be visible to you. For more information on Paxata’s Connector Framework, please see here.
Also: Your Admin may have named this Connector something else in the list of Data Sources.
This connector enables the ability to import & export data against Amazon S3 object storage. The following fields are used to define the connection parameters.
Name: Name of the data source as it will appear to users in the UI.
Description: Description of the data source as it will appear to users in the UI.
Something to consider: You may connect Paxata to multiple S3Buckets and having a descriptive name can be a big help to users in identifying the appropriate data source. If you are a Paxata SaaS customer, please inform Paxata DevOps how you would like this set.
Amazon S3 Client Configuration
Bucket name: An S3 bucket represents a collection of objects stored in Amazon S3. The connector requires the following permissions: s3:ListBucket, s3:GetObject, and (for export only) s3:PutObject. In addition, if there is a SourceIP condition block specified in your bucket policy, then you must include the IP addresses for your Main Core Server and Automation Core Server (if you have one). See AWS S3 Bucket Permission/Policy Details at the bottom of this article for more details.
Prefix: Limits results to only those keys that begin with the specified prefix.
Encryption type: Server-side encryption type to be used. See AWS Encryption Types for more information.
Bucket region: This option allows users to specify the region in which their S3 bucket is hosted or to choose that the connector should automatically determine the region.
Amazon S3 Authentication
These options specify how to authenticate with S3.
AWS Credentials: The Access Key ID and Secret Key associated with the user’s AWS Access Key. This is the default setting. See AWS Security Credentials for more details.
IAM Cross Account: enables access to S3 by assuming a role in another AWS account that has access to the configured S3 bucket. See Cross Account Access for more details.
Important: for the Instance Profile (IAM Role) and IAM Cross Account options, Paxata must be installed on your Amazon EC2 hosts.
If you connect to Amazon S3 through a proxy server, these fields define the proxy details.
Web Proxy: 'None' if no proxy is required or 'Proxied' if the connection to the Amazon S3 REST Endpoint should be made via a proxy server. If a web proxy server is required, the following fields are required to enable a proxied connection.
Proxy host: The host name or IP address of the web proxy server.
Proxy port: The port on the proxy server for Data Source.
Proxy username: The username for the proxy server.
Proxy password: The password for the proxy server. *Leave username & password blank for an unauthenticated proxy connection.
Socket Timeout Seconds: The number of seconds to wait for a response from Amazon S3 on an established connection. The default value is 5 minutes. To handle the export of large files, increase the value.
Data Import & Export Information
The Connector will present a browsable directory hierarchy starting at the location defined in the Prefix field.
The Connector also supports Wildcard & Glob importing, this enables users to import multiple S3 data files into Paxata as a single Dataset.
Via SQL Query
As S3 is a file store, SQL Queries are not supported for this data source. If you would like to directly query AWS S3 data, please reach out to your Customer Success contact regarding Paxata’s AWS Athena Connector.
AWS S3 Bucket Permission/Policy Details
This section reviews the permissions that must be assigned in your S3 bucket policy and what you are required to do if you have a SourceIP condition block specified in your bucket policy.
The AWS S3 connector requires specific permissions in your S3 bucket policy to ensure that you can successfully import data from S3, publish to S3, and automate importing from an S3 source. In summary:
The connector requires the s3:ListBucket permission on the bucket for browsing.
For importing the bucket contents, Paxata requires the permissions s3:GetObject
For exporting to the bucket, Paxata requires the permission s3:PutObject
Sample bucket policy example:
The minimum policy permissions for reading from an S3 bucket are:
The minimum policy permissions for writing to an S3 bucket are:
If there is a SourceIP condition block specified in your bucket policy, then you must include the IP addresses of your Paxata cloud servers or Paxata Core Server (depending on your Paxata deployment) in the SourceIP Condition block. In addition, if you have a dedicated Paxata server for automation, you must also include the automation server IP addresses in the SourceIP Condition block.
Please consult with Paxata's Customer Success team to obtain the list of IP addresses for Paxata cloud servers.