Cross-account AWS Glue Data Catalog access with Glue ETL

To process data in AWS Glue ETL, DataFrame or DynamicFrame is required. A DataFrame is similar to a table and supports functional-style (map/reduce/filter/etc.) along with SQL operations. The AWS Glue DynamicFrame is similar to DataFrame, except that each record is self-describing, so no schema is required initially. It computes a schema on-the-fly when required, and explicitly encodes schema inconsistencies using a choice (or union) type.

DynamicFrame can be created using the below options -

  • create_dynamic_frame_from_rdd — created from an Apache Spark Resilient Distributed Dataset (RDD)
  • create_dynamic_frame_from_catalog — created using a Glue catalog database and table name
  • create_dynamic_frame_from_options — created with the specified connection and format. Example — The connection type, such as Amazon S3, Amazon Redshift, and JDBC

This post elaborates on the steps needed to access cross account AWS Glue catalog to create the DynamicFrames using create_dynamic_frame_from_catalog option.

Account A — AWS Glue ETL execution account.
Account B — Data stored in S3 and cataloged in AWS Glue.

  1. In Account A
  • Create an IAM role in Account A to access the destination catalog and attach it to the Glue ETL job. If the job already exists, create a new policy and attach it to the existing role which the job is using.
  • The below policy grants access to “marvel” database and all the tables within the database in AWS Glue catalog of Account B.

2. In Account B

  • On the AWS Glue page, under Settings add a policy for Glue Data catalog granting table and database access to IAM identities from Account A created in step 1.
  • Apply a bucket policy to S3 bucket, granting access to role created in step 1.

3. In Account A

  • Create the DynamicFrame using “from_catalog” option in Account A, reading the data from Account B catalog.

In the above code, datasource0 is the DynamicFrame created by reading the data from “marvel_superheroes” table under “marvel” database from another AWS account mentioned in “catalog_id” parameter. The value for catalog_id should be within quotes!

To conclude, DynamicFrames in AWS Glue ETL can be created by reading the data from cross-account Glue catalog with the correctly defined IAM permissions and policies.

Avid learner of technology solutions around databases, big-data, Machine Learning. 5x AWS Certified | 5x Oracle Certified. Connect on Twitter @anandp86

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store