Image by Hebi B. from Pixabay

Data transformation is an important aspect of Data Engineering and can be a challenging task depending on the dataset and the transformation requirements. A bug in data transformation can have a severe impact on the final data set generated leading to data issues. In this blog I am going to share my experience of having missing values in Pandas DataFrame, handling these missing values in Pandas and converting the Pandas DataFrame to Spark DataFrame.

To give a quick background, I was writing a data transformation (ETL) job in AWS Glue using PySpark which was to be executed every 15mins. The…


Image by GraphicMama-team from Pixabay

AWS Glue is a serverless ETL service to process large amount of datasets from various sources for analytics and data processing. Recently I came across “CSV data source does not support map data type” error for a newly created glue job. In a nutshell, the job was performing below steps:

  1. Read the data from S3 using create_dynamic_frame_from_options
  2. Perform some required transformations
  3. Write the transformed data to Amazon Redshift using write_dynamic_frame_from_jdbc_conf

And it was during this write step that the glue job was failing. Lets look into it in little more details -

datasource0 = glueContext.create_dynamic_frame_from_options( connection_type="s3", connection_options = { "paths"…


Image by Peggy und Marco Lachmann-Anke from Pixabay

In this blog post I will discuss following scenarios to connect to databases from AWS Lambda function:

  • Connecting to Amazon Aurora PostgreSQL database in private subnet with public accessibility set to No in same AWS account.
  • Connecting to cross account Amazon Redshift database in public subnet with public accessibility set to Yes.

Connect to Amazon Aurora PostgreSQL database in Private subnet with Public accessibility set to No in the same AWS account

In this setup, Amazon Aurora PostgreSQL database is running in private subnet with public accessibility set to No. The connectivity and security detail are as follows:


Given that you have a partitioned table in AWS Glue Data Catalog, there are few ways in which you can update the Glue Data Catalog with the newly created partitions.

  1. Run MSCK REPAIR TABLE <database>.<table_name> in AWS Athena service.
  2. Rerun the AWS Glue crawler .

Recently, AWS Glue service team has added a new feature (or say parameter for Glue job) using which you can immediately view the newly created partitions in Glue Data Catalog.

To demo this, I will pre-create an empty partitioned table using Amazon Athena Service with target location to S3. I have another S3 location which…


AWS Glue is a fully managed extract, transform, and load (ETL) service to process large amount of datasets from various sources for analytics and data processing. When you add a AWS Glue job, you can choose the job to be either Spark or Spark Streaming or Python shell type.

For one of my use-case I wanted to try the new “ redshift-data “ api service in AWS Glue Python Shell script. The Amazon Redshift Data API can be used to run SQL queries on Amazon Redshift tables. To test the redshift-data API I wrote a simple AWS Glue Python Shell…


Every time with AWS re:Invent around, AWS releases many new features over a period of month. In this blog post I will touch on 3 new features which were introduced for Amazon DynamoDB. DynamoDB is a non-relational managed database with single digit millisecond performance at any scale.

New Features in Amazon DynamoDB -

  1. PartiQL — SQL-compatible query language for Amazon DynamoDB.
  2. Export to S3 — Export Amazon DynamoDB table to S3. In this blog I have added a use-case of deserializing the DynamoDB items, writing it to S3 and query using Athena.
  3. Direct integration of DynamoDB with Kinesis Streams —…


I will admit, AWS Data Wrangler has become my go to package for developing extract, transform, and load (ETL) data pipelines and other day-to-day scripts. AWS Data Wrangler integration with multiple big data AWS services like S3, Glue Catalog, Athena, Databases, EMR, and others makes life simple for engineers. It also provides the ability to import packages like Pandas and PyArrow to help writing transformations.

In this blog post I will walk you through a hypothetical use-case to read data from glue catalog table and obtain filter value to retrieve data from redshift. I would create glue connection with redshift…


In this blog I will walk you through the way timestamp is stored in Parquet file version 1.0 and 2.0, how the timestamp column data is displayed in Athena for each version and how you can cast the timestamp column in Athena to view timestamp for version 2.0 Parquet files.

Using AWS Glue crawler, I crawled few parquet files stored in S3 created by RDS Snapshot to S3 feature. After the crawler completed and added the new table, I used AWS Athena to query the data and the timestamp column displayed the data as below -

To dive deep and…


AWS CloudTrail service captures actions taken by an IAM user, IAM role, APIs, SDKs and other AWS services. By default, AWS CloudTrail is enabled in your AWS account. You can create “trail” to record ongoing events which will be delivered in JSON format to an Amazon S3 Bucket of your choice.


Introduction

In this post, I have penned down AWS Glue and PySpark functionalities which can be helpful when thinking of creating AWS pipeline and writing AWS Glue PySpark scripts.

AWS Glue is a fully managed extract, transform, and load (ETL) service to process large amounts of datasets from various sources for analytics and data processing.

While creating the AWS Glue job, you can select between Spark, Spark Streaming, and Python shell. These jobs can run a proposed script generated by AWS Glue, or an existing script that you provide or a new script authored by you. …

Anand Prakash

Avid learner of technology solutions around databases, big-data, Machine Learning. 5x AWS Certified | 5x Oracle Certified. Connect on Twitter @anandp86

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store