I have worked as a consultant with several companies that needed help with large Hadoop to Cloud migrations. I have a few lessons learned over the years that I'll share with you.

One of the first things to do is identify the hadoop cluster use cases. By this I mean to identify the users of the cluster, what they do and how.

Also we want to interview the cluster admins, cluster tenants (application engineers, data scientists, etc), support and any other team or individual that can provide insight.

We want to talk to people that don't use the cluster to find out why.

These interviews should be short, non-intrusive and positive and will build a picture of where we are at and where we want to go.

We want to accommodate and improve all use cases as well as find new use cases as the cloud will provide us access to tools and services that we didn't necessary have before.

The biggest thing that engineers miss is the greatest thing of all...you now have access to hundreds of tools and services versus the handful that your Hadoop distribution provides. This opens to the door to providing robust products with state-of-the-art user experiences

Knowing these things will aide in the migration plan.

On-Prem Hadoop Use Cases

On-prem Access Patterns

Now we need to look at the migration options. There are basically 3 ways to do this.

Migration Types

  1. Lift-and-shift: Replicate the Hadoop cluster in cloud, keeping Hadoop distribution
  2. Conversion: migrate to cloud services where overlap with on-prem
  3. Hybrid: Use the information collected from interviews to decided which product to convert and which to lift-and-shift. Convert products with significant pain points, products in development, non-critical products. Lift-and-shift to like services (hive catalog > glue catalog, hbase > dynamodb, etc)

Now that we have a view of the playing field we can start looking at an example migration.

Example Migration

Customer needed to migrate to cloud while continuing to support all on-prem use cases. Customer has selected Amazon Web Services (AWS) as the Cloud Provider and wants to do a conversion migration

We simply need to map our use cases and access patterns above to AWS services

Service Selection

HDFS to S3

Data storage on HDFS and Spark/Pyspark for computer covers a large portion of our use cases, so we start here. We choose to use S3 to replace HDFS. Since S3 is an object store and not a file system like HDFS there are differences in how reading and writing data to be aware of.

S3 does not have directories or partitions so everything is an object in the bucket. What we see as directories are really just prefixes that are prepended to the beginning of the file name. Since all the objects are at the same level execution engines will scan all the objects in the bucket. AWS has been addressing this with service enhancements, plugins, etc so while there are differences the cost and elasticity of S3 made it a clear choice.

Spark to EMR

We choose EMR to replace the Apache Spark services that lived on the Hadoop Cluster. We evaluated Glue but didn't want to migrate to Glue APIs and stay close to native Spark Python APIs due to our large skill set in python. Glue APIs are good and make development easier so if you are new to Spark it might be worth it to learn and understand how to use the Glue APIs efficiently. There is an added cost to use these Glue API features, so you have to weight the convenience of the features with the cost. Since we already had a large Spark code base it didn't make sense to migrate to Glue.

Hive to Lake Formation

We decided to use Lake Formation with Role-Based Access Controls (RBAC) as a replacement to Hive/Metastore. You'll find that Lake Formation feels very familiar to those using Hive. You have a catalog of databases and tables which is controlled by roles. We also get the benefit of row and column level access controls. Lake Formation is far more advanced than the catalog you used on-prem, which is good...and sometimes not so good so make sure you know what you are doing and test access controls anytime access is changed. Unlike on-prem there is probably not an expensive firewall with a team of cybersecurity analysts watching out for you.

HBase to DynamoDB

Another large percent of use cases was our HBase cluster to support random seek use cases from our API teams. We choose to go with DynamoDB to replace HBase. HBase was slightly faster than DynamoDB, but we did have years of experience and highly tuned HBase cluster, so it was not an entirely fair comparison.

DynamoDB offered us a managed NoSQL service, so it was an easy decision. The slight performance loss with DynamoDB did not overcome the ease of use and personal needed to run a managed HBase cluster. The total cost of ownership (TCOA) was very much on DynamoDB's side

Migration Process

HDFS to S3

We designed a process that would hook into our existing jobs that load data to Hadoop and load the same data to S3 while it was still in memory. So instead of our batch jobs taking data from a DB2 server and writing it to HDFS via PySpark JDBC jobs, we would write the data to both HDFS and S3 at the same time. This would guarantee that the same data was available in both platform.s

For data that did not fit into the process above we created program to use the native DistCp (distributed copy) utility provided the Hadoop platform. There are other AWS and Third-Party tools and services to handle data copy but DistCP was free, already available and we were very familiar with it as we had been using it for years.

Hadoop Distributed Copy Utility (DistCp)

DistCp is a native hadoop utility that that copies large data intra or inter-cluster in parallel. Parallelism is controlled by the "num_maps" parameter. So if a directory in HDFS contained 800 files, and you specify 800 mappers the tool will copy all 800 files in parallel.

perform cdc checks, etcs. It also has an argument that will mirror the source and target directory by copying any new or updated files to S3,then deleting any files that are in S3 but not in HDFS.

The benefit to this approach is that we quickly made data available in s3 vs taking the time to analyze, design and build a historical load process. It also had the added benefit of delivering data months ahead of schedule which impressed the project management team and our manager which translates into better bonus!

Hive Databases to Lake Formation Databases

In Hive all of our tables were external meaning that the tables are just overlays of the data in HDFS. With Hive external tables we do not load data into a database. We found that this concept is a bit difficult for native database uses to grasp.

So migrating our databases and tables was easy since there is no database to load.

Now that data was published to S3, we created python programs to extract the database and table schemas from our Hive databases, make minor edits to be compatible with Lake Formation and executed those in Lake Formation.

The Glue Crawler can be considered here, but we like to store our schemas in a schema repository and have tight control. We found that when using the Glue Crawler we ended up editing the schemas anyway, so we decided to skip that step and handle the schemas ourselves. But Glue Crawlers are recommended and work well for most use cases.

HBase to DynamoDB

There are a few ways to go here. There are tools that can take the native HBase HFiles and convert and load the data to DynamodDB. You can also extract HBase data into csv files and use the DynamoDB bulk loader.

Both of those work well but since we wanted to use the migration as an opportunity to modify our NoSQL table design and architecture we went a different way.

We create a PySpark job to read the data from HBase, translate the records in the new architecture in-memory and save it into DynamoDB tables with the Spark DynamoDB data source.

It took a little more time to write and test code, but it was worth it as this program because a reusable component in our enterprise python libraries and other business units were able to benefit when it came time for them to migrate.

Sometime you need to take the time hit and build something reusable if the benefits are known.

Be careful as I"ve seen many projects fail when teams become obsessed with making every line of code reusable and cover all possible future use cases vs building to specifications and delivering the software. Reuse-ability is great, but it's not greater than delivering working software on-time and budget!

Code Migration

Code migration was easier than expected. We have been using Apache Spark for years and most of our ETL, Streaming and Analytics jobs were written in PySpark. Our execution layer was an enterprise scheduled named Autosys.

We create a migration program that had a list of translation than need to happen. Things like directory paths (we kept the S3 directory structure as close as possible to HDFS), hostnames, etc were cleaned up.

We deployed the code from our Git repos to EFS filesytem so that our code would be available on all our of lambda functions, notebooks and emr clusters. We then started running our jobs to see which jobs failed and why. Then apply those transformation to the transformation program mentioned above.

There were minimal changes to our PySpark code and many of our jobs completed successfully.


We were heavily reliant on the Autosys enterprise scheduler. We converted many jobs to step function flows.

We also added an architecture where all business units subscribed to EventBridge events and stared jobs and/or notified support. This gave a level of transparency to the business units that they didn't have before.

We implemented a UI where business users could see and subscribe to not only any job but any step in any jobs. With the subscription they could choose email, text or teams channel notification which was convenient

We later replaced the Event subscription UI with a chatbot to handle the subscriptions and many other use cases which I'll discuss later

This article will keep evolving so keep watching for more information coming soon...

Related Articles

Multi-tenant Hadoop Cluster