In this new series of blog posts called Use Cases, I'm going to go over specific projects that we've worked on with clients to address specific needs, how the project went, and the lessons retained from that experience. I won't go into specifics, but I'll cover the important bits. Hopefully this will be helpful if your organization is looking at undertaking a similar project. This week, I'll talk about a project where we had to migrate a large data lake from an on-premise Hadoop infrastructure to the cloud on AWS.
Before any project can start, an analysis of the business needs and a design has to be constructed in order to decide what the solution will be. In this particular case, we weren't involved in the initial decision process, but the solution architect came up with a pretty good design. The company had around 30 TB of data sitting in an on-premise data lake. Data engineers and developers would use Hadoop clusters in order to run their ETL pipelines, and the data would stay on a myriad of Windows, Linux and AIX systems while many scripts would handle processing. While the situation wasn't too bad, having data spread around was definitively a problem, so the new solution had to be more centralized. The on-premise clusters also caused problems, because they were of fixed size, and so there was a lot of wasted resources as needs varied day by day. Finally, there was a desire to use some cloud services which couldn't be done easily with this on-premise solution.
The new design saw this data lake being moved and completely replaced by a combination of AWS S3, Databricks clusters and RDS databases. We were brought in to help with the implementation of this new solution. The first thing we did was create a number of Terraform templates to provision the base environment. These templates live in GitHub Enterprise, and they get deployed in the AWS cloud. Every resource then gets tagged, so that cost and usage can easily be tracked. S3 buckets are used as the new data lake, where all raw data would be ingested into. This allows data to be centralized, and any new files would be deposited in this central location. Databricks clusters would then replace Hadoop in order to run the ETL pipelines. The benefit here is that these clusters are cloud based and can be scaled up or down instantly, greatly reducing costs. Finally, processed data would be stored in RDS databases. All of the pipelines responsible for the moving of data, processing and deployment of jobs, along with monitoring were done in Jenkins so they could easily be deployed.
By moving to a cloud solution like that, the client was able to get a much more scalable, cost efficient and transparent solution. All of their data now lives in a central, designated location, and all of the processing is done in an automated way which is much less prone to mistakes. That isn't to say the entire project went off flawlessly. One of the challenges we encountered was moving such a massive amount of files. Even by using cloud native tools like the AWS CLI, moving millions of files between buckets can be slow and costly. So it's important to design your environment properly from the start to avoid doing needless operations later on. Another important thing is keeping track of features and costs. For example, the difference in cost between tagging a bucket and tagging individual files inside of a bucket can be massive when you're talking about billions of individual tags. Lastly, leveraging automation tools like Terraform and Jenkins is critical when you're talking about such a large and business critical environment.
Overall, a use case like that is fairly standard, and we see organizations move more and more workloads to the cloud for the reasons stated above. Still, there are a lot of things to keep in mind, from cost predictions, backups, performance monitoring, automation, DR and more. Having a consultant that is experienced in this type of project can often be key between success and failure.