Migrating and Scaling Data Pipelines on AWS

By Kunal Agarwal, co-founder and CEO, Unravel Data.

  • Monday, 25th May 2020 Posted 4 years ago in by Phil Alsop

When an enterprise moves environments, whether that’s on-premise to the cloud, between clouds, and then between data systems, there’s a set of common questions that come up time and again.

 

The questions centre around old favourites like: How much will this cost us? What are the right type of instances for me to run on Amazon for me to run these workloads? Which apps are actually best suited for this particular cloud environment or this particular infrastructure that I’m moving to? And, ‘how do I know if my migration is successful?’

 

Usually in most organisations all of these questions are answered by a very technical process called ‘guesstimating’. When moving to a new environment like Amazon, the decision is usually based on just a few factors like: What is your size of data to process? And then the decision-maker picks their instance types, and then picks a quantity of that particular instance type.

 

Now, that technique doesn’t offer the accuracy in terms of cost and performance that one really needs for high performance and an effective cloud move. The right strategy is to set and achieve a really scientific and surgical understanding of the best instance type, the right environment, and how big should the environment be on any given day of the week. One can then use auto-scaling features up and down to actually run code in an optimised and cost-effective fashion.

 

What the business really needs is a measure of performance management, cost reduction, cluster optimisation, and a carefully controlled migration. These are the ingredients for a successful migration in order of action.

 

First and foremost: The current environment

It may be on-premises, it may be to another provider. Firstly, understanding what you are currently running is key, so then the business can start to figure out whether it has compatibility, and whether you will be able to scale up in this new environment. It’s crucial to find this data from the systems you are currently running, and what services you are actually leveraging on your current environment. Then break down other usage information by how many apps you have, how many of these applications are Spark versus Kafka, or Hive or Hadoop, for example, and which users actually access these particular applications.

 

Secondly, look at your resource information

Secondly, one needs to look at resource information around how much CPU, memory, containers, and so on are currently used. All of this information is baseline information to understand the shape of the workload or what the cost in the new environment will start to look like.

 

Once you have all that data collected (and that requires an application performance management (APM) solution) one must understand which are the workloads that are best suited for the cloud. For example, one might have a consistently-running environment and be trying to get to an auto-scaling, dynamic ecosystem. Some applications will be very bursty in nature. Which of these applications sometimes process 50GB of data one day, and on another - 500GB, and so require a lot of different resource changes. It’s best to have the power to pick and choose and only migrate those types of workloads that are best suited for the cloud.

 

It’s important to consider working smart in such a complex environment as it will save on costs. As an example, you may think that if you have fifteen business groups on the cluster now that you might just want to move only the marketing department to the cloud. Having the ability to slice and dice that data could show that you don’t need to move the whole five petabytes of data. By seeing all the apps that the marketing department touches and all the datasets they access you might decide to only move particular workloads to a cloud environment and save costs. It’s a more complex calculation, but a more mature and sophisticated approach.

 

As mentioned, an APM solution simplifies the complexity. It’s too great a challenge for the data engineering team to discover everything happening on the cluster, and expect them to instantly access rich information about applications, datasets, and resources used, all the users in your system. Yet once you have all of this data you can start to intelligently plan and migrate workloads to the cloud based on cost and performance requirements.

 

Once you’re in the cloud, the choices really pile up

Once you’re on the cloud, there is a plethora of different services available. You can end up with a very sophisticated pipeline, with care and planning. But with multiple distributed systems powering mission-critical applications a lot of things can go wrong. The application fails, or your IoT pipeline lags. That’s when the operations team or developers need to get involved to figure out the situation and how to solve the problem. It’s not uncommon that the pipeline was running well last week but suddenly starts to miss SLAs the next.

 

So instead of having to look at three CloudWatch dashboards, logs, and metrics pipelines, teams need to quickly come to a definitive answer and be able to implement a solution.

 

It’s crucial for a high-performing team to make use of an APM so they achieve that single place where data is brought in and visualised. These fixers need to be able to take all this data, apply AI and machine learning algorithms, and ensure that the operations teams, architects and developers don’t waste their time troubleshooting solvable but time-consuming problems, but building applications and making them reliable and trouble-free.