Orchestrating an ML Workflow with Step Functions and EMR

0:00 / 0:00

As we have seen on previous posts (Launching an EMR cluster using Lambda functions to run PySpark scripts, part 1 and 2), EMR is a very popular big data platform, designed to simplify running Big Data frameworks such as Hadoop and Spark. Taking advantage of this platform allows us to scale up and down our compute capacity, leverage cheap spot instances and integrate seamlessly with our Data Lake. When we combine EMR with Lambdas, we can launch our clusters programmatically using serverless FaaS, and react to new data becoming available, or triggering on a cron schedule.

In this post, we propose to take this approach of using Lambdas to trigger EMR clusters one step further, and use AWS Step Functions to orchestrate our ML process. AWS Step Functions is a fully-managed service that makes it easy to build and run multi-step workflows. It allows us to define a series of steps that make up our workflow, and then Step Functions executes them in order. These steps can be run sequentially, in parallel, wait for previous stages to finish, or run conditionally if other workflows fail. The image below shows an example of a Step Function. It is a series of steps, where we can see a Start, Steps, Decisions, Alternatives, and an End. This is ideal for real-life workflows, where different decisions need to be made as the process progresses.

‍

‍

An interesting advantage of using Step Functions for Machine Learning workflows (instead of a single workflow, for example) is that we can spin up independent clusters for each step. Different stages of a Machine Learning pipeline can have different memory requirements and processing power. Being able to easily spin up a cluster for each step allows us to use the right size of instances for each stage of the workflow, thus being more efficient with our time and spend.

‍

Some other advantages to this approach are the seamless integration of StepFunctions with other services, such as Lambda, that allow us to build a complex workflow without having to orchestrate it manually, with easy out-of-the-box monitoring and error handling that StepFunctions provide – simplifying our retry logic and overall process monitoring.

So, hoping I have convinced you to try this new approach, let’s get hands on!

‍

Spinning up our Step Function

The generation of the Step Functions can be done in two different ways: using the Step Functions visual editor, or leveraging CloudFormation. When I began this post, I wanted to use CloudFormation, but I ran into a CloudFormation limit (CloudFormation doesn’t allow us to create StepFunctions with a definition as long as I needed it to). This is why I will use the Step Functions visual editor, combined with Amazon States Language.

From the AWS website:

The Amazon States Language is a JSON-based, structured language used to define your state machine, a collection of states, that can do work (Task states), determine which states to transition to next (Choice states), stop an execution with an error (Fail states), and so on.

Our ML Workflow

We will define a workflow made up of 3 steps: data preprocessing, model training and model evaluation.

A set of examples, coded in Python using PySpark, and step by step to launch our first Step Function!

For this, we will need to create five things:

An S3 bucket where you will store your scripts: preprocessing, training and evaluation, which contain the logic for each of those steps:

‍

preprocessing_script.py

‍

training_script.by

‍

evaluation_script.py

‍

An SNS topic where we’ll send our errors, if they happen
A ServiceRole for EMR
– I used the EMR_DefaultRole, that AWS provides for us
An IAM Role for our EMR clusters (I called it blog-emr-job-role)
– This will need the following trust policy:

‍

– And the following permissions:

‍

And an IAM Role for our Step Function
– With the following trust policy:

‍

And the following permissions:
‍

‍

Once we have all of these elements ready, we’ll need to replace them in our StepFunction definition, which I share below.

At a high level, we define three steps for each stage of our workflow: cluster creation, code execution and cluster termination. We also define a “catch-all” step that will alert via SNS if there is an error on any of the code execution steps. Something interesting to highlight is that StepFunctions have a predefined step type (createCluster, addStep and terminateCluster) for each of the steps we need, which reduces significantly the amount of code we need to write for our orchestration.:

‍

Some interesting aspects we can see – for example:

makes reference to the step above:

‍

which allows us to create the cluster in one step, obtain its id and then use it in another step.

If we wanted to do more complex things, the idea is the same – we have to reference it in the result path or output path. More details about this here

Going back to our stack creation – once we’ve replaced the values, we can author a new Step Function. You can use this link or navigate to the StepFunctions console, select “Create State Machine” and select the option “Write your workflow in code”.

Both options will open the screen we see below, that includes a basic “Hello World” application. You will replace the code in the definition, and refresh the graph.

‍

Once you’ve refreshed the graph, you should see the following:

‍

‍

The next steps are to name your Step Function, and assign it a role (the one we had created above would be ideal) – and create it with the button “Create State Machine”. Allow some minutes for it to be completed, and you’re ready to go!

You’ll find a button to “Start Execution” – which will trigger your steps. Once it starts executing, the flowchart will update with the different statuses – blue for steps in progress, green for successful steps, orange for caught errors, and red for unexpected errors.

It’s important to highlight that the EMR clusters will be available via the usual EMR console – which simplifies the process monitoring for people who are familiar with it. The fact that it was triggered via StepFunctions is completely transparent to the end users.

Going back to our Step Function and error handling, in the image below we can see two different errors. The step “TrainingStep” caught an error (which is part of the expected workflow), but the step “HandleErrors” failed. This allows us to easily find what failed and where.

‍

‍

On this other image, we can see that the workflow failed, but the error handling succeeded.

‍

‍

This is important to highlight because caught errors are part of our workflow and we have prepared for them (e.g. with alerting). Unexpected errors might leave our work in an incomplete state that requires manual intervention.

‍

Conclusions and next steps

Well, thanks for making it this far and I hope you’re excited to try this out!

We can see that spinning up a set of EMR clusters is quite easy with Step Functions, and it allows us to leverage the power or EMR in a simple and straightforward way. As a final recommendation, I’d like to highlight some of the best practices for using these two services together:

Separate cluster creation and job execution
Parametrize your steps
Handle failure scenarios gracefully
Include log destinations
Optimize cluster sizing
Be careful with security and access control

With all of this in mind, onwards to your own experiments! And don’t hesitate to write with any issues or questions – happy to help!

‍

Happy coding! 🙂

‍

Stay ahead of the curve on the latest trends and insights in big data, machine learning and artificial intelligence. Don't miss out and subscribe to our newsletter!

‍

‍

An interesting advantage of using Step Functions for Machine Learning workflows (instead of a single workflow, for example) is that we can spin up independent clusters for each step. Different stages of a Machine Learning pipeline can have different memory requirements and processing power. Being able to easily spin up a cluster for each step allows us to use the right size of instances for each stage of the workflow, thus being more efficient with our time and spend.

‍

So, hoping I have convinced you to try this new approach, let’s get hands on!

‍

Spinning up our Step Function

From the AWS website:

Our ML Workflow

We will define a workflow made up of 3 steps: data preprocessing, model training and model evaluation.

A set of examples, coded in Python using PySpark, and step by step to launch our first Step Function!

For this, we will need to create five things:

An S3 bucket where you will store your scripts: preprocessing, training and evaluation, which contain the logic for each of those steps:

‍

preprocessing_script.py

‍

training_script.by

‍

evaluation_script.py

‍

An SNS topic where we’ll send our errors, if they happen
A ServiceRole for EMR
– I used the EMR_DefaultRole, that AWS provides for us
An IAM Role for our EMR clusters (I called it blog-emr-job-role)
– This will need the following trust policy:

‍

– And the following permissions:

‍

And an IAM Role for our Step Function
– With the following trust policy:

‍

And the following permissions:
‍

‍

Once we have all of these elements ready, we’ll need to replace them in our StepFunction definition, which I share below.

‍

Some interesting aspects we can see – for example:

makes reference to the step above:

‍

which allows us to create the cluster in one step, obtain its id and then use it in another step.

If we wanted to do more complex things, the idea is the same – we have to reference it in the result path or output path. More details about this here

Both options will open the screen we see below, that includes a basic “Hello World” application. You will replace the code in the definition, and refresh the graph.

‍

Once you’ve refreshed the graph, you should see the following:

‍

‍

‍

On this other image, we can see that the workflow failed, but the error handling succeeded.

‍

‍

Conclusions and next steps

Well, thanks for making it this far and I hope you’re excited to try this out!

Separate cluster creation and job execution
Parametrize your steps
Handle failure scenarios gracefully
Include log destinations
Optimize cluster sizing
Be careful with security and access control

With all of this in mind, onwards to your own experiments! And don’t hesitate to write with any issues or questions – happy to help!

‍

Happy coding! 🙂

‍

Stay ahead of the curve on the latest trends and insights in big data, machine learning and artificial intelligence. Don't miss out and subscribe to our newsletter!

‍

Thank you! The file will start to download shortly

Oops! Something went wrong while submitting the form.

Orchestrating an ML Workflow with Step Functions and EMR

Spinning up our Step Function

Our ML Workflow

Conclusions and next steps

Stay ahead of the curve on the latest trends and insights in big data, machine learning and artificial intelligence. Don't miss out and subscribe to our newsletter!

Download youre-book today!

Spinning up our Step Function

Our ML Workflow

Conclusions and next steps

Stay ahead of the curve on the latest trends and insights in big data, machine learning and artificial intelligence. Don't miss out and subscribe to our newsletter!

Related Articles

Related Articles

AI Transformation Challenge

AI Transformation Challenge

Download your
e-book today!