Building a Big Data Processing Pipeline – Chapter 4

Juan Irabedra
.
September 1, 2022
Building a Big Data Processing Pipeline – Chapter 4

First steps towards building a real-time big data processing pipeline with Apache Kafka and Apache Flink with Scala: Cassandra with Docker

In the previous article we focused on getting started with Apache Flink. In order to run our app we do also need some Cassandra instance running. We will set up a Cassandra local cluster with Docker and learn how to query the messages Flink sent. We will also show our pipeline in action.

Remember the pipeline looks like the following:

Setting up a local Cassandra server with Docker

Before getting our Flink app running, we should have a Cassandra table to store our values. Flink offers many connectors! You can check them out here. In this example we chose to show the Apache Cassandra in action.

There are many ways to host Cassandra: a local server, a local containerized server, managed on AWS Keyspaces and so on. One of the simplest ways is to get it running on a Docker container. If you do not have Docker, find out how to get it in the official documentation.  

Open a new terminal tab and run:

no-line-numbers|light|text docker run -p 9042:9042 –rm –name cassandra -d cassandra:3.11

Your Cassandra instance is now running on port 9042.

The easiest way to communicate with your Cassandra server is through cqlsh. CQL stands for Cassandra Query Language (remember Cassandra is not SQL). SH Stands for shell. You can download this tool here.

Once it is installed, the following commands will connect to your Cassandra server, create a Keyspace and a table with a single column: payload.

no-line-numbers|light|bash cqlsh

no-line-numbers|light|sql CREATE KEYSPACE cassandraSink WITH REPLICATION = { ‘class’ : ‘SimpleStrategy’, ‘replication_factor’ : ‘1’ };

USE cassandraSink;

no-line-numbers|light|sql CREATE TABLE IF NOT EXISTS messages ( payload text PRIMARY KEY);

You can get all the records living in the table by querying:

no-line-numbers|light|sql SELECT * FROM messages;

Building and submitting the jar

Now that our code is set up, it is time for the sbt assembly plugin to shine. We will now run the Scala project with the Apache Flink dependency we built in the previous article. Open up a console in your Scala project root folder (where we placed our build.sbt file). Run the command sbt assembly.

Congrats! Your jar file should be on its way. You should be able to find it soon on the /target/scala-2.12 folder.

Open up a new terminal tab on your Flink source directory. To get our Flink app running:

no-line-numbers|light|bash ./bin/flink run [path to your .jar file]

And that is it. Your Flink app should be now running.

Demo

We will now post a value to our SpringBoot API. Assuming the API is now running locally:

no-line-numbers|light|bash curl -X POST -H “Content-Type: application/json” -d ‘{“topic”: “flink-input”, “key” : “someKey”, “value” : “hello, Flink!”}’ “http://localhost:8080/api/messages/”

Let us query our Kafka topic. Open up a terminal tab in your Kafka source directory.

no-line-numbers|light|bash bin/kafka-console-consumer.sh –bootstrap-server localhost:9092 –topic flink-input –from-beginning

We can see:

no-line-numbers|light|bash juanirabedra@MontevideosMBP5 kafka_2.13-3.1.0 % bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic flick-input --from-beginning hello, Flink!

Now, let us query our Cassandra table:

And that is it. Our real time big data processing pipeline is working as expected. Ideally, this pipeline should be hosted in the cloud. This task will be covered in a future article.

Wrapping up

The Apache Software Foundation is working really hard on providing us with amazing open-source big data tools. Apache Kafka is really easy to get started with and has an interesting learning curve. One of the most outstanding things about it is that it provides many connectors for both pulling and pushing records.

Apache Flink provides a customizable environment to develop big data processing applications with many connectors as well. Connectors for Java and Scala using the latest DataSource API are widely available. Also, Flink becomes a really powerful alternative for unbounded stream processing.

This pipeline could be used to feed data to many other applications. Once the data is stored in Apache Cassandra, it could be used by any other application to make decisions (for example a machine learning model).

It is important to consider how robust this pipeline is. Apache Kafka, Apache Flink and Apache Cassandra are built upon distributed architectures. They are resilient and, thanks to Flink, offer an exactly once processing policy, which guarantees both efficiency and consistency.

All these three technologies can easily migrate from on premise setups to fully cloud managed setups. We can say that scalability is another interesting aspect of this pipeline.

Following this last idea, we consider that showing a cloud setup for this pipeline could be an interesting challenge, and could be a great idea for upcoming articles. Stay tuned to get to know more about top-notch practices and technology to make the most out of your data!

Missed any of our previous articles on this pipeline? Check them here:

First steps towards building a real-time big data processing pipeline with Apache Kafka and Apache Flink with Scala: Cassandra with Docker

In the previous article we focused on getting started with Apache Flink. In order to run our app we do also need some Cassandra instance running. We will set up a Cassandra local cluster with Docker and learn how to query the messages Flink sent. We will also show our pipeline in action.

Remember the pipeline looks like the following:

Setting up a local Cassandra server with Docker

Before getting our Flink app running, we should have a Cassandra table to store our values. Flink offers many connectors! You can check them out here. In this example we chose to show the Apache Cassandra in action.

There are many ways to host Cassandra: a local server, a local containerized server, managed on AWS Keyspaces and so on. One of the simplest ways is to get it running on a Docker container. If you do not have Docker, find out how to get it in the official documentation.  

Open a new terminal tab and run:

no-line-numbers|light|text docker run -p 9042:9042 –rm –name cassandra -d cassandra:3.11

Your Cassandra instance is now running on port 9042.

The easiest way to communicate with your Cassandra server is through cqlsh. CQL stands for Cassandra Query Language (remember Cassandra is not SQL). SH Stands for shell. You can download this tool here.

Once it is installed, the following commands will connect to your Cassandra server, create a Keyspace and a table with a single column: payload.

no-line-numbers|light|bash cqlsh

no-line-numbers|light|sql CREATE KEYSPACE cassandraSink WITH REPLICATION = { ‘class’ : ‘SimpleStrategy’, ‘replication_factor’ : ‘1’ };

USE cassandraSink;

no-line-numbers|light|sql CREATE TABLE IF NOT EXISTS messages ( payload text PRIMARY KEY);

You can get all the records living in the table by querying:

no-line-numbers|light|sql SELECT * FROM messages;

Building and submitting the jar

Now that our code is set up, it is time for the sbt assembly plugin to shine. We will now run the Scala project with the Apache Flink dependency we built in the previous article. Open up a console in your Scala project root folder (where we placed our build.sbt file). Run the command sbt assembly.

Congrats! Your jar file should be on its way. You should be able to find it soon on the /target/scala-2.12 folder.

Open up a new terminal tab on your Flink source directory. To get our Flink app running:

no-line-numbers|light|bash ./bin/flink run [path to your .jar file]

And that is it. Your Flink app should be now running.

Demo

We will now post a value to our SpringBoot API. Assuming the API is now running locally:

no-line-numbers|light|bash curl -X POST -H “Content-Type: application/json” -d ‘{“topic”: “flink-input”, “key” : “someKey”, “value” : “hello, Flink!”}’ “http://localhost:8080/api/messages/”

Let us query our Kafka topic. Open up a terminal tab in your Kafka source directory.

no-line-numbers|light|bash bin/kafka-console-consumer.sh –bootstrap-server localhost:9092 –topic flink-input –from-beginning

We can see:

no-line-numbers|light|bash juanirabedra@MontevideosMBP5 kafka_2.13-3.1.0 % bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic flick-input --from-beginning hello, Flink!

Now, let us query our Cassandra table:

And that is it. Our real time big data processing pipeline is working as expected. Ideally, this pipeline should be hosted in the cloud. This task will be covered in a future article.

Wrapping up

The Apache Software Foundation is working really hard on providing us with amazing open-source big data tools. Apache Kafka is really easy to get started with and has an interesting learning curve. One of the most outstanding things about it is that it provides many connectors for both pulling and pushing records.

Apache Flink provides a customizable environment to develop big data processing applications with many connectors as well. Connectors for Java and Scala using the latest DataSource API are widely available. Also, Flink becomes a really powerful alternative for unbounded stream processing.

This pipeline could be used to feed data to many other applications. Once the data is stored in Apache Cassandra, it could be used by any other application to make decisions (for example a machine learning model).

It is important to consider how robust this pipeline is. Apache Kafka, Apache Flink and Apache Cassandra are built upon distributed architectures. They are resilient and, thanks to Flink, offer an exactly once processing policy, which guarantees both efficiency and consistency.

All these three technologies can easily migrate from on premise setups to fully cloud managed setups. We can say that scalability is another interesting aspect of this pipeline.

Following this last idea, we consider that showing a cloud setup for this pipeline could be an interesting challenge, and could be a great idea for upcoming articles. Stay tuned to get to know more about top-notch practices and technology to make the most out of your data!

Missed any of our previous articles on this pipeline? Check them here:

Download your e-book today!