Building a Big Data Processing Pipeline – Chapter 2

Juan Irabedra
.
August 1, 2022
Building a Big Data Processing Pipeline – Chapter 2

First steps towards building a real-time big data processing pipeline with Apache Kafka and Apache Flink with Scala: Apache Kafka topic

In the previous article we presented the project we are learning to build. We will now focus on Apache Kafka. We will set up a Kafka cluster and learn how to create a topic. We will discuss a couple ways to populate this Kafka topic.

Remember the pipeline looks like the following:

A little Kafka architecture

Explaining the Kafka architecture can become complex and overly theoretical. In this section we will briefly discuss as little Kafka as we need to build this pipeline.

Kafka is an event streaming framework. Events and messages are two words for the same thing in this context. Events are sent to topics. Topics are nothing but a simple ordered sequence of events. Topics can be partitioned and replicated.

Kafka servers, or brokers, are responsible for topics log. Kafka brokers are coordinated thanks to the Zookeeper.  The Zookeeper basically manages servers and topics. We say we have a Kafka cluster when we launch brokers coordinated by the Zookeeper.

Producers are actors that write on topics. Consumers usually read from them. This (simplified) setup would look something like this:


And that is it. That will be just enough Kafka to get started. Bear in mind that this explanation simplified the Kafka architecture a lot. There are lots of resources on this interesting architecture. One of our favorite blog entries to learn about Kafka is this one.

Setting up a local Kafka Cluster

The very first thing that has to be done is installing Scala and Kafka. In order to install Scala, check the official Scala download site. We want to install Scala 2.13. Installing Kafka is as simple as downloading the compressed file available in the Apache Kafka download site. For this endeavor we want the version supported by Scala 2.13. Also, Scala needs Java 1.8 or higher to run. Make sure to get an appropriate JDK.

Once we have extracted the compressed Kafka files, there are two folders that are worth inspecting. The first one is bin. In this folder there are a bunch of .sh (shell script) files. These files will be used to run our Zookeeper and server, as well as managing and querying topics. There are many scripts we will not use in this project.

The second folder we should pay attention to is config. In this folder we can find many files with the .properties extension. These files will be fed as arguments to our shell scripts execution. For example, we can set our Zookeeper (zookeeper.properties) and Kafka server (server.properties) host and port there.

The first thing to get Kafka running is to launch the Zookeeper. Head to your favorite terminal and change directory to your Kafka source directory (the one we downloaded moments ago). Get the Zookeeper up and running executing the following command:

no-line-numbers|bash ./bin/zookeeper-server-start.sh ./config/zookeeper.properties

We should see a lot of information on screen. This is how it should behave if the Zookeeper could successfully launch!

The following step to take is getting up a broker node, this is to say, a server. To do this, execute the following command in a new terminal tab or window:

no-line-numbers|bash ./bin/kafka-server-start.sh ./config/server.properties

We are now ready to create a Kafka topic and bind it to the server we have just created. Since our previous tab is running our Kafka broker, we have to get up a new one.

no-line-numbers|bash kafka_2.13-3.1.0 % ./bin/kafka-topics.sh –create –topic flink-input –replication-factor 1 –bootstrap-server localhost:9092

We have called the topic ‘flink-input’, replicated it only one time and bound it to the server we created. Note that the replication factor can not exceed the number of brokers (in this case we only have one broker).

Populating and debugging the topic

In our local set up we experimented with a Spring Boot based REST API. Building a RESTful API escapes from this article’s scope. In order to get started with Kafka we recommend getting used to the scripts Kafka offers. Let us now see how to populate and query a Kafka topic using a terminal.

If you cd to your Apache Kafka root folder, you can run

./bin/kafka-topics.sh –bootstrap-server=localhost:9092 –list

In order to see your Kafka topics listed. It would be interesting to write some messages to our topic, right? Try running:

no-line-numbers|bash bin/kafka-console-producer.sh –topic flink-input –bootstrap-server localhost:9092

The console will now enter in interactive mode and every line you type will be stored as an event in your Kafka topic.

Try querying you topic running the following command:

no-line-numbers|bash bin/kafka-console-consumer.sh –bootstrap-server localhost:9092 –topic flink-input –from-beginning

We can see:

Our Kafka cluster is now up and running! We already know how to use Kafka’s basic features. Easy enough, right?

In the next entry of this series of articles we will move on with Flink.  Before moving on, can you think of a similar implementation using any other technology that supports data streaming? What advantages and disadvantages would it have compared to the Apache Kafka implementation? We will leave some ideas for you to get started:

Missed any of our previous articles on this pipeline? Check them here:

First steps towards building a real-time big data processing pipeline with Apache Kafka and Apache Flink with Scala: Apache Kafka topic

In the previous article we presented the project we are learning to build. We will now focus on Apache Kafka. We will set up a Kafka cluster and learn how to create a topic. We will discuss a couple ways to populate this Kafka topic.

Remember the pipeline looks like the following:

A little Kafka architecture

Explaining the Kafka architecture can become complex and overly theoretical. In this section we will briefly discuss as little Kafka as we need to build this pipeline.

Kafka is an event streaming framework. Events and messages are two words for the same thing in this context. Events are sent to topics. Topics are nothing but a simple ordered sequence of events. Topics can be partitioned and replicated.

Kafka servers, or brokers, are responsible for topics log. Kafka brokers are coordinated thanks to the Zookeeper.  The Zookeeper basically manages servers and topics. We say we have a Kafka cluster when we launch brokers coordinated by the Zookeeper.

Producers are actors that write on topics. Consumers usually read from them. This (simplified) setup would look something like this:


And that is it. That will be just enough Kafka to get started. Bear in mind that this explanation simplified the Kafka architecture a lot. There are lots of resources on this interesting architecture. One of our favorite blog entries to learn about Kafka is this one.

Setting up a local Kafka Cluster

The very first thing that has to be done is installing Scala and Kafka. In order to install Scala, check the official Scala download site. We want to install Scala 2.13. Installing Kafka is as simple as downloading the compressed file available in the Apache Kafka download site. For this endeavor we want the version supported by Scala 2.13. Also, Scala needs Java 1.8 or higher to run. Make sure to get an appropriate JDK.

Once we have extracted the compressed Kafka files, there are two folders that are worth inspecting. The first one is bin. In this folder there are a bunch of .sh (shell script) files. These files will be used to run our Zookeeper and server, as well as managing and querying topics. There are many scripts we will not use in this project.

The second folder we should pay attention to is config. In this folder we can find many files with the .properties extension. These files will be fed as arguments to our shell scripts execution. For example, we can set our Zookeeper (zookeeper.properties) and Kafka server (server.properties) host and port there.

The first thing to get Kafka running is to launch the Zookeeper. Head to your favorite terminal and change directory to your Kafka source directory (the one we downloaded moments ago). Get the Zookeeper up and running executing the following command:

no-line-numbers|bash ./bin/zookeeper-server-start.sh ./config/zookeeper.properties

We should see a lot of information on screen. This is how it should behave if the Zookeeper could successfully launch!

The following step to take is getting up a broker node, this is to say, a server. To do this, execute the following command in a new terminal tab or window:

no-line-numbers|bash ./bin/kafka-server-start.sh ./config/server.properties

We are now ready to create a Kafka topic and bind it to the server we have just created. Since our previous tab is running our Kafka broker, we have to get up a new one.

no-line-numbers|bash kafka_2.13-3.1.0 % ./bin/kafka-topics.sh –create –topic flink-input –replication-factor 1 –bootstrap-server localhost:9092

We have called the topic ‘flink-input’, replicated it only one time and bound it to the server we created. Note that the replication factor can not exceed the number of brokers (in this case we only have one broker).

Populating and debugging the topic

In our local set up we experimented with a Spring Boot based REST API. Building a RESTful API escapes from this article’s scope. In order to get started with Kafka we recommend getting used to the scripts Kafka offers. Let us now see how to populate and query a Kafka topic using a terminal.

If you cd to your Apache Kafka root folder, you can run

./bin/kafka-topics.sh –bootstrap-server=localhost:9092 –list

In order to see your Kafka topics listed. It would be interesting to write some messages to our topic, right? Try running:

no-line-numbers|bash bin/kafka-console-producer.sh –topic flink-input –bootstrap-server localhost:9092

The console will now enter in interactive mode and every line you type will be stored as an event in your Kafka topic.

Try querying you topic running the following command:

no-line-numbers|bash bin/kafka-console-consumer.sh –bootstrap-server localhost:9092 –topic flink-input –from-beginning

We can see:

Our Kafka cluster is now up and running! We already know how to use Kafka’s basic features. Easy enough, right?

In the next entry of this series of articles we will move on with Flink.  Before moving on, can you think of a similar implementation using any other technology that supports data streaming? What advantages and disadvantages would it have compared to the Apache Kafka implementation? We will leave some ideas for you to get started:

Missed any of our previous articles on this pipeline? Check them here:

Download your e-book today!