Setting Up Kafka
Apache Kafka is a high-throughput, low-latency event processing system designed to handle millions of data feeds per second. It has the capability to store, read and analyze streaming data. In an object-based system, what we are interested is the objects and its interaction with one another and the state of each object which we may store in a database. In an event-based system, our interest will be more on events generated by those objects or the system than the objects itself. Event data is stored in a structure called log and log is an ordered sequence of these events. Unlike databases, these logs are easy to scale. And Apache Kafka is a system to manage and process these logs. The main components of Kafka system are
- Kafka connect.
- Kafka streams.
In Apache Kafka terms, these logs are called as Topics, and is stored in a durable way. They are written to disk and are replicated on multiple disks, multiple servers to make sure the data persists even in an event of hardware failure. Topics can be stored for a configurable amount of time ranging from short period of time to forever. Topics can be small or enormous. Each entry of the log item or topics item represent an event in the system. Modern software systems, unlike traditional where we have a single monolithic application, are microservice based where each service/component can be developed and versioned separately. All these services can talk to each other through Kafka topics. Each service can consume a topic, process it, and generate another topic which can be consumed by another service or stored as a topic. If you have a non-topic data, then you should be able to connect that to Kafka system via a configurable software package known as Kafka connect. With Kafka connect, you can read in and out data from and to external systems or database. There are a lot of such connectors in market. Now we have a lot of topics in the system and if you want to do some computation/operation on these topics like aggregation, grouping, filtering, enrichment(joint/connecting topic data), we have an inbuild library/API to do all these and this is called as Kafka streams. The result of it will be another topic.
Fundamental components of Kafka.
- Orchestrator – zookeeper
To get data into Kafka, we need producer. Write program that put data into Kafka. Some potential application knows about events in a system and writes those into Kafka and Kafka acknowledges it. Kafka clusters has brokers. Brokers can be treated as machines in Kafka. Each broker has a disk of its own. If we are working with a fully managed Kafka, you may not worry about these brokers. Calling brokers as machine may not be fair, it may be a container aswell. A broker can run on bare metal hardware, a cloud instance, in a container managed by Kubernetes, in Docker on your laptop, or wherever JVM processes can run. Now at the consumer side, we must write a piece of code which knows how to read data from Kafka. Producer and consumer doesn’t know each other. They are completely decoupled. Producers or consumers can scale. You can add consumers/producers into the system without disturbing the system. They can fail independently they can evolve independently. What orchestrator does: Each broker has the capability of process topics and can store it in its disk. For durability purpose, it is also replicated in other brokers too. When a broker fails, these stored topics needs to be managed. There should be someone who should decide who has the responsibility of these topics. These are the main job of orchestrator. Topics are sequence of events. Since it is a log, you can only write to topic at its end. Many producers can write to a topic, or one producer can write to many topics. Conceptually you can have as many topics as you like.
Topics is a log, it can increase in size over period, it needs to be stored in a broker. Broker is a machine with a disk size, we can scale the machine vertically, but it cannot scale forever. Also to read in and write into it will become costly if it keeps increasing its size. So to handle this topics effectively, Kafka has ways to partition these topics and each partition can be allocated to different brokers. This is key to how Kafka scales. So when we do this, every partition is a log. Each partition is then further broken into segments. SO a partitions on a broker is a set of files with index. You don’t really have to worry about segments but you do have to think about partitioning as you think about how to model data in topics. Kafka makes the decision about where these partitions are going to live. What Kafka doesn’t do is to keep track of the size of those partitions and move them around if one broker gets overloaded as topics gets created and destroyed. These are functionality that we need to add to keep those things balanced.
How Kafka works:
- How to handle schema change.
- How to handle when Kafka node run out of space?
- Non default types and serialization.(GUID as key for a message) – schemaRegistry https://bartwullems.blogspot.com/2020/11/sending-message-through-kafka-value.html
Setting up Kafka cluster on local machine.
- Java JDK must be installed.
- Kafka package must be downloaded
- Configure Kafka. The java used by Kafka is JDK 8 so please download and install JDK 8. Download Kafka package from apache.org and place it in some folder. These are nicely explained at https://medium.com/technofunnel/apache-kafka-in-5-minutes-c92c43ba3f39
How to run more Brokers locally:
Make sure you have separate server configuration for each broker that you want. And on each setting, make sure to change the below values.
- log.dirs=C:\kafka_2.12-2.4.0\Broker1 Start each broker separately.
When going to production:
Update the server server.properties to reflect the settings required for your production environment. Set the proper broker id: broker.id=0 Set host name and port numbers : port=9092 advertised.host.name = localhost Set the topic delete option to specify if you would like to have the topic deleted after certain point in time. delete.topic.enable = true