Setting Up Kafka

Apache Kafka is a high-throughput, low-latency event processing system designed to handle millions of data feeds per second. It has the capability to store, read and analyze streaming data. In an object-based system, what we are interested is the objects and its interaction with one another and the state of each object which we may store in a database. In an event-based system, our interest will be more on events generated by those objects or the system than the objects itself. Event data is stored in a structure called log and log is an ordered sequence of these events. Unlike databases, these logs are easy to scale. And Apache Kafka is a system to manage and process these logs. The main components of Kafka system are

  • Topics.
  • Kafka connect.
  • Kafka streams.

In Apache Kafka terms, these logs are called as Topics, and is stored in a durable way. They are written to disk and are replicated on multiple disks, multiple servers to make sure the data persists even in an event of hardware failure. Topics can be stored for a configurable amount of time ranging from short period of time to forever. Topics can be small or enormous. Each entry of the log item or topics item represent an event in the system. Modern software systems, unlike traditional where we have a single monolithic application, are microservice based where each service/component can be developed and versioned separately. All these services can talk to each other through Kafka topics. Each service can consume a topic, process it, and generate another topic which can be consumed by another service or stored as a topic. If you have a non-topic data, then you should be able to connect that to Kafka system via a configurable software package known as Kafka connect. With Kafka connect, you can read in and out data from and to external systems or database. There are a lot of such connectors in market. Now we have a lot of topics in the system and if you want to do some computation/operation on these topics like aggregation, grouping, filtering, enrichment(joint/connecting topic data), we have an inbuild library/API to do all these and this is called as Kafka streams. The result of it will be another topic.

Fundamental components of Kafka.

  • Producer
  • Consumers.
  • Brokers
  • Orchestrator – zookeeper

To get data into Kafka, we need producer. Write program that put data into Kafka. Some potential application knows about events in a system and writes those into Kafka and Kafka acknowledges it. Kafka clusters has brokers. Brokers can be treated as machines in Kafka. Each broker has a disk of its own. If we are working with a fully managed Kafka, you may not worry about these brokers. Calling brokers as machine may not be fair, it may be a container aswell. A broker can run on bare metal hardware, a cloud instance, in a container managed by Kubernetes, in Docker on your laptop, or wherever JVM processes can run. Now at the consumer side, we must write a piece of code which knows how to read data from Kafka. Producer and consumer doesn’t know each other. They are completely decoupled. Producers or consumers can scale. You can add consumers/producers into the system without disturbing the system. They can fail independently they can evolve independently. What orchestrator does: Each broker has the capability of process topics and can store it in its disk. For durability purpose, it is also replicated in other brokers too. When a broker fails, these stored topics needs to be managed. There should be someone who should decide who has the responsibility of these topics. These are the main job of orchestrator. Topics are sequence of events. Since it is a log, you can only write to topic at its end. Many producers can write to a topic, or one producer can write to many topics. Conceptually you can have as many topics as you like.

Topic-partition

Topics is a log, it can increase in size over period, it needs to be stored in a broker. Broker is a machine with a disk size, we can scale the machine vertically, but it cannot scale forever. Also to read in and write into it will become costly if it keeps increasing its size. So to handle this topics effectively, Kafka has ways to partition these topics and each partition can be allocated to different brokers. This is key to how Kafka scales. So when we do this, every partition is a log. Each partition is then further broken into segments. SO a partitions on a broker is a set of files with index. You don’t really have to worry about segments but you do have to think about partitioning as you think about how to model data in topics. Kafka makes the decision about where these partitions are going to live. What Kafka doesn’t do is to keep track of the size of those partitions and move them around if one broker gets overloaded as topics gets created and destroyed. These are functionality that we need to add to keep those things balanced.

How Kafka works:

  1. How to handle schema change.
  2. How to handle when Kafka node run out of space?
  3. Non default types and serialization.(GUID as key for a message) – schemaRegistry https://bartwullems.blogspot.com/2020/11/sending-message-through-kafka-value.html

Setting up Kafka cluster on local machine.

  1. Java JDK must be installed.
  2. Kafka package must be downloaded
  3. Configure Kafka. The java used by Kafka is JDK 8 so please download and install JDK 8. Download Kafka package from apache.org and place it in some folder. These are nicely explained at https://medium.com/technofunnel/apache-kafka-in-5-minutes-c92c43ba3f39

How to run more Brokers locally:

Make sure you have separate server configuration for each broker that you want. And on each setting, make sure to change the below values.

  1. broker.id=1
  2. port=9192
  3. log.dirs=C:\kafka_2.12-2.4.0\Broker1 Start each broker separately.

When going to production:

Update the server server.properties to reflect the settings required for your production environment. Set the proper broker id: broker.id=0 Set host name and port numbers : port=9092 advertised.host.name = localhost Set the topic delete option to specify if you would like to have the topic deleted after certain point in time. delete.topic.enable = true

Parquet file experiments, findings and recommendations

Parquet is a binary file format designed with big data in mind where we must access data frequently and efficiently. The way it stores file on the disk is also different from other file formats. It is a column-based data file. And in reality it uses both row based and column based approach to bring the best of both worlds. The data is encoded on disk which ensures that the size remains small compared to actual data and is then compressed where the file is scanned as whole and cut out redundant parts. The query/read speed is dramatically fast when compared to other file formats. Nested data is handled efficiently which is quite cumbersome in other file format to achieve. Doesn’t require to parse the entire file to find data due to its way of storing data. This makes it efficient in reading data. Works quite efficiently with data processing frameworks. Automatically stores schema information. SQL querying is possible with this file format using Continue reading

Libish Varghese Jacob

Libish Varghese JacobI am a lead engineer at a prominent wind turbine manufacturing firm. My interests span a diverse range, and immersing myself in technology is one of them. This platform serves as my primary knowledge base, where I seek information and insights. Moreover, I utilize this platform to share my experiences and experiments, hoping they may benefit those following a similar path. It's important to note that the suggestions I express here are based on my best knowledge at the time of writing and may not necessarily represent the optimal solutions. I wholeheartedly welcome any comments you may have to improve or refine my ideas. Your feedback is greatly appreciated and valued.