Apache Kafka & Event Streaming
Kafka is a distributed event store and stream-processing platform. It allows you to handle trillions of events a day, making it the backbone of real-time Data Engineering.
🔗 Learning Resources
- Official Documentation: Apache Kafka Docs
- Introduction to Kafka: Confluent’s Kafka 101 - The best conceptual starting point.
- Interactive Course: Learn Apache Kafka (YouTube)
- Visual Guide: Kafka Visualizer - Great for understanding how offsets and partitions work.
⚡ Apache Kafka Practice Tasks: Real-Time Streaming
Kafka is a distributed event streaming platform. Instead of processing “batches” of data, we process “events” as they happen.
📝 Practice Tasks: Topics, Producers, and Consumers
Task 1: Infrastructure Setup (CLI)
Goal: Understand the core components of a Kafka Cluster.
- Requirement: Start a Zookeeper and Kafka broker (using Docker or local binaries).
- Challenge: 1. Create a Topic named
grocery-saleswith 3 partitions and a replication factor of 1. 2. List all topics to verify it exists. 3. Describe the topic to see the partition distribution. - Command:
kafka-topics.sh --create --topic grocery-sales --bootstrap-server localhost:9092 ...
Task 2: The Console Producer & Consumer
Goal: Send and receive your first messages manually.
- Requirement: Open two terminal windows.
- Challenge: 1. In Terminal A (Producer), type 3 messages representing sales:
{"id": 1, "item": "Apple"}. 2. In Terminal B (Consumer), read the messages from the beginning (--from-beginning). - Discussion: What happens to the messages in Terminal B if you close it and reopen it? Do they disappear?
Task 3: Python Integration (confluent-kafka or kafka-python)
Goal: Programmatically stream data.
- Requirement: Write a Python script to act as a Producer.
- Challenge: * Create a loop that sends a new “Sale Event” every 2 seconds to the
grocery-salestopic.- Each event should have a random
priceand atimestamp.
- Each event should have a random
- Verification: Use the Console Consumer to watch the live stream of data appearing in real-time.
Task 4: Consumer Groups & Scalability
Goal: Understand how Kafka handles high volume.
- Requirement: Start two instances of a Python Consumer script using the same
group.id. - Challenge: Observe the output. Does each consumer see every message, or do they split the load?
- Discussion: Why is the number of Partitions the “ceiling” for how many consumers you can have in a group?
🏗️ Mini Project: The “Real-Time Alert” Pipeline
Build a small streaming application that monitors for high-value transactions.
- Producer: A script that generates random sales data (0 - $100).
- Stream Processor (Python): A consumer script that reads the stream.
- The Logic: If a sale is greater than $80, print a “🚨 ALERT: High Value Sale Detected!” message.
- The Sink: Save only the “Alert” transactions into a local file named
alerts_log.txt.