Real-Time Data Pipelines with Kafka and Flink: A Production Playbook

Batch pipelines are no longer enough. When your business needs data in seconds instead of hours, Kafka and Flink are the production-proven combination. Here is how to build it without drowning in complexity.

When Batch Is Not Enough

Most data pipelines run in batch — extract data every hour or every night, transform it, load it into a warehouse, refresh the dashboards. For reporting and analytics, batch is perfectly fine. If your CEO looks at last week's revenue on Monday morning, a nightly batch pipeline delivers exactly what they need.

But some business problems cannot wait.

Fraud detection — a fraudulent transaction must be flagged in milliseconds, not discovered in tomorrow's batch report. By then, the money is gone.
Real-time personalisation — an e-commerce recommendation engine that updates with every click, not every hour. The difference between "relevant right now" and "relevant an hour ago" is the difference between a conversion and a bounce.
Operational monitoring — a logistics company tracking 10,000 delivery vehicles needs live position updates, not hourly snapshots. A stuck vehicle costs money every minute it goes undetected.
IoT and sensor data — a manufacturing plant monitoring 500 sensors needs to detect anomalies in seconds. A batch pipeline that catches a temperature spike 30 minutes later is useless if the equipment is already damaged.
Live dashboards — a trading desk, a command centre, a live event — any scenario where decisions are made on the current state of the world, not yesterday's state.

If your use case falls into any of these categories, you need a streaming data pipeline. And the production-proven combination is Apache Kafka for the messaging backbone and Apache Flink for the stream processing engine.

Architecture Overview

Apache Kafka: The Event Backbone

Kafka is a distributed event streaming platform. Think of it as a durable, ordered, replayable log of events. Producers write events (an order placed, a sensor reading, a user click), and consumers read those events — in real time or by replaying from any point in history.

Why Kafka

Durability — events are persisted to disk and replicated across brokers. You will not lose data even if servers fail.
Ordering guarantees — events within a partition are strictly ordered. This matters for use cases where sequence is important (financial transactions, event sourcing).
Scalability — Kafka handles millions of events per second. LinkedIn, Netflix, and Uber run Kafka at scales most enterprises will never approach.
Decoupling — producers and consumers are independent. Add a new consumer (a new dashboard, a new ML model, a new microservice) without touching the producers.
Replay — reprocess events from any point in time. Changed your processing logic? Replay the last 30 days through the new logic without re-ingesting from source systems.

Managed vs. Self-Hosted

Option	Pros	Cons	Cost (approximate)
Confluent Cloud	Fully managed, Schema Registry included, connectors marketplace	Premium pricing, vendor lock-in risk	$400-2,000/month for mid-market
Amazon MSK	AWS-native, IAM integration, serverless option	Less feature-rich than Confluent, limited connector ecosystem	$200-800/month
Redpanda	Kafka-compatible, simpler operations, no JVM	Smaller ecosystem, less battle-tested at extreme scale	Self-hosted: infrastructure cost only
Self-hosted Kafka	Full control, no licensing cost	Significant operational burden (ZooKeeper, broker management, upgrades)	Infrastructure + 1-2 days/month ops

Our recommendation: For most teams, Confluent Cloud or Amazon MSK Serverless eliminates operational pain and lets you focus on building pipelines instead of managing brokers. Self-hosted Kafka only makes sense if you have a dedicated platform team with Kafka expertise.

Schema Registry: Non-Negotiable

Every Kafka topic must have a schema. Period. Without schema enforcement, you will inevitably face the "producer changed the event format and broke every consumer" problem.

Use Confluent Schema Registry (works with any Kafka, not just Confluent Cloud) with Avro or Protobuf schemas. Schema Registry enforces compatibility rules — a producer cannot publish a backward-incompatible schema change without explicitly breaking the compatibility contract.

This is the single most important operational decision you will make with Kafka. We have seen teams spend weeks debugging consumer failures that would have been prevented by a schema registry rejecting the bad schema at publish time.

Apache Flink: Stateful Stream Processing

Flink is a distributed stream processing engine. It reads events from Kafka (or other sources), processes them — including stateful operations like windowed aggregations, joins, and pattern detection — and writes results to downstream systems.

Why Flink Over Alternatives

True stream processing — Flink processes events one at a time as they arrive, not in micro-batches. This gives you lower latency (milliseconds vs. seconds) and simpler semantics.
Stateful processing — Flink manages state (counts, windows, session data) internally with exactly-once guarantees. This is the hard part of stream processing, and Flink does it better than any alternative.
Event time processing — Flink understands the difference between when an event happened (event time) and when it arrived (processing time). This is critical for handling late-arriving data correctly.
SQL support — Flink SQL lets you write streaming queries in SQL. For teams with strong SQL skills, this dramatically lowers the barrier to stream processing.
Savepoints and replay — checkpoint the entire processing state, upgrade your job, and resume from exactly where you left off. Zero data loss during deployments.

Flink vs. Spark Streaming vs. Kafka Streams

Criteria	Flink	Spark Structured Streaming	Kafka Streams
Processing model	True streaming	Micro-batch	True streaming
Latency	Milliseconds	Seconds (micro-batch interval)	Milliseconds
State management	Built-in, exactly-once	Built-in, exactly-once	Built-in, exactly-once
SQL support	Excellent (Flink SQL)	Excellent (Spark SQL)	None (Java/Kotlin API only)
Deployment	Dedicated cluster or managed	Spark cluster (often shared)	Embedded in application (no cluster)
Scaling	Horizontal, dynamic	Horizontal, requires restart	Horizontal via consumer groups
Best for	Complex stateful processing, low latency	Teams already on Spark, batch + stream unification	Simple transformations, microservice-embedded

When to use each:

Flink — complex event processing, pattern detection, low-latency requirements, stateful joins and aggregations.
Spark Streaming — you already have Spark infrastructure and want to add streaming without a new system. Accept the latency trade-off.
Kafka Streams — lightweight transformations (filter, map, simple aggregations) that can run embedded in your application without a separate cluster.

Production Patterns

Pattern 1: CDC to Lakehouse (Change Data Capture)

The most common starting point. Capture every change from your operational databases and stream them to your data lakehouse in near real-time.

Stack: Debezium → Kafka → Flink (transform) → Delta Lake / Iceberg on S3

Why: Your analytics are always current. No more "the dashboard updates overnight." Your lakehouse reflects the state of your operational systems within seconds.

Gotchas:

Schema evolution in the source database must be handled gracefully. Debezium captures DDL changes; your Flink job must handle new columns without crashing.
Initial snapshot: when you first connect Debezium to a large table, it snapshots the entire table into Kafka. Plan for this volume.
Ordering: Debezium guarantees ordering per-row (same primary key), not across rows. Your Flink job must handle this correctly for joins.

Pattern 2: Real-Time Aggregations

Compute live metrics from event streams — orders per minute, revenue per hour, active users right now.

Stack: Application events → Kafka → Flink (windowed aggregation) → Redis (cache) → Dashboard (WebSocket)

Why: Operational dashboards that update every second instead of every hour. Command centres, trading floors, live event monitoring.

Flink SQL example:

SELECT
  product_category,
  TUMBLE_START(event_time, INTERVAL '5' MINUTE) AS window_start,
  COUNT(*) AS order_count,
  SUM(order_total) AS revenue
FROM orders_stream
GROUP BY
  product_category,
  TUMBLE(event_time, INTERVAL '5' MINUTE)

Pattern 3: Event-Driven Microservices

Decouple your services through events instead of synchronous API calls.

Stack: Service A → Kafka (event) → Service B consumes → processes → Kafka (result event) → Service C consumes

Why: Services are independent. Service B can be deployed, scaled, or even rewritten without touching Service A. If Service B goes down, events queue in Kafka and are processed when it recovers. No data loss.

Pattern 4: Real-Time ML Feature Serving

Compute ML features from streaming data and serve them to models in real time.

Stack: Events → Kafka → Flink (feature computation) → Feature Store (Redis / Feast) → ML Model

Why: Your fraud detection model needs the "number of transactions in the last 5 minutes for this card" as a feature. That feature must be computed in real time from the transaction stream, not looked up from a batch table that is 6 hours stale.

Operational Realities

Kafka Operations

Topic partitioning — choose partition count carefully. Too few partitions limit parallelism. Too many increase broker overhead and rebalancing time. Rule of thumb: 3-6 partitions per topic for most workloads, scale up based on throughput needs.
Retention — default to 7 days retention. For event sourcing or audit requirements, use infinite retention or tiered storage (Confluent) to move old data to cheaper storage.
Consumer lag monitoring — the most important Kafka metric. If consumers fall behind producers, you have a processing bottleneck. Alert on lag exceeding your SLA threshold.
Dead letter queues — events that fail processing must go somewhere. Write them to a DLQ topic with error context, monitor, and reprocess after fixing the issue.

Flink Operations

Checkpointing — enable checkpointing at 1-5 minute intervals. This is how Flink provides exactly-once guarantees. Without checkpointing, a failure means reprocessing from the last committed Kafka offset.
State backend — use RocksDB state backend for production. The in-memory backend is faster but limited by JVM heap. RocksDB handles state sizes that exceed available memory by spilling to disk.
Parallelism tuning — Flink's parallelism determines how many parallel instances process your stream. Start with the number of Kafka partitions and tune based on throughput and latency metrics.
Backpressure handling — Flink propagates backpressure automatically. If a downstream sink is slow, Flink slows down the entire pipeline rather than dropping events. Monitor backpressure in the Flink dashboard and address the bottleneck.

Cost Considerations

Real-time pipelines are more expensive than batch — that is the trade-off for lower latency. Here is what to expect:

Component	Managed Cost	Self-hosted Cost
Kafka (Confluent Cloud, Basic)	$400-1,500/month	$500-1,000/month (3-node cluster)
Flink (AWS Kinesis Data Analytics / Confluent)	$300-1,200/month	$400-800/month (3-node cluster)
Schema Registry	Included with Confluent	Free (open-source)
Monitoring (Datadog/Grafana)	$100-500/month	$100-300/month
Total	$800-3,200/month	$1,000-2,100/month

Cost optimisation tips:

Not everything needs to be real-time. Use streaming for the 20% of data that needs sub-minute latency. Keep the other 80% on batch pipelines (cheaper, simpler).
Use Kafka compacted topics for slowly changing dimensions instead of full retention.
Right-size Flink parallelism — over-provisioning wastes compute; under-provisioning causes lag.

When NOT to Build Real-Time Pipelines

Your dashboards refresh daily — if the business consumes data in batch cadence, real-time infrastructure is pure cost overhead.
Your data volume is small — under 10,000 events per hour, a simple poll-based approach (query the database every minute) works fine without Kafka or Flink.
Your team has no streaming experience — the learning curve is real. Budget 4-6 weeks for a team to become productive with Kafka + Flink. If you need results in 2 weeks, start with batch and add streaming later.
The use case is not latency-sensitive — "we want real-time data" is not a business requirement. "We lose $X per minute of delay in detecting Y" is. If you cannot quantify the cost of latency, you probably do not need real-time.

How We Deliver Real-Time Pipelines

Our Data Engineering & Pipeline Development service includes real-time architecture design and implementation:

Assessment (1 week) — identify which data flows actually need real-time processing. Map latency requirements to business value. Often, 2-3 streams need real-time while the rest stay batch.
Architecture Design (1-2 weeks) — select managed vs. self-hosted, design topic structure, define schemas, plan state management, and design the monitoring stack.
Implementation (4-8 weeks) — build the Kafka infrastructure, Flink jobs, schema registry, monitoring, and integration with your existing data lakehouse or warehouse.
Handover (1-2 weeks) — documentation, runbooks, team training, and operational handover. Your team owns it; we provide optional ongoing support.

For teams choosing between streaming architectures, our orchestrator comparison covers how Airflow, Prefect, and Dagster fit alongside streaming pipelines for hybrid batch + real-time architectures.

Book a free architecture review to determine whether real-time pipelines are worth the investment for your specific use case.

Frequently Asked Questions

Do we need Kafka if we just want CDC?

Not necessarily. For simple CDC-to-warehouse, tools like Fivetran or Airbyte handle CDC without Kafka. You need Kafka when: (a) multiple consumers need the same CDC stream, (b) you need sub-minute latency, (c) you want replay capability, or (d) you are building event-driven microservices that consume database changes as events.

Can we use Amazon Kinesis instead of Kafka?

Yes, but with trade-offs. Kinesis is simpler to operate (fully managed, no broker management) but has lower throughput limits per shard, no native schema registry, limited replay (max 365 days), and smaller ecosystem. For AWS-native teams processing under 50,000 events/second, Kinesis is a valid choice. Beyond that, Kafka is the better foundation.

How do we handle exactly-once processing?

Flink provides exactly-once semantics through checkpointing and two-phase commit with Kafka. The key is using Flink's Kafka sink with exactly-once mode enabled and setting the Kafka transaction timeout appropriately. For sinks other than Kafka (databases, APIs), you need idempotent writes — design your sink operations so that processing the same event twice produces the same result.

What monitoring do we need?

At minimum: Kafka consumer lag (the most critical metric), Flink checkpoint duration and failure rate, Flink backpressure indicators, event throughput (in/out), and end-to-end latency (event timestamp to sink write timestamp). We use Prometheus + Grafana for self-hosted and Confluent Control Center + Datadog for managed deployments.

Can Flink replace our batch pipeline tool (Airflow/dbt)?

Flink can process batch data (it treats batch as a special case of streaming), but it is not a replacement for Airflow or dbt in most architectures. Use Flink for stream processing and real-time transformations. Use dbt for complex SQL transformations in your warehouse. Use Airflow/Dagster to orchestrate both. The hybrid architecture — streaming for latency-sensitive data, batch for everything else — is the pragmatic choice for most organisations.