"Real-time" gets used loosely. Before architecting anything, we push every team to define what real-time actually means for their use case. Sub-second? Sub-minute? Within five minutes? The answer changes everything downstream.
When Confluent fits
Confluent — and Kafka more broadly — is the right choice when you have multiple producers, multiple consumers, and need durable replay. If only one team produces and only one team consumes, you don't need Kafka. A queue or direct-write pattern is simpler.
The lakehouse is the analytics tier
Streams flow through Kafka. They land in the lakehouse for analytics. Trying to make Kafka itself the analytical store is a mistake; trying to make the lakehouse the operational message bus is the same mistake from the other direction.
Patterns that hold up
- CDC into Kafka, then to Delta: Source-of-truth changes go through Confluent (with schema enforcement); Auto Loader or Delta Live Tables ingests them into the lakehouse.
- Spark Structured Streaming for transformations: When transformations need to happen in flight, structured streaming over Kafka topics with Delta sinks is the path of least resistance.
- Unity Catalog for governance: Whatever lands in the lakehouse from streaming sources still needs the same governance as batch data. Don't carve out a streaming exception.
Trade-offs to plan for
Streaming pipelines are harder to test, harder to debug, and harder to back-fill than batch. The benefit is fresh data. Make sure the freshness is worth the operational cost — and design for graceful degradation when streaming sources hiccup. They will.