Stream processing is similar to any data processing where you read the data ,apply some transformation and then push it somewhere.
However there are 3 key concepts that are unique to stream processing
- Time is the most important concept in stream processing and therefore it is important to have a common understanding of time with respect to stream processing .
- Stream-processing systems typically refer to the following notions of time:
- Event Time: This is the time the events we are tracking occurred and the record was created
- Example: Time when an order was placed, time when user clicked on a link
- Event time is usually the time that matters most when processing stream data.
- Kafka automatically adds the current time to producer records at the time they are created.
- If this notion of Kafka’s event time does not match your requirements , for example if you want event time as the time when event was inserted in the database then you should add the event time as a field in the record itself.
- Log Append Time: This is the time the event arrived to the Kafka broker and was stored there.
- Kafka brokers will add this time to records if it is configured to do so or if the records arrive from older producers(versions 0.10.0 and lower) and contain no timestamps.
- It is typically less relevant for stream processing, since we are usually interested in the times the events occurred.
- Processing time: This is the time at which a stream-processing application received the event in order to perform some calculation.
- This time can be milliseconds, hours, or days after the event occurred.
- This can be different for the same event depending on exactly when each stream processing application happened to read the event.
- It can even differ for two threads in the same application
- Therefore, this notion of time is highly unreliable and best avoided.
- Stream processing has 2 kind of use cases :
- Use cases which look at each event individually, for example – check if the incoming order is greater than $100,000 and trigger a workflow step.
- Use cases which requires information about previous events , for example – finding the total number of orders which are greater than $100,000 and updating the dashboard. This kind of use cases cannot be completed by looking at the event individually and needs to know how many events of this type occurred in the past . This information that is stored between events is called a state
- Stream processing refers to 2 types of state:
- Internal State/Local State:
- State that is accessible only by a specific instance of the stream-processing application.
- This state is usually maintained and managed with an embedded, in-memory database running within the application.
- This is fast as the state is maintain in , in-memory database.
- External State:
- This is maintained in external datastore.
- External data store provides the advantage of unlimited size and it can be accessed from multiple instances of the application or even from different applications.
- Disadvantages are increased latency and complexity introduced with an additional system.
- Windowing allows you to scope your stream processing pipelines to a specific time window/range
- It lets you control how to group records that have the same key for stateful operations such as aggregations or joins into so-called windows.
- Windows are tracked per record key.
- Types of window supported by Kafka streams are
Thanks. Share your comments / feedback.