Stream Processing with Apache Spark: An Introduction with DStreams and Structured Streaming

Understand How Streaming Works in Apache Spark with Examples

Necati Demir
5 min readJul 17, 2023

Real-time data processing has shifted from being a luxury to a necessity for many organizations. Businesses require the capability to process and analyze incoming data as soon as possible for a range of applications, including real-time analytics, fraud detection, etc. In this this article, I will handling real-time data using Spark Streaming, focusing specifically on DStreams and Structured Streaming.

The Power of Apache Spark

With its in-memory processing capabilities and a comprehensive array of functional programming APIs, Spark is the tool of choice when it comes to processing colossal data sets.

I have been writing about Apache Spark recently and recommend you to check the article out:

Diving into Streaming with Apache Spark

When you use streaming with Apache Spark, data can be ingested from various sources like Kafka, Flume, HDFS, and even plain TCP sockets, and processed using complex algorithms expressed with high-level functions such as map, reduce, join, and window. The processed data can then be pushed to file systems, databases, and live dashboards. In next section, we will give an example of reading from a TCP socket.

The Real-Time Capabilities with Apache Spark

Before diving into examples and explaining DStreams and Structed Streaming, let’s take a look what feature we have with Spark Streaming:

  1. Fault-tolerant Streaming: Apache Spark provides a fault-tolerant stream processing system through Spark Streaming. If a node fails during processing, Spark Streaming can recover the lost data and state.
  2. Window Operations: It lets developers apply transformations over a sliding window of data, calculating metrics over the last ’n’ minutes/hours/seconds and updating them at regular intervals.
  3. Integration with MLlib and GraphX: Apache Spark supports seamless integration with MLlib (Machine Learning library) and GraphX (graph computation library). This feature gives you the power to implement real-time machine learning application or use graph processing techniques.
  4. Persistence: It supports data persistence in memory or on disk for faster access, especially it is specifically useful when data is accessed repeatedly.
  5. DStreams and Structured Streaming: It offers two powerful abstractions for real-time processing: DStreams and Structured Streaming.

DStreams: Discretized Streams

DStream, short for Discretized Stream, is the fundamental data type of Spark Streaming. A DStream represents a continuous stream of data, either generated by data in memory or received from external systems. DStreams are built on RDDs, Spark’s core data abstraction, allowing you to manipulate data streams as collections of data points.

Here’s a basic example of using DStreams in PySpark for a word count application:

from pyspark import SparkContext
from pyspark.streaming import StreamingContext

# Create a local StreamingContext with two working threads and batch interval of 1 second
sc = SparkContext("local[2]", "NetworkWordCount")
ssc = StreamingContext(sc, 1)
# Create a DStream connected to hostname:port
lines = ssc.socketTextStream("localhost", 9999)
# Split each line into words
words = lines.flatMap(lambda line: line.split(" "))
# Count each word in each batch
word_pairs = words.map(lambda word: (word, 1))
word_counts = word_pairs.reduceByKey(lambda x, y: x + y)
# Print the result to console
word_counts.pprint()
ssc.start() # Start the computation
ssc.awaitTermination() # Wait for the computation to terminat

Structured Streaming: Unifying Batch and Streaming

Structured Streaming is a scalable and fault-tolerant stream processing engine built on the Spark SQL engine. You can write your streaming code the same way you would write a batch computation code. If you are willing to use Spark SQL, its engine takes care of running it incrementally and continuously, updating the final result as streaming data continues to arrive.

Here’s a simple example of using Structured Streaming in PySpark:

from pyspark.sql import SparkSession
from pyspark.sql.functions import explode
from pyspark.sql.functions import split

spark = SparkSession \
.builder \
.appName("StructuredNetworkWordCount") \
.getOrCreate()
# Create DataFrame representing data in the stream
lines = spark \
.readStream \
.format("socket") \
.option("host", "localhost") \
.option("port", 9999) \
.load()
# Split the lines into words
words = lines.select(
explode(
split(lines.value, " ")
).alias("word")
)
# Generate word count
wordCounts = words.groupBy("word").count()
# Start running the query that prints the running counts to the console
query = wordCounts \
.writeStream \
.outputMode("complete") \
.format("console") \
.start()
query.awaitTermination()

Making the Choice: DStreams vs Structured Streaming

If you’re already familiar with RDD-based APIs and need to maintain legacy code, DStreams can be a good choice. It provides a great deal of control over your streaming computations.

However, if you’re starting a new project, Structured Streaming is the de-facto choice due to its simplicity, expressiveness, powerful fault-tolerance guarantees, and unified APIs for batch and streaming data.

In addition, Apache Spark’s official webpage tell that DStreams is the previous generation of Spark’s streaming engine and there will be no longer updates to Spark Streaming (DStreams) and it’s a legacy project.

Summary & Key Takeaways

We discussed the real-time data processing using Apache Spark’s two abstractions: DStreams and Structured Streaming. We presented a introduction guide to DStreams and Structured Streaming, and provided simple examples in PySpark. Here are the key takeaways:

  1. The Power of Apache Spark: Apache Spark is a popular choice for handling large datasets due to its in-memory processing capabilities and robust functional programming APIs.
  2. Spark Streaming: An extension of the core Spark API, it processes incoming live data streams by dividing incoming data into mini-batches and performing RDD transformations. It can ingest data from various sources and supports complex algorithms.
  3. Real-Time Capabilities with Apache Spark: Spark Streaming supports fault-tolerant processing, window operations, seamless integration with MLlib and GraphX, and data persistence.
  4. DStreams vs Structured Streaming: There are two powerful abstractions of Spark for real-time processing; DStreams and Structured Streaming. While the former represent a continuous stream of data, the latter is a stream processing engine, which is built on the Spark SQL engine and it offers a unified API for batch and streaming data.
  5. Choosing Between DStreams and Structured Streaming: If you are familiar with RDD-based APIs and/or need to maintain legacy code, DStreams is a good choice. If you are starting a new project, Structured Streaming should be your choice. In addition to that, Apache Spark will no more update DStreams.

WRITER at MLearning.ai // Code Interpreter // AI’s Safe Deception

--

--