Spark Streaming : Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams.
For more visit : Spark Streaming
We will write a basic Word Count Program which will be running on port 9999 using nc (netcat utility)
We are going to run spark in local mode by starting spark-shell
Let's get going.
Start the spark-shell
./bin/spark-shell
Import the StreamingCOntext
import org.apache.spark.streaming._
Create a new Spark Streaming context using the default avaiable SparkContext(sc)
val ssc = new StreamingContext(sc,Seconds(5))
// The time window is set to 5 seconds
Create a Dstream that will connect to localost on port 9999
val lines = new ssc.socketTextStream("localhost",9999)
flatMap is combination of Map and Flatten used individually. So line would get split on basis of white space.
val words = lines.flatMap(_.split(" "))
Now let us create pairs of each word with 1
val pairs = words.map(word => (word,1))
using reduceByKey to find the final count of each word. Also check this to find when to use reduceByKey vs groupByKey .
val wordCounts = pairs.reduceByKey(_+_)
Next we will print each word and it's count.
wordCounts.print()
Notice that this would not print anything until we start the StreamingContext.
On a differnet termial start the nc -lk on port 9999 (using the netcat utility here)
nc -lk 9999
Add few lines on the terminal. In this example I have used the contents from this link .
Now let's start the ssc Context.
ssc.start()
ssc.awaitTermination()
//make sure that as soon as we start the StreamingContext , we would not be able to write this, we can forcefully stop this by pressing Ctrl+C (On Cluster mode this is not a problem)
O/P
Next : Coming Soon : Spark Streaming with Kafka.
Stay Tuned !!! ;)




No comments:
Post a Comment