Friday, August 14, 2015

Spark Streaming WordCount Example


Spark Streaming : Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams.

For more visit :  Spark Streaming



We will write a basic Word Count Program which will be running on port 9999 using nc (netcat utility)

We are going to run spark in local mode by starting spark-shell

Let's get going.

Start the spark-shell

./bin/spark-shell

Import the StreamingCOntext

import org.apache.spark.streaming._

Create a new Spark Streaming context using the default avaiable SparkContext(sc)

val ssc = new StreamingContext(sc,Seconds(5))

// The time window is set to 5 seconds

Create a Dstream that will connect to localost on port 9999

val lines = new ssc.socketTextStream("localhost",9999)

flatMap is combination of Map and Flatten used individually. So line would get split on basis of white space.

val words = lines.flatMap(_.split(" "))

Now let us create pairs of each word with 1 

val pairs = words.map(word => (word,1))

using reduceByKey to find the final count of each word. Also check this to find when to use reduceByKey vs groupByKey .

val wordCounts = pairs.reduceByKey(_+_)

Next we will print each word and it's count.

wordCounts.print()


Notice that this would not print anything until we start the StreamingContext.

On a differnet termial start the nc -lk on port 9999 (using the netcat utility here)

nc -lk 9999



Add few lines on the terminal. In this example I have used the contents from this link .




Now let's start the ssc Context.

ssc.start()

ssc.awaitTermination()

//make sure that as soon as we start the StreamingContext , we would not be able to write this, we can forcefully stop this by pressing Ctrl+C (On Cluster mode this is not a problem)

O/P

Next : Coming Soon : Spark Streaming with Kafka.

Stay Tuned !!! ;)

No comments:

Post a Comment