Leveraging Apache Storm and Spark to handle streaming data
Here we explain the main differences between Apache Storm and Apache Spark Streaming. Briefly, we can say that Storm is designed to process streams while Apache Spark Streaming is an extension to the batch-oriented Apache Spark to provide stream processing for that.
Storm was developed by Twitter as a way to model the constant stream of Tweets in a way that they could query and run analytics against that. What Storm does is provide real-time data processing to Hadoop, although it can also write its output to other data stores or just retain the Storm data model in memory as the whole point is to produce a moving picture of what is happening now.
Storm produces a graph. This is not a graph like a graph of sales over time from a spreadsheet. When we say graph here we mean a topology, which is a concept from the field of mathematics. An example topology is shown in the graphic below.
The graph provides the answer to some question, but as a living model. Because data is continually streaming into the model, the graph keeps changing all the time. If you were to map this to a dashboard you could visualize that.
In the graph, the circles are nodes, also called vertices. The lines between the circles are called links, also called edges. This represents the answer to some question posited against the data source, expressed as a topological model.
What does this mean?
Here are some examples to clarify that:
- A graph can model the relationship between people on Facebook. If you dumped all the billion people on Facebook into a graph then you could navigate the edges to see what friends each person has and what people have friends in common. Putting that in a graph is the most efficient way to process such vast amounts of data. If would be much faster than other data structures designed for speed, like a Java TreeSet.
- Consider another example, this one described by from Wikipedia. This is the relationship between formulas in a spreadsheet. This shows the order in which computations need to be executed. That model tells Excel when to recalculate one cell or a whole range of cells and worksheets.
- Another example is Twitter. You could construct a graph to show all the Tweets related to some topic that is trending. Twitter publishes Tweets as an API that returns a JSON object. You send them a URL formatted in way that indicates what Tweets you want.
Here is how Storm took those abstract concepts from graph theory to build their own topology, shown in the picture below:
graphic source Apache
In this topology there are two types of nodes:
Spout–is the source of a stream arriving as data structures known as tuples. e.g. (field 1, field 2, …, field n). This could be data from multiple sources, like machine sensors, social media, IoT devices, firewalls, or other.
Bolt–takes the input Spout and runs filters, aggregates data, or joins data sets.
Now we provide an example to illustrate how this is put to use.
HortonWorks Trucking Example
Here is example from HortonWorks, the cloud Hadoop platform.
This application provides a real-time graph of fleet operations for a trucking company. The goal is to analyze data streaming from multiple sources and use that to change driver behavior (such as slow down or go another route) or alert management (The driver is going to fast or maybe should stop because of road conditions.).
The input are what you would normally think of when trying to make these kinds of decisions, except the computer processes all of that, instead of humans, and does so at lightning speed:
- Driver information–sent from IoT network-connected sensors on the vehicle as this tuple: (driver, truck, longitude, latitude, time).
- Weather–this comes from a weather service API. There are several.
- Traffic data–comes from MapQuest. It could also come from the crowdsourced application Waze.
- NYC trucking routes–from the NY DOT. This is in KML format, i.e., Google Earth format.
Apache Spark Streaming
Now, given the design of Storm, what is Apache Spark Streaming and how does that differ from Storm?
graphic source Apache Spark
Spark Streaming output is not a graph, as with Storm, but a Dstream, which is a data structure that can be handed off to regular Apache Spark for processing. Dstream mimics the Apache Spark RDD (Resilient Distributed Dataset.) format. RDD is similar to the output of a Apache Hadoop MapReduce program. Also, Spark Streaming can produce a graph, using the Graph analytics APIs build into the product.
To understand Spark, you have to understand MapReduce. This is because Apache Spark is used to provide an enormous boost in performance for the batch Apache Hadoop MapReduce function. Because of that improved performance, it is threatening to push MapReduce aside for market position in Big Data processing (Although both are free, there is much cloud business built around those.)
Hadoop MapReduce, like the name suggests, does two things.
- Map–gather data. Save that to disk.
- Reduce–filter it, such as to remove duplicates or count objects. Save that to disk.
The most commonly used example to explain MapReduce is the word count problem:
Consider, for example any book, say, “The Bible.” Suppose this book is so large that it is stored across a Hadoop Distributed File System (HDFS). To figure out how many times the word “Jonah” occurs in The Bible you send MapReduce jobs out to scan each chunk of The Bible stored on the nodes (virtual machines with attached storage) of the HDFS.
The first step is to produce the rather redundant looking <key,value> pairs (redundant because each value in this case is 1, meaning itself):
Then the Map step collapses the <key, pair> items and to produce this output dataset:
The <key, pair><Jonah, 2> shows that the word Jonah occurs 2 times in the Bible (Obviously it occurs many more times that, but this is a simple example.)
Spark Map Process
Spark Streaming does the same thing as MapReduce, but works on streaming rather than static data.
The simplest example is a TCP socket If you know how computer logs are monitored for security or operational events then you will be familiar with syslog. That is a TCP or UDP port. Computers log their events by writing to the port (socket). Aa log file collector program listens to that to collect all logs into one central place.
Pinterest uses Apache Spark Streaming to measure user engagement.
Pinterest streams user data to their analytics platform using Apache Kafka, an open source tool developed by LinkedIn. As data flows into the model it uses the geolocation information included in photos and presents that as pins on a map. So the output in that case is a dashboard.
So there you have a brief overview of how Apache Spark Streaming and Apache Storm process streaming data. Given that they do basically the same thing, which one should you use? That probably depends on what you have now and the skillsets of your programmers and data scientists.
Subscribe to the newsletter
Want to stay on top of the latest information from Opallios?
Sign up for our newsletter, and we’ll let you know about our latest news, updates on our products and services, and helpful tips and articles to learn more about Big Data, Cloud, PaaS platforms like Salesforce.com, and other relevant topics.