Leveraging Apache Storm and Spark to handle streaming data

Here we explain the main differences between Apache Storm and Apache Spark Streaming. Briefly, we can say that Storm is designed to process streams while Apache Spark Streaming is an extension to the batch-oriented Apache Spark to provide stream processing for that.

Apache Storm

Storm was developed by Twitter as a way to model the constant stream of Tweets in a way that they could query and run analytics against that. What Storm does is provide real-time data processing to Hadoop, although it can also write its output to other data stores or just retain the Storm data model in memory as the whole point is to produce a moving picture of what is happening now.

Storm produces a graph.  This is not a graph like a graph of sales over time from a spreadsheet.  When we say graph here we mean a topology, which is a concept from the field of mathematics.  An example topology is shown in the graphic below.

The graph provides the answer to some question, but as a living model.  Because data is continually streaming into the model, the graph keeps changing all the time.  If you were to map this to a dashboard you could visualize that.

Apache Storm

 

 

 

 

 

 

 

In the graph, the circles are nodes, also called vertices.  The lines between the circles are called links, also called edges. This represents the answer to some question posited against the data source, expressed as a topological model.

What does this mean?

Here are some examples to clarify that:

 

  • A graph can model the relationship between people on Facebook. If you dumped all the billion people on Facebook into a graph then you could navigate the edges to see what friends each person has and what people have friends in common. Putting that in a graph is the most efficient way to process such vast amounts of data. If would be much faster than other data structures designed for speed, like a Java TreeSet.
  • Consider another example, this one described by from Wikipedia. This is the relationship between formulas in a spreadsheet. This shows the order in which computations need to be executed.  That model tells Excel when to recalculate one cell or a whole range of cells and worksheets.
  • Another example is Twitter. You could construct a graph to show all the Tweets related to some topic that is trending. Twitter publishes Tweets as an API that returns a JSON object. You send them a URL formatted in way that indicates what Tweets you want.

Here is how Storm took those abstract concepts from graph theory to build their own topology, shown in the picture below:

 

 

Apache Source

 

 

 

 

 

 

 


graphic source Apache

 

In this topology there are two types of nodes:

 

Spout–is the source of a stream arriving as data structures known as tuples. e.g. (field 1, field 2, …, field n). This could be data from multiple sources, like machine sensors, social media, IoT devices, firewalls, or other.

 

Bolt–takes the input Spout and runs filters, aggregates data, or joins data sets.

 

Now we provide an example to illustrate how this is put to use.

 

HortonWorks Trucking Example

Here is example from HortonWorks, the cloud Hadoop platform.

 

This application provides a real-time graph of fleet operations for a trucking company.  The goal is to analyze data streaming from multiple sources and use that to change driver behavior (such as slow down or go another route) or alert management (The driver is going to fast or maybe should stop because of road conditions.).

The input are what you would normally think of when trying to make these kinds of decisions, except the computer processes all of that, instead of humans, and does so at lightning speed:

  • Driver information–sent from IoT network-connected sensors on the vehicle as this tuple: (driver, truck, longitude, latitude, time).
  • Weather–this comes from a weather service API. There are several.
  • Traffic data–comes from MapQuest. It could also come from the crowdsourced application Waze.
  • NYC trucking routes–from the NY DOT. This is in KML format, i.e., Google Earth format.

 

Apache Spark Streaming

Now, given the design of Storm, what is Apache Spark Streaming and how does that differ from Storm?

 

Spark Streaming

 

 

 

 

graphic source Apache Spark

Apache Spark Streaming delivers streaming ability to Apache Spark.  The graphic above gives some ideas of how you might use it in your architecture.

Spark Streaming output is not a graph, as with Storm, but a Dstream, which is a data structure that can be handed off to regular Apache Spark for processing.  Dstream mimics the Apache Spark RDD (Resilient Distributed Dataset.) format. RDD is similar to the output of a Apache Hadoop MapReduce program.  Also, Spark Streaming can produce a graph, using the Graph analytics APIs build into the product.

To understand Spark, you have to understand MapReduce.  This is because Apache Spark is used to provide an enormous boost in performance for the batch Apache Hadoop MapReduce function.  Because of that improved performance, it is threatening to push MapReduce aside for market position in Big Data processing (Although both are free, there is much cloud business built around those.)

Hadoop MapReduce, like the name suggests, does two things.

  1. Map–gather data. Save that to disk.
  2. Reduce–filter it, such as to remove duplicates or count objects. Save that to disk.

The most commonly used example to explain MapReduce is the word count problem:

Consider, for example any book, say, “The Bible.”  Suppose this book is so large that it is stored across a Hadoop Distributed File System (HDFS).  To figure out how many times the word “Jonah” occurs in The Bible you send MapReduce jobs out to scan each chunk of The Bible stored on the nodes (virtual machines with attached storage) of the HDFS.

The first step is to produce the rather redundant looking <key,value> pairs (redundant because each value in this case is 1, meaning itself):

<Jonah, 1>

<Mary, 1>

<John,1>

<Jonah,1>

 

Then the Map step collapses the <key, pair> items and to produce this output dataset:

<Jonah, 2>

<Mary, 1>

<John,1>

The <key, pair><Jonah, 2> shows that the word Jonah occurs 2 times in the Bible (Obviously it occurs many more times that, but this is a simple example.)

Spark Map Process

Spark Streaming does the same thing as MapReduce, but works on streaming rather than static data.

The simplest example is a TCP socket  If you know how computer logs are monitored for security or operational events then you will be familiar with syslog.  That is a TCP or UDP port.  Computers log their events by writing to the port (socket).  Aa log file collector program listens to that to collect all logs into one central place.

Finally, to add analytics to this process and produce a graph or run other algorithms, Spark has its own APIs: MLIB (Machine Learning Library) and Graph processing library.

Spark Example

 

Pinterest uses Apache Spark Streaming to measure user engagement.

Pinterest streams user data to their analytics platform using Apache Kafka, an open source tool developed by LinkedIn. As data flows into the model it uses the geolocation information included in photos and presents that as pins on a map. So the output in that case is a dashboard.

Summary

So there you have a brief overview of how Apache Spark Streaming and Apache Storm process streaming data. Given that they do basically the same thing, which one should you use?  That probably depends on what you have now and the skillsets of your programmers and data scientists.



Data Streaming & Analytics

Processing streaming data lets a business, organization, machine, or other kind of entity or program change behavior based upon what has happened or, in the case of predictive models, what is about to happen. The tools to process streams of data and draw conclusions are broadly called analytics. In this short blog post, we explain what kinds of analytics might be used for streaming data.

 

Analytics Algorithms

There’s not any difference in the underlying mathematics and statistics upon which different analytics tools are based whether you have streaming or static data.  What differs are the algorithms used to read the data and sort that before handing off to the statistical model.  In other words, a time series analysis is the same whether you are looking at a static data file (with time as one of its fields) or a stream of data.

Consider an example. Supposed you want to model customers in a retail store checkout line.  You are looking at the shoppers lined up at the cash register.  You have written a model that determines whether given the length of the queue you should send a cashier to open a new cash register.  If some customers walk out of the store because the line is too long (A laser can count and stream that.) then you might have to install additional cash registers and hire more employees to work those.  You would do so if the cost of doing that is less than the revenue lost when customers walk out.

The movement of customers through cash registers is time sensitive data which lends itself to time sensitive analysis.  Apache storm, which is one tool for processing streaming data, gathers these up as tuples.  Think of the tuple as a record.  In the case of time sensitive analysis, the tuple would have to include the time as one of its values.  The tuple for the register transaction in this case is: (register number, time transaction processed, time person entered line).

 

Some Applications for Streaming Data Analytics

When you think of streaming data, you think of something coming in quite fast.  Customers lined up at a grocery store register move along slowly.  But if you have lots of stores that would increase how much data there is to process, but it’s still not very fast like an image processing application.  Machine sensor data, used in a preventive maintenance application, comes in slowly as well.  A compressor might report its temperature and vibration every 10 minutes. So stream, contrary to the way that word sounds, does not always mean fast, but it can.  Here are some examples of streaming data and which analytics could be or are used to process that:

Cybersecurity threat detection–web servers, firewalls, network routers, etc. produce reams of data very fast.   Analytics appropriate for processing these would include, for example, data classification (explained below).

Stock trading–mathematicians who work on Wall Street are called quants.  Their models read newspaper feeds (streams) to pick up when bad news or good news is circulating and then buy or sell stocks based on that. Again this software could use classification and different kinds of machine learning analytics (explained below) to make those decisions.

Credit card fraud–works on the principles of classification (Now we explain that.).  Classification takes data points and puts them into sets with known characteristics.  The algorithm processes events by calculating the distance of an event from the closest set. For financial transactions, this measures the likelihood that a transaction is fraud given that the customer is, say, high risk and the transaction is coming from places where there is a lot of fraud, like over the internet from Russia.

Preventive maintenance (PM)–machines report their temperature, ambient environment (i.e., the humidity and temperature in the vicinity), and their movement (vibration).  In the case of a compressor this would indicate when it is time to replace the filter as the machine is working harder.  This problem can be solved by a regression model, which is a type of predictive analysis.  A regression model finds the correlation between variables.  In the case of a PM application, the model takes changes in temperature and vibration and the mean-time-to-failure (published by the manufacturer) to calculate the likelihood that a machine will fail within a given time.  Given their threshold for risk (downtime), the company then sends out a technician to repair the machine when it is operating outside norms.

Adaptive Algorithms and Machine Learning

A key question for finding the right analytic to apply to streaming data is to ask whether the model needs to adapt to changing data.  Models that do that are called learning models. These are a type of machine learning.  Consider the rather counterintuitive example of weather forecasting.  Those models do not necessarily need to change as wind speed, temperature, and barometric pressure change.  That is because all of the possible data conditions (rain, snow, dust storm) are known.  But an email spam engine needs to change as email flows in.  That used Bayesian statistics.  As people report spam, the Bayesian model adds to its list of spam keywords. That is called training the model, which is also a type of machine learning.
Neural Network

A neural network is a type of machine learning that mimics the way humans think.  It does this by modeling a problem as a series of synapses and neurons, just like the brain is constructed. The most common way to explain this idea is to consider facial recognition software.

Consider the steps used to determine whose face we are looking at and how that might be programmed into a computer.

  1. Neuron 1: Sort through the functions for determining skin tone. For example, one function is designed to work in low light. When the optimal solution (best guess) is found hand off the answer to next neuron. Suppose in this case the person is blonde.
  2. Neuron 2: Now, what color is their hair? The neuron says it is blonde.
  3. Neuron 3: Where was this picture taken (It’s a geotagged photo). Match that up against address and name records.
  4. Neurons …: (Do more steps…)
  5. Last neuron: Check the resulting numerical representation of the face (i.e., the neural network). The network says: “That looks like someone I know. It looks like Fred.  It is Fred!”

 

In our future blogs, we will explore how some known technologies handle streaming data.



Introducing ODE

Today, we are announcing the general availability of ODE 0.1. This is the first release of ODE, the Opallios Distribution of ELSA. For people who are not familiar with ELSA, it stands for Enterprise Log Search and Archive. ELSA is an open-source enterprise ready log management system. It leverages syslog-ng for processing incoming syslog data and Sphinx full-text indexing for log searching. ELSA stands out from other well-known log management system in its high performance and scalability. For more details on ELSA refer to its Github site, https://github.com/mcholste/elsa.

Why ODE?

The first version of ELSA was released back in 2010 and since, has gone through various updates and releases. Over time it has gained a loyal user group with active community. We at Opallios first came across ELSA for one of our customer’s project where we needed a high performing syslog data indexer. There was a rigid requirement on scalability and hardware usage. We needed a system that could ingest syslog data at the rate of 10Mb/sec on a single machine and persist data for over 6 months. Of all the open source log management systems we evaluated ELSA came out ahead, both in terms of performance and scalability. After some rigorous testing and use ELSA impressed us a lot, but at the same time we saw some opportunity for improvement and thus, originated the idea of ODE. Till now, we have noticed from our customer requests that there are some minor changes that if included in the open-source edition, would benefit both – our customers and the community. As a result to provide a committed roadmap and delivery schedule for these changes, we decide to fork a branch and initiate ODE (Opallios Distribution of ELSA).

Martin C Holste, the creator of ELSA, has put together an impressive log management system, but, as with any open source project the success of the project depends on its adaption. We thought ELSA had all the necessary ingredients to be one of the leading log management solution, but lacked in some key areas,

  1. Installation Process – Though, the installation is pretty straight forward, but is prone to failures in some cases. There is also not much of updates to support newer version of OS.
  2. Data Sources – ELSA uses syslog-ng for to process input data. syslog-ng allows you to write parsers for varying number of data formats, but there is limited support for various log formats out-of-box.
  3. Analytical Functions – ELSA’s strong data correlation along with dynamic full-text search bodes for impressive analytics, but there is a room for adding more analytical functions that would bring it on par with other leading data analytics engines.
  4. Debugging Tools – ELSA works quite reliably there could be instances when we need to troubleshoot some unseen issues. Improving debugging ability that would reduce TTI (time to investigate) & TTR (time to resolve) would enhance usability of ELSA.
  5. Documentation – ELSA has just enough documentation that helps a technology expert to configure and get started. However the community would benefit with some documentation like a user guide and use case examples explaining different configurations.
  6. Regular Updates – As there are limited number of committers to ELSA, the updates lag sometimes. There are features the community want to see in the future ELSA, with Martin’s approval that we can help bring to the market faster either in main ELSA branch or in ODE.

After discussion with Martin we agreed to fork out ELSA into ODE to address some of these items. ODE will continue to be in sync with ELSA branch for any updates.

ODE 0.1

The goal behind ODE 0.1 release is to make the ELSA installation reliable and consistent with the primary focus on “the out of the box experience”.  As part of ODE 0.1 we provide support for both debian and rpm packages. ODE 0.1 uses the latest ELSA codebase from github. Distributing ODE via standard packages has allowed us to have a better control over the flow of installation and support the basic install, remove and update software features. There are some other slight changes to make the software more reliable, like using Starman in place of Apache http server. Following OSs have been tested and verified with the ODE 0.1 packages,

  1. Debian package – Ubuntu 12.04, Ubuntu 14.04
  2. RPM package – Red Hat 6.6, Centos 6.5

We will also provide AWS images for the above mentioned linux distributions for easier installs and quick ODE evaluation. The ODE 0.1 packages and images are available for download at ODE github site, https://github.com/opallios/ode.

Roadmap

We are a big supporter of ELSA and are committed to assist Martin in taking it to the next level. Our goal is to increase its adaptability by filling in the gaps in its implementation that exist today. In the next few months we will continue to develop on the items we listed earlier in this blog. We hope users will find ELSA much more appealing as we continue to check-off items from our to-do list. The roadmap of ODE will also very much depend on our users’ feedback and their wish-list.

 Summary

ELSA is a very well thought out and implemented log management system, but has limited popularity due to inconsistent installation process, limited support for various log formats, and minimal documentation/support. Opallios would like to see better adoption of ELSA and is ready to invest its resources to drive higher adoption, as it ELSA provides a real low total cost of ownership option in log management space.  The first step in that direction is to improve ELSA installation process, which we are doing with the release of ODE 0.1. Over next few months Opallios will continue to update ODE adding new features and contributing to ELSA’s growth.