Data Streaming & Analytics

Processing streaming data lets a business, organization, machine, or other kind of entity or program change behavior based upon what has happened or, in the case of predictive models, what is about to happen. The tools to process streams of data and draw conclusions are broadly called analytics. In this short blog post, we explain what kinds of analytics might be used for streaming data.


Analytics Algorithms

There’s not any difference in the underlying mathematics and statistics upon which different analytics tools are based whether you have streaming or static data.  What differs are the algorithms used to read the data and sort that before handing off to the statistical model.  In other words, a time series analysis is the same whether you are looking at a static data file (with time as one of its fields) or a stream of data.

Consider an example. Supposed you want to model customers in a retail store checkout line.  You are looking at the shoppers lined up at the cash register.  You have written a model that determines whether given the length of the queue you should send a cashier to open a new cash register.  If some customers walk out of the store because the line is too long (A laser can count and stream that.) then you might have to install additional cash registers and hire more employees to work those.  You would do so if the cost of doing that is less than the revenue lost when customers walk out.

The movement of customers through cash registers is time sensitive data which lends itself to time sensitive analysis.  Apache storm, which is one tool for processing streaming data, gathers these up as tuples.  Think of the tuple as a record.  In the case of time sensitive analysis, the tuple would have to include the time as one of its values.  The tuple for the register transaction in this case is: (register number, time transaction processed, time person entered line).


Some Applications for Streaming Data Analytics

When you think of streaming data, you think of something coming in quite fast.  Customers lined up at a grocery store register move along slowly.  But if you have lots of stores that would increase how much data there is to process, but it’s still not very fast like an image processing application.  Machine sensor data, used in a preventive maintenance application, comes in slowly as well.  A compressor might report its temperature and vibration every 10 minutes. So stream, contrary to the way that word sounds, does not always mean fast, but it can.  Here are some examples of streaming data and which analytics could be or are used to process that:

Cybersecurity threat detection–web servers, firewalls, network routers, etc. produce reams of data very fast.   Analytics appropriate for processing these would include, for example, data classification (explained below).

Stock trading–mathematicians who work on Wall Street are called quants.  Their models read newspaper feeds (streams) to pick up when bad news or good news is circulating and then buy or sell stocks based on that. Again this software could use classification and different kinds of machine learning analytics (explained below) to make those decisions.

Credit card fraud–works on the principles of classification (Now we explain that.).  Classification takes data points and puts them into sets with known characteristics.  The algorithm processes events by calculating the distance of an event from the closest set. For financial transactions, this measures the likelihood that a transaction is fraud given that the customer is, say, high risk and the transaction is coming from places where there is a lot of fraud, like over the internet from Russia.

Preventive maintenance (PM)–machines report their temperature, ambient environment (i.e., the humidity and temperature in the vicinity), and their movement (vibration).  In the case of a compressor this would indicate when it is time to replace the filter as the machine is working harder.  This problem can be solved by a regression model, which is a type of predictive analysis.  A regression model finds the correlation between variables.  In the case of a PM application, the model takes changes in temperature and vibration and the mean-time-to-failure (published by the manufacturer) to calculate the likelihood that a machine will fail within a given time.  Given their threshold for risk (downtime), the company then sends out a technician to repair the machine when it is operating outside norms.

Adaptive Algorithms and Machine Learning

A key question for finding the right analytic to apply to streaming data is to ask whether the model needs to adapt to changing data.  Models that do that are called learning models. These are a type of machine learning.  Consider the rather counterintuitive example of weather forecasting.  Those models do not necessarily need to change as wind speed, temperature, and barometric pressure change.  That is because all of the possible data conditions (rain, snow, dust storm) are known.  But an email spam engine needs to change as email flows in.  That used Bayesian statistics.  As people report spam, the Bayesian model adds to its list of spam keywords. That is called training the model, which is also a type of machine learning.
Neural Network

A neural network is a type of machine learning that mimics the way humans think.  It does this by modeling a problem as a series of synapses and neurons, just like the brain is constructed. The most common way to explain this idea is to consider facial recognition software.

Consider the steps used to determine whose face we are looking at and how that might be programmed into a computer.

  1. Neuron 1: Sort through the functions for determining skin tone. For example, one function is designed to work in low light. When the optimal solution (best guess) is found hand off the answer to next neuron. Suppose in this case the person is blonde.
  2. Neuron 2: Now, what color is their hair? The neuron says it is blonde.
  3. Neuron 3: Where was this picture taken (It’s a geotagged photo). Match that up against address and name records.
  4. Neurons …: (Do more steps…)
  5. Last neuron: Check the resulting numerical representation of the face (i.e., the neural network). The network says: “That looks like someone I know. It looks like Fred.  It is Fred!”


In our future blogs, we will explore how some known technologies handle streaming data.

Subscribe to the newsletter

Want to stay on top of the latest information from Opallios?

Sign up for our newsletter, and we’ll let you know about our latest news, updates on our products and services, and helpful tips and articles to learn more about Big Data, Cloud, PaaS platforms like, and other relevant topics.