Fraud Analytics – Financial Industry

Analytics can be used to flag credit card transactions as fraudulent.  Here we give an overview of how that is done.

 

First of all, know that such tools are not perfect which means card processors and retailers still have to manually check lots of transactions.  Consider this: a statistical model that is 99% accurate or even 99.9% accurate is not, by definition, 100% accurate.  This means the credit card processor has to manually check a certain number of transactions to verify whether they are fraudulent or not.  There is a cost to doing this.  So such systems operate on a cost-of-checking vs. financial-cost-of-fraud basis to establish a tolerable threshold for fraud.  In other words, it is not cost effective to achieve a model that has no flaws.

Below we give a brief survey of statistical techniques used for credit card detection and then look briefly at one product, Falcon, that uses neural networks to identify fraudulent transactions,

 

Academic Background

When someone writes software for analytics, they always start with ideas developed by academics.  Those are the people who discover the techniques and write the algorithms that make such tools possible.  Here we draw upon a paper “Statistical Fraud Detection: a Review” written by two academics, Richard J. Bolton and David J. Hand, to give background information on how analytics is applied to credit card fraud detection.

 

How much money are we talking about here? This document gives some information about that:

 

  • In 2010, 33% of credit and debit card customers in the world reported fraud on their account in the past 5 years.
  • There was $5.5 billion in credit card fraud in 2012 of which $3.56 billion was in the USA.
  • There we $955 million in losses due to debit card fraud in the USA in 2010.
  • There were $2 trillion total purchases in USA in 2012 using Amex, Discover, MasterCard and Visa cards.

 

Bolton and Hand explain that there are two types of analytic techniques:  supervised and unsupervised.

 

Supervised

Supervised fraud analysis means using data from known attacks. It:

 

  • Uses samples of both fraudulent and non-fraudulent records to construct and train models
  • Assigns new observations into one of the two classes (likely fraud, likely legitimate)
  • Only works if this type fraud has previously occurred

 

Unsupervised

Unsupervised fraud analysis looks for variations in account transactions and customer data from observed norms.  So it would look for outliers or other events that are statistically significant. The goal is to calculate what is called a “suspicion score.”

 

Problems with the Models

Bolton and Hand explain that these systems are not 100% accurate.  To explain what that means in practical terms, the give an example that says suppose a credit card risk analysis system can:

  • Correctly identify 99% of legit transactions as legit
  • Correctly identity 99% of fraudulent transactions as fraudulent

 

Now suppose that in actuality 1/1000 or 0.1% transactions are fraudulent.

 

This model will say that out of 1000 transactions, 990 are legit and 10 are fraud.  But of those fraudulent ones there is only a 99% certainty of the model being right about that.  We know from observation that 1 out of 1,000 is fraudulent.   So we have to check the 9 others proposed by the model by calling each customer or perhaps doing some other kind of manual investigation.  That takes time and costs money.  Can the models do any better than that?

Among supervised models, classification models can narrow down the model to flagging only 0.1% of transactions as fraudulent, which in the case of example above would be dead-on accurate.  But in the situation of this example here only 0.04% of transactions were actually fraudulent.    So out of the 10 in 10,000 flagged by the model only 4 are actually fraudulent and 6 legit.   There is also a cost of checking all 10 of these. So a cost-weight acceptable loss threshold is set.

 

Supervised fraud detection tools

Here are some of the supervised fraud classification techniques.  We just cite those here and do not go into detail about how they work.  You are encouraged to do further investigation into those if you want to dig deeper.

  • linear discriminant analysis
  • logistic discrimination
  • neural networks

 

And then there are the rules based techniques:

 

  • BAYES
  • FOIL
  • RIPPER

Link analysis is another technique.  It uses the techniques of mining social media networks (or any other kind of graph, i.e., vector and edges) to sort out, for example, if someone linked to someone else is phoning in a credit card transactions using the same fraud technique.

 

Unsupervised fraud detection tools

Unsupervised fraud detection tools are used when there is no prior legitimate or fraudulent observations available upon which to make decisions.

In this case statistics are used to profile transactions and detect outliers.  Some techniques used here are approaches similar to those used in text analysis.

One technique used to detect fraud is an application of the rather esoteric and not-at-all-intuitive Benford’s law.  That laws says that certain digits taken from random samples of financial transactions occur at a certain known frequencies.  For example, you would think that the number 1 would occur 1/10 times in a dollar amount since there are 10 digits.  But on average it occurs 30% of the time.  (You would have to read on your own to try and understand that.) So if a batch of transactions varies from this pattern it is probably fraudulent.

 

Falcon software

Anyone in the USA who has obtained a mortgage or other loan has come to hate the organization FICO.  They assign a risk-based number to loan applicants called a credit score. Since they are in the business of detecting risk, FICO has also acquired Falcon software, which uses neural networks to detect fraud.

 

Here is a graphic from FICO giving a view of the Falcon analytics platform.

card fault

You can tell by reading their product literature that this is a supervised learning classification system using neural networks.  Since it plugs into a merchant’s POS cashier terminals, it can be used to detect fraudulent transactions right in the store.  But I am not sure what a sales clerk is supposed to do when someone standing at the register is flagged as a criminal.  Anyway such system could also be plugged into the merchant’s ecommerce web system as well.

The product literature also says that FICO is keeping credit profiles on card holders.  We already knew about that.  That helps them do classification.  They also say that their software includes adaptive analytics which means it responds to up-to-the-minute fraudulent activity to update the model.  This they say improves the model by 10% as it learns in real time.

The FICO system can be deployed as a cloud solution.  Or they provide their APIs and framework so that a company can build their own fraud detection into their own platform.

So there is a basic overview of how analytics is used to detect credit card fraud.  As you can see, these techniques would have applications to assessing any kind of financial risk.

 

 

 

 



Warranty Analytics – Improving product quality

Warranty Analytics & Preventive maintenance

Here we illustrate Warranty Analytics & Preventive maintenance with an example. Plus we show how you can use the computational engine Wolfram Alpha and the programming language R to solve such problems easily.

In preventive maintenance (PM) there are the concepts of the mean-time-to-failure (MTF) and the failure rate. One is calculated directly from the other. The manufacturer knows the MTF of their product and prints that on the product label or specs sheet.  They base their product warranty on that.  They pick some period of time for the warranty period where the probability of the device breaking down is not so high as to expose them to excess repair costs yet satisfies the customer by selling them something that is guaranteed to last a certain amount of time.  That is typically 1, 2, or 5 years.

Factories and people with lots of machines to maintain, or a data center with lots of servers and disks, also do this kind of analysis.  They weigh the cost of replacing a device or its components against the cost of it breaking down within a certain period of time.

PM: the Classic Analytics Problem and major IoT Application

Preventive maintenance is the classic analytics problem. It is the first, and the dominate IoT (Internet of Things) application, besides, perhaps, health care monitoring.  For example, sensors in truck tires and brakes use cellular networks to phone in temperature and air pressure data so that the trucking company can know when it is time to replace the brakes or tires.

The problem with analytics and PM is knowing which statistical model to use. Pick the wrong probability distribution and the results will be horribly wrong.  That is why you need a data scientist on your staff.  You should not blindly trust your analytics software without understanding its assumptions and how it works.

Example: Predicting the Failure of an Electrical Device

If a machine component fails at a frequency that is independent of any other event then its failure rate follows an exponential distribution.  This is often applicable in predicting failure of electric components as they do not wear out over time, because they have no moving parts. (Actually silicon flash memory dies suffer what is called wear fatigue and wears out after about 100,000 write/erase cycles.  The disk controller knows that and uses wear leveling to prolong the life of the solid state drive.) Brake failure does not follow the exponential distribution, as brake deterioration is a linear function, i.e., its graph is a straight line when plotted against time. It is not a curve, as the brakes do not wear out at an increasing rate over time.

Let’s expand on this example taken from the internet and assume we are looking at an electrical device and want to predict when it might break down. The device is made up electrical components whose failure rate is not at all related to the failure rate of any other component.  All of these properties allow us to use the exponential distribution model to predict when the whole device might break down.

Suppose the device has 1,000 components and the failure rate is 0.01.  These are not realistic numbers as they are abnormally large.  But with analytics it is normal to multiply or divide large or small numbers (called normalization) to make them small numbers that are easy to graph and work with. For example, if you try to graph something like 0.0000001 over time, the graph will be so scrunched up you cannot easily read it.

Since the failure rate is 0.01 and we have 1,000 components then on the first day we would expect to have 100 failures.

On average, the device falls apart at this exponent rate:

 

dayFailureRemaining
001,000
1(0.01*100=1001000-100=900
2(0.01*900)=90900-90=810
3(0.01*810)=81810-81=729|
and so on

The graph of the exponential probability density function is:

function

where;

λ is the failure rate
e is the constant e
t is time in days

Wolfram Alpha

If you do not know what Wolfram Alpha is, now is the time to see this powerful computational engine in action.  You can generate a plot from Wolfram Alpha plus solve easy or complicated functions.  Click this link and it will draw the graph:

plots

Of course, with statistics we are not interest in time t<0. So you can ignore anything to the left of t=0.

To solve that function in Wolfram Alpha your use this syntax, which is similar to what you would put into Microsoft Excel:

(0.1)exp(-t/10)

Probability Density Distribution

Here another view of the probability density distribution drawn as a graph.

The area under the curve from the point where time (t) = 0 is to any other time t, say t=k, is the cumulative probability or the probability of the device failing is the area at any time less than or equal to t.

Because this curve flattens slowly as time moves toward 30, 40, and 50 days the cumulative probability at those points will be close to 99%, which is certainly, 100%.

plot red

If you remember your calculus, the area under that curve is the definite integral over from 0 to t of the function

function 2

which you can also solve with Wolfram Alpha.  Click here to see that.

The definite integral is the indefinite integral evaluated at two points.  In this case, time t=0 and t whatever you want it to be.

Give that explanation, the probability that the device will fail in t=10 days or less is:

function 3

If you want another explanation of what a probability distribution is, watch this short video from The Kahn Academy.

The R Programming Language

Anyone working with analytics should learn how to use the R programming language.  It is the programming language of chose for data scientists. It and its APIs are used in many if not most analytics software. Plus you can use it like a calculator.  Some programmer use Python, but R is superior for doing analytics because of its rich function set and short, albeit cryptic, notation.

R is filled with analytics models.  One of which is the exponential distribution.  Finding the probability that the device will break down on any one day is as simple as typing;

pexp(day, failure rate)

into its command line interpreter.  Using that, here are some cumulative percentages for different days.

 

Dayprobability of failure of that device on that day or earlier
0pexp(0,0.1)=0 (You would not expect it to break down as soon as you turn it on.)
1pexp(1,0.1)=9.5%
10pexp(10,0.1)=63%
2086%
3095%
5099%

Wrapping Up

So the goal of the PM program should be to send technicians out into the field to replace the device before its probability of failure is within some threshold.  That threshold will be a costs-versus-benefit decision of the cost of letting the machine break down versus the cost of going out to fix it.  The cost of letting it break down include the cost in idle time at the plant.



Cybersecurity & Anaytics – a Tango

Analytics and Big Data provide a quantum leap forward over earlier means of detecting a hacker attack  or misuse of computing assets by inside employees.  Software vendors and open source projects have taken ideas from data scientists and  applied those algorithms against streaming data.  Those algorithms and products like Splunk and Apache Spark together with Hadoop have made it possible to sort through and process vast amounts of log and network data in a way that would not have been possible just 6 years ago.

Presumably all of this has provided a large boost to defeating hackers.  But if you read the news you can see that it does not work in all cases as hackers are still seemingly able to steal data at will.  One problem is companies generally do not publish data about which attacks they have blocked, so we cannot really tell how well all this work.  But what they have done is respond to surveys by groups like the Ponemon Institute as to the extent that companies are using analytics and big data as part of their cyber defense.  One would hope that as analytics and cybersecurity come together and their adoption and understanding among companies becomes more widespread then companies might be able to defeat some cyber disasters of the past.

In this short blog post we look at one approach to classifying events as a threat and then mention two cybersecurity analytics products: one from IBM and the other from the US Government Oak Ridge Labs.

 

Cybersecurity Before Analytics

Consider the way that cybersecurity works when you do not have the tools to do real time analytics.  A SIEM (Security Information and Event Management) like ArcSight is programmed to gather up logs and apply rules to flash alerts when, for example:

  • A former employee tries to login.
  • Someone tries and fails to login X number of time times.
  • Someone who normally works in the office is trying to login remotely.
  • There is network traffic of a type on a subset that is not what is usually there.

 

And then it provides dashboard reports like the SANS top 5:

 

  1. Attempts to Gain Access through Existing Accounts
  2. Failed File or Resource Access Attempts
  3. Unauthorized Changes to Users, Groups, and Service
  4. Systems Most Vulnerable to Attack
  5. Suspicious or Unauthorized Network Traffic Patterns

 

The problem with these approaches are:

 

  1. They are not good at solving the signal-to-noise problem (Too many normal events crowding out ones that could indicate a security breach.).
  2. It is too difficult for those tools to correlate logs systems as a hacker will use different credentials as they traverse the network, depending on where they login, so the SIEM cannot easily follow the hacker from one end to the other (e.g., tracking from a hacker’s IP address, VPN credentials, firewall user login, Activity Directory credentials, and UNIX credentials.).
  3. Users have to respond to alerts manually, most of which are false positives.

At best SIEM can be used for forensics, i.e., after the fact to try to track a hacker’s activity through the system, since it lets one query logs gathered from multiple sources.

 

Algorithms:  Complex Academics, but Simpler Implementation

What is needed are algorithms that can sort through these events in real time.  That is what modern analytics and big data have made possible today.

Here are some of the details for those who want to dig into those.  For example. we have PDFs from Stanford used in their course  “Data Mining for Cyber Security.” This includes an excellent, technical presentation by Google on “Data Mining at Google”.   But you can also chose to just use cybsersecurity products without understanding how they really work too, but where’s the fun in that?

 

Anomaly Detection:  A rather Simple Explanation

A way to flag anomalies it to use different kinds of machine learning.  For example, there is the easily understood technique of classification and its more complex implementation graph analysis.

The most common explanation of graph analysis and classification is mapping the relationship of persons on Facebook.  These are drawn as a graph, which is a series of nodes and edges like this example drawn from a lecture “Mining Massive Datasets” available as an online course at Coursera.

Facebook chart

 

 

 

 

 

 

 

 

 

 

 

 

 

Each of these dots is a person (node) and the lines draw between nodes (called edges) represent friendships.  The cluster of nodes represents something these people have in common.  In this example, the crowded part of the graph are Stanford University students who play soccer.  For cybersecurity, the nodes would be users, botnets, network resources, traffic types, or other, and the edges would represent relationships between those.  Classification would be used to flag a series of events as a threat or non-threat.

 

Big Data for Cybersecurity

Here is a presentation from the IBM Watson Research Lab “Big Data for Cybersecurity.” It’s a build up to selling the IBM QRadar Security Intelligence Platform and IBM Big Data Platform. IBM QRadar is a tool that does visualization, unstructured analysis, machine learning, and prediction.

streaming analytics

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Graphics Source:  IBM Watson Research Lab

 

The graph above shows how this approach and product is deployed into a network. IBM says streaming analytics “detects behaviors by correlating across diverse & massive data streams via Analytics in Motion.” That replaces the old way of looking at network traffic which was signatures, or know attacks registered in cloud databases, similar to an antivirus database.  Note that IBM says that their models learn offline with data stored after-the-fact.  In this case “learning” means training Bayes, Neural Networks, and other algorithms.

 

Protecting the Power Plant

If you have been to Tennessee, you have heard of the TVA, Tennessee Valley Authority.  It’s so large you can hardly overlook this network of dams and rivers built across the state in the Great Depression to both provide jobs and power.  Now, much of that power runs not on water but nuclear fuel.  That needs to be protected.  So the Department of Defense has funded the Oak Ridge Lab to develop the Oak Ridge Cyber Analytics, which power companies can use to detect attacks on this critical infrastructure.  Oak Ridge says,  “ORCA is comprised of several components, each of which addresses widespread technology gaps in computer network defense.” Attack targets they say include home gateways, smart meters, substations, and the control room.

 

ORCA uses these algorithms to detect attacks:

 

  • OneR
  • Nearest Neighbor
  • Random Forests
  • Naive Bayes
  • SVM
  • JRip
  • Adaboost + JRip

 

As to what all of this does.  We let Oak Ridge explain:

 

“The Temporal Analysis component provides a time-based and connection-based view of the events relative to each other. The Cluster Analysis applies raw text analysis and clustering algorithms to provide a hierarchical organization of the events. The Particle Swarm Optimizer analyzes behavior patterns over time to highlight behaviors that are consistent with known attack patterns. ORCA Categorization automatically classifies alerts produced by the various IDS sources based on two different approaches: text classification using the terms and phrases in the alert text, and semi-supervised machine learning which classifies based on the network statistics”

Although they invite other power companies to participate in the project and call their product a framework, I don’t see where you can download any open source software.  I wrote to Oak Ridge and we will see what they say about that.

So there you have an overview of some of the analytics tools, concepts, algorithms, and products used in cybersecurity. You could write a whole book on this topic, and people have.  If you want to dig really deep into all of this to increase your understanding I would suggest you take the Stanford classes on data mining.

 

We will write more about analytics and big data in the next post.



Leveraging Apache Storm and Spark to handle streaming data

Here we explain the main differences between Apache Storm and Apache Spark Streaming. Briefly, we can say that Storm is designed to process streams while Apache Spark Streaming is an extension to the batch-oriented Apache Spark to provide stream processing for that.

Apache Storm

Storm was developed by Twitter as a way to model the constant stream of Tweets in a way that they could query and run analytics against that. What Storm does is provide real-time data processing to Hadoop, although it can also write its output to other data stores or just retain the Storm data model in memory as the whole point is to produce a moving picture of what is happening now.

Storm produces a graph.  This is not a graph like a graph of sales over time from a spreadsheet.  When we say graph here we mean a topology, which is a concept from the field of mathematics.  An example topology is shown in the graphic below.

The graph provides the answer to some question, but as a living model.  Because data is continually streaming into the model, the graph keeps changing all the time.  If you were to map this to a dashboard you could visualize that.

Apache Storm

 

 

 

 

 

 

 

In the graph, the circles are nodes, also called vertices.  The lines between the circles are called links, also called edges. This represents the answer to some question posited against the data source, expressed as a topological model.

What does this mean?

Here are some examples to clarify that:

 

  • A graph can model the relationship between people on Facebook. If you dumped all the billion people on Facebook into a graph then you could navigate the edges to see what friends each person has and what people have friends in common. Putting that in a graph is the most efficient way to process such vast amounts of data. If would be much faster than other data structures designed for speed, like a Java TreeSet.
  • Consider another example, this one described by from Wikipedia. This is the relationship between formulas in a spreadsheet. This shows the order in which computations need to be executed.  That model tells Excel when to recalculate one cell or a whole range of cells and worksheets.
  • Another example is Twitter. You could construct a graph to show all the Tweets related to some topic that is trending. Twitter publishes Tweets as an API that returns a JSON object. You send them a URL formatted in way that indicates what Tweets you want.

Here is how Storm took those abstract concepts from graph theory to build their own topology, shown in the picture below:

 

 

Apache Source

 

 

 

 

 

 

 


graphic source Apache

 

In this topology there are two types of nodes:

 

Spout–is the source of a stream arriving as data structures known as tuples. e.g. (field 1, field 2, …, field n). This could be data from multiple sources, like machine sensors, social media, IoT devices, firewalls, or other.

 

Bolt–takes the input Spout and runs filters, aggregates data, or joins data sets.

 

Now we provide an example to illustrate how this is put to use.

 

HortonWorks Trucking Example

Here is example from HortonWorks, the cloud Hadoop platform.

 

This application provides a real-time graph of fleet operations for a trucking company.  The goal is to analyze data streaming from multiple sources and use that to change driver behavior (such as slow down or go another route) or alert management (The driver is going to fast or maybe should stop because of road conditions.).

The input are what you would normally think of when trying to make these kinds of decisions, except the computer processes all of that, instead of humans, and does so at lightning speed:

  • Driver information–sent from IoT network-connected sensors on the vehicle as this tuple: (driver, truck, longitude, latitude, time).
  • Weather–this comes from a weather service API. There are several.
  • Traffic data–comes from MapQuest. It could also come from the crowdsourced application Waze.
  • NYC trucking routes–from the NY DOT. This is in KML format, i.e., Google Earth format.

 

Apache Spark Streaming

Now, given the design of Storm, what is Apache Spark Streaming and how does that differ from Storm?

 

Spark Streaming

 

 

 

 

graphic source Apache Spark

Apache Spark Streaming delivers streaming ability to Apache Spark.  The graphic above gives some ideas of how you might use it in your architecture.

Spark Streaming output is not a graph, as with Storm, but a Dstream, which is a data structure that can be handed off to regular Apache Spark for processing.  Dstream mimics the Apache Spark RDD (Resilient Distributed Dataset.) format. RDD is similar to the output of a Apache Hadoop MapReduce program.  Also, Spark Streaming can produce a graph, using the Graph analytics APIs build into the product.

To understand Spark, you have to understand MapReduce.  This is because Apache Spark is used to provide an enormous boost in performance for the batch Apache Hadoop MapReduce function.  Because of that improved performance, it is threatening to push MapReduce aside for market position in Big Data processing (Although both are free, there is much cloud business built around those.)

Hadoop MapReduce, like the name suggests, does two things.

  1. Map–gather data. Save that to disk.
  2. Reduce–filter it, such as to remove duplicates or count objects. Save that to disk.

The most commonly used example to explain MapReduce is the word count problem:

Consider, for example any book, say, “The Bible.”  Suppose this book is so large that it is stored across a Hadoop Distributed File System (HDFS).  To figure out how many times the word “Jonah” occurs in The Bible you send MapReduce jobs out to scan each chunk of The Bible stored on the nodes (virtual machines with attached storage) of the HDFS.

The first step is to produce the rather redundant looking <key,value> pairs (redundant because each value in this case is 1, meaning itself):

<Jonah, 1>

<Mary, 1>

<John,1>

<Jonah,1>

 

Then the Map step collapses the <key, pair> items and to produce this output dataset:

<Jonah, 2>

<Mary, 1>

<John,1>

The <key, pair><Jonah, 2> shows that the word Jonah occurs 2 times in the Bible (Obviously it occurs many more times that, but this is a simple example.)

Spark Map Process

Spark Streaming does the same thing as MapReduce, but works on streaming rather than static data.

The simplest example is a TCP socket  If you know how computer logs are monitored for security or operational events then you will be familiar with syslog.  That is a TCP or UDP port.  Computers log their events by writing to the port (socket).  Aa log file collector program listens to that to collect all logs into one central place.

Finally, to add analytics to this process and produce a graph or run other algorithms, Spark has its own APIs: MLIB (Machine Learning Library) and Graph processing library.

Spark Example

 

Pinterest uses Apache Spark Streaming to measure user engagement.

Pinterest streams user data to their analytics platform using Apache Kafka, an open source tool developed by LinkedIn. As data flows into the model it uses the geolocation information included in photos and presents that as pins on a map. So the output in that case is a dashboard.

Summary

So there you have a brief overview of how Apache Spark Streaming and Apache Storm process streaming data. Given that they do basically the same thing, which one should you use?  That probably depends on what you have now and the skillsets of your programmers and data scientists.



Data Streaming & Analytics

Processing streaming data lets a business, organization, machine, or other kind of entity or program change behavior based upon what has happened or, in the case of predictive models, what is about to happen. The tools to process streams of data and draw conclusions are broadly called analytics. In this short blog post, we explain what kinds of analytics might be used for streaming data.

 

Analytics Algorithms

There’s not any difference in the underlying mathematics and statistics upon which different analytics tools are based whether you have streaming or static data.  What differs are the algorithms used to read the data and sort that before handing off to the statistical model.  In other words, a time series analysis is the same whether you are looking at a static data file (with time as one of its fields) or a stream of data.

Consider an example. Supposed you want to model customers in a retail store checkout line.  You are looking at the shoppers lined up at the cash register.  You have written a model that determines whether given the length of the queue you should send a cashier to open a new cash register.  If some customers walk out of the store because the line is too long (A laser can count and stream that.) then you might have to install additional cash registers and hire more employees to work those.  You would do so if the cost of doing that is less than the revenue lost when customers walk out.

The movement of customers through cash registers is time sensitive data which lends itself to time sensitive analysis.  Apache storm, which is one tool for processing streaming data, gathers these up as tuples.  Think of the tuple as a record.  In the case of time sensitive analysis, the tuple would have to include the time as one of its values.  The tuple for the register transaction in this case is: (register number, time transaction processed, time person entered line).

 

Some Applications for Streaming Data Analytics

When you think of streaming data, you think of something coming in quite fast.  Customers lined up at a grocery store register move along slowly.  But if you have lots of stores that would increase how much data there is to process, but it’s still not very fast like an image processing application.  Machine sensor data, used in a preventive maintenance application, comes in slowly as well.  A compressor might report its temperature and vibration every 10 minutes. So stream, contrary to the way that word sounds, does not always mean fast, but it can.  Here are some examples of streaming data and which analytics could be or are used to process that:

Cybersecurity threat detection–web servers, firewalls, network routers, etc. produce reams of data very fast.   Analytics appropriate for processing these would include, for example, data classification (explained below).

Stock trading–mathematicians who work on Wall Street are called quants.  Their models read newspaper feeds (streams) to pick up when bad news or good news is circulating and then buy or sell stocks based on that. Again this software could use classification and different kinds of machine learning analytics (explained below) to make those decisions.

Credit card fraud–works on the principles of classification (Now we explain that.).  Classification takes data points and puts them into sets with known characteristics.  The algorithm processes events by calculating the distance of an event from the closest set. For financial transactions, this measures the likelihood that a transaction is fraud given that the customer is, say, high risk and the transaction is coming from places where there is a lot of fraud, like over the internet from Russia.

Preventive maintenance (PM)–machines report their temperature, ambient environment (i.e., the humidity and temperature in the vicinity), and their movement (vibration).  In the case of a compressor this would indicate when it is time to replace the filter as the machine is working harder.  This problem can be solved by a regression model, which is a type of predictive analysis.  A regression model finds the correlation between variables.  In the case of a PM application, the model takes changes in temperature and vibration and the mean-time-to-failure (published by the manufacturer) to calculate the likelihood that a machine will fail within a given time.  Given their threshold for risk (downtime), the company then sends out a technician to repair the machine when it is operating outside norms.

Adaptive Algorithms and Machine Learning

A key question for finding the right analytic to apply to streaming data is to ask whether the model needs to adapt to changing data.  Models that do that are called learning models. These are a type of machine learning.  Consider the rather counterintuitive example of weather forecasting.  Those models do not necessarily need to change as wind speed, temperature, and barometric pressure change.  That is because all of the possible data conditions (rain, snow, dust storm) are known.  But an email spam engine needs to change as email flows in.  That used Bayesian statistics.  As people report spam, the Bayesian model adds to its list of spam keywords. That is called training the model, which is also a type of machine learning.
Neural Network

A neural network is a type of machine learning that mimics the way humans think.  It does this by modeling a problem as a series of synapses and neurons, just like the brain is constructed. The most common way to explain this idea is to consider facial recognition software.

Consider the steps used to determine whose face we are looking at and how that might be programmed into a computer.

  1. Neuron 1: Sort through the functions for determining skin tone. For example, one function is designed to work in low light. When the optimal solution (best guess) is found hand off the answer to next neuron. Suppose in this case the person is blonde.
  2. Neuron 2: Now, what color is their hair? The neuron says it is blonde.
  3. Neuron 3: Where was this picture taken (It’s a geotagged photo). Match that up against address and name records.
  4. Neurons …: (Do more steps…)
  5. Last neuron: Check the resulting numerical representation of the face (i.e., the neural network). The network says: “That looks like someone I know. It looks like Fred.  It is Fred!”

 

In our future blogs, we will explore how some known technologies handle streaming data.