Warranty Analytics – Improving product quality

Warranty Analytics & Preventive maintenance

Here we illustrate Warranty Analytics & Preventive maintenance with an example. Plus we show how you can use the computational engine Wolfram Alpha and the programming language R to solve such problems easily.

In preventive maintenance (PM) there are the concepts of the mean-time-to-failure (MTF) and the failure rate. One is calculated directly from the other. The manufacturer knows the MTF of their product and prints that on the product label or specs sheet.  They base their product warranty on that.  They pick some period of time for the warranty period where the probability of the device breaking down is not so high as to expose them to excess repair costs yet satisfies the customer by selling them something that is guaranteed to last a certain amount of time.  That is typically 1, 2, or 5 years.

Factories and people with lots of machines to maintain, or a data center with lots of servers and disks, also do this kind of analysis.  They weigh the cost of replacing a device or its components against the cost of it breaking down within a certain period of time.

PM: the Classic Analytics Problem and major IoT Application

Preventive maintenance is the classic analytics problem. It is the first, and the dominate IoT (Internet of Things) application, besides, perhaps, health care monitoring.  For example, sensors in truck tires and brakes use cellular networks to phone in temperature and air pressure data so that the trucking company can know when it is time to replace the brakes or tires.

The problem with analytics and PM is knowing which statistical model to use. Pick the wrong probability distribution and the results will be horribly wrong.  That is why you need a data scientist on your staff.  You should not blindly trust your analytics software without understanding its assumptions and how it works.

Example: Predicting the Failure of an Electrical Device

If a machine component fails at a frequency that is independent of any other event then its failure rate follows an exponential distribution.  This is often applicable in predicting failure of electric components as they do not wear out over time, because they have no moving parts. (Actually silicon flash memory dies suffer what is called wear fatigue and wears out after about 100,000 write/erase cycles.  The disk controller knows that and uses wear leveling to prolong the life of the solid state drive.) Brake failure does not follow the exponential distribution, as brake deterioration is a linear function, i.e., its graph is a straight line when plotted against time. It is not a curve, as the brakes do not wear out at an increasing rate over time.

Let’s expand on this example taken from the internet and assume we are looking at an electrical device and want to predict when it might break down. The device is made up electrical components whose failure rate is not at all related to the failure rate of any other component.  All of these properties allow us to use the exponential distribution model to predict when the whole device might break down.

Suppose the device has 1,000 components and the failure rate is 0.01.  These are not realistic numbers as they are abnormally large.  But with analytics it is normal to multiply or divide large or small numbers (called normalization) to make them small numbers that are easy to graph and work with. For example, if you try to graph something like 0.0000001 over time, the graph will be so scrunched up you cannot easily read it.

Since the failure rate is 0.01 and we have 1,000 components then on the first day we would expect to have 100 failures.

On average, the device falls apart at this exponent rate:

 

dayFailureRemaining
001,000
1(0.01*100=1001000-100=900
2(0.01*900)=90900-90=810
3(0.01*810)=81810-81=729|
and so on

The graph of the exponential probability density function is:

function

where;

λ is the failure rate
e is the constant e
t is time in days

Wolfram Alpha

If you do not know what Wolfram Alpha is, now is the time to see this powerful computational engine in action.  You can generate a plot from Wolfram Alpha plus solve easy or complicated functions.  Click this link and it will draw the graph:

plots

Of course, with statistics we are not interest in time t<0. So you can ignore anything to the left of t=0.

To solve that function in Wolfram Alpha your use this syntax, which is similar to what you would put into Microsoft Excel:

(0.1)exp(-t/10)

Probability Density Distribution

Here another view of the probability density distribution drawn as a graph.

The area under the curve from the point where time (t) = 0 is to any other time t, say t=k, is the cumulative probability or the probability of the device failing is the area at any time less than or equal to t.

Because this curve flattens slowly as time moves toward 30, 40, and 50 days the cumulative probability at those points will be close to 99%, which is certainly, 100%.

plot red

If you remember your calculus, the area under that curve is the definite integral over from 0 to t of the function

function 2

which you can also solve with Wolfram Alpha.  Click here to see that.

The definite integral is the indefinite integral evaluated at two points.  In this case, time t=0 and t whatever you want it to be.

Give that explanation, the probability that the device will fail in t=10 days or less is:

function 3

If you want another explanation of what a probability distribution is, watch this short video from The Kahn Academy.

The R Programming Language

Anyone working with analytics should learn how to use the R programming language.  It is the programming language of chose for data scientists. It and its APIs are used in many if not most analytics software. Plus you can use it like a calculator.  Some programmer use Python, but R is superior for doing analytics because of its rich function set and short, albeit cryptic, notation.

R is filled with analytics models.  One of which is the exponential distribution.  Finding the probability that the device will break down on any one day is as simple as typing;

pexp(day, failure rate)

into its command line interpreter.  Using that, here are some cumulative percentages for different days.

 

Dayprobability of failure of that device on that day or earlier
0pexp(0,0.1)=0 (You would not expect it to break down as soon as you turn it on.)
1pexp(1,0.1)=9.5%
10pexp(10,0.1)=63%
2086%
3095%
5099%

Wrapping Up

So the goal of the PM program should be to send technicians out into the field to replace the device before its probability of failure is within some threshold.  That threshold will be a costs-versus-benefit decision of the cost of letting the machine break down versus the cost of going out to fix it.  The cost of letting it break down include the cost in idle time at the plant.



Cybersecurity & Anaytics – a Tango

Analytics and Big Data provide a quantum leap forward over earlier means of detecting a hacker attack  or misuse of computing assets by inside employees.  Software vendors and open source projects have taken ideas from data scientists and  applied those algorithms against streaming data.  Those algorithms and products like Splunk and Apache Spark together with Hadoop have made it possible to sort through and process vast amounts of log and network data in a way that would not have been possible just 6 years ago.

Presumably all of this has provided a large boost to defeating hackers.  But if you read the news you can see that it does not work in all cases as hackers are still seemingly able to steal data at will.  One problem is companies generally do not publish data about which attacks they have blocked, so we cannot really tell how well all this work.  But what they have done is respond to surveys by groups like the Ponemon Institute as to the extent that companies are using analytics and big data as part of their cyber defense.  One would hope that as analytics and cybersecurity come together and their adoption and understanding among companies becomes more widespread then companies might be able to defeat some cyber disasters of the past.

In this short blog post we look at one approach to classifying events as a threat and then mention two cybersecurity analytics products: one from IBM and the other from the US Government Oak Ridge Labs.

 

Cybersecurity Before Analytics

Consider the way that cybersecurity works when you do not have the tools to do real time analytics.  A SIEM (Security Information and Event Management) like ArcSight is programmed to gather up logs and apply rules to flash alerts when, for example:

  • A former employee tries to login.
  • Someone tries and fails to login X number of time times.
  • Someone who normally works in the office is trying to login remotely.
  • There is network traffic of a type on a subset that is not what is usually there.

 

And then it provides dashboard reports like the SANS top 5:

 

  1. Attempts to Gain Access through Existing Accounts
  2. Failed File or Resource Access Attempts
  3. Unauthorized Changes to Users, Groups, and Service
  4. Systems Most Vulnerable to Attack
  5. Suspicious or Unauthorized Network Traffic Patterns

 

The problem with these approaches are:

 

  1. They are not good at solving the signal-to-noise problem (Too many normal events crowding out ones that could indicate a security breach.).
  2. It is too difficult for those tools to correlate logs systems as a hacker will use different credentials as they traverse the network, depending on where they login, so the SIEM cannot easily follow the hacker from one end to the other (e.g., tracking from a hacker’s IP address, VPN credentials, firewall user login, Activity Directory credentials, and UNIX credentials.).
  3. Users have to respond to alerts manually, most of which are false positives.

At best SIEM can be used for forensics, i.e., after the fact to try to track a hacker’s activity through the system, since it lets one query logs gathered from multiple sources.

 

Algorithms:  Complex Academics, but Simpler Implementation

What is needed are algorithms that can sort through these events in real time.  That is what modern analytics and big data have made possible today.

Here are some of the details for those who want to dig into those.  For example. we have PDFs from Stanford used in their course  “Data Mining for Cyber Security.” This includes an excellent, technical presentation by Google on “Data Mining at Google”.   But you can also chose to just use cybsersecurity products without understanding how they really work too, but where’s the fun in that?

 

Anomaly Detection:  A rather Simple Explanation

A way to flag anomalies it to use different kinds of machine learning.  For example, there is the easily understood technique of classification and its more complex implementation graph analysis.

The most common explanation of graph analysis and classification is mapping the relationship of persons on Facebook.  These are drawn as a graph, which is a series of nodes and edges like this example drawn from a lecture “Mining Massive Datasets” available as an online course at Coursera.

Facebook chart

 

 

 

 

 

 

 

 

 

 

 

 

 

Each of these dots is a person (node) and the lines draw between nodes (called edges) represent friendships.  The cluster of nodes represents something these people have in common.  In this example, the crowded part of the graph are Stanford University students who play soccer.  For cybersecurity, the nodes would be users, botnets, network resources, traffic types, or other, and the edges would represent relationships between those.  Classification would be used to flag a series of events as a threat or non-threat.

 

Big Data for Cybersecurity

Here is a presentation from the IBM Watson Research Lab “Big Data for Cybersecurity.” It’s a build up to selling the IBM QRadar Security Intelligence Platform and IBM Big Data Platform. IBM QRadar is a tool that does visualization, unstructured analysis, machine learning, and prediction.

streaming analytics

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Graphics Source:  IBM Watson Research Lab

 

The graph above shows how this approach and product is deployed into a network. IBM says streaming analytics “detects behaviors by correlating across diverse & massive data streams via Analytics in Motion.” That replaces the old way of looking at network traffic which was signatures, or know attacks registered in cloud databases, similar to an antivirus database.  Note that IBM says that their models learn offline with data stored after-the-fact.  In this case “learning” means training Bayes, Neural Networks, and other algorithms.

 

Protecting the Power Plant

If you have been to Tennessee, you have heard of the TVA, Tennessee Valley Authority.  It’s so large you can hardly overlook this network of dams and rivers built across the state in the Great Depression to both provide jobs and power.  Now, much of that power runs not on water but nuclear fuel.  That needs to be protected.  So the Department of Defense has funded the Oak Ridge Lab to develop the Oak Ridge Cyber Analytics, which power companies can use to detect attacks on this critical infrastructure.  Oak Ridge says,  “ORCA is comprised of several components, each of which addresses widespread technology gaps in computer network defense.” Attack targets they say include home gateways, smart meters, substations, and the control room.

 

ORCA uses these algorithms to detect attacks:

 

  • OneR
  • Nearest Neighbor
  • Random Forests
  • Naive Bayes
  • SVM
  • JRip
  • Adaboost + JRip

 

As to what all of this does.  We let Oak Ridge explain:

 

“The Temporal Analysis component provides a time-based and connection-based view of the events relative to each other. The Cluster Analysis applies raw text analysis and clustering algorithms to provide a hierarchical organization of the events. The Particle Swarm Optimizer analyzes behavior patterns over time to highlight behaviors that are consistent with known attack patterns. ORCA Categorization automatically classifies alerts produced by the various IDS sources based on two different approaches: text classification using the terms and phrases in the alert text, and semi-supervised machine learning which classifies based on the network statistics”

Although they invite other power companies to participate in the project and call their product a framework, I don’t see where you can download any open source software.  I wrote to Oak Ridge and we will see what they say about that.

So there you have an overview of some of the analytics tools, concepts, algorithms, and products used in cybersecurity. You could write a whole book on this topic, and people have.  If you want to dig really deep into all of this to increase your understanding I would suggest you take the Stanford classes on data mining.

 

We will write more about analytics and big data in the next post.