Cybersecurity & Anaytics – a Tango

Analytics and Big Data provide a quantum leap forward over earlier means of detecting a hacker attack  or misuse of computing assets by inside employees.  Software vendors and open source projects have taken ideas from data scientists and  applied those algorithms against streaming data.  Those algorithms and products like Splunk and Apache Spark together with Hadoop have made it possible to sort through and process vast amounts of log and network data in a way that would not have been possible just 6 years ago.

Presumably all of this has provided a large boost to defeating hackers.  But if you read the news you can see that it does not work in all cases as hackers are still seemingly able to steal data at will.  One problem is companies generally do not publish data about which attacks they have blocked, so we cannot really tell how well all this work.  But what they have done is respond to surveys by groups like the Ponemon Institute as to the extent that companies are using analytics and big data as part of their cyber defense.  One would hope that as analytics and cybersecurity come together and their adoption and understanding among companies becomes more widespread then companies might be able to defeat some cyber disasters of the past.

In this short blog post we look at one approach to classifying events as a threat and then mention two cybersecurity analytics products: one from IBM and the other from the US Government Oak Ridge Labs.


Cybersecurity Before Analytics

Consider the way that cybersecurity works when you do not have the tools to do real time analytics.  A SIEM (Security Information and Event Management) like ArcSight is programmed to gather up logs and apply rules to flash alerts when, for example:

  • A former employee tries to login.
  • Someone tries and fails to login X number of time times.
  • Someone who normally works in the office is trying to login remotely.
  • There is network traffic of a type on a subset that is not what is usually there.


And then it provides dashboard reports like the SANS top 5:


  1. Attempts to Gain Access through Existing Accounts
  2. Failed File or Resource Access Attempts
  3. Unauthorized Changes to Users, Groups, and Service
  4. Systems Most Vulnerable to Attack
  5. Suspicious or Unauthorized Network Traffic Patterns


The problem with these approaches are:


  1. They are not good at solving the signal-to-noise problem (Too many normal events crowding out ones that could indicate a security breach.).
  2. It is too difficult for those tools to correlate logs systems as a hacker will use different credentials as they traverse the network, depending on where they login, so the SIEM cannot easily follow the hacker from one end to the other (e.g., tracking from a hacker’s IP address, VPN credentials, firewall user login, Activity Directory credentials, and UNIX credentials.).
  3. Users have to respond to alerts manually, most of which are false positives.

At best SIEM can be used for forensics, i.e., after the fact to try to track a hacker’s activity through the system, since it lets one query logs gathered from multiple sources.


Algorithms:  Complex Academics, but Simpler Implementation

What is needed are algorithms that can sort through these events in real time.  That is what modern analytics and big data have made possible today.

Here are some of the details for those who want to dig into those.  For example. we have PDFs from Stanford used in their course  “Data Mining for Cyber Security.” This includes an excellent, technical presentation by Google on “Data Mining at Google”.   But you can also chose to just use cybsersecurity products without understanding how they really work too, but where’s the fun in that?


Anomaly Detection:  A rather Simple Explanation

A way to flag anomalies it to use different kinds of machine learning.  For example, there is the easily understood technique of classification and its more complex implementation graph analysis.

The most common explanation of graph analysis and classification is mapping the relationship of persons on Facebook.  These are drawn as a graph, which is a series of nodes and edges like this example drawn from a lecture “Mining Massive Datasets” available as an online course at Coursera.

Facebook chart














Each of these dots is a person (node) and the lines draw between nodes (called edges) represent friendships.  The cluster of nodes represents something these people have in common.  In this example, the crowded part of the graph are Stanford University students who play soccer.  For cybersecurity, the nodes would be users, botnets, network resources, traffic types, or other, and the edges would represent relationships between those.  Classification would be used to flag a series of events as a threat or non-threat.


Big Data for Cybersecurity

Here is a presentation from the IBM Watson Research Lab “Big Data for Cybersecurity.” It’s a build up to selling the IBM QRadar Security Intelligence Platform and IBM Big Data Platform. IBM QRadar is a tool that does visualization, unstructured analysis, machine learning, and prediction.

streaming analytics















Graphics Source:  IBM Watson Research Lab


The graph above shows how this approach and product is deployed into a network. IBM says streaming analytics “detects behaviors by correlating across diverse & massive data streams via Analytics in Motion.” That replaces the old way of looking at network traffic which was signatures, or know attacks registered in cloud databases, similar to an antivirus database.  Note that IBM says that their models learn offline with data stored after-the-fact.  In this case “learning” means training Bayes, Neural Networks, and other algorithms.


Protecting the Power Plant

If you have been to Tennessee, you have heard of the TVA, Tennessee Valley Authority.  It’s so large you can hardly overlook this network of dams and rivers built across the state in the Great Depression to both provide jobs and power.  Now, much of that power runs not on water but nuclear fuel.  That needs to be protected.  So the Department of Defense has funded the Oak Ridge Lab to develop the Oak Ridge Cyber Analytics, which power companies can use to detect attacks on this critical infrastructure.  Oak Ridge says,  “ORCA is comprised of several components, each of which addresses widespread technology gaps in computer network defense.” Attack targets they say include home gateways, smart meters, substations, and the control room.


ORCA uses these algorithms to detect attacks:


  • OneR
  • Nearest Neighbor
  • Random Forests
  • Naive Bayes
  • SVM
  • JRip
  • Adaboost + JRip


As to what all of this does.  We let Oak Ridge explain:


“The Temporal Analysis component provides a time-based and connection-based view of the events relative to each other. The Cluster Analysis applies raw text analysis and clustering algorithms to provide a hierarchical organization of the events. The Particle Swarm Optimizer analyzes behavior patterns over time to highlight behaviors that are consistent with known attack patterns. ORCA Categorization automatically classifies alerts produced by the various IDS sources based on two different approaches: text classification using the terms and phrases in the alert text, and semi-supervised machine learning which classifies based on the network statistics”

Although they invite other power companies to participate in the project and call their product a framework, I don’t see where you can download any open source software.  I wrote to Oak Ridge and we will see what they say about that.

So there you have an overview of some of the analytics tools, concepts, algorithms, and products used in cybersecurity. You could write a whole book on this topic, and people have.  If you want to dig really deep into all of this to increase your understanding I would suggest you take the Stanford classes on data mining.


We will write more about analytics and big data in the next post.

Subscribe to the newsletter

Want to stay on top of the latest information from Opallios?

Sign up for our newsletter, and we’ll let you know about our latest news, updates on our products and services, and helpful tips and articles to learn more about Big Data, Cloud, PaaS platforms like, and other relevant topics.