Announcing ODE 0.3 release

Announcing ODE 0.3

 

Today, Opallios is pleased to announce the general availability of ODE 0.3, Opallios Distribution of ELSA. The first version, ODE 0.1, was released two months back on July 14, 2015. Since then our development team has been hard at work to make further improvements to ELSA as we listed out in our first blog. The focus of ODE 0.3 is to add some new functionalities that our customers have asked for in past and improve on the charting experience of ELSA. Besides some bug fixes, the changes in 0.3 can be divided in three main areas,

  1. Improved Charting and Dashboard User Experience
  2. New Aggregate Functions
  3. Integration with Fluentd

ODE 0.3 is available for download as both debian and rpm packages as well as pre-built AWS images. Currently, deb is supported on Ubuntu 12.x and 14.x, and rpm has been successfully tested on RedHat 6.5 and Centos 6.6. For more details on download and installation please refer to ODE’s github site, https://github.com/opallios/ode.

Charting

ELSA, since beginning, has been leveraging Google Charts for visualization. Google Charts, although extensive, lacks basic responsiveness and coolness of some of the more modern javascript based charting libraries. After discussion with Martin, creator of ELSA, we agreed that visualization is one area which will make the most impact in terms of user experience. Out of several popular charting libraries out there we narrowed down our investigation to couple namely, NVD3 and Chart.js, for replacing flash based Google Charts.

We found NVD3’s svg rendering very attractive, but it’s not very flexible and poorly documented. Due to some of its chart reusability requirements we found it little hard to integrate with the current ELSA DOM based code, which would have required dynamic loading of SVG charts. Chart.js is a canvas based library, which is quite small and highly customizable. It’s very popular among open source projects and have a big user community. With all the pros and cons we finally opted for Chart.js to replace Google Charts in ODE 0.3.

Chart.js produces responsive, animated and great looking charts. Charting plays a big role in ELSA. It’s primarily used for aggregated queries (groupby clause) and dashboard.  Using Chart.js we were able to make charts load 2-3 times faster and look cool. We also fixed some of the long table and charting alignment issues for aggregated queries. Here are some snapshots,

ode c1

 

 

 

 

 

 

 

ode c2

On Dashboard, we cleaned up the links for each chart and improved charting layout. The result is cleaner and attractive dashboards. Here is one example,

ode c3

 

 

 

 

 

 

 

 

 

 

 

 

 

 

ode c4

 

 

 

 

 

 

 

Chart.js has a strong active community around it, which allows for regular enhancements and releases. We will continue to update ODE with any major changes in Chart.js to further improve user experience.

Analytical Functions

ODE 0.3 adds five new aggregate functions, which are also known as transforms in ELSA. Transforms allow you to pass the results of a query to a backend plugin. These functions serve as backbone of ELSA analytics. The plugins that currently ship with ELSA include whois, dnsdb, and CIF (Collective Intelligence Framework). There are also utility transforms filter, grep, and sum. There have requests from users to add new plugins for some of the basic aggregate functions. ODE 0.3 adds four new such transforms,

  1. min – finds the field’s minimum value for the given result of the subquery. For ex.,  “class:xxx | min(eventid)”
  2. max – finds the field’s maximum value for the given result of the subquery. For ex., “class:xxx | max(eventid)”
  3. avg – finds the field’s average value for the given result of the subquery. For ex., “class:xxx | avg(bytes)”
  4. median – finds the field’s median value for the given result of the subquery. For ex., “class:xxx | median(bytes)”

 

Few key things to note,

 

  • As ELSA queries return only 100 records by default, you would have to set limit directive to apply the aggregation over more than 100 records. For ex., “class:xxx nobatch:1 timeout:0 limit:0 | max(eventid)”
  • You can pipe multiple transforms within the same query
  • The field passed to the above aggregated functions have to be a number type

FluentD

ELSA was created as a centralized syslog server and such has syslog-ng at its heart. With syslog-ng you can collect logs from any source (both structured and unstructured) and process them in near real-time. Syslog-ng’s powerful db-parser() allows to extract information and tag messages for later classification. Even though, db-parser() is a generic and supports various log formats, the parsing language for it can become quite complex depending on the log format. In one of our project we were required to process large number of nested json messages. After quite a bit of trial with syslog-ng json parser we were still not able to effectively create parsing for our needs. We ended up writing json parser in java and pass its output through syslog-ng for ELSA’s consumption.

Though, syslog-ng is pretty reliable most of the times, but there are cases when it’s parser may come up little short. In comes Fluentd, a simple, but extensible log collector that has over 100 plugins to parse different log formats. It’s widely used in industry and has a strong active community behind it. We think Fluentd can work well with syslog-ng in ODE.

By integrating fluentd we open up ODE to consume a wide range of data sources. With ODE 0.3 we have created fluentd configuration for the following log types,

  1. JSON
  2. NetFlow
  3. Apache Log

User can easily add more fluentd plugins as needed. The current configuration allows for both stream and file based inputs. Fluentd is currently disabled by default in ODE 0.3 and requires some manual configuration to enable it. The messages that are chosen to be processed by fluentd would first enter fluentd and then processed by syslog-ng to be consumed by ELSA. Fluentd would transform the messages in the form that’s easily parsable by syslog-ng.

ode-c6

The fluentd installation and configuration for ODE is documented in more details on the ODE site,  http://opalliosode.org/documentation/fluentd/.

What’s Next?

Opallios is committed to continue make improvements to ELSA. With our first release, ODE 0.1, we made the installation process smoother and now, with ODE 0.3 we are trying to improve on the user experience and add new functionalities. In coming months, we’ll continue to work on some of the usability features and add new functions as desired by our customers. Here are some of the items we’re currently looking at for our next release, ODE 0.5,

  1. Support ODE installation on newer OS versions
  2. Speed up the installation process
  3. More analytical functions
  4. Better documentation with use cases
  5. Tighter integration with Fluentd
  6. Bug fixes

Our users can dictate the future roadmap of ODE by getting involved in the product development and requesting new features. Drop us a line so we can prioritize our ODE work list based on what’s important to our users. We appreciate your feedback and comments, http://opalliosode.org/feature-request/.

About Opallios

Opallios provides software consulting for big data analytics and cloud computing. Over last five years it has helped several organizations simplify their big data projects and develop cloud based softwares. Opallios business model is that of building long term partnerships to help our clients lower their software development and maintenance cost by offering our expertise customized to client’s’ needs and requirements.

 



Fraud Analytics – Financial Industry

Analytics can be used to flag credit card transactions as fraudulent.  Here we give an overview of how that is done.

 

First of all, know that such tools are not perfect which means card processors and retailers still have to manually check lots of transactions.  Consider this: a statistical model that is 99% accurate or even 99.9% accurate is not, by definition, 100% accurate.  This means the credit card processor has to manually check a certain number of transactions to verify whether they are fraudulent or not.  There is a cost to doing this.  So such systems operate on a cost-of-checking vs. financial-cost-of-fraud basis to establish a tolerable threshold for fraud.  In other words, it is not cost effective to achieve a model that has no flaws.

Below we give a brief survey of statistical techniques used for credit card detection and then look briefly at one product, Falcon, that uses neural networks to identify fraudulent transactions,

 

Academic Background

When someone writes software for analytics, they always start with ideas developed by academics.  Those are the people who discover the techniques and write the algorithms that make such tools possible.  Here we draw upon a paper “Statistical Fraud Detection: a Review” written by two academics, Richard J. Bolton and David J. Hand, to give background information on how analytics is applied to credit card fraud detection.

 

How much money are we talking about here? This document gives some information about that:

 

  • In 2010, 33% of credit and debit card customers in the world reported fraud on their account in the past 5 years.
  • There was $5.5 billion in credit card fraud in 2012 of which $3.56 billion was in the USA.
  • There we $955 million in losses due to debit card fraud in the USA in 2010.
  • There were $2 trillion total purchases in USA in 2012 using Amex, Discover, MasterCard and Visa cards.

 

Bolton and Hand explain that there are two types of analytic techniques:  supervised and unsupervised.

 

Supervised

Supervised fraud analysis means using data from known attacks. It:

 

  • Uses samples of both fraudulent and non-fraudulent records to construct and train models
  • Assigns new observations into one of the two classes (likely fraud, likely legitimate)
  • Only works if this type fraud has previously occurred

 

Unsupervised

Unsupervised fraud analysis looks for variations in account transactions and customer data from observed norms.  So it would look for outliers or other events that are statistically significant. The goal is to calculate what is called a “suspicion score.”

 

Problems with the Models

Bolton and Hand explain that these systems are not 100% accurate.  To explain what that means in practical terms, the give an example that says suppose a credit card risk analysis system can:

  • Correctly identify 99% of legit transactions as legit
  • Correctly identity 99% of fraudulent transactions as fraudulent

 

Now suppose that in actuality 1/1000 or 0.1% transactions are fraudulent.

 

This model will say that out of 1000 transactions, 990 are legit and 10 are fraud.  But of those fraudulent ones there is only a 99% certainty of the model being right about that.  We know from observation that 1 out of 1,000 is fraudulent.   So we have to check the 9 others proposed by the model by calling each customer or perhaps doing some other kind of manual investigation.  That takes time and costs money.  Can the models do any better than that?

Among supervised models, classification models can narrow down the model to flagging only 0.1% of transactions as fraudulent, which in the case of example above would be dead-on accurate.  But in the situation of this example here only 0.04% of transactions were actually fraudulent.    So out of the 10 in 10,000 flagged by the model only 4 are actually fraudulent and 6 legit.   There is also a cost of checking all 10 of these. So a cost-weight acceptable loss threshold is set.

 

Supervised fraud detection tools

Here are some of the supervised fraud classification techniques.  We just cite those here and do not go into detail about how they work.  You are encouraged to do further investigation into those if you want to dig deeper.

  • linear discriminant analysis
  • logistic discrimination
  • neural networks

 

And then there are the rules based techniques:

 

  • BAYES
  • FOIL
  • RIPPER

Link analysis is another technique.  It uses the techniques of mining social media networks (or any other kind of graph, i.e., vector and edges) to sort out, for example, if someone linked to someone else is phoning in a credit card transactions using the same fraud technique.

 

Unsupervised fraud detection tools

Unsupervised fraud detection tools are used when there is no prior legitimate or fraudulent observations available upon which to make decisions.

In this case statistics are used to profile transactions and detect outliers.  Some techniques used here are approaches similar to those used in text analysis.

One technique used to detect fraud is an application of the rather esoteric and not-at-all-intuitive Benford’s law.  That laws says that certain digits taken from random samples of financial transactions occur at a certain known frequencies.  For example, you would think that the number 1 would occur 1/10 times in a dollar amount since there are 10 digits.  But on average it occurs 30% of the time.  (You would have to read on your own to try and understand that.) So if a batch of transactions varies from this pattern it is probably fraudulent.

 

Falcon software

Anyone in the USA who has obtained a mortgage or other loan has come to hate the organization FICO.  They assign a risk-based number to loan applicants called a credit score. Since they are in the business of detecting risk, FICO has also acquired Falcon software, which uses neural networks to detect fraud.

 

Here is a graphic from FICO giving a view of the Falcon analytics platform.

card fault

You can tell by reading their product literature that this is a supervised learning classification system using neural networks.  Since it plugs into a merchant’s POS cashier terminals, it can be used to detect fraudulent transactions right in the store.  But I am not sure what a sales clerk is supposed to do when someone standing at the register is flagged as a criminal.  Anyway such system could also be plugged into the merchant’s ecommerce web system as well.

The product literature also says that FICO is keeping credit profiles on card holders.  We already knew about that.  That helps them do classification.  They also say that their software includes adaptive analytics which means it responds to up-to-the-minute fraudulent activity to update the model.  This they say improves the model by 10% as it learns in real time.

The FICO system can be deployed as a cloud solution.  Or they provide their APIs and framework so that a company can build their own fraud detection into their own platform.

So there is a basic overview of how analytics is used to detect credit card fraud.  As you can see, these techniques would have applications to assessing any kind of financial risk.