Monday, June 22, 2009

Examining PMML 4.0 - Part I: Pre-Processing

You may be wondering what is all the fuss around PMML and its 4.0 version. So, we decided to explore all that PMML 4.0 has to offer in a series of blogs. In part I, we will be exploring its improved pre-processing capabilities.

All data mining models manipulate the raw data in a way or another before passing it through a neural network, support vector machine, or regression model. Therefore, a language that wants to represent all the computations that go into a model needs also to be able to represent the data transformations that were applied to the raw data before scoring takes place. PMML is this language! It is the Yin and Yang of data mining.

Let's first re-cap on the pre-processing capabilities available in PMML 3.2. This version of PMML allows for the following out of the box data transformations:
  • Normalization of continuous variables: this is accomplished via the NormContinuous element of PMML. It is mostly used to normalized a variable between 0 and 1. See example below (real PMML code) in which two variables are normalized. The first between 0 and 1 and the second between 0 and 4.
  • Normalizing Categorical Inputs: normally used to transform strings into numerical variables. This is accomplished by the element NormDiscrete. In the PMML example below, a categorical variable creates dummy variables that will be assigned values 1 or 0 depending on the category assumed by the input variable.
  • Discretization: this is used to transform continuous variables into strings. This is accomplished by the Discretize element. In the PMML example below, if the input variable is equal to 500, it is transformed to low; if equal to 5000, it is transformed to medium; and if 50,000, it is high.
  • Value Mapping: this is accomplished in PMML by the use of a mapping table and the element MapValues. To make things more interesting, in the PMML example below, we combine elements MapValues and NormDiscrete to group small sets of categorical values. In specific, we want to find out if the input variable belongs to a specific group of colors. We do that by using MapValues to map different colors to the same number. We then use the element NormDiscrete to create dummy variables which are used to indicate group membership.
  • Arithmetic Expressions: PMML offers a range of arithmetic functions (as well as string and date/time maniputation functions) that can be arranged in different ways to express complex arithmetic expressions. The example below solves the following operation:

  • PMML 4.0 - Boolean Operations: Not only PMML 4.0 allows for Boolean operations to be fully expressed, but it also allows these to be nested into IF-THEN-ELSE logic. These new buit-in functions offer a vast new array of possibilites for representing data transformations in PMML. So, we devote the rest of this review by looking at transformations that can now be easily expressed in PMML 4.0.
We start with the PMML code below which implements the following logical and arithmetic operations:
IF InputVar1 == "Partner" THEN DerivedVar1 = "P" ELSE DerivedVar2 = 2 * InputVar2

Note that it uses the newly defined 4.0 functions: "if", "equal", and "not" as well as function "*".

The PMML code below assumes that both "then" and "else" parts of the "if" use the same derived variable to implement the following operations:
IF InputVar1 == "Partner" THEN DerivedVar1 = "5.1 * InputVar2" ELSE DerivedVar1 = "InputVar2 / 3.3"

Finally, we end our list of PMML pre-processing examples by showing the use of 4.0 functions "isMissing" and "isIn" combined with function "if". The PMML example below implements the following operations:
IF InputVar is missing THEN DerivedVar = 1 ELSE (IF InputVar is in ("Partner", "Associate", "Colleague") THEN DerivedVar = 2 ELSE DerivedVar = 3)

We finish part I of our PMML tour hoping that this short description of its pre-processing capabilities can help you to easily navigate through all the data transformations available in PMML 4.0.

Tuesday, June 16, 2009

PMML 4.0 is here!

The DMG (Data Mining Group) has just released PMML 4.0, the latest and greatest version of the Predictive Model Markup Language.

Zementis, together with SPSS, SAS, IBM, Open Data Group, Salford Systems, Microstrategy and all the other contributing members of the DMG is proud to be part of the making of PMML, the de facto standard to represent data mining models.

Not only can
PMML represent a wide range of statistical techniques, but it can also be used to represent the data transformations necessary to transform raw data into meaningful feature detectors. In this way, PMML offers a standard to represent data manipulation and modeling in a single concise way.

Improved Pre-Processing Capabilities

PMML 4.0 extends the range of pre-processing capabilities supported by older versions by adding a range of boolean operations (e.g., and, or, not, equal, notEqual, greaterOrEqual, ...) to the list of built-in functions. These, combined with an IF-THEN-ELSE function which is also new to PMML, allow for the representation of a wide range of feature detectors.

For examples on how to use these new pre-processing capabilities as well as all the standard PMML transformations, please check the PMML Data Pre-Processing Primer.

Time Series Models

PMML 4.0 also extends the existing standard by allowing for the representation of Time Series Models. In particular, it allows for data miners and data mining tools to represent Exponential Smoothing models and offers place holders for ARIMA, Seasonal Trend Decomposition, and Spectral Analysis which are to be supported in the near future.

Model Explanation

Other additions are Model Explanation and Multiple Models. Model Explanation allows for evaluation and model performance measures to be part of the PMML file itself. In this way, not only data manipulation and models get to be defined, but also associated ROC Graph, Gains/Lift Charts, Confusion Matrix, Field Correlations, Univariate Statistics, and more.

Multiple Models

Multiple Models allows for model composition, ensembles, and segmentation. It replaces the old Model Composition element to offer greater flexibility for combining different models types, such as regression and decision trees.

Extending Existing Elements

Last, but not least, PMML 4.0 offers a range of extensions to existing elements, such as the addition of multi-class classification for Support Vector Machines, improved representation for Association Rules, and the addition of Cox Regression Models.

There is no doubt that PMML is here to stay. The announcement of PMML 4.0 attests to the commitment of the leading data mining vendors to be able to represent their solutions through a single language, a language that can be understood by all. It is our vision that users will be free to share models among many solutions, benefiting from an environment in which interoperability is truly attainable.

For more information on PMML and a list of useful links, please check PMML 101. Also, check the article "PMML: An Open Standard for Sharing Models" just published in The R Journal.

We also invite the entire community to join our on-going PMML discussion at the AnalyticBridge website.

Monday, June 1, 2009

How to Score 300,000,000 Customer Records for $3

This posting has been moved to the Zementis Support Site. You can still access it by clicking HERE.

Copyright © 2009-2014 Zementis Incorporated. All rights reserved.

Privacy - Terms Of Use - Contact Us