Knowlege Base Info

Welcome to the technical support knowledge base for the Zementis ADAPA Predictive Analytics decision engine. Our blogs cover general questions related to predictive models, PMML, and supported functionality of the ADAPA decision engine. Although posting is limited to technical support team members, we highly value your comments. Your feedback will be incorporated into updated/new postings and we will try to respond to your comments directly.

All our blogs can be found below. Please use the FAQ Categories on the left to find the information you are looking for. If you can't find it, feel free to contact us.

© Zementis, Inc. - All Rights Reserved.

Wednesday, July 8, 2009

KDD 2009 Panel Report: Open Standards and Cloud Computing

Leading Experts Debate Emerging Trends for Predictive Analytics and Data Mining.

At KDD 2009 in Paris, the leading conference on Knowledge Discovery and Data Mining, a panel of experts discussed various topics related to open standards and cloud computing, with a particular focus on the practical use of statistical algorithms, reliable production deployment of models and the integration of predictive analytics within other systems.


Moderated by Zementis, the panel was comprised of a distinguished group of thought leaders representing key software vendors in the data mining industry including DMG / Open Data Group, IBM, KNIME, KXEN, Microstrategy, Pervasive, SAS and SPSS.

The first major focus of the discussion was the Predictive Model Markup Language (PMML). All vendors on the panel strongly support PMML, the de-facto standard for model exchange. It was evident that all panel members champion the PMML standard and will continue to actively improve features and usability through their products. Addressing enhanced compatibility among vendors, the DMG and Zementis now offer a comprehensive PMML converter to check, validate, and convert PMML models. The panel also coincided with the general release announcement of PMML 4.0, the latest version of the standard.

Turning towards the emerging trend of Cloud Computing, it was evident that all vendors are actively investigating how to leverage the cloud most effectively for predictive analytics and data mining. Several vendors already provide cloud-based solutions, either on a public cloud infrastructure like Amazon EC2 or their own data center.



PMML and Cloud Computing are a reality and available today! There was no doubt that PMML as a standard has been accepted and has evolved into a valuable foundation for the predictive analytics industry. Cloud Computing will deliver additional benefits for various data mining solutions, either through a private or a public cloud infrastructure depending on the nature of the application.

For a more detailed summary of the panel, please review the KDD 2009 Panel Report which summarizes questions and answers from the discussion.






Tuesday, July 7, 2009

Scoring data in ADAPA via Web Services in SSIS 2008 (SQL Server Integration Services)

Data integration is a big part of putting predictive models to work. ADAPA on the Amazon cloud allows for easy model deployment, but how can you actually move the data to the cloud for scoring? One simple way is by using SQL Server Integration Services (SSIS).

In our last post on this topic, we showed how to use the provided Web Service Task available in SSIS to score data in ADAPA via web services. A limitation of the Web Service Task, however, is that one needs to provide values for the input parameters through the task UI, which is not very useful if the intent is to score several data records.

If that is the case, the answer is to use a Script Component instead. Below, we demonstrate in 10 simple steps how to setup, write and execute our own Script Component so that we can successfully pass multiple data records to ADAPA on the cloud and get a result back for each record (without having to type the inputs in the UI every time).

Step 1: Installing the Zementis Security Certificate

First, we need to have the "Zementis Security Certificate" installed as a trusted 3rd party root certification authority. A link to the Zementis Security Certificate is available from the ADAPA Control Center window. Look for the link underneath the table of available instances. To install the certificate, follow the instructions posted here - follow the instructions on how to install a certificate on a server with Microsoft Management Console (MMC). Instead of importing the certificate to the "Personal" folder under "Certificates", choose instead "Third-Party Root Certification Authorities". See figure below.



After having the certificate in place, it is time to start working on the SSIS project to have our data scored by ADAPA.

Step 2: Uploading PMML models into ADAPA

In the example below, we launched an ADAPA instance by using the ADAPA Control Center. In this instance, we uploaded two models available in the PMML Examples page of the Zementis website: "Iris_NN.xml" and "Iris_SVM.xml". These are the respective PMML exports of a neural network model and a support vector machine built to solve the Iris classification problem. The figure below shows the ADAPA Console after the two models have been uploaded.



Step 3: Creating the Excel data file

From the same PMML Examples page, we also downloaded the "Iris_NN.csv" file, opened it in Excel, deleted the "Class" column and saved it as an Excel file named "Iris_NN_input.xlsx". We will be passing each data record in this file to ADAPA on cloud via a web service request in the Script Component.

Step 4: Preparing a package in SSIS for the Script Component

First, let's create a new SSIS package. Then, drag a Data Flow Task from the Control Flow Items tab to the Control Flow design surface. Double-click the Data Flow Task to open the Data Flow design surface. Drag a Excel Source component to the design surface. Double-click this source component and choose to create a new Excel Connection Manager. Select the Excel file created in step 3 above. Make sure to select First row has column names. Click OK and select the appropriate Excel sheet. Click OK again to return to the Data Flow task editor.

Step 5: Adding the Script Component

Drag a Script Component from the Data Flow Transformations tab to the Data Flow design surface. In the dialog box, select Transformations and click OK. Connect the output of the Excel Source component to the Script Component. See figure below.



Step 6: Setting-up the Script Component

Double-click the Script Component to open the Script Transformation Editor. On the Input Columns tab, check the Name box to select all input columns (sepal_length, sepal_width, petal_length, and petal_width). On the Input and Outputs tab, expand the Output node and select the Output Columns folder. Click the Add Column button. Edit the Name property of the column to be Class (this is the name of the predicted field in the Iris model). Iris types will be strings, so select the Data Type property of the column and use the drop-down list to select the type String [DT_STR]. Finally, select the Script tab and click the Design Script button to open the script editor. We will be using C# in this example.

Step 7: Creating a Web Reference to ADAPA

The script editor will automatically create a C# frame for our script. We will populate it later. For now, let's create a web reference to ADAPA. On the Project Explorer panel on the left side of the script editor, right-click on the References folder and select to Add Web Reference. In the Add Web Reference box, enter the URL for the ADAPA WSDL file. See figure below.



A link to the ADAPA WSDL file can be found in the ADAPA Console. See ADAPA Web Console figure posted on Step 3 above. For security reasons, SSIS will ask us to enter username and password multiple times before it concludes generating the client code for calling the ADAPA web service. When done, it should look like the figure below.



The username and password refer to the e-mail address and password we used to access the ADAPA Web Console.

Step 8: Handling HTTP authentication

To see the classes generated by SSIS, go to the Project Explorer panel, open the Web References folder and double click on com.amazonaws.compute.... The Object Browser tab will appear on the right side of the script editor. On this panel, find and select the ...com.amazonaws.compute... folder. This folder should contain a class for the adapaModelsService. The figure below shows the script editor with the Object Browser tab on the righ and the Project Explorer panel on the left.



Now, double click on the adapaModelsService class to open file Reference.cs on the script editor. To class adapaModelsService add a method (see below) to override the GetWebRequest() method of the System.Web.Services.Protocols.SoapHttpClientProtocol class that the client code derived from. This is so that the SSIS generated classes are able to handle the basic authentication process required by ADAPA. The new GetWebRequest() method can be obtained on a posting by Mark Michaelis dated March 23, 2005. After copying this method into the adapaModelsService class, the script editor will look like the figure below.



Also add the following two lines of code to the original list of includes:

using System.Text
using System.Net


and save the file.

Step 9: Writing the script (finally!)

Switch tabs now to "main.cs". This is the file containing the code for our C# script. This is the place in which we write the client call to ADAPA. The figure below shows the script editor after editing it. Note that the client code requires that the network credentials are set and the PreAuthenticate property is assigned true (as per Mark Michaelis posting).



As in the previous step, make sure to add the following lines of code to the original list of includes:

using System.Net
using ...com.amazonaws.compute...


where ...com.amazonaws.compute... is the namespace generated by SSIS for the ADAPA WSDL. Copy it from the Object Explorer tab and paste it to "main.cs". All we need to do now is to move to the Project Explorer panel and right click on the root folder and select Build. We should see the message Build Succeeded on the bottom left corner of the script editor. We can now close the editor and click OK to close the Script Transformation Editor. Our mission to build the script to access ADAPA on the cloud is finished. All we need to do now is to run it.

Step 10: Scoring data

To see the results of all our labor, we will use the Row Count component. Let's first create a variable in the Variables tab and name it "RowCount". Make sure its scope is global by clicking first on the back panel of the Data Flow design surface. Then, switch to the Toolbox tab and drag a Row Count component from the Data Flow Transformations tab to the Data Flow design surface. Connect the output of the Script Component to the Row Count component. Double click on the Row Count component. Use the drop-down list associated with VariableName of CustomProperties and select the variable just created which now reads "user::RowCount". The best way to see the output of the Script Component is to add a data viewer to the Data Flow on the path between the output of the Script Component and the input of the Row Count component. For that, right click on the connection between the Script Component and Row Count and select Data Viewers. Select Data Viewers again on the Data Flow Path Editor and click Add. Select type Grid on the Configure Data Viewer window and click OK. Then, click OK again on the Data Flow Path Editor. We are now ready to execute our script and see the results generated by ADAPA.



To execute the package, select Start Debugging from the Debug menu. By doing that, the data viewer window will appear containing the data and results of the web service call to ADAPA and the Iris neural network model. Note that for every single input record, we obtained a result back which identifies the class, i.e. the particular Iris type, the input characteristics belongs to. See figure below for results.



If you are interested in learning more about accessing Web Services with SSIS Script, we highly recommend the following list of resources:

Last, but not least, the package we created, named "ADAPAWS.dtsx", can be downloaded by clicking here (this is a zip file that you will need to unzip). Remember to replace "yourname@company.com" and "yourADAPApassword" in "main.cs" by the e-mail and password used to access the ADAPA Web Console.

Thursday, July 2, 2009

I got an error uploading a Neural Network model in ADAPA generated in Clementine. Can you help?

SPSS PASW tools export all kinds of models in PMML. These are usually perfect and ready for being deployed in ADAPA.

We have noticed however that the latest version of Clementine (PASW Modeler) exports Neural Network models in PMML with a small inconsistency which may cause ADAPA to generate an error during model uploading.

If you have defined any of your continuous input variables as integer, Clementine will define them correctly in the DataDictionary in the PMML file as integer. See below:

However, in the MiningSchema, Clementine assigns such variables real numbers. See below:

Note that in this case missing value replacement for variable "Age" is 44.5, not an integer.

In such cases, ADAPA will throw an error since a real number is being assigned to a variable of type integer.

To avoid this problem, the solution is to define any integer continuous variables in the DataDictionary as double. See new DataDictionary below:


We plan to add this correction to the PMML converter tool so that it will automatically do the replacement. Note that in this case, the converter is acting as a corrector since you will be passing a PMML file in version 3.2 through it and getting a corrected version 3.2 file back.

Monday, June 22, 2009

Examining PMML 4.0 - Part I: Pre-Processing

You may be wondering what is all the fuss around PMML and its latest 4.0 version. So, we decided to explore all that PMML 4.0 has to offer in a series of blogs. In part I, we will be exploring its improved pre-processing capabilities.

All data mining models manipulate the raw data in a way or another before passing it through a a neural network, support vector machine, or regression model. Therefore, a language that wants to represent all the computations that go into a model needs also to be able to represent the data transformations that were applied to the raw data before scoring takes place. PMML is this language! It is the Yin and Yang of data mining.

Let's first re-cap on the pre-processing capabilities available in PMML 3.2. This version of PMML allows for the following out of the box data transformations:
  • Normalization of continuous variables: this is accomplished via the NormContinuous element of PMML. It is mostly used to normalized a variable between 0 and 1. See example below (real PMML code) in which two variables are normalized. The first between 0 and 1 and the second between 0 and 4.
  • Normalizing Categorical Inputs: normally used to transform strings into numerical variables. This is accomplished by the element NormDiscrete. In the PMML example below, a categorical variable creates dummy variables that will be assigned values 1 or 0 depending on the category assumed by the input variable.
  • Discretization: this is used to transform continuous variables into strings. This is accomplished by the Discretize element. In the PMML example below, if the input variable is equal to 500, it is transformed to low; if equal to 5000, it is transformed to medium; and if 50,000, it is high.
  • Value Mapping: this is accomplished in PMML by the use of a mapping table and the element MapValues. To make things more interesting, in the PMML example below, we combine elements MapValues and NormDiscrete to group small sets of categorical values. In specific, we want to find out if the input variable belongs to a specific group of colors. We do that by using MapValues to map different colors to the same number. We then use the element NormDiscrete to create dummy variables which are used to indicate group membership.
  • Arithmetic Expressions: PMML offers a range of arithmetic functions (as well as string and date/time maniputation functions) that can be arranged in different ways to express complex arithmetic expressions. The example below solves the following operation:
ResultVar=maximum(round(InputVar1/3.3),2^(1+log(1.3*InputVar2+1)))

  • PMML 4.0 - Boolean Operations: Not only PMML 4.0 allows for Boolean operations to be fully expressed, but it also allows these to be nested into IF-THEN-ELSE logic. These new buit-in functions offer a vast new array of possibilites for representing data transformations in PMML. So, we devote the rest of this review by looking at transformations that can now be easily expressed in PMML 4.0.
We start with the PMML code below which implements the following logical and arithmetic operations:
IF InputVar1 == "Partner" THEN DerivedVar1 = "P" ELSE DerivedVar2 = 2 * InputVar2



Note that it uses the newly defined 4.0 functions: "if", "equal", and "not" as well as function "*".

The PMML code below assumes that both "then" and "else" parts of the "if" use the same derived variable to implement the following operations:
IF InputVar1 == "Partner" THEN DerivedVar1 = "5.1 * InputVar2" ELSE DerivedVar1 = "InputVar2 / 3.3"

Finally, we end our list of PMML pre-processing examples by showing the use of 4.0 functions "isMissing" and "isIn" combined with function "if". The PMML example below implements the following operations:
IF InputVar is missing THEN DerivedVar = 1 ELSE (IF InputVar is in ("Partner", "Associate", "Colleague") THEN DerivedVar = 2 ELSE DerivedVar = 3)


We finish part I of our PMML tour hoping that this short description of its pre-processing capabilities can help you to easily navigate through all the data transformations now available in PMML 4.0.

Tuesday, June 16, 2009

PMML 4.0 is here!

The DMG (Data Mining Group) has just released PMML 4.0, the latest and greatest version of the Predictive Model Markup Language.

DMG, PMML
Zementis, together with SPSS, SAS, IBM, Open Data Group, Salford Systems, Microstrategy and all the other contributing members of the DMG is proud to be part of the making of PMML, the de facto standard to represent data mining models.

Not only can
PMML represent a wide range of statistical techniques, but it can also be used to represent the data transformations necessary to transform raw data into meaningful feature detectors. In this way, PMML offers a standard to represent data manipulation and modeling in a single concise way.



Improved Pre-Processing Capabilities

PMML 4.0 extends the range of pre-processing capabilities supported by older versions by adding a range of boolean operations (e.g., and, or, not, equal, notEqual, greaterOrEqual, ...) to the list of built-in functions. These, combined with an IF-THEN-ELSE function which is also new to PMML, allow for the representation of a wide range of feature detectors.

For examples on how to use these new pre-processing capabilities as well as all the standard PMML transformations, please check the PMML Data Pre-Processing Primer.

Time Series Models


PMML 4.0 also extends the existing standard by allowing for the representation of Time Series Models. In particular, it allows for data miners and data mining tools to represent Exponential Smoothing models and offers place holders for ARIMA, Seasonal Trend Decomposition, and Spectral Analysis which are to be supported in the near future.

Model Explanation

Other additions are Model Explanation and Multiple Models. Model Explanation allows for evaluation and model performance measures to be part of the PMML file itself. In this way, not only data manipulation and models get to be defined, but also associated ROC Graph, Gains/Lift Charts, Confusion Matrix, Field Correlations, Univariate Statistics, and more.

Multiple Models

Multiple Models allows for model composition, ensembles, and segmentation. It replaces the old Model Composition element to offer greater flexibility for combining different models types, such as regression and decision trees.

Extending Existing Elements

Last, but not least, PMML 4.0 offers a range of extensions to existing elements, such as the addition of multi-class classification for Support Vector Machines, improved representation for Association Rules, and the addition of Cox Regression Models.

There is no doubt that PMML is here to stay. The announcement of PMML 4.0 attests to the commitment of the leading data mining vendors to be able to represent their solutions through a single language, a language that can be understood by all. It is our vision that users will be free to share models among many solutions, benefiting from an environment in which interoperability is truly attainable.

For more information on PMML and a list of useful links, please check PMML 101. Also, check the article "PMML: An Open Standard for Sharing Models" just published in The R Journal.

We also invite the entire community to join our on-going PMML discussion at the AnalyticBridge website.





Copyright © 2009 Zementis Incorporated. All rights reserved.

Privacy - Terms Of Use - Contact Us