Wednesday, November 26, 2008
Talk at the Forum on Analytics - San Diego
In 2008, we gave a talk (11/12/08) at the Forum on Analytics (sponsored by the San Diego Software Industry Council - SDSIC). The talk was entitled "Easy Expression and Execution of Data Mining Models through PMML". Please click here to see the presentation slides. The presentation transcripts follow below (after the abstract).
PMML (Predictive Model Markup Language) is an XML-based language used to define data mining models. It was specified by the Data Mining Group, an independent group of leading technology companies. By providing a uniform standard to represent predictive models, PMML allows for the exchange of predictive solutions between different applications and various vendors. Many statistical packages already support the PMML standard; these include, for example, SAS and SPSS. In an effort to broaden the scientific workbench available to data mining scientists and to support the open source community, Zementis recently contributed code to the R project. In particular, we implemented the export of neural network models built with the nnet R package as well as Support Vector Machines built with the ksvm R package. The same PMML exporter can also produce decision trees built with rpart and linear regression models built with lm. The PMML exporter package is currently available through CRAN (the Comprehensive R Archive Network).
All of the R exported PMML models are readily available to be uploaded into an execution engine for scoring or classification. For example, the ADAPA engine, which can be used for production deployment of PMML models, is currently available as a service in the Amazon Elastic Compute Cloud (Amazon EC2).
Our aim here is to show how one can quickly build a data mining model in R, such as a Support Vector Machine, and use the PMML exporter to produce a model file which can be uploaded and executed in a different application. We demonstrate how one can use data containing expected results to verify correct model deployment. If all computed and expected values match, the model can be considered ready for production, i.e. available for generating predictions on incoming data as part of an overall enterprise decision management strategy. From R to ADAPA, we use PMML as an effective way to express and execute data mining models.
Our work shows how PMML can be effectively used to allow for model exchange between different applications. Also, it highlights how one can benefit from an open-source statistical package such as R to easily export models into PMML and upload them into ADAPA, a light-weight scoring engine which consumes several PMML models. The ease of model expression and execution allows data mining scientists to concentrate on the important tasks: data analysis and model building. Real-time, scalable execution is handled through software tools which communicate through a common language, PMML.
Below, please find the transcripts of our talk, organized per slide.
Slide 1 - Title
My talk will be divided in 3 parts all of which are centered around Open Standards. I will start by talking about the Development of Predictive Models using R. I will then talk about Deployment and in doing so I will focus on PMML. Finally, I will talk about the real-time execution of PMML files.
So, R is our software of choice for this presentation, given that it is an open source and a GNU project. R is available for free over the internet. R allows for data manipulation, calculation, and graphical display. It provides a wide variety of statistical techniques and it is highly extensible.
But, how to export models out of R? Once you build your models in R, you can easily export them into PMML.
We recently contributed code to the R PMML package which can now export a variety of modeling techniques which include … to name a few:
Support Vector Machines
Great! But, what is PMML? PMML stands for Predictive Model Markup Language. It is an XML-based language which is the de facto standard for exchanging predictive models between compliant applications. For this reason, PMML avoids proprietary issues and incompatibilities
PMML provides a clear separation of tasks in which model deployment easily follows model development. In this way, PMML frees scientists to focus on model building. PMML eliminates the need for custom model deployment. In doing so, it ensures scalability and reliability.
PMML is a mature standard and is widely supported by the industry. It is developed and maintained by the Data Mining Group which is a vendor independent consortium with several major supporters including IBM, Oracle, Microsoft, SAS, SPSS, Fair Isaac and Zementis.
A single PMML file can be used to represent data transformations and well as the model itself. In doing so, PMML brings data transformations and statistical models together.
PMML allows for the definition of a data dictionary which is used to define all the raw data fields coming into the model, including missing value strategy and outlier treatment.
Several data transformations can also be expressed in PMML which can be used to extract feature detectors from the raw data.
On the other hand, post-processing of results allows for tailored decisions.
So, here it is, R … The R GUI is very simple.
Imagine that we want to build a neural network model to solve the Iris classification problem. In this case, all we need to do is upload the R neural network package NNET.
We then assign the data file containing the IRIS data set to the R object we called Iris.
And call NNET with the right parameters which include the data set used to train the model as well as the size of the hidden layer. We assign the network to the R object IrisNet
Once trained, this network can be easily exported into PMML. We need to upload the PMML package first and then call it with our neural network object as a parameter.
Here is your PMML code. Hot from the oven!
Note the data dictionary which contains a description of the four input variables used to train the model.
Also note that in this simple example, there were no data transformations. We are using the input data AS IS.
So, you did all your data analysis, built your model and generate PMML using R … but what now?
What can you do if you need to execute this model on a Iris Field and use it in real-time?
This is where a predictive analytics scoring engine fits in. We are going to use ADAPA here as an example.
ADAPA allows for data transformations and models to be uploaded and executed in real time via web-services calls.
It is an environment to manage and execute not only one but many predictive models and rule sets.
ADAPA is not a model building environment. We used R for that!
This is a screen-shot of the ADAPA management console.
PMML files are easily uploaded and maintained through this interface.
Note that several models have already been uploaded into ADAPA including the IRIS Neural Network model we just built using R.
Once a model is uploaded, it needs to be validated. This is accomplished via a score matching test.
In this case we uploaded 150 IRIS data records containing the input variables as well as the expected output. ADAPA will then compare and match computed and expected values.
If any mismatches are found, ADAPA allows for complete traceability of its internal decisions.
Great! I just showed you how to Easy Express and Execute Data Mining Models using PMML … all in 6 steps.
R allows for Data Analysis, Model Building, and PMML Export.
PMML can then be uploaded into a compliant decision engine.
Data reaches the engine via web-service calls.
And in so doing,
Model execution is performed in real-time.
Slide 14 - Thank you!