Friday, December 9, 2011

In-database Scoring with PMML, Zementis, and Sybase IQ: Big Data Analytics Made Easy

Not all analytic tasks are born the same. If one is confronted with massive volumes of data that need to be scored on a regular basis, in-database scoring sounds like the logical thing to do. In all likelihood, the data in these cases is already stored in a database and, with in-database scoring, there is no data movement. Data and models reside together hence scores and predictions flow on an accelerated pace.

So, wouldn't it be great if you could now benefit from the flexibility of a standard such as PMML combined with in-database scoring? Zementis is offering just such a solution. It is called the Universal PMML Plug-in™ and it is truly amazing!

Here is why: for starters, it is simple to deploy and maintain. Our Universal PMML Plug-in was designed from the ground up to take advantage of efficient in-database execution, and, as its name suggests, it is PMML-based. PMML, the Predictive Model Markup Language is the standard for representing predictive models currently exported from all major commercial and open-source data mining tools. So, if you build your models in either SAS, IBM/SPSS, or R, you are ready to start benefiting from in-database scoring right away.

Announcing the Universal PMML Plug-in for Sybase IQ

It is our pleasure to announce, together with Sybase, the availability of the Zementis Universal PMML Plug-In for Sybase IQ 15.4 (Press Release: Sybase Does More Big Data Analytics). This solution allows external predictive models created in the PMML standard to be parsed, ingested and executed In-database in Sybase IQ. This unique capability is extremely appealing to most enterprises that leverage multiple data mining tools or seek to deploy their existing predictive models closer to the data for better performance and broader applicability.


The PMML Plug-in seamlessly embeds models within Sybase IQ. In this way, data scoring requires nothing more than adding a simple function call into your SQL statements. You can score data against one model or against multiple models at the same time. There is no need to code connection weights, regression equations or other more complex calculations in SQL or stored procedures. PMML and our Universal Plug-in can easily take care of that.

PMML execution combined with Sybase IQ existing capabilities for text and multimedia analytics provides enterprises with a breadth of available techniques for analyzing big data.

For more details about the Universal PMML Plug-in for Sybase IQ, contact Zementis, or download the product data sheet.

Friday, December 2, 2011

KNIME PMML Support: Model Import and Export + Pre-processing

What is KNIME? According to knime.com:

KNIME (Konstanz Information Miner) is a user-friendly and comprehensive open-source data integration, processing, analysis, and exploration platform.

Yes, KNIME is user-friendly, not only because it offers an intuitive GUI to analyze data, but also because it is open-source. KNIME is also standards friendly. KNIME 2.0 released in 2008 was the first release to offer PMML support. PMML, the Predictive Model Markup Language, is the de facto standard to represent data mining and predictive analytic models. PMML today is supported by all the top statistical packages, including SAS, IBM SPSS, KXEN, and R.

Since release 2.0, PMML support in KNIME has matured considerably, from the import and export of predictive models all the way to the pre-processing of input variables. KNIME 2.5, released December 01, 2011 offers a series of PMML-enabled pre-processing nodes which can be embedded automatically in the final PMML model. All these features are documented in a paper presented at the KDD 2011 PMML Workshop:

Peer-reviewed article: KDD 2011 – PMML Pre-processing in KNIME

To illustrate some of KNIME capabilities when it comes to PMML, we describe below a workflow we built in KNIME for training a neural network model for classification of the audit data set. This workflow encapsulates the following high-level tasks:
  • The reading in of the audit data set (this data set is supplied as part of the R Rattle package): This is an artificial data set consisting of fictional clients who have been audited, perhaps for tax refund compliance. For each case an outcome is recorded: whether the taxpayer's claims had to be adjusted or not which in the data is represented by 0 (no) and 1 (yes).
  • The pre-processing of input variables, which involves dummyfication of categorical variables and normalization of numerical variables
  • The training and testing of a neural network model
  • The exporting of the resulting PMML file which includes all pre-processing steps as well as the neural network model itself.

KNIME Workflow - Step-by-Step

Below we describe in 8 steps how we went around building such a workflow.

Step 1: We start by reading the audit data set from a csv file. We simply use node "CSV Reader" for that. We then use node "Number To String" to tell KNIME that our predicted variable "TARGET_Adjusted" should be treated as a string.


Step 2: Since we do not want to use all variables in the data set for training our neural network, we use the node "Column Filter" to filter out variables such as ID and IGNORE_Accounts.


Step 3: We are now ready to start massaging the remaining data. For that we use the new PMML-enabled node "One2Many" to create dummy variables out of the categorical raw input variables. Note that this node comes with a blue port indicating its PMML capabilities. We also use another "Column Filter" node to remove the original categorical variables from our data.

Step 4: We then add PMML-enabled node "Normalizer" to the workflow. This node normalizes all the numerical variables so that they can be presented to the neural network for training. Note that we linked the blue port from the preceding node to this node. This signals KNIME that we would like to have the PMML representation passed between nodes.


Step 5: We then use the node "Partitioning" to partition the audit data into two data sets, one for training and another for testing.


Step 6: We can now use node "RProp MLP Learner" to train our neural network model. Note that this node is also PMML-enabled and so we link the blue port from node "Normalizer" to it. This ensures that the PMML equivalent of the pre-processing operations are being passed to the neural net learner node.


Step 7: Given that the neural network has been trained, it is time to export the resulting PMML file. For that we use the node "PMML Writer". You can inspect the exported PMML file on your own (see RESOURCES below).


Step 8: As far as PMML is concerned, we are done. But, to complete the model building process, we must evaluate our model against the test data. For that, we connect the test piece of node "Partitioning" to node "MultiLayerPerceptron Predictor". Note that the trained neural network model is communicated from the learner node to the predictor node via a blue PMML port. Finally, we can then visualize the scoring results using node "Interactive Table". With this step, our data workflow is complete.



Putting your model to work

Once you have you verified that the model works and that it generalizes over the testing data, you can simply upload the resulting PMML file into ADAPA where it will be made available for execution.

Resources

Learn

Get products and technologies

  • KNIME, a user-friendly and comprehensive open-source data integration, processing, analysis, and exploration platform. From day one, KNIME has been developed using rigorous software engineering practices and is used by professionals in both industry and academia in over 60 countries.

  • ADAPA is a revolutionary predictive analytics decision management platform, available as a service on the cloud or for on site. It provides a secure, fast, and scalable environment to deploy your data mining models and business logic and put them into actual use.

Tuesday, October 25, 2011

Operational Deployment of Predictive Solutions: Lost in Translation? Not with PMML

Traditionally, the deployment of predictive solutions have been, to put it mildly, cumbersome. As shown in the Figure below, data mining scientists work hard to analyze historical data and to build the best predictive solutions out it. Engineers, on the other hand, are usually responsible for bringing these solutions to life, by recoding them into a format suitable for production deployment. Given that data mining scientists and engineers tend to inhabit different information worlds, the process of moving a predictive solution from the scientist's desktop to production can get lost in translation.


Luckily, the advent of PMML (Predictive Model Markup Language) changed this scenario radically. PMML is the de facto standard used to represent predictive solutions. In this way, there is no need for scientists to write a word document describing the solution. They can just export it as a PMML file. Today, all major data mining tools and statistical packages support PMML. These include IBM SPSS, SAS, R, KNIME, RapidMiner, KXEN, ... Also, tools such as the Zementis Transformations Generator and KNIME allow for easy PMML coding for pre- and post-processing steps.

Great! Once a PMML file exists, it can be easily deployed in production with ADAPA, the Zementis scoring engine. ADAPA even allows for models to be deployed in the Amazon Cloud and be accessed from anywhere via web-services. Zementis also offers in-database scoring via its Universal PMML Plug-in, which is also available for Hadoop. In this way, a process that could take 6 months, now takes minutes.


PMML and ADAPA have transformed model deployment forever. If you or your company are still spending time and resources in deploying your predictive analytics the traditional way, make sure to contact us. The secret behind exceptional predictive analytics is out!

Wednesday, April 27, 2011

With PMML, interoperability is truly attainable

Developed by the Data Mining Group (DMG), an independent, vendor led committee, PMML provides an open standard for representing data mining models. In this way, models can easily be shared between different applications avoiding proprietary issues and incompatibilities. Currently, all major commercial and open source data mining tools already support PMML. These include IBM/SPSS, SAS, KXEN, TIBCO, STATISTICA, Microstrategy, R, KNIME, and RapidMiner (for a list of PMML-compliant tools, see of PMML-powered tools at DMG.org).

PMML is an XML-based language which follows a very intuitive structure to describe data pre- and post-processing as well as predictive algorithms. Not only does PMML represent a wide range of statistical techniques, but it can also be used to represent input data as well as the data transformations necessary to transform raw data into meaningful features.

The PMML Converter


As part of the Data Mining Group, Zementis is committed to the continual development of PMML. It is our vision for the community that users will be free to share models among many solutions, benefiting from an environment in which interoperability is truly attainable.

In this spirit, Zementis has made available a tool called the PMML Converter which converts older versions of PMML to its latest, Version 4.0. The converter is also used to validate a data mining model against the PMML specification for versions 2.0, 2.1, 3.0, 3.1, 3.2, and 4.0. If validation is not successful, the converter gives back a file containing explanations for why the validation failed (click on the "details" button).

Before actual conversion takes place, the validation phase needs to be successful, i.e. the model file needs to conform to the PMML specification as published by the DMG (for any of the older PMML versions listed above). For known PMML issues (from a variety of sources/vendors), the PMML Converter will actually correct the model file so that it can be converted appropriately.

The PMML converter currently converts the following model elements to PMML 4.0:
  • Association Rules
  • Clustering Models
  • Decision Trees
  • General Regression Models Regression
  • Naive Bayes Classifiers
  • Neural Networks Regression Models
  • Ruleset Models
  • Support Vector Machines
It will also convert pre- and post-processing PMML elements.

The PMML Converter can be found in the Zementis PMML Tools page.

For more information on how to use the converter, please refer to the how-to guide.

The ADAPA Decision Engine

If you are using the ADAPA Decision Engine, there is no need to use the PMML Converter before uploading your models into the engine. That's because ADAPA encapsulates the PMML Converter. By doing that, it understands PMML files generated by different vendors in all the different PMML versions. ADAPA will actually take a step further than syntactic validation provided bythe PMML Converter, it will also validate PMML from a semantic perspective.

And so, once a model is successfully uploaded in ADAPA, it is syntactically and semantically sound.

You can benefit from ADAPA today by signing up for your private ADAPA instance on the Amazon Cloud. You can also sign up for the ADAPA free trial.

Start executing your models right now!

Friday, April 22, 2011

ADAPA 3.4 Released: Association Rules

PMML, the Predictive Model Markup Language, allows for a predictive analytic model to be developed in one application and easily moved to another for production deployment and execution.

Once a predictive model is exported from a PMML-compliant tool such as SAS EM, SPSS/IBM, R, KNIME, RapidMiner, ... it can be uploaded directly into the Zementis ADAPA engine which makes the model available for execution via its console or as a web-service. ADAPA can already import most of the techniques defined by the PMML standard and now, with the release of ADAPA 3.4, we have expanded it even further to cover Association Rules.

Association Rules


Analysts always want to explore rules and relations between variables in large data sets. The learning mechanism of Association Rules serves this purpose. The rules discovered by Association Rules often provide useful information for marketing activities. For example, they can be used for discovering relations between products in transaction data in supermarkets. In this way, an association rule can be found to indicate that if a customer purchases beef and cat food together, he/she is most likely to also buy tuna cans.

An Association Rule Model in PMML is represented by the element "AssociationModel". ADAPA and PMML support two different formats for representing Association Rules. These are "rectangular" and "transactional". To learn more about these two formats, please read our posting: Association Rules in ADAPA.

PMML Converter

In addition, with the release of ADAPA 3.4, we were able to make ADAPA even better when it comes to converting and correcting PMML files. This is yet another big step towards true interoperability. In many cases, even if the model has syntactic or semantic problems, ADAPA automatically corrects known issues for models exported from several model development environments. For that, we analyze PMML files submitted to us by our partners and clients.


If for any reason, your PMML code cannot be converted or corrected automatically, feel free to contact us. We are here to help!

Wednesday, April 20, 2011

Predictions in the Cloud with ADAPA

ADAPA is the first standards-based, real-time predictive decisioning engine available on the market and the first scoring engine accessible on the Amazon Cloud as a service. ADAPA on the Cloud combines the benefits of Software as a Service (SaaS), the scalability of cloud computing and the extensive feature set of ADAPA on Site.

What do you mean by standards-based?

ADAPA executes predictive models represented in PMML (Predictive Model Markup Language). PMML is the standard for representing predictive models currently exported from all major commercial and open-source data mining tools.

With PMML, you can basically build your model in IBM SPSS, SAS, R, KNIME, ... export it as a PMML file and upload it in ADAPA. Once you do that, your model is ready to be used from anywhere via web-services. You can even execute your models directly from within Excel.




Is ADAPA really fast?


ADAPA is very fast. We recently published a study on the ACM SIGKDD Newsletter in which we show that ADAPA can easily score thousands of transactions per second. In the High-CPU Extra-Large instance, ADAPA can score 300 million transactions per hour. FAST!

What kind of models does it support?

Modeling techniques currently supported are:
  • Neural Networks
  • Association Rules
  • Support Vector Machines
  • Naive Bayes Classifiers
  • Ruleset Models
  • Clustering Models (including Two-Step Clustering)
  • Decision Trees
  • Regression Models (including Cox Regression Models)
  • Scorecards

How about data pre- and post-processing?

ADAPA transforms your raw data into meaningful feature detectors before scoring it. It post-processes the output of your predictive model so that it conforms to your requirements. ADAPA supports all the PMML built-in functions and data manipulations (as well as user defined functions). To learn more about how to represent pre- and post-processing operations in PMML, please take a look at our PMML data manipulation primer or simply contact us.

Can I combine predictive analytics with business rules?


ADAPA provides seamless integration of predictive analytics and rules. Simply put, ADAPA allows data driven insight and expert knowledge to be combined into a single and powerful decision strategy. That is because in addition of a sophisticated predictive analytics engine, ADAPA also incorporates the full functionality of a rules engine.

How do I pay for it? Is it expensive?

Once you sign up for ADAPA on the Cloud through Amazon.com, ADAPA charges show up on your credit card bill. Amazon handles all the billing. You can even use the same account you use to buy books. ADAPA on the Cloud does not cost an arm and a leg. Check out our pricing! And, the best part, you pay only for what you actually use.

Tuesday, April 19, 2011

Webinar: Deploying Predictive Analytics with PMML, Revolution R, and ADAPA


Presented: Wednesday, April 13th, 2011
Presenters: Alex Guazzelli, Vice President - Analytics, Zementis Inc.
David Smith, Vice President - Marketing, Revolution Analytics

View the on-demand replay of the webinar

Download the webinar presentation

The rule in the past was that whenever a predictive model was built in a particular development environment, it remained in that environment forever, unless it was manually recoded to work somewhere else. This rule has been shattered with the advent of PMML (Predictive Modeling Markup Language). By providing a uniform standard to represent predictive models, PMML allows for the exchange of predictive solutions between different applications and various vendors.

In this joint webinar from Revolution Analytics and Zementis, you’ll learn:
  • How to use data to create predictive models in the R language, with Revolution R Enterprise
  • The purpose of the PMML standard, and predictive models it supports
  • How to export predictive models from R using PMML
  • How to score predictive models in PMML using ADAPA, from within Microsoft Excel and in the cloud
This webinar will be suitable for any technology professionals with an interest in predictive models and who wishes to learn more about Revolution R, PMML and ADAPA.

Download the whitepaper:
Deploying Advanced Analytics Using R & PMML

Monday, April 18, 2011

ADAPA and PMML Association Rules

An association rule describes a relation between one group of objects and another group of objects. This may be said in another way: "If a condition A is satisfied, then so is condition B". As an example, consider items people purchase in a grocery store. Suppose most people who buy milk also buy juice. Also, most people who buy chicken and beef also buy bread. Then two association rules exist:

[Milk] --> [Juice]
If you buy milk, then you will also buy juice

[Chicken,Beef] --> [Bread]
If you buy chicken and beef, then you will also buy bread



Association Rules in ADAPA


Normally, as in a typical regression model, ADAPA reads one data row (or record) at a time and gives back one output. In particular, it reads one input value for each of the input variables required by the model, which are positioned in different columns but in the same row, and outputs the result by appending the predicted value or score to the last column. For Association rules, on the other hand, ADAPA has to read multiple items of a single transaction before it can produce an output. As suggested by the example above, it needs to read in "Chicken" and "Beef" before producing "Bread" as an output. In the usual data format, the entire transaction will have its unique value in one column. For association rules, ADAPA relies on two different methods to read all the items under a single transaction.

These two methods allow for the data to be expressed either in a "rectangular" for or in a "transactional" format.

Rectangular Format

The rectangular format lists all possible items of a single transaction in a separate column for each row. For the above example, if customers purchase from a list of five possible items: Milk, Juice, Chicken, Beef, and Bread, the input data might be represented as:

Milk,Juice,Chicken,Beef,Bread
1,1,0,0,0
0,0,1,1,1

Note that the first row specifies the header, while the third row, for example, specifies that Chicken, Beef and Bread were purchased together. Of course, it is not clear from these if chicken and beef implies bread, or if chicken implies bread and beef; but together with the PMML file, the scoring machine is able to deduce the correct relationships. And so, for a "rectangular" data file, the output is added to the same row as a different column.

The PMML file for each format is different as well. For a "rectangular" PMML file, all the possible values or items are defined as different fields. And so, these are defined as different "MiningFields" under the "MiningSchema" element. For the example above, instead of a single "MiningField" for the entire purchase, one would have five "MiningFields": Milk, Juice, Chicken, Beef, and Bread, as follows:

  
<MiningSchema>
<MiningField name="Milk" usageType="active"/>
<MiningField name="Juice" usageType="active"/>
<MiningField name="Chicken" usageType="active"/>
<MiningField name="Beef" usageType="active"/>
<MiningField name="Bread" usageType="active"/>
</MiningSchema>

PMML Example - Association Rules in Rectangular Format

For an example of a PMML file and its correspondent data file in rectangular format, click HERE.

Transactional Format

The "transactional" format, on the other hand, allows for the input data to be specified in two columns: the first one is the identifier and the second one contains the possible items. For the example above, the data file might be represented as:

ID,value
1,Milk
1,Juice
2,Chicken
2,Beef
2,Bread

The identifier (column "ID") indicates which items belong together. And so, in this example, ID = 1 specifies that the first two items (Milk and Juice) belong to the same input group or transaction, while ID = 2 indicates that Chicken, Beef and Bread belong to a different group. In this case, for the "transactional" data file, the predicted value is added as an extra column in the first row of each group only.

A "transactional" PMML file defines two "MiningFields". One is of type "group" which indicates which group the items belong to. The second is of type 'active' which includes, as in our example, all the possible items that were purchased. Note that is not necessary to list all items one by one. And so, the "MiningSchema" in a "transactional" PMML file might look like:


<MiningSchema>
<MiningField name="ID" usageType="group"/>
<MiningField name="item" usageType="active"/>
</MiningSchema>

In this case, the columns with the same "ID" belong together: since Milk and Juice in our example both have ID = 1, they both are in the same group. The second column, titled "item" in the data file, lists all the items for that group: Milk and Juice. One can thus read the first group as: “Milk and juice are purchased together”.

PMML Example - Association Rules in Transactional Format

For an example of a PMML file and its correspondent data file in transactional format, click HERE.





Copyright © 2009 Zementis Incorporated. All rights reserved.

Privacy - Terms Of Use - Contact Us