Monday, February 22, 2010
Scorecards in PMML: A Primer
Scorecards are extremely popular, since they provide a clear and effective to way to predict outcome for a variety of situations. By clear I mean that the logic behind the scores obtained via a scorecard can be easily understood and appreciated. Scorecards are effective for situations in which you want to predict the probability of someone or something being "bad" or "good". These probabilities can then be readily used for decision making.
Scorecards, as any data mining model, contain a set of inputs fields which are used to predict a certain target value. This prediction can be seen as an assessment about a prospect, a customer, or a scenario for which an outcome is predicted based on historical data. In a scorecard, input fields, also referred to as characteristics (for example, "Age"), are broken down into attributes (for example, "20-29" and "30-39" age groups) with specific partial scores associated with them. These scores represent the influence of the input attributes on the target and are readily available for inspection. For example, a high partial score for a particular attribute could imply a heavy dependence of the target value on that attribute. Partial scores are then summed up so that an overall score can be obtained for the target value (is it good? Or, is it bad?).
ADAPA provides two different ways for scorecards to be represented. The first being through rules as described in the ADAPA Scorecard Guide and the second, as described in here, through the use of PMML.
Given that PMML does not offer a specific scorecard element, we use a RegressionModel element to implement different score allocation strategies and to compute the overall score. More specifically, we show in here how to represent different attributes (categorical or continuous ... and complex) and their corresponding partial scores by the use of data transformations and built-in functions (see tutorial on data processing in PMML).
Score Allocation for Categorical Attributes
Typical score allocation for categorical attributes is done by associating a partial score with each attribute. In the PMML code shown below, input field "var1" may contain one of the following values (or attributes): "positive", "negative", and "neutral", for which a partial score is defined (see table below for score allocation details). Note, that it also accounts for missing values. In the PMML example, the resulting partial score is assigned to derived variable "derivedVar1".
Note that for categorical attributes, we simply use the MapValues element as described in to implement score allocation. If the input field consists of a large set of attributes, score allocation can be easily implemented by using the element TableLocator.
Score Allocation for Continuous Attributes
In the PMML code shown below, continuous input field "var2" has been discretized into three ranges or attributes: "less than 100", "greater or equal to 100 and less than 200", and "greater than 200" (see table for score allocation details). Note, that it also accounts for missing values. In the PMML example, the resulting partial score is assigned to derived variable "derivedVar2a".
Note that for continuous attributes, we simply use the Discretize element to implement score allocation.
Score Allocation for Complex Attributes
If the attributes are complex, built-in functions can be used to implement score allocation. The PMML code shown below uses several built-in functions to implement a complex score allocation (see table for details). As in the previous score allocation examples, this also accounts for missing values. In the PMML example, the resulting partial score is assigned to derived variable "derivedVar2b".
Note that we are using built-in function IF-THEN-ELSE in conjunction with arithmetic operators to implement the necessary logic. Built-in functions in PMML are very powerful and can be used to represent a variety of complex score allocation strategies.
Computing the Overall Score
The score allocation examples shown in here include input attributes which are either related to "var1", which is a categorical field, or to "var2", which is continuous. For each attribute associated with these fields, a partial score is assigned to each derived field: "derivedVar1", "derivedVar2a", and "derivedVar2b" by using a PMML transformation.
Finally, as shown in the PMML code below, the sum of all partial scores is implemented via a regression table for which all regression coefficients are set to 1. Note also that score allocation for all attributes are represented as transformations placed inside the LocalTransformations element.
A file containing the full PMML example shown here as well as data for model verification can be found in the PMML Examples page of the Zementis website.
There is a whole lot of information posted in different websites about Scorecards, PMML and ADAPA. If you want to learn more on how to represent data processing in PMML including different ways to perform score allocation for complex attributes, make sure to check our PMML Data Processing Primer.
For a more detailed list of ADAPA features, feel free to take a tour of ADAPA on the Cloud or check what is inside the ADAPA box. If you are still unsure about any of the features or would like to learn more about them and how ADAPA can represent scorecards using rules, drop us a note or give us a call. You can find our contact information in the contacts page of the Zementis website.