What You've Always Wanted to Know About Modeling Lists but Didn't Ask

Response rates are dropping, lists are suffering from fatigue, old selects just aren't working like they used to and the universe of available names that you can mail profitably is slipping. Many mailers are turning to new ways to combat this decline, and multivariate regression models have become an important device in the marketers' toolbox.

By modeling a prospecting list, you can change unprofitable lists into profitable ones and profitable lists into stellar performers. Many modeling techniques are available. Some are better than others for specific business objectives. Some are completely inappropriate. You also need to know the lingo so you can speak with the statistician in his own language.

Defining business objectives. Analysis begins with a definition of the business challenges or objectives. In direct mail, that's usually (but not always) increased response. The statistician then determines, based upon the objectives, which analytical methods are most appropriate for the job.

In many ways, the statistician's role is similar to that of a physician diagnosing a patient. The physician must evaluate the situation and prescribe the most effective remedy. The statistician must evaluate business objectives, as well as the information available, and prescribe a correct method for achieving these objectives. An incorrect technique can lead to ineffective or even disastrous results.

EDA shows the way. Creating a good model starts with an exploratory data analysis. EDA helps the statistician understand the “personality” of the data to be modeled — in this example, the responses — and helps to strengthen the efficiency of the final model's predictability.

Comparing the customers generated by your mailing against the lists mailed, for instance, tells you who responded. Matching records mailed and the responses to demographics and transactional/promotional information, and examining the interplay between variables that give you the greatest lift in response is what modeling is all about.

To gain knowledge about your customers' demographic profile and what variables are important in predicting response, a good statistician will apply some combination of the following EDA techniques in developing a model for you:

* Univariate analysis — measures the effect of single variables (such as age, income, number of employees, SIC code) on our target variable. For example, as income increases, does response increase or decrease?

Other factors to evaluate include whether the predictor variable contains enough data to be statistically valid, contains valid information or garbage values, or needs to be transformed into a more “modeler friendly” format. Your statistician may do a frequency distribution and other descriptive measures to see if patterns in the data emerge.

* Bivariate analysis — this is similar to univariate analysis, except you are doing cross tabulations of two variables (e.g., SIC by number of employees) to see if more complex patterns emerge. This may reveal relationships not detectable through univariate analysis.

* Factor analysis — sometimes referred to as “dimensionality reduction analysis,” factor analysis helps to group similar variables such as home value and income to identify if they are statistically redundant. Redundant variables can be eliminated from the analysis or grouped together into composite measures.

Suppose that within your marketing database you have information on product purchases for more than 100 products. You are having a model developed to predict responsiveness to a particular catalog promotion and you know that product purchase history is likely to be a significant attribute in predicting responsiveness. In theory, each product purchased could be used as a predictor variable. But this would yield 100 possible predictors, far too many to include in a single model. Also, many of the 100 products may have an affinity to one another and, thus, somewhat redundant in predicting response. Factor analysis may help by grouping the products into a smaller number of more manageable categories, thereby reducing the number of predictor variables and leveraging more information than if the variables were considered separately.

* Cluster analysis — factor analysis is commonly, though by no means exclusively, used as a precursor to cluster analysis. Cluster analysis evaluates a large, heterogeneous population of businesses or consumers and breaks it into smaller segments with like characteristics. The primary benefit of cluster analysis is that it makes it easier to target these more homogenous, well-defined universes.

Suppose you profile your customer base and learn that the average number of employees at your customer locations is 75. It may be that not one of your customers has 75 or even close to 75 employees. Perhaps roughly one-half of your customer base consists of smaller businesses with 25 employees on average, and one-half of your customer base consists of larger businesses with 125 employees on average. Targeting your customer base as a whole would lead you to think that companies with 75 employees are your ideal target, when in fact they aren't a core market.

Cluster analysis extracts these submarkets within your database, making it easier to target — both from a modeling perspective as well as a creative marketing perspective — these dynamic groups of customers.

Actually, cluster analysis would use several variables (not just number of employees) to determine the various market segments within a database. The greater the number of variables to evaluate, the more clusters you are likely to find.

* Discriminant analysis — often referred to as “cross-validation analysis,” discriminant analysis is a test to see if the groups being used are valid discriminators.

The idea here is that if the groups perform the same, there's no reason to keep them apart. Discriminant analysis often is used to cross-validate cluster groupings identified within a database or assign new customers into such clusters.

* CHAID analysis — short for “Chi-Square Interaction Detector,” CHAID is a technique used to segment a population with respect to its relationship to a target variable, such as response. Responders are compared with nonresponders for various attributes to determine whether a statistically valid difference exists.

CHAID can be viewed as a “tree” with branches. The trunk is usually a variable within the database found to have the greatest impact on the target variable. The branches are different variables and combinations of variables that produce an even greater impact on response. As you move up the branches, the impact on response intensifies.

Choosing the right method. Now that you've done your exploratory data analysis, it's time to pick a modeling method.

It is theoretically possible to get to a model that predicts a fantastic result but does not provide you with sufficient quantity to justify a mailing. This is why you must get involved in the modeling process. You must examine the interplay between variables that gives you the greatest lift for the quantity you must mail to have a profitable business. If you can't scale up your mailings, it won't matter that you've gotten the best response anyone ever had on the list.

There are three basic multivariate modeling techniques commonly used in direct marketing: linear regression, logistic regression and neural net models.

* Linear regression, known as “least squares estimation,” seeks to find the smallest variance between points. Linear regression can be used on anything that can be averaged (such as dollars, lifetime value, average order size).

* Logistic regression uses a technique known as “maximum likelihood estimation,” which seeks to find a model that gives you the greatest proportion of successes. Use logistical regression in situations where outcomes are characterized as a “yes” or “no,” response, nonresponse, or “better,” “same” or “worse.”

* Neural net is a technique that is good at finding patterns in the data that are nonlinear. Its strength is that it can test for all the combinations of variables, not easily detectable through traditional statistical methods, to find patterns that determine response.

Do it yourself? There are a few software programs on the market that claim to do all the different types of regression modeling automatically. The problem with this is that the model can only be effective if you understand the data, its relationships to each other and the type of methodology to use for the information to be discovered.

So be careful, and remember most importantly to clearly communicate your business objectives or challenges. Only then can your statistician prescribe the appropriate medicine for your marketing challenges.

Stevan Roberts is president of Edith Roman Associates Inc., Pearl River, NY, a list brokerage and management, database and Internet marketing company.

Marc Fanelli, director of the analytical services division of infoUSA, Montvale, NJ, contributed to this column.