Every time I make a database marketing presentation, I first try to determine the level of sophistication of the audience. One question I usually ask is “How many of you know how a relational database works?”
Usually, fewer than half of the people raise their hands, even at an event such as the Direct Marketing Association’s biannual National Center for Database Marketing conferences, which attract advanced participants. Everyone knows that relational databases provide the most flexibility and are the preferred architecture for marketing databases, but at least half of them do not know how or why.
One of the most important aspects of creating a marketing database is to design it for efficient storage, data access and query processing, and understanding the relational database model is necessary to do that.
Flat Files
The first database management computer programs placed all the information about specific customers and their relationships into a single computer record called a “flat file.” For example, customers' names and addresses, what they have purchased, plus any other information such as demographics, all are in a single record.
But what if a customer had purchased three times? Flat files would work, but much information would be duplicated. Data elements such as name, customer number, address and ZIP code would be identical in all three records. Thus, the flat file would store the same information three times. From a storage standpoint, this is inefficient.
Scanning that file for information is also inefficient. Suppose you wanted to find and create a list of all the customers who had bought a specific product in a file of this nature. Flat file databases scan all the fields in each record from left to right, looking for “hits” in a search pattern.
Flat files, while easy to construct, result in inefficient storage and inefficient and slow processing. The relational database model solves this.
The Relational Database Model
If we were to reorganize the same data for a relational database, we would use a group of tables to store the information.
First, we would create a customer information table, into which we would place a single record for each customer containing all the information specific to the customer such as customer ID number, name and address.
Next, we would create a purchase table, into which we would place the customer ID number, the product type, the date of purchase and the price. If the customer bought three products, we would create three records in the purchase table to hold that information.
Using this architecture, we would have stored the same information about customers and their purchases in much less space because we would have stored the customers' names and addresses — the bulk of the information — only once. The process of reorganizing this information into separate tables is called “normalization.”
In our relational database, all the records have one common data element: the customer ID number. This number is a unique identifier because it is used only to identify a specific customer. This unique number is the key to making a relational database work.
Suppose we once again wanted to find and create a list of the customers who have purchased specific products. In our relational database, the system would first search the purchase table to find the products and store the customer ID numbers in the qualifying records in temporary memory. The system would then go to the customer table, extract only those records in the table that have the same customer numbers and put the customer information and the account information together. This process of combining the information found in multiple tables is called a “join.”
The same two tables also could be used in a reverse sequence to produce a different kind of report. If we wanted to find and list all of the products bought by a specific customer, the system would first search the customer table to find the customer, then use the customer ID number to locate all the records in the purchase table that belong to the same customer, joining them to create the report.
Keep in mind that the examples above are tremendously simplified. In a real database, the records in both the customer table and the purchase table would contain many more fields. Plus, there would likely be additional tables containing information such as a history of the promotions sent to each customer, a record of customer transactions, customer demographics, perhaps a history of customer service activities and so on. Marketing databases can have dozens of tables.
One of the most important benefits of a relational database is the ability to add or edit data. Another table can be added; another field can be added to an existing table, and values in existing fields can be changed, all without having to rebuild the database. This makes the architecture extremely flexible.
There is a downside to normalization: Joins take time. Relational databases are flexible and efficient in storing information, but if a query or report has to join too many tables, the answers will be slow in coming. For that reason, most smart marketers will try to balance the need for flexibility and efficiency with the need for speed, building the marketing database with as few separate tables as possible. Collapsing a relational database into fewer tables is called “denormalization,” or sometimes referred to as “summarization.”
Indexing
One trick to speed up relational databases is to “index” the fields that are likely to be in common use, a way of creating pointers that can help the system quickly find the records in the database having each of the values to be found in a specific field. The technique can help, but it also has drawbacks.
For example, indexing works well on fields with only a few discrete values such as gender, which would have only three values — male, female and unknown. But there is no speed gain if the field has 100 or more values. So a continually variable field like account balance, which could theoretically have a million or more values, does not benefit from indexing. An even greater problem with indexing is its effect on storage requirements. A fully indexed database can easily balloon up to five times its original size. So indexing, though helpful in some situations, is not always the way to solve performance problems.
Bit Mapping
Sometimes, it is possible to treat data elements as bits, rather than bytes, a data organization referred to as “bit mapping.” A byte consists of eight bits, so storing and accessing data at the bit level can make processing eight times faster.
Inverted Files
Some proprietary vendor database offerings use an “inverted” file structure to speed query processing. In an inverted file, the system would perform a horizontal search of the first record in the database looking for the field it wants, then go through the rest of the file vertically, searching only that field.
This search pattern can produce lightning-speed results, sometimes returning an answer in just a few seconds against a multimillion-record database. And since the query searches only the fields it needs to satisfy the query rather than all the fields in each record, the length of the records or the number of fields in those records has no effect on speed.
So why don’t all databases use an inverted file structure? There are two major drawbacks. First, inverted files don’t offer much in the way of flexibility. If you wanted to add some information to the database, you couldn’t just create or edit a table as you would if the database were relational — you would have to rebuild the entire database.
Second, inverted files are typically not compliant with open database connectivity. The ODBC standard allows applications to share information, a key requirement of today’s analytical programs. So using an inverted database management system usually limits the user to the reporting functions that came bundled with the system. To get data into other analytical applications requires time-consuming exporting of data sets that meet the query criteria.
The most advanced systems sometimes use a combination of these technologies to satisfy user needs. One key to getting the maximum benefit from a marketing database is to understand enough about how the data is processed to make the right judgments about data organization before the database is built.