Mining gives meaning to complex data
New tools, techniques let IT more quickly and cheaply sift through critical data, while significantly improving the quality of the analysis.
- By John K. Waters
"Data mining goes beyond simply digging up new information. It helps you to understand what you've uncovered, and it allows you to take advantage of the patterns and associations you've found."
— Jeff Jones, senior program manager,
IBM Data Management Solutions Group
The enterprise is piling up data like a pack of squirrels on meth-amphetamines. Thanks to the data warehousing boom in recent years, the data assets of some organizations are now measured in terabytes. That is more than a trillion bytes.
Not surprisingly, organizations want to do something with all of the information they have collected. They want to apply it to better understand and serve their customers. They want to exploit it in order to cross-sell, up-sell and just generally sell more. They want to draw on it for insights into their internal processes. They want to use it to spot market and customer trends, and to discover waste and fraud. In short, they want to marshal their data assets into some kind of competitive business advantage.
Enter data mining, the silver bullet, the killer app, the mission-critical, must-have business process of the moment. Done right, a data mining initiative can provide an organization with the means to extract meaningful information from its data stores, ferret out hidden patterns, discover unseen relationships and the ability to turn unwieldy statistics into useful, relevant knowledge.
The roots of data mining technology can be found in the field of artificial intelligence (AI). In particular, three branches of AI studies—neural networks, machine-learning and genetic algorithms—which focused on emulating human perception and learning, have been incorporated into modern data mining tools.
Data mining is a complex process, and it can be expensive; but the current crop of data mining tools and applications is making it simpler and more cost-effective than ever to slice and dice your data. Still, if at the end of the day you are hoping for more than a pile of coleslaw, you have to plan your data mining initiative and invest the money, time and resources necessary to do it right.
|Coming of age
|"When standards begin to happen, it generally means a couple of things," said Jeff Jones, senior program manager, IBM Data Management Solutions Group. "The technology has matured enough for a standard to make sense, and customers and vendors are beginning to clamor for some stability and commonality they can bank on."
If what Jones says is true, the recent formation of The Data Mining Group (DMG) may very well augur the coming of age of the data mining process. The DMG (www.dmg.org/) is an independent, vendor-led association formed to develop data mining standards.
Among the group's recent innovations is the Predictive Model Markup Language (PMML). The PMML is an eXtensible Markup Language (XML)-based language that provides a way for companies to define predictive models and to share models between compliant vendors' applications.
According to the DMG literature, PMML provides applications with a vendor-independent method of defining models so that proprietary issues and incompatibilities "are no longer a barrier to the exchange of models between applications." PMML allows users "to develop models within one vendor's application, and use other vendors' applications to visualize, analyze, evaluate, or otherwise use the models."
Founding members of the organization include: Angoss Software Corp., IBM Corp., NCR Corp., Magnify Inc., Oracle Corp., The National Center for Data Mining (NCDM) at the University of Illinois at Chicago, and SPSS Inc.
The group is actively seeking new members. Interested parties can investigate further at the DMG Web site, or send E-mail to firstname.lastname@example.org for more information.
— John K. Waters
Data mining (also known as Knowledge Discovery in Databases, or KDD) has been defined as "the nontrivial extraction of implicit, previously unknown, and potentially useful information from data." In their book, Data Mining Techniques for Marketing, Sales, and Customer Support (Wiley, New York City, 1997), authors Michael J.A. Berry and Gordon Linoff define it more elegantly as "... the process of exploration and analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns and rules." The whatis.com Web site defines it simply as "... the analysis of data for relationships that have not previously been discovered."
"Data mining means different things to different people," explained Jeff Jones, senior program manager of IBM's data management solutions group. "In the beginning, it referred to simple queries, OLAP and report writing. Today, it's much more widely viewed as the pro-cess of using software algorithms to find information in data that you can't find in traditional ways."
To Jones, data mining is "... the frontier where Ph.D.-caliber technologies are used to find patterns and associations, to classify things and to segment things into groups by similarities you might not find in any other way. It's a very high-end way to extend what humans can do to locate new things in their data."
Whatever your definition is, first and foremost, data mining is about finding something new.
"Data mining goes beyond simply digging up new information," Jones explained. "True data mining yields something that's very difficult to come by with traditional statistical data analysis techniques—you might call it meaning. It helps you to understand what you've uncovered, and it allows you to take advantage of the patterns and associations you've found."
Although they go hand in hand, data mining should not be confused with data warehousing. As savvy readers already know, a data warehouse is a central repository for all or significant parts of an enterprise's collected data. Data warehousing is a process of centralized data management and retrieval. Data warehousing is about capturing data; data mining is about extracting meaningful information from the data.
And remember, in this context, data is not information, which (and I hope I am not driving everyone nuts) is not knowledge. If "data" refers to the raw numbers, and "information" refers to the meaning extracted from those numbers, "knowledge" is the know-how, experience or acumen needed to utilize the information extracted from the data. Whew! (See "The data mining lexicon" for more brain-twisting data mining jargon.)
|The data mining lexicon
|As the practices and processes of data mining have evolved and proliferated, a substantial vocabulary of new and newly applied terminology has emerged. Some of these terms are well known, others are newly minted locutions you will want to add to your IT lexicon.
Associations—An association algorithm creates rules that describe how often events have occurred together. For example, when skateboarders buy helmets, they also buy kneepads 25 percent of the time.
Chi Square Automatic Interaction Detection (CHAID)—A decision-tree technique used for classification of a data set. Provides a set of rules that you can apply to a new (unclassified) data set to predict which records will have a given outcome. Segments a data set by using chi square tests to create multi-way splits. Preceded Classification and Regression Trees (CART), and requires more data preparation.
Classification—The process of finding a rule or formula for organizing data into classes. For example, a bank might want to classify customers seeking loans into groups based on their creditworthiness (high risk, medium risk and low risk). Decision trees and neural networks are examples of classification methods.
Classification and Regression Trees (CART)—A decision-tree technique used for classification of a data set. Provides a set of rules that can be applied to a new (unclassified) data set to predict which records will have a given outcome. Segments a data set by creating two-way splits. Requires less data preparation than Chi Square Automatic Interaction Detection (CHAID).
Cleaning or Cleansing—The process of ensuring that all values in a data set are consistent and correctly recorded. A key step when preparing data for a data mining activity. Obvious data errors are detected and corrected (e.g., improbable dates) and missing data is replaced.
Clustering—The process of dividing a data set into mutually exclusive groups. Like classification, clustering breaks a large database into different subgroups or clusters. It differs from classification because there are no pre-defined classes. The clusters are put together on the basis of similarity to each other, but it is up to the data miners to determine whether the clusters offer any useful insight. Consequently, this process is sometimes referred to as "unsupervised learning."
Decision Tree—A graphical representation of the relationships between a dependent variable and a set of independent variables. Decision Trees are one of the most popular data mining techniques in use today. They utilize techniques derived from statistical and artificial intelligence research to find correlations and groupings in data automatically.
Genetic Algorithm—A computer-based method of generating and testing combinations of possible input parameters to find the optimal output. It uses processes based on natural evolution concepts such as genetic combination, mutation and natural selection.
Model—In data mining, can be descriptive or predictive. A descriptive model helps in understanding underlying processes or behavior (for example, consumer behavior). A predictive model is an equation or set of rules that makes it possible to predict an unseen or unmeasured value (the dependent variable or output) from other, known values (independent variables or input).
Nearest neighbor method (k-nearest neighbor technique)—A technique that classifies a point by calculating the distances between the point and points in the training data set. It then assigns the point to the class that is most common among its k-nearest neighbors, where k is an integer.
Neural Network—Non-linear predictive model that learns through training and resembles biological neural networks in structure. Neural networks are based on a simplified model of how the human brain works. Sometimes referred to as Artificial Intelligence (AI).
Online Analytical Processing (OLAP)—Refers to array-oriented database applications that allow users to view, navigate through, manipulate and analyze multidimensional databases.
Pattern—A non-causal relationship between two variables. Data mining techniques include automatic pattern discovery that makes it possible to detect complicated non-linear relationships in data.
Rule Induction—The extraction of useful if-then rules from data based on statistical significance.
Visualization—Visualization tools display data graphically to facilitate better understanding of its meaning. Graphical capabilities range from simple scatter plots to complex, multidimensional representations.
Sources: Two Crows Corp. (www.twocrows.com), KDnuggets (www.kdnuggets.com/) and the Australian Academy of Science.
Data mining tools
The nuts and bolts of data mining are not really new. Organizations have been using powerful computers to sift through piles of market research and grocery store scanner numbers for years. What is new is the accuracy of the analysis tools and the cost of the processes involved. As computing power continues to increase and disk storage space continues to grow, data mining becomes cheaper, easier and, some argue, better.
"You have to have a certain aptitude to dive into this stuff," said IBM's Jones. "You can't just turn on your computer and start pushing buttons. This is very sophisticated data analysis. But the tools and applications are doing more and more of it for you, guiding you through the process. The tools are definitely making the process easier."
Data mining software analyzes relationships and patterns in stored transaction data based on open-ended user queries. Several types of data mining tools are available today, including statistical products, machine-learning tools and neural networks.
Many of today's tools have graphical user interfaces (GUIs) and wizards that are much less intimidating than the old Unix command-line options. Still, according to David Nance, CIO at Kore, a Newport Beach, Calif.-based professional services provider, new data miners need training.
"It's not so much that they need to learn which buttons to push in particular software packages," commented Nance. "They need to understand the principles and best practices for successful data mining. You need people who can get their minds around the fundamental concepts." (See "Data mining tools" on for a sampling of some current data mining tools.)
|Data mining tools
|The KDnugget Web site (http://www.kdnuggets.com/), a veritable data-mining vein of gold, lists nearly 200 data mining products and service providers. Although not nearly as comprehensive, the following list should provide a sampling of some of the more popular data mining tools currently available.
Darwin, Oracle Corp., Redwood Shores, Calif.
A set of tools oriented toward classification and regression.
DataBase Mining Marksman, HNC Software Inc., San Diego, Calif.
Targeted at database marketing applications and sold as a combination of hardware and software. The hardware component is a standard PC with an accelerator board containing 16 parallel processors, allowing Marksman to quickly and automatically build many neural nets with different architectures in order to select the best. The product discovers relationships between attributes by computing relationship strengths between all pairs of fields.
DataCruncher, DataMind Corp., Redwood City, Calif.
Designed to detect customer attrition or "churn" problems, with a particular emphasis on the telecommunications industry. DataCruncher is a client/server tool that uses a proprietary model-building technique called Agent Network Technology to build tree-like classification models on complex data sets. A graphical user interface (GUI) that runs under Microsoft Excel automates much of the model-building process. The product uses Excel graphs to help visualize the results of a model. (Red Brick Systems has integrated a version of DataCruncher into its Red Brick Warehouse database.)
Data Mining Solution, SAS Institute Inc., Cary, N.C.
A SAS System module for data mining analysis. SAS provides a GUI with an extensive set of options for building the model. The current version includes the SAS Neural Network Application, and the SAS Decision Tree Application for building Chi Square Automatic Interaction Detection (CHAID)-based decision trees. Before generating the model, you can explore the data using the SAS/Insight visualization tool. Future versions are expected to include association discovery.
4Thought, Cognos Inc., Ottawa, Ontario
Designed to build regression and time-series models, but may be used for classification. The product uses neural nets to build these models, and utilizes a spreadsheet-style interface. Extensive deployment and model analysis capabilities are included. Predicted values can be exported in a number of formats, and a model can also be exported as a function to Excel, Lotus 1-2-3 and the SPSS statistical product.
KnowledgeSeeker, Angoss Software Corp., Toronto, Ontario
A desktop or client/server tool that uses decision trees for predictive models. A version of Chi Square Automatic Interaction Detection (CHAID) is used to predict categorical variables, and Classification and Regression Trees (CART) are used for continuous variables. Angoss provides a GUI for building the model and interactive facilities that let users explore the data by splitting a selected node in the tree or even forcing a particular split that might be of interest.
Intelligent Miner, IBM Corp., Somers, N.Y.
A comprehensive set of data mining tools for classification, association and sequence discovery, time series, clustering and regression. IBM supplies multiple technologies for classification (decision tree and neural net) and clustering (demographic and neural net); most of the algorithms have been 'parallel-ized' for scalability. Models can be built using either a GUI or an API. Not surprisingly, this product is tightly coupled with DB2, which must also be installed, but it does support input from sources such as ASCII files.
MineSet, Silicon Graphics Inc. (SGI), Mountain View, Calif.
A set of data mining tools that combines classification and association algorithms with visualization. Integrates data mining analytic tools with high-end visualization tools for user exploration and navigation of data sets and mining results. Data mining tools include the Association Rule Generator, Decision Tree Inducer for classification, Evidence Inducer, and Column Importance determination utility. All the algorithms, except association, are based on MLC++ technology developed at Stanford University.
— John K. Waters
As of press time, data miners can expect to pay anywhere from $10,000 for a solution that runs on a single platform to as much as several hundred thousand dollars for a multiplatform enterprise data mining package—and even up to $1 million per terabyte for the very largest systems. Clearly, data mining is big business, and it is still growing. IDC, Framingham, Mass., has predicted that market demand for data mining tools will grow from a reported $259 million in 1998 to $1.78 billion by 2003.
The types of data typically mined today include operational or transactional data, such as sales, cost, inventory, payroll and accounting; so-called non-operational data, such as industry sales, forecast data and macro-economic data; and meta data, which is data about the data itself, such as logical database design or data dictionary definitions.
The types of results you can expect the data mining process to yield include:
- Associations (associative mining)—one event can be correlated to another event (ice cream purchases and chocolate syrup).
- Sequences—instances of one event leading to another (the purchase of a bedspread followed by the purchase of a set of pillows). Results anticipate behavior patterns and trends.
- Classification—the recognition of patterns and a resulting new organization of data (customer profiles and buying behavior).
- Clustering—this term refers to finding and visualizing groups of facts that were not previously known. The information is grouped according to logical relationships or consumer preferences (market segments or consumer affinities).
- Forecasting—discovering patterns in the information that can lead to predictions about the future.
Cutter Consortium recently surveyed 94 companies worldwide on their adoption of data warehousing, as well as their overall data mining and data warehouse administration and management. Overall, 36% of the firms surveyed are currently using data mining tools or planning to deploy them in the next year. Fifty-two percent of the respondents rate their data mining efforts as successful or very successful.
Data mining strategies
Modern data mining tools have automated and simplified the process considerably, but a data mining initiative will not run itself. Anyone embarking on such a project will increase his or her chances of success by asking and answering some critical questions.
1. Has everyone bought into the project? To succeed, a data mining initiative must have the support of both the decision-makers (the line-of-business or public sector executives) and the information technology organization. In most successful data mining projects, the decision-maker is both the champion for and the leader of the project, said Kore's Nance.
"Data mining has to be embraced at the highest level," he added. "Your organization has to make the budgetary commitment to sustain the project."
2. Does your staff have the specific experience and skill set you need? The fundamentals of data mining have been around for a while, but modern data mining is new stuff. Chances are, your organization will not have many people with much data mining experience. And data warehousing experience does not count.
"If you don't have the people on staff, you'll want to consider outsourcing to get the data mining expertise you need," said Nance.
3. Does your organization have the technology infrastructure needed to support the kind of data mining you have in mind? Enterprise-wide applications generally range in size from 10 gigabytes to more than 11 terabytes. Despite the range of application sizes, data mining tools are serious, big computing applications, said IBM's Jones.
"Data mining tools are not desktop software," he explained. "This process takes parallel processing-style Unix servers, clusters of NT servers or mainframes. This is big compute stuff; you need the right hardware and all the attendant skill sets."
A critical factor here, Jones noted, is the size of the database: The more data you want to process and maintain, the more powerful your system needs to be. Another factor is query complexity. Complex and numerous queries will also up the system requirements.
Relational database storage and management technology is adequate for many data mining applications of less than 50Gbs, Jones explained. But the infrastructure needs to be enhanced significantly to support larger applications.
The bottom line is that data mining involves enormous calculations, typically on enormous amounts of data, so high compute power is essential.
4. Do you have a data warehouse? If you want to mine it, you have to have data—structured data, that is. "Build a clean data warehouse with updated information," said Kore's Nance. "If you don't have a data warehouse, chances are you won't be able to consider a data mining solution."
5. Is your data clean? "By clean data, I mean data that you've looked at, and from which you've removed data that might have invalid values in certain fields," said IBM's Jones.
"You can do some data preparation. Maybe the sales records have 59 fields and only three or four of them contain data that you care about in the data mining. Go ahead and strip all that other stuff off and create new short records that you're going to mine that contain just what you want," he said.
6. Have you made the business case for your data mining initiative? "The process really begins from the business side," explained Kore's Nance. "You have to ask, 'What is the business case that supports the expense of making this stuff truly valuable?' I think that the business case for a mid-sized company is definitely there."
But he is not sure the case can be made for larger organizations.
"When you start getting into multiple connection layers and multiple terabytes of spread out data, that's when things begin to get expensive," he said. "Once you start managing massive amounts of information, you don't get the true value, because there is the expense of storing the information. It's kind of a catch-22. We've found that you tend not to get the value. You don't create a critical mass. In the Fortune 10 companies with 50 databases, I don't know that the business case is there."
Nance believes that the most successful and cost-effective data mining initiatives start small and focus on a critical organizational issue, such as retaining customers.
Jan Mrazek agrees that making your business case is essential, but he probably would not agree that data mining is cost-effective only for medium-sized organizations. Mrazek is senior manager of business intelligence solutions at the Bank of Montreal, and he spearheaded the bank's data mining initiative. Begun about four years ago with a staff of three, his department has grown to 60 since the project began. "Most of those people are responsible for the transformation of data and for building the underlying systems," he said. "A relatively small group of people is responsible for building the [data mining] models."
The bank had been collecting and analyzing customer-related data in disparate data warehouses for many years before it initiated the project. Mrazek and his team used IBM's Intelligent Miner product to provide insights into customer behaviors and to predict the likelihood that a customer might be interested in a new bank product.
"Data mining is one of the corporate weapons that can give you a competitive advantage," Mrazek noted. "But even if you don't get the advantage, if you don't do it, you get behind. It's like being at the movies in a crowded theater and someone stands up. The [person] behind has to stand up just to see the screen, and pretty soon, you'll have to stand up, too."
|Techniques of data mining are typically useful for exploiting so-called structured data—that is, data in databases. But industry analysts have estimated that 80 percent of corporate information exists as unstructured data. A wealth of potentially useful information resides in customer letters, Web pages, E-mail, online news services and other documents not found in databases.
Simple text is the most prevalent form of unstructured data. A technique called "text mining" has emerged to help organizations plumb the textual content of such documents in order to analyze and classify that content for the same purpose as data mining: the discovery of new information.
Text mining searches for stored documents, but text-mining tools can also provide more data-mining-like features, such as:
- Feature extraction—finding the key single or multi-word concepts in a document or a collection of documents;
- Clustering—discovering predominant themes in a document collection; and
- Classification of documents.
— John K. Waters
Data mining and CRM
The process of data warehousing goes hand in glove with data mining, noted IBM's Jones. Another perfect fit for data mining processes, he said, is customer relationship management (CRM).
"In the data mining world, you'll hear lots of talk about CRM," Jones said. "This is a broad term that means that you desperately want to know as much as you can, within the boundaries of privacy laws, about your customers. You want to market directly to them and attract them to products and services you have a high probability of succeeding with."
Companies are beginning to build data mining algorithms into the latest customer relationship management applications, customer tracking and performance monitoring tools to analyze customer behavior. "Data mining is there, but users never see it," said Phillip Russom, service director and analyst at the Hurwitz Group, Framingham, Mass. "Customer analysis is the killer app of data mining."
IBM's Jones emphasizes that implementation of a data mining strategy does not mean that you have to throw out "older" processes, such as OLAP and query.
"Actually," he said, "all three complement each other. You can do a data-mining run and discover something new, and then do some iterative querying or some OLAP analysis on what you've discovered. This way, you can refine the data you're mining, and then you can do it again. It's an iterative aspect of data mining that's very important: Run, refine, run, refine, until you get to the sorts of things you can take action on."
|Data mining books
|Interest in data mining has never been higher, and book publishers have taken notice during the past few years. A veritable library of books on the subject has been published recently. Here are some of the more popular titles:
Berry, Michael J.A., and Gordon Linoff. Data Mining Techniques: For Marketing, Sales, and Customer Support. New York: Wiley, 1997.
Berry, Michael J.A., and Gordon Linoff. Mastering Data Mining: The Art and Science of Customer Relationship Management. New York: Wiley Computer Pub., 2000.
Berson, Alex, Stephen Smith and Kurt Thearling. Building Data Mining Applications for CRM. New York: McGraw-Hill, 2000.
Fayyad, Usama M., et al., (Eds.). Advances in Knowledge Discovery and Data Mining. Cambridge, Mass.: MIT Press, 1996.
Groth, Robert. Data Mining: Building Competitive Advantage. Upper Saddle River, N.J.: Prentice Hall PTR, 2000.
Mattison, Rob, and Brigitte Kilger-Mattison (Eds.). Web Warehousing and Knowledge Management. New York: McGraw-Hill, 1999.
Meña, Jesus. Data Mining Your Website. Boston: Digital Press, 1999.
Pyle, Dorian. Data Preparation for Data Mining. San Francisco, Calif.: Morgan Kaufmann Publishers, 1999.
Westphal, Christopher, and Teresa Blaxton. Data Mining Solutions: Methods and Tools for Solving Real-World Problems. New York: Wiley, 1998.
Witten, Ian H., and Eibe Frank. Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. San Francisco, Calif.: Morgan Kaufmann, 2000.
— John K. Waters