### Mining for data tools

**Data mining is hot, companies want it. But problems arise when it comes
time to select a data mining tool. Often, companies do not know how to select a product suitable to their specific
requirements. Data mining techniques and algorithms are complex. In fact they probably represent some of the most
complex technology available today. But their complexity should not intimidate users.
**A recent market intelligence report from Software Productivity Group's Analyst Services division, Natick, Mass.,
shows strong trends in the data mining market. Over the last three years there's been explosive product growth,
with well over 100 product offerings to date. Over the last two years, an SPG survey uncovered, corporate data
mining projects have increased an incredible 450%.

Still, most organizations do not understand what data mining is, how it works and how to differentiate between the product offerings currently available. To best utilize data mining's capabilities, managers must first understand the various technologies that underlie data mining tools and the types of problems each technology is best suited to solve.

Data mining technologies are based on a combination of mathematical and heuristic algorithms that borrow heavily from statistical analysis and artificial intelligence. While mathematical models and statistics are well-suited to analyze many business activities, they perform poorly when modeling systems that exhibit non-linear or chaotic behavior (for example, weather forecasting or stock market predictions). Simplifying assumptions can be made that remove the non-linearities and better support mathematical modeling, but these simplified assumptions all too often remove the very behavior hoped to be modeled. Heuristic algorithms are not based on math, but rather a set of analytical steps to be followed, where each successive step is not predetermined, but is selected based on the results from the previous step. Heuristic techniques such as neural networks and genetic algorithms are very good at accurately modeling many complex, non-linear business problems.

The myriad of data mining market offerings employ a wide variety of technologies ranging from statistical and regression analysis (decision trees) to heuristic techniques such as neural networks, genetic algorithms, fuzzy logic, rules induction and case-based reasoning. However, most of the major data mining products available are based on decision tree techniques, neural networks or genetic algorithms.

The Data Mining Market -- ## Product Approaches
DataMind from DataMind Corp. is one of the few products based on polymorphic hybrid techniques. They call their approach "agent technology," and it has characteristics of both neural nets and decision trees. Data mining and Olap offer very complimentary functionality, and most of the major Olap vendors have data mining offerings. |

**Trees for decisions**

Decision trees represent a specific class of machine learning algorithms, also known as recursive partitioning algorithms, which are used for classification and prediction. Typically based on statistical algorithms such as CHAID (chi-squared automatic interaction detection) and CART (classification and regression trees), decision trees try to describe specific populations of the dataset in terms of certain characteristics or variables that influence the data. For example, "market churn" is a major issue in the telecommunications business. It would be important for a telecommunications company to understand and predict what set of customers are more likely to remain loyal so it can target its product services and marketing to that desirable customer group. A decision tree might separate out its customer data based on the likelihood the customer will remain loyal. It might examine variables such as customer age, geographic location, income and average monthly bill.

In this example, the question that is being studied -- customer loyalty (or, conversely, customer churn) -- is known as the dependent variable. The variables that may affect the dependent variable (age, location, income and average bill) are known as independent variables. A sample dataset represents historic information about customer loyalty in the past. In the sample data, 14 out of 30 customers remained loyal, while 16 switched carriers. The algorithm determined that the most significant independent variable describing customer loyalty is income. By making the first split where income is less than or greater than $40,000, the algorithm creates two smaller subsets of data with less entropy (or, conversely, greater purity).

The dataset's entropy is a measure of how heterogeneous the data population is. In our sample data, where 14 of 30 customers remained loyal while 16 did not, there is a nearly random distribution, and therefore a high entropy (or level of disorder). At each subsequent level of the tree, the algorithm will test each independent variable, and partition with the one that results in the lowest entropy. In a well-partitioned decision tree, the level of entropy should decrease as one reads down the tree. In our example, the final partitions have relatively low levels of entropy.

This makes sense if one considers the interpretation of the decision tree. From the example, one could state that customers who earn more than $40,000 and are over 31 years old tend to be very loyal, whereas customers who earn less than $40,000 with average monthly bills of $50 or more usually switch carriers. This can be stated with some degree of confidence because the entropy of these final partitions is quite low. If entropy was high and the partition was heterogeneous, few conclusions could be drawn. This example also demonstrates how different independent variables can be used at a given level in the tree. For the customers earning more than $40,000, age was the next best descriptor. For the customers earning less than $40,000, average monthly bill was the next best descriptor.

While successive partitioning (known as growing the tree) will typically create finer and more accurate groupings with lower entropy, at some point the partitioning will be too complex to be readily interpretable. Building an effective decision tree requires a balance between accuracy (from more partitions) and explainability (limiting the number of partitions). The other risk of successive partitioning is that the tree will be "overfit" to the data. The goal of classification analysis is to abstract an understanding of causal characteristics from the data. If the dataset is too finely partitioned relative to the size of the dataset, the tree describes this particular dataset and its inherent noise, as opposed to the types of characteristics this dataset represents.

The best practice is to grow the tree all the way out, at the risk of overfitting. Then the tree is pruned back one node at a time. With each pruning, the overall entropy of the system is measured. The tree is pruned back until entropy increases significantly (significant purity is lost). In this way, the partitions that offer the most information are kept, while further partitions that increase the model's complexity without offering much more accuracy are dropped.

Decision trees are well-suited for classification and prediction problems where accuracy is not as important as explainability. They are very easy to understand and explain (assuming the tree is not overgrown), but are not as subtle (and accurate) at partitioning as neural networks. They scale well for larger datasets, and can tolerate noisy and anomalous data, but require large datasets in order to avoid overfitting.

**A neural way**

Neural network algorithms mimic neurological structure and function in an effort to model complex problems involving subtle data relationships. Basic animal neurology shows us that with as few as three layers of simple nerves, or neurons, very complex outputs can be derived from very simple input signals. In a neural network, each node plays the part of a neuron. Nodes can send or receive signals based on the input they each receive and their relationship with the nodes around them.

Model complexity quickly grows as more layers of nodes are introduced into the neural model. In the trivial case, one input node connects to one output node. This would represent a simple causal effect (for example if this happens, then do this). Multiple input nodes can affect multiple possible outcomes, depending on the signals sent by each of the input nodes. This structure can model more complex behavior and, if the input signals were linear, would in essence perform a linear regression. A hidden layer separates input and output layers in a typical neural net. The hidden layer offers a much richer integration and interpretation of the original input signals and allows the model to represent very complex non-linear behavior quite accurately.

In a neural net, each node in a given layer can signal nodes in the successive layer through connections that are referred to as weights. Weights indicate how strong the signal is between the two connected nodes. Nodes perform two functions when incoming signals are received. First, all the incoming weighted signals are summed -- note that signals may be negative (inhibitory) as well as positive (excitatory). Then a transfer function (or activation function) is used to determine how strong the output signal should be; based on the inputs it has received. If, for instance, the transfer function is linear, then the stronger the total input, the stronger the output signal. If instead, the transfer function is bell-shaped, then the strongest output signals will occur for summed inputs in the middle of the range of possible inputs.

Back propagation algorithms are the most frequently used neural net technique for supervised learning problems (where an initial dataset, the training data, is used to train the model before it makes predictions against the test data). In a back propagation model, average weights are first assigned to each nodal connection. Then the training inputs are run through the model, and the net creates its output response. Because this is a supervised learning technique, the output response is then compared to the correct result for the training data, and an error term is calculated (in other words, the difference between the output response and the correct result). The net then analyzes each node's contribution to the error term, and adjusts the node's weights accordingly (the backward propagation). This process is repeated until the error term is satisfactorily small and the net is ready to predict based on the test dataset.

Neural nets can be overtrained on a particular set of training data, much as decision trees can encounter overfitting. When overtrained, the model has essentially memorized the actual data, rather than learned the patterns in the data. While an overtrained net can achieve high accuracy on the training data it "knows," accuracy tends to be very poor on other datasets. One way to measure whether the net is overtrained, is to reserve a small subset of training data until the net is trained. Then the reserved training data is run through the net, and the output results are compared with actual results for that reserved dataset. If the error term for this test is small, then the net has not overtrained and can be applied to the test data with confidence. This method of testing does require sufficient data for a representative training sample, as well as a reserved training sample. Generally overtraining is a result of too small a training dataset, and too many iterations of the training process.

Neural nets can offer very high accuracy regardless of the complexity of the problem being studied. This however, is offset by the fact that it is nearly impossible to understand how the results were reached.

This is why neural nets are sometimes referred to as black box systems. They provide answers, but as if like magic, you cannot see how it was done.

**Genetic algorithms**

Genetic algorithms, like neural networks, borrow from nature to solve problems -- in this case the techniques are borrowed from genetics, as opposed to neurology. Genetic algorithms specialize in solving optimization problems, for example helping to identify the best, or optimal solution to a particular problem. In business, these problems range from finding the best product configurations to map to specific customer needs, to routing and delivery problems, to staffing and scheduling management.

Optimization problems are trivial, if the range of possible solutions is small. In these cases an exhaustive search of all the potential solutions would be sufficient. However, as the number of problem variables grows, the range of potential solutions grows exponentially, and a more efficient technique is required. For instance, if you wanted to determine the shortest delivery route between three cities, simple inspection of the six possible combinations of routes would be sufficient. However, if you had to consider routes between 15 different cities, an exhaustive search would have to evaluate 15! or 1,307,674,368,000 different solutions.

The basic elements of genetic algorithms are chromosomes, which are made up of individual genes. Each chromosome represents one potential solution to a given problem. The genes each represent a variable in the solution. For example, a Web-based realty service wants to automatically match a home buyer's criteria against the realtor's database of available houses on the market, and recommend the best three houses to the home buyer. Let's assume in this example that the criteria includes the number of bedrooms, whether or not the house has a fireplace and a garage, and the acreage of the lot. Each chromosome would include a gene that represented the number of bedrooms, a gene that represented the existence of a fireplace, a gene for the garage, and a gene for acreage.

Using a chromosome structure to encode possible solutions, an initial population of solutions is generated by the genetic algorithm. This is the starting place for the algorithm. The chromosomes are then evaluated by a fitness function that quantifies how good a solution to the specific problem each chromosome represents. In our example, if a potential home buyer wanted more than two bedrooms, a fireplace and a garage, and at least two acres of land, then the chromosome would be a good potential solution depending on how important the garage was to the buyer. The fitness function knows how to weigh the values of each gene to quantify the goodness of the solution.

One of three things can then happen to each chromosome. The algorithm first must decide whether or not the chromosome will be selected against (in the truest Darwinian sense). The best chromosomes will survive to the next generation (be kept), and the poorest chromosomes will die (be discarded). For those surviving chromosomes, there is then the chance of crossover or mutation. Crossover mixes the genes from two chromosomes resulting in two new chromosomes. Mutation randomly changes the value of one of the genes in a chromosome. As in nature, crossover and mutation enrich the gene pool. In effect, selection ensures the algorithm is working with the best solutions to date. Crossover and mutation explore slight variations on good solutions to try and find better ones. The results of selection, crossover, and mutation create the next generation of chromosomes, and the process is repeated.

Genetic algorithms are very fast and effective at finding good solutions quickly. While they may not always arrive at the single best answer, they usually offer several "near-best" answers to choose from. The solutions are easy to interpret, but because the algorithmic process is essentially trial and error with refinement, the path by which the solution was reached is not very explainable.

Another characteristic of genetic algorithms is that as long as the fitness function knows how to recognize a good solution when it sees one, the model doesn't have to understand anything about how to solve a particular problem. With a representative initial population, the model does its thing and generates the optimum solutions.

**Hybrid techniques**

Each data mining technique has its own characteristic strengths and weakness, and while appropriate for solving some problems, may be totally unsuited for other problems. Artificial intelligence research is developing a new class of problem solving techniques, known as polymorphic hybrid techniques, or intelligent hybrid systems. These hybrid algorithms borrow from other techniques under different modeling conditions, combining techniques such as genetic algorithms and neural nets, or fuzzy logic and expert systems. Because combinations of techniques can vary, it is difficult to describe this class of techniques as a whole. When examining the behavior of a specific polymorphic hybrid algorithm, the characteristics of the embedded techniques that the hybrid is based on should be evident.

**Is a Ph.D. necessary?**

Informed product selection is the only way to ensure that a particular product is well-suited to the problem to be solved. Unfortunately, the marketing and sales efforts of most of the data mining vendors gloss over the technical details. Even worse, some go as far as claiming that the underlying technology is not important, because even if the technique the product is based on is not the most appropriate for a customer's problem, they will still achieve some benefit from using that product!

Correct product selection requires two things: First, the business problem must be clearly identified and understood. Second, the technique that a product is based on must be understood, and it must be decided whether or not that technique is appropriate for the business problem. A Ph.D. is not a prerequisite for an informed selection of a data mining product. But a general knowledge of data mining techniques and their appropriate application, together with a familiarity of the terms and jargon can demystify the process and greatly improve the chances of an appropriate purchase.

**Data mining is hot, companies want it. But problems arise when it comes
time to select a data mining tool. Often, companies do not know how to select a product suitable to their specific
requirements. Data mining techniques and algorithms are complex. In fact they probably represent some of the most
complex technology available today. But their complexity should not intimidate users.
**A recent market intelligence report from Software Productivity Group's Analyst Services division, Natick, Mass.,
shows strong trends in the data mining market. Over the last three years there's been explosive product growth,
with well over 100 product offerings to date. Over the last two years, an SPG survey uncovered, corporate data
mining projects have increased an incredible 450%.

Still, most organizations do not understand what data mining is, how it works and how to differentiate between the product offerings currently available. To best utilize data mining's capabilities, managers must first understand the various technologies that underlie data mining tools and the types of problems each technology is best suited to solve.

Data mining technologies are based on a combination of mathematical and heuristic algorithms that borrow heavily from statistical analysis and artificial intelligence. While mathematical models and statistics are well-suited to analyze many business activities, they perform poorly when modeling systems that exhibit non-linear or chaotic behavior (for example, weather forecasting or stock market predictions). Simplifying assumptions can be made that remove the non-linearities and better support mathematical modeling, but these simplified assumptions all too often remove the very behavior hoped to be modeled. Heuristic algorithms are not based on math, but rather a set of analytical steps to be followed, where each successive step is not predetermined, but is selected based on the results from the previous step. Heuristic techniques such as neural networks and genetic algorithms are very good at accurately modeling many complex, non-linear business problems.

The myriad of data mining market offerings employ a wide variety of technologies ranging from statistical and regression analysis (decision trees) to heuristic techniques such as neural networks, genetic algorithms, fuzzy logic, rules induction and case-based reasoning. However, most of the major data mining products available are based on decision tree techniques, neural networks or genetic algorithms.

The Data Mining Market -- ## Product Approaches
DataMind from DataMind Corp. is one of the few products based on polymorphic hybrid techniques. They call their approach "agent technology," and it has characteristics of both neural nets and decision trees. Data mining and Olap offer very complimentary functionality, and most of the major Olap vendors have data mining offerings. |

**Trees for decisions**

Decision trees represent a specific class of machine learning algorithms, also known as recursive partitioning algorithms, which are used for classification and prediction. Typically based on statistical algorithms such as CHAID (chi-squared automatic interaction detection) and CART (classification and regression trees), decision trees try to describe specific populations of the dataset in terms of certain characteristics or variables that influence the data. For example, "market churn" is a major issue in the telecommunications business. It would be important for a telecommunications company to understand and predict what set of customers are more likely to remain loyal so it can target its product services and marketing to that desirable customer group. A decision tree might separate out its customer data based on the likelihood the customer will remain loyal. It might examine variables such as customer age, geographic location, income and average monthly bill.

In this example, the question that is being studied -- customer loyalty (or, conversely, customer churn) -- is known as the dependent variable. The variables that may affect the dependent variable (age, location, income and average bill) are known as independent variables. A sample dataset represents historic information about customer loyalty in the past. In the sample data, 14 out of 30 customers remained loyal, while 16 switched carriers. The algorithm determined that the most significant independent variable describing customer loyalty is income. By making the first split where income is less than or greater than $40,000, the algorithm creates two smaller subsets of data with less entropy (or, conversely, greater purity).

The dataset's entropy is a measure of how heterogeneous the data population is. In our sample data, where 14 of 30 customers remained loyal while 16 did not, there is a nearly random distribution, and therefore a high entropy (or level of disorder). At each subsequent level of the tree, the algorithm will test each independent variable, and partition with the one that results in the lowest entropy. In a well-partitioned decision tree, the level of entropy should decrease as one reads down the tree. In our example, the final partitions have relatively low levels of entropy.

This makes sense if one considers the interpretation of the decision tree. From the example, one could state that customers who earn more than $40,000 and are over 31 years old tend to be very loyal, whereas customers who earn less than $40,000 with average monthly bills of $50 or more usually switch carriers. This can be stated with some degree of confidence because the entropy of these final partitions is quite low. If entropy was high and the partition was heterogeneous, few conclusions could be drawn. This example also demonstrates how different independent variables can be used at a given level in the tree. For the customers earning more than $40,000, age was the next best descriptor. For the customers earning less than $40,000, average monthly bill was the next best descriptor.

While successive partitioning (known as growing the tree) will typically create finer and more accurate groupings with lower entropy, at some point the partitioning will be too complex to be readily interpretable. Building an effective decision tree requires a balance between accuracy (from more partitions) and explainability (limiting the number of partitions). The other risk of successive partitioning is that the tree will be "overfit" to the data. The goal of classification analysis is to abstract an understanding of causal characteristics from the data. If the dataset is too finely partitioned relative to the size of the dataset, the tree describes this particular dataset and its inherent noise, as opposed to the types of characteristics this dataset represents.

The best practice is to grow the tree all the way out, at the risk of overfitting. Then the tree is pruned back one node at a time. With each pruning, the overall entropy of the system is measured. The tree is pruned back until entropy increases significantly (significant purity is lost). In this way, the partitions that offer the most information are kept, while further partitions that increase the model's complexity without offering much more accuracy are dropped.

Decision trees are well-suited for classification and prediction problems where accuracy is not as important as explainability. They are very easy to understand and explain (assuming the tree is not overgrown), but are not as subtle (and accurate) at partitioning as neural networks. They scale well for larger datasets, and can tolerate noisy and anomalous data, but require large datasets in order to avoid overfitting.

**A neural way**

Neural network algorithms mimic neurological structure and function in an effort to model complex problems involving subtle data relationships. Basic animal neurology shows us that with as few as three layers of simple nerves, or neurons, very complex outputs can be derived from very simple input signals. In a neural network, each node plays the part of a neuron. Nodes can send or receive signals based on the input they each receive and their relationship with the nodes around them.

Model complexity quickly grows as more layers of nodes are introduced into the neural model. In the trivial case, one input node connects to one output node. This would represent a simple causal effect (for example if this happens, then do this). Multiple input nodes can affect multiple possible outcomes, depending on the signals sent by each of the input nodes. This structure can model more complex behavior and, if the input signals were linear, would in essence perform a linear regression. A hidden layer separates input and output layers in a typical neural net. The hidden layer offers a much richer integration and interpretation of the original input signals and allows the model to represent very complex non-linear behavior quite accurately.

In a neural net, each node in a given layer can signal nodes in the successive layer through connections that are referred to as weights. Weights indicate how strong the signal is between the two connected nodes. Nodes perform two functions when incoming signals are received. First, all the incoming weighted signals are summed -- note that signals may be negative (inhibitory) as well as positive (excitatory). Then a transfer function (or activation function) is used to determine how strong the output signal should be; based on the inputs it has received. If, for instance, the transfer function is linear, then the stronger the total input, the stronger the output signal. If instead, the transfer function is bell-shaped, then the strongest output signals will occur for summed inputs in the middle of the range of possible inputs.

Back propagation algorithms are the most frequently used neural net technique for supervised learning problems (where an initial dataset, the training data, is used to train the model before it makes predictions against the test data). In a back propagation model, average weights are first assigned to each nodal connection. Then the training inputs are run through the model, and the net creates its output response. Because this is a supervised learning technique, the output response is then compared to the correct result for the training data, and an error term is calculated (in other words, the difference between the output response and the correct result). The net then analyzes each node's contribution to the error term, and adjusts the node's weights accordingly (the backward propagation). This process is repeated until the error term is satisfactorily small and the net is ready to predict based on the test dataset.

Neural nets can be overtrained on a particular set of training data, much as decision trees can encounter overfitting. When overtrained, the model has essentially memorized the actual data, rather than learned the patterns in the data. While an overtrained net can achieve high accuracy on the training data it "knows," accuracy tends to be very poor on other datasets. One way to measure whether the net is overtrained, is to reserve a small subset of training data until the net is trained. Then the reserved training data is run through the net, and the output results are compared with actual results for that reserved dataset. If the error term for this test is small, then the net has not overtrained and can be applied to the test data with confidence. This method of testing does require sufficient data for a representative training sample, as well as a reserved training sample. Generally overtraining is a result of too small a training dataset, and too many iterations of the training process.

Neural nets can offer very high accuracy regardless of the complexity of the problem being studied. This however, is offset by the fact that it is nearly impossible to understand how the results were reached.

This is why neural nets are sometimes referred to as black box systems. They provide answers, but as if like magic, you cannot see how it was done.

**Genetic algorithms**

Genetic algorithms, like neural networks, borrow from nature to solve problems -- in this case the techniques are borrowed from genetics, as opposed to neurology. Genetic algorithms specialize in solving optimization problems, for example helping to identify the best, or optimal solution to a particular problem. In business, these problems range from finding the best product configurations to map to specific customer needs, to routing and delivery problems, to staffing and scheduling management.

Optimization problems are trivial, if the range of possible solutions is small. In these cases an exhaustive search of all the potential solutions would be sufficient. However, as the number of problem variables grows, the range of potential solutions grows exponentially, and a more efficient technique is required. For instance, if you wanted to determine the shortest delivery route between three cities, simple inspection of the six possible combinations of routes would be sufficient. However, if you had to consider routes between 15 different cities, an exhaustive search would have to evaluate 15! or 1,307,674,368,000 different solutions.

The basic elements of genetic algorithms are chromosomes, which are made up of individual genes. Each chromosome represents one potential solution to a given problem. The genes each represent a variable in the solution. For example, a Web-based realty service wants to automatically match a home buyer's criteria against the realtor's database of available houses on the market, and recommend the best three houses to the home buyer. Let's assume in this example that the criteria includes the number of bedrooms, whether or not the house has a fireplace and a garage, and the acreage of the lot. Each chromosome would include a gene that represented the number of bedrooms, a gene that represented the existence of a fireplace, a gene for the garage, and a gene for acreage.

Using a chromosome structure to encode possible solutions, an initial population of solutions is generated by the genetic algorithm. This is the starting place for the algorithm. The chromosomes are then evaluated by a fitness function that quantifies how good a solution to the specific problem each chromosome represents. In our example, if a potential home buyer wanted more than two bedrooms, a fireplace and a garage, and at least two acres of land, then the chromosome would be a good potential solution depending on how important the garage was to the buyer. The fitness function knows how to weigh the values of each gene to quantify the goodness of the solution.

One of three things can then happen to each chromosome. The algorithm first must decide whether or not the chromosome will be selected against (in the truest Darwinian sense). The best chromosomes will survive to the next generation (be kept), and the poorest chromosomes will die (be discarded). For those surviving chromosomes, there is then the chance of crossover or mutation. Crossover mixes the genes from two chromosomes resulting in two new chromosomes. Mutation randomly changes the value of one of the genes in a chromosome. As in nature, crossover and mutation enrich the gene pool. In effect, selection ensures the algorithm is working with the best solutions to date. Crossover and mutation explore slight variations on good solutions to try and find better ones. The results of selection, crossover, and mutation create the next generation of chromosomes, and the process is repeated.

Genetic algorithms are very fast and effective at finding good solutions quickly. While they may not always arrive at the single best answer, they usually offer several "near-best" answers to choose from. The solutions are easy to interpret, but because the algorithmic process is essentially trial and error with refinement, the path by which the solution was reached is not very explainable.

Another characteristic of genetic algorithms is that as long as the fitness function knows how to recognize a good solution when it sees one, the model doesn't have to understand anything about how to solve a particular problem. With a representative initial population, the model does its thing and generates the optimum solutions.

**Hybrid techniques**

Each data mining technique has its own characteristic strengths and weakness, and while appropriate for solving some problems, may be totally unsuited for other problems. Artificial intelligence research is developing a new class of problem solving techniques, known as polymorphic hybrid techniques, or intelligent hybrid systems. These hybrid algorithms borrow from other techniques under different modeling conditions, combining techniques such as genetic algorithms and neural nets, or fuzzy logic and expert systems. Because combinations of techniques can vary, it is difficult to describe this class of techniques as a whole. When examining the behavior of a specific polymorphic hybrid algorithm, the characteristics of the embedded techniques that the hybrid is based on should be evident.

**Is a Ph.D. necessary?**

Informed product selection is the only way to ensure that a particular product is well-suited to the problem to be solved. Unfortunately, the marketing and sales efforts of most of the data mining vendors gloss over the technical details. Even worse, some go as far as claiming that the underlying technology is not important, because even if the technique the product is based on is not the most appropriate for a customer's problem, they will still achieve some benefit from using that product!

Correct product selection requires two things: First, the business problem must be clearly identified and understood. Second, the technique that a product is based on must be understood, and it must be decided whether or not that technique is appropriate for the business problem. A Ph.D. is not a prerequisite for an informed selection of a data mining product. But a general knowledge of data mining techniques and their appropriate application, together with a familiarity of the terms and jargon can demystify the process and greatly improve the chances of an appropriate purchase.