In-Depth
Mining for data tools
- By Julie Hahnke
- July 12, 2001
Data mining is hot, companies want it. But problems arise when it comes
time to select a data mining tool. Often, companies do not know how to select a product suitable to their specific
requirements. Data mining techniques and algorithms are complex. In fact they probably represent some of the most
complex technology available today. But their complexity should not intimidate users.
A recent market intelligence report from Software Productivity Group's Analyst Services division, Natick, Mass.,
shows strong trends in the data mining market. Over the last three years there's been explosive product growth,
with well over 100 product offerings to date. Over the last two years, an SPG survey uncovered, corporate data
mining projects have increased an incredible 450%.
Still, most organizations do not understand what data mining is, how it works and how to differentiate between
the product offerings currently available. To best utilize data mining's capabilities, managers must first understand
the various technologies that underlie data mining tools and the types of problems each technology is best suited
to solve.
Data mining technologies are based on a combination of mathematical and heuristic algorithms that borrow heavily
from statistical analysis and artificial intelligence. While mathematical models and statistics are well-suited
to analyze many business activities, they perform poorly when modeling systems that exhibit non-linear or chaotic
behavior (for example, weather forecasting or stock market predictions). Simplifying assumptions can be made that
remove the non-linearities and better support mathematical modeling, but these simplified assumptions all too often
remove the very behavior hoped to be modeled. Heuristic algorithms are not based on math, but rather a set of analytical
steps to be followed, where each successive step is not predetermined, but is selected based on the results from
the previous step. Heuristic techniques such as neural networks and genetic algorithms are very good at accurately
modeling many complex, non-linear business problems.
The myriad of data mining market offerings employ a wide variety of technologies ranging from statistical and regression
analysis (decision trees) to heuristic techniques such as neural networks, genetic algorithms, fuzzy logic, rules
induction and case-based reasoning. However, most of the major data mining products available are based on decision
tree techniques, neural networks or genetic algorithms.
The Data Mining Market --
Product Approaches
Most of the data mining products on the market employ
some combination of decision trees, neural networks and, to a lesser degree, genetic algorithms. However, there
are also products available based on techniques such as fuzzy logic, rules induction and case-based reasoning.
While some products are built around a single technique (for example, Angoss's KnowledeSEEKER uses decision trees
techniques) others offer several different techniques (for example, Thinking Machines' Darwin offers a suite of
products using neural nets, decision trees and genetic algorithms).
DataMind from DataMind Corp. is one of the few products based on polymorphic hybrid techniques.
They call their approach "agent technology," and it has characteristics of both neural nets and decision
trees.
Data mining and Olap offer very complimentary functionality, and most of the major Olap vendors
have data mining offerings.
|
Trees for decisions
Decision trees represent a specific class of machine learning algorithms, also known as recursive partitioning
algorithms, which are used for classification and prediction. Typically based on statistical algorithms such as
CHAID (chi-squared automatic interaction detection) and CART (classification and regression trees), decision trees
try to describe specific populations of the dataset in terms of certain characteristics or variables that influence
the data. For example, "market churn" is a major issue in the telecommunications business. It would be
important for a telecommunications company to understand and predict what set of customers are more likely to remain
loyal so it can target its product services and marketing to that desirable customer group. A decision tree might
separate out its customer data based on the likelihood the customer will remain loyal. It might examine variables
such as customer age, geographic location, income and average monthly bill.
In this example, the question that is being studied -- customer loyalty (or, conversely, customer churn) -- is
known as the dependent variable. The variables that may affect the dependent variable (age, location, income and
average bill) are known as independent variables. A sample dataset represents historic information about customer
loyalty in the past. In the sample data, 14 out of 30 customers remained loyal, while 16 switched carriers. The
algorithm determined that the most significant independent variable describing customer loyalty is income. By making
the first split where income is less than or greater than $40,000, the algorithm creates two smaller subsets of
data with less entropy (or, conversely, greater purity).
The dataset's entropy is a measure of how heterogeneous the data population is. In our sample data, where 14 of
30 customers remained loyal while 16 did not, there is a nearly random distribution, and therefore a high entropy
(or level of disorder). At each subsequent level of the tree, the algorithm will test each independent variable,
and partition with the one that results in the lowest entropy. In a well-partitioned decision tree, the level of
entropy should decrease as one reads down the tree. In our example, the final partitions have relatively low levels
of entropy.
This makes sense if one considers the interpretation of the decision tree. From the
example, one could state that customers who earn more than $40,000 and are over 31 years old tend to be very loyal,
whereas customers who earn less than $40,000 with average monthly bills of $50 or more usually switch carriers.
This can be stated with some degree of confidence because the entropy of these final partitions is quite low. If
entropy was high and the partition was heterogeneous, few conclusions could be drawn. This example also demonstrates
how different independent variables can be used at a given level in the tree. For the customers earning more than
$40,000, age was the next best descriptor. For the customers earning less than $40,000, average monthly bill was
the next best descriptor.
While successive partitioning (known as growing the tree) will typically create finer and more accurate groupings
with lower entropy, at some point the partitioning will be too complex to be readily interpretable. Building an
effective decision tree requires a balance between accuracy (from more partitions) and explainability (limiting
the number of partitions). The other risk of successive partitioning is that the tree will be "overfit"
to the data. The goal of classification analysis is to abstract an understanding of causal characteristics from
the data. If the dataset is too finely partitioned relative to the size of the dataset, the tree describes this
particular dataset and its inherent noise, as opposed to the types of characteristics this dataset represents.
The best practice is to grow the tree all the way out, at the risk of overfitting. Then the tree is pruned back
one node at a time. With each pruning, the overall entropy of the system is measured. The tree is pruned back until
entropy increases significantly (significant purity is lost). In this way, the partitions that offer the most information
are kept, while further partitions that increase the model's complexity without offering much more accuracy are
dropped.
Decision trees are well-suited for classification and prediction problems where accuracy is not as important as
explainability. They are very easy to understand and explain (assuming the tree is not overgrown), but are not
as subtle (and accurate) at partitioning as neural networks. They scale well for larger datasets, and can tolerate
noisy and anomalous data, but require large datasets in order to avoid overfitting.
A neural way
Neural network algorithms mimic neurological structure and function in an effort to model complex problems involving
subtle data relationships. Basic animal neurology shows us that with as few as three layers of simple nerves, or
neurons, very complex outputs can be derived from very simple input signals. In a neural network, each node plays
the part of a neuron. Nodes can send or receive signals based on the input they each receive and their relationship
with the nodes around them.
Model complexity quickly grows as more layers of nodes are introduced into the neural model. In the trivial case,
one input node connects to one output node. This would represent a simple causal effect (for example if this happens,
then do this). Multiple input nodes can affect multiple possible outcomes, depending on the signals sent by each
of the input nodes. This structure can model more complex behavior and, if the input signals were linear, would
in essence perform a linear regression. A hidden layer separates input and output layers in a typical neural net.
The hidden layer offers a much richer integration and interpretation of the original input signals and allows the
model to represent very complex non-linear behavior quite accurately.
In a neural net, each node in a given layer can signal nodes in the successive layer through connections that are
referred to as weights. Weights indicate how strong the signal is between the two connected nodes. Nodes perform
two functions when incoming signals are received. First, all the incoming weighted signals are summed -- note that
signals may be negative (inhibitory) as well as positive (excitatory). Then a transfer function (or activation
function) is used to determine how strong the output signal should be; based on the inputs it has received. If,
for instance, the transfer function is linear, then the stronger the total input, the stronger the output signal.
If instead, the transfer function is bell-shaped, then the strongest output signals will occur for summed inputs
in the middle of the range of possible inputs.
Back propagation algorithms are the most frequently used neural net technique for supervised learning problems
(where an initial dataset, the training data, is used to train the model before it makes predictions against the
test data). In a back propagation model, average weights are first assigned to each nodal connection. Then the
training inputs are run through the model, and the net creates its output response. Because this is a supervised
learning technique, the output response is then compared to the correct result for the training data, and an error
term is calculated (in other words, the difference between the output response and the correct result). The net
then analyzes each node's contribution to the error term, and adjusts the node's weights accordingly (the backward
propagation). This process is repeated until the error term is satisfactorily small and the net is ready to predict
based on the test dataset.
Neural nets can be overtrained on a particular set of training data, much as decision trees can encounter overfitting.
When overtrained, the model has essentially memorized the actual data, rather than learned the patterns in the
data. While an overtrained net can achieve high accuracy on the training data it "knows," accuracy tends
to be very poor on other datasets. One way to measure whether the net is overtrained, is to reserve a small subset
of training data until the net is trained. Then the reserved training data is run through the net, and the output
results are compared with actual results for that reserved dataset. If the error term for this test is small, then
the net has not overtrained and can be applied to the test data with confidence. This method of testing does require
sufficient data for a representative training sample, as well as a reserved training sample. Generally overtraining
is a result of too small a training dataset, and too many iterations of the training process.
Neural nets can offer very high accuracy regardless of the complexity of the problem being studied. This however,
is offset by the fact that it is nearly impossible to understand how the results were reached.
This is why neural nets are sometimes referred to as black box systems. They provide answers, but as if like magic,
you cannot see how it was done.
Genetic algorithms
Genetic algorithms, like neural networks, borrow from nature to solve problems -- in this case the techniques are
borrowed from genetics, as opposed to neurology. Genetic algorithms specialize in solving optimization problems,
for example helping to identify the best, or optimal solution to a particular problem. In business, these problems
range from finding the best product configurations to map to specific customer needs, to routing and delivery problems,
to staffing and scheduling management.
Optimization problems are trivial, if the range of possible solutions is small. In these cases an exhaustive search
of all the potential solutions would be sufficient. However, as the number of problem variables grows, the range
of potential solutions grows exponentially, and a more efficient technique is required. For instance, if you wanted
to determine the shortest delivery route between three cities, simple inspection of the six possible combinations
of routes would be sufficient. However, if you had to consider routes between 15 different cities, an exhaustive
search would have to evaluate 15! or 1,307,674,368,000 different solutions.
The basic elements of genetic algorithms are chromosomes, which are made up of individual genes. Each chromosome
represents one potential solution to a given problem. The genes each represent a variable in the solution. For
example, a Web-based realty service wants to automatically match a home buyer's criteria against the realtor's
database of available houses on the market, and recommend the best three houses to the home buyer. Let's assume
in this example that the criteria includes the number of bedrooms, whether or not the house has a fireplace and
a garage, and the acreage of the lot. Each chromosome would include a gene that represented the number of bedrooms,
a gene that represented the existence of a fireplace, a gene for the garage, and a gene for acreage.
Using a chromosome structure to encode possible solutions, an initial population of
solutions is generated by the genetic algorithm. This is the starting place for the algorithm. The chromosomes
are then evaluated by a fitness function that quantifies how good a solution to the specific problem each chromosome
represents. In our example, if a potential home buyer wanted more than two bedrooms, a fireplace and a garage,
and at least two acres of land, then the chromosome would be a good potential solution depending on how important
the garage was to the buyer. The fitness function knows how to weigh the values of each gene to quantify the goodness
of the solution.
One of three things can then happen to each chromosome. The algorithm first must decide whether or not the chromosome
will be selected against (in the truest Darwinian sense). The best chromosomes will survive to the next generation
(be kept), and the poorest chromosomes will die (be discarded). For those surviving chromosomes, there is then
the chance of crossover or mutation. Crossover mixes the genes from two chromosomes resulting in two new chromosomes.
Mutation randomly changes the value of one of the genes in a chromosome. As in nature, crossover and mutation enrich
the gene pool. In effect, selection ensures the algorithm is working with the best solutions to date. Crossover
and mutation explore slight variations on good solutions to try and find better ones. The results of selection,
crossover, and mutation create the next generation of chromosomes, and the process is repeated.
Genetic algorithms are very fast and effective at finding good solutions quickly. While they may not always arrive
at the single best answer, they usually offer several "near-best" answers to choose from. The solutions
are easy to interpret, but because the algorithmic process is essentially trial and error with refinement, the
path by which the solution was reached is not very explainable.
Another characteristic of genetic algorithms is that as long as the fitness function knows how to recognize a good
solution when it sees one, the model doesn't have to understand anything about how to solve a particular problem.
With a representative initial population, the model does its thing and generates the optimum solutions.
Hybrid techniques
Each data mining technique has its own characteristic strengths and weakness, and while appropriate for solving
some problems, may be totally unsuited for other problems. Artificial intelligence research is developing a new
class of problem solving techniques, known as polymorphic hybrid techniques, or intelligent hybrid systems. These
hybrid algorithms borrow from other techniques under different modeling conditions, combining techniques such as
genetic algorithms and neural nets, or fuzzy logic and expert systems. Because combinations of techniques can vary,
it is difficult to describe this class of techniques as a whole. When examining the behavior of a specific polymorphic
hybrid algorithm, the characteristics of the embedded techniques that the hybrid is based on should be evident.
Is a Ph.D. necessary?
Informed product selection is the only way to ensure that a particular product is well-suited to the problem to
be solved. Unfortunately, the marketing and sales efforts of most of the data mining vendors gloss over the technical
details. Even worse, some go as far as claiming that the underlying technology is not important, because even if
the technique the product is based on is not the most appropriate for a customer's problem, they will still achieve
some benefit from using that product!
Correct product selection requires two things: First, the business problem must be clearly identified and understood.
Second, the technique that a product is based on must be understood, and it must be decided whether or not that
technique is appropriate for the business problem. A Ph.D. is not a prerequisite for an informed selection of a
data mining product. But a general knowledge of data mining techniques and their appropriate application, together
with a familiarity of the terms and jargon can demystify the process and greatly improve the chances of an appropriate
purchase.
Data mining is hot, companies want it. But problems arise when it comes
time to select a data mining tool. Often, companies do not know how to select a product suitable to their specific
requirements. Data mining techniques and algorithms are complex. In fact they probably represent some of the most
complex technology available today. But their complexity should not intimidate users.
A recent market intelligence report from Software Productivity Group's Analyst Services division, Natick, Mass.,
shows strong trends in the data mining market. Over the last three years there's been explosive product growth,
with well over 100 product offerings to date. Over the last two years, an SPG survey uncovered, corporate data
mining projects have increased an incredible 450%.
Still, most organizations do not understand what data mining is, how it works and how to differentiate between
the product offerings currently available. To best utilize data mining's capabilities, managers must first understand
the various technologies that underlie data mining tools and the types of problems each technology is best suited
to solve.
Data mining technologies are based on a combination of mathematical and heuristic algorithms that borrow heavily
from statistical analysis and artificial intelligence. While mathematical models and statistics are well-suited
to analyze many business activities, they perform poorly when modeling systems that exhibit non-linear or chaotic
behavior (for example, weather forecasting or stock market predictions). Simplifying assumptions can be made that
remove the non-linearities and better support mathematical modeling, but these simplified assumptions all too often
remove the very behavior hoped to be modeled. Heuristic algorithms are not based on math, but rather a set of analytical
steps to be followed, where each successive step is not predetermined, but is selected based on the results from
the previous step. Heuristic techniques such as neural networks and genetic algorithms are very good at accurately
modeling many complex, non-linear business problems.
The myriad of data mining market offerings employ a wide variety of technologies ranging from statistical and regression
analysis (decision trees) to heuristic techniques such as neural networks, genetic algorithms, fuzzy logic, rules
induction and case-based reasoning. However, most of the major data mining products available are based on decision
tree techniques, neural networks or genetic algorithms.
The Data Mining Market --
Product Approaches
Most of the data mining products on the market employ
some combination of decision trees, neural networks and, to a lesser degree, genetic algorithms. However, there
are also products available based on techniques such as fuzzy logic, rules induction and case-based reasoning.
While some products are built around a single technique (for example, Angoss's KnowledeSEEKER uses decision trees
techniques) others offer several different techniques (for example, Thinking Machines' Darwin offers a suite of
products using neural nets, decision trees and genetic algorithms).
DataMind from DataMind Corp. is one of the few products based on polymorphic hybrid techniques.
They call their approach "agent technology," and it has characteristics of both neural nets and decision
trees.
Data mining and Olap offer very complimentary functionality, and most of the major Olap vendors
have data mining offerings.
|
Trees for decisions
Decision trees represent a specific class of machine learning algorithms, also known as recursive partitioning
algorithms, which are used for classification and prediction. Typically based on statistical algorithms such as
CHAID (chi-squared automatic interaction detection) and CART (classification and regression trees), decision trees
try to describe specific populations of the dataset in terms of certain characteristics or variables that influence
the data. For example, "market churn" is a major issue in the telecommunications business. It would be
important for a telecommunications company to understand and predict what set of customers are more likely to remain
loyal so it can target its product services and marketing to that desirable customer group. A decision tree might
separate out its customer data based on the likelihood the customer will remain loyal. It might examine variables
such as customer age, geographic location, income and average monthly bill.
In this example, the question that is being studied -- customer loyalty (or, conversely, customer churn) -- is
known as the dependent variable. The variables that may affect the dependent variable (age, location, income and
average bill) are known as independent variables. A sample dataset represents historic information about customer
loyalty in the past. In the sample data, 14 out of 30 customers remained loyal, while 16 switched carriers. The
algorithm determined that the most significant independent variable describing customer loyalty is income. By making
the first split where income is less than or greater than $40,000, the algorithm creates two smaller subsets of
data with less entropy (or, conversely, greater purity).
The dataset's entropy is a measure of how heterogeneous the data population is. In our sample data, where 14 of
30 customers remained loyal while 16 did not, there is a nearly random distribution, and therefore a high entropy
(or level of disorder). At each subsequent level of the tree, the algorithm will test each independent variable,
and partition with the one that results in the lowest entropy. In a well-partitioned decision tree, the level of
entropy should decrease as one reads down the tree. In our example, the final partitions have relatively low levels
of entropy.
This makes sense if one considers the interpretation of the decision tree. From the
example, one could state that customers who earn more than $40,000 and are over 31 years old tend to be very loyal,
whereas customers who earn less than $40,000 with average monthly bills of $50 or more usually switch carriers.
This can be stated with some degree of confidence because the entropy of these final partitions is quite low. If
entropy was high and the partition was heterogeneous, few conclusions could be drawn. This example also demonstrates
how different independent variables can be used at a given level in the tree. For the customers earning more than
$40,000, age was the next best descriptor. For the customers earning less than $40,000, average monthly bill was
the next best descriptor.
While successive partitioning (known as growing the tree) will typically create finer and more accurate groupings
with lower entropy, at some point the partitioning will be too complex to be readily interpretable. Building an
effective decision tree requires a balance between accuracy (from more partitions) and explainability (limiting
the number of partitions). The other risk of successive partitioning is that the tree will be "overfit"
to the data. The goal of classification analysis is to abstract an understanding of causal characteristics from
the data. If the dataset is too finely partitioned relative to the size of the dataset, the tree describes this
particular dataset and its inherent noise, as opposed to the types of characteristics this dataset represents.
The best practice is to grow the tree all the way out, at the risk of overfitting. Then the tree is pruned back
one node at a time. With each pruning, the overall entropy of the system is measured. The tree is pruned back until
entropy increases significantly (significant purity is lost). In this way, the partitions that offer the most information
are kept, while further partitions that increase the model's complexity without offering much more accuracy are
dropped.
Decision trees are well-suited for classification and prediction problems where accuracy is not as important as
explainability. They are very easy to understand and explain (assuming the tree is not overgrown), but are not
as subtle (and accurate) at partitioning as neural networks. They scale well for larger datasets, and can tolerate
noisy and anomalous data, but require large datasets in order to avoid overfitting.
A neural way
Neural network algorithms mimic neurological structure and function in an effort to model complex problems involving
subtle data relationships. Basic animal neurology shows us that with as few as three layers of simple nerves, or
neurons, very complex outputs can be derived from very simple input signals. In a neural network, each node plays
the part of a neuron. Nodes can send or receive signals based on the input they each receive and their relationship
with the nodes around them.
Model complexity quickly grows as more layers of nodes are introduced into the neural model. In the trivial case,
one input node connects to one output node. This would represent a simple causal effect (for example if this happens,
then do this). Multiple input nodes can affect multiple possible outcomes, depending on the signals sent by each
of the input nodes. This structure can model more complex behavior and, if the input signals were linear, would
in essence perform a linear regression. A hidden layer separates input and output layers in a typical neural net.
The hidden layer offers a much richer integration and interpretation of the original input signals and allows the
model to represent very complex non-linear behavior quite accurately.
In a neural net, each node in a given layer can signal nodes in the successive layer through connections that are
referred to as weights. Weights indicate how strong the signal is between the two connected nodes. Nodes perform
two functions when incoming signals are received. First, all the incoming weighted signals are summed -- note that
signals may be negative (inhibitory) as well as positive (excitatory). Then a transfer function (or activation
function) is used to determine how strong the output signal should be; based on the inputs it has received. If,
for instance, the transfer function is linear, then the stronger the total input, the stronger the output signal.
If instead, the transfer function is bell-shaped, then the strongest output signals will occur for summed inputs
in the middle of the range of possible inputs.
Back propagation algorithms are the most frequently used neural net technique for supervised learning problems
(where an initial dataset, the training data, is used to train the model before it makes predictions against the
test data). In a back propagation model, average weights are first assigned to each nodal connection. Then the
training inputs are run through the model, and the net creates its output response. Because this is a supervised
learning technique, the output response is then compared to the correct result for the training data, and an error
term is calculated (in other words, the difference between the output response and the correct result). The net
then analyzes each node's contribution to the error term, and adjusts the node's weights accordingly (the backward
propagation). This process is repeated until the error term is satisfactorily small and the net is ready to predict
based on the test dataset.
Neural nets can be overtrained on a particular set of training data, much as decision trees can encounter overfitting.
When overtrained, the model has essentially memorized the actual data, rather than learned the patterns in the
data. While an overtrained net can achieve high accuracy on the training data it "knows," accuracy tends
to be very poor on other datasets. One way to measure whether the net is overtrained, is to reserve a small subset
of training data until the net is trained. Then the reserved training data is run through the net, and the output
results are compared with actual results for that reserved dataset. If the error term for this test is small, then
the net has not overtrained and can be applied to the test data with confidence. This method of testing does require
sufficient data for a representative training sample, as well as a reserved training sample. Generally overtraining
is a result of too small a training dataset, and too many iterations of the training process.
Neural nets can offer very high accuracy regardless of the complexity of the problem being studied. This however,
is offset by the fact that it is nearly impossible to understand how the results were reached.
This is why neural nets are sometimes referred to as black box systems. They provide answers, but as if like magic,
you cannot see how it was done.
Genetic algorithms
Genetic algorithms, like neural networks, borrow from nature to solve problems -- in this case the techniques are
borrowed from genetics, as opposed to neurology. Genetic algorithms specialize in solving optimization problems,
for example helping to identify the best, or optimal solution to a particular problem. In business, these problems
range from finding the best product configurations to map to specific customer needs, to routing and delivery problems,
to staffing and scheduling management.
Optimization problems are trivial, if the range of possible solutions is small. In these cases an exhaustive search
of all the potential solutions would be sufficient. However, as the number of problem variables grows, the range
of potential solutions grows exponentially, and a more efficient technique is required. For instance, if you wanted
to determine the shortest delivery route between three cities, simple inspection of the six possible combinations
of routes would be sufficient. However, if you had to consider routes between 15 different cities, an exhaustive
search would have to evaluate 15! or 1,307,674,368,000 different solutions.
The basic elements of genetic algorithms are chromosomes, which are made up of individual genes. Each chromosome
represents one potential solution to a given problem. The genes each represent a variable in the solution. For
example, a Web-based realty service wants to automatically match a home buyer's criteria against the realtor's
database of available houses on the market, and recommend the best three houses to the home buyer. Let's assume
in this example that the criteria includes the number of bedrooms, whether or not the house has a fireplace and
a garage, and the acreage of the lot. Each chromosome would include a gene that represented the number of bedrooms,
a gene that represented the existence of a fireplace, a gene for the garage, and a gene for acreage.
Using a chromosome structure to encode possible solutions, an initial population of
solutions is generated by the genetic algorithm. This is the starting place for the algorithm. The chromosomes
are then evaluated by a fitness function that quantifies how good a solution to the specific problem each chromosome
represents. In our example, if a potential home buyer wanted more than two bedrooms, a fireplace and a garage,
and at least two acres of land, then the chromosome would be a good potential solution depending on how important
the garage was to the buyer. The fitness function knows how to weigh the values of each gene to quantify the goodness
of the solution.
One of three things can then happen to each chromosome. The algorithm first must decide whether or not the chromosome
will be selected against (in the truest Darwinian sense). The best chromosomes will survive to the next generation
(be kept), and the poorest chromosomes will die (be discarded). For those surviving chromosomes, there is then
the chance of crossover or mutation. Crossover mixes the genes from two chromosomes resulting in two new chromosomes.
Mutation randomly changes the value of one of the genes in a chromosome. As in nature, crossover and mutation enrich
the gene pool. In effect, selection ensures the algorithm is working with the best solutions to date. Crossover
and mutation explore slight variations on good solutions to try and find better ones. The results of selection,
crossover, and mutation create the next generation of chromosomes, and the process is repeated.
Genetic algorithms are very fast and effective at finding good solutions quickly. While they may not always arrive
at the single best answer, they usually offer several "near-best" answers to choose from. The solutions
are easy to interpret, but because the algorithmic process is essentially trial and error with refinement, the
path by which the solution was reached is not very explainable.
Another characteristic of genetic algorithms is that as long as the fitness function knows how to recognize a good
solution when it sees one, the model doesn't have to understand anything about how to solve a particular problem.
With a representative initial population, the model does its thing and generates the optimum solutions.
Hybrid techniques
Each data mining technique has its own characteristic strengths and weakness, and while appropriate for solving
some problems, may be totally unsuited for other problems. Artificial intelligence research is developing a new
class of problem solving techniques, known as polymorphic hybrid techniques, or intelligent hybrid systems. These
hybrid algorithms borrow from other techniques under different modeling conditions, combining techniques such as
genetic algorithms and neural nets, or fuzzy logic and expert systems. Because combinations of techniques can vary,
it is difficult to describe this class of techniques as a whole. When examining the behavior of a specific polymorphic
hybrid algorithm, the characteristics of the embedded techniques that the hybrid is based on should be evident.
Is a Ph.D. necessary?
Informed product selection is the only way to ensure that a particular product is well-suited to the problem to
be solved. Unfortunately, the marketing and sales efforts of most of the data mining vendors gloss over the technical
details. Even worse, some go as far as claiming that the underlying technology is not important, because even if
the technique the product is based on is not the most appropriate for a customer's problem, they will still achieve
some benefit from using that product!
Correct product selection requires two things: First, the business problem must be clearly identified and understood.
Second, the technique that a product is based on must be understood, and it must be decided whether or not that
technique is appropriate for the business problem. A Ph.D. is not a prerequisite for an informed selection of a
data mining product. But a general knowledge of data mining techniques and their appropriate application, together
with a familiarity of the terms and jargon can demystify the process and greatly improve the chances of an appropriate
purchase.