Data Mining

Free Data Mining by Mehmed Kantardzic Page B

Book: Data Mining by Mehmed Kantardzic Read Free Book Online
Authors: Mehmed Kantardzic
Available samples and also future cases may have values missing. Some of the data-mining methods accept missing values and satisfactorily process data to reach a final conclusion. Other methods require that all values be available. An obvious question is whether these missing values can be filled in during data preparation, prior to the application of the data-mining methods. The simplest solution for this problem is the reduction of the data set and the elimination of all samples with missing values. That is possible when large data sets are available, and missing values occur only in a small percentage of samples. If we do not drop the samples with missing values, then we have to find values for them. What are the practical solutions?
    First, a data miner, together with the domain expert, can manually examine samples that have no values and enter a reasonable, probable, or expected value, based on a domain experience. The method is straightforward for small numbers of missing values and relatively small data sets. But, if there is no obvious or plausible value for each case, the miner is introducing noise into the data set by manually generating a value.
    The second approach gives an even simpler solution for elimination of missing values. It is based on a formal, often automatic replacement of missing values with some constants, such as:
    1. replace all missing values with a single global constant (a selection of a global constant is highly application dependent);
    2. replace a missing value with its feature mean; and
    3. replace a missing value with its feature mean for the given class (this approach is possible only for classification problems where samples are classified in advance).
    These simple solutions are tempting. Their main flaw is that the substituted value is not the correct value. By replacing the missing value with a constant or changing the values for a few different features, the data are biased. The replaced value (values) will homogenize the cases with missing values into a uniform subset directed toward the class with the most missing values (an artificial class). If missing values are replaced with a single global constant for all features, an unknown value may be implicitly made into a positive factor that is not objectively justified.
    One possible interpretation of missing values is that they are “don’t care” values. In other words, we suppose that these values do not have any influence on the final data-mining results. In that case, a sample with the missing value may be extended to the set of artificial samples, where, for each new sample, the missing value is replaced with one of the possible feature values of a given domain. Although this interpretation may look more natural, the problem with this approach is the combinatorial explosion of artificial samples. For example, if one 3-D sample X is given as X = {1, ?, 3}, where the second feature’s value is missing, the process will generate five artificial samples for the feature domain [0, 1, 2, 3, 4]

    Finally, the data miner can generate a predictive model to predict each of the missing values. For example, if three features A, B, and C are given for each sample, then based on samples that have all three values as a training set, the data miner can generate a model of correlation between features. Different techniques, such as regression, Bayesian formalism, clustering, or decision-tree induction, may be used depending on data types (all these techniques are explained later in Chapters 5, 6, and 7). Once you have a trained model, you can present a new sample that has a value missing and generate a “predictive” value. For example, if values for features A and B are given, the model generates the value for the feature C. If a missing value is highly correlated with the other known features, this process will generate the best value for that feature. Of course, if you can always predict a missing value with certainty, this means that the

Similar Books

Thoreau in Love

John Schuyler Bishop

3 Loosey Goosey

Rae Davies

The Testimonium

Lewis Ben Smith

Consumed

Matt Shaw

Devour

Andrea Heltsley

Organo-Topia

Scott Michael Decker

The Strangler

William Landay

Shroud of Shadow

Gael Baudino