Data Mining

Free Data Mining by Mehmed Kantardzic Page A

Book: Data Mining by Mehmed Kantardzic Read Free Book Online
Authors: Mehmed Kantardzic
places, rounding the values to the given precision could be a simple smoothing algorithm for a large number of samples, where each sample has its own real value. If the set of values for the given feature F is {0.93, 1.01, 1.001, 3.02, 2.99, 5.03, 5.01, 4.98}, then it is obvious that smoothed values will be F smoothed = {1.0, 1.0, 1.0, 3.0, 3.0, 5.0, 5.0, 5.0}. This simple transformation is performed without losing any quality in a data set, and, at the same time, it reduces the number of different real values for the feature to only three.
    Some of these smoothing algorithms are more complex, and they are explained in Section 3.2. Reducing the number of distinct values for a feature means reducing the dimensionality of the data space at the same time. Reduced values are particularly useful for logic-based methods of data mining, as will be explained in Chapter 6. Smoothers in this case can be used to discretize continuous features into a set of features with binary true–false values.
    2.3.3 Differences and Ratios
    Even small changes to features can produce significant improvement in data-mining performances. The effects of relatively minor transformations of input or output features are particularly important in the specification of the data-mining goals. Two types of simple transformations, differences and ratios, could make improvements in goal specification, especially if they are applied to the output features.
    These transformations sometimes produce better results than the simple, initial goal of predicting a number. In one application, for example, the objective is to move the controls for a manufacturing process to an optimal setting. But instead of optimizing the absolute magnitude specification for the output s(t + 1), it is more effective to set the goal of a relative move from current value to a final optimal s(t + 1) − s(t). The range of values for the relative moves is generally much smaller than the range of values for the absolute control setting. Therefore, for many data-mining methods, a smaller number of alternatives will improve the efficiency of the algorithm and will very often give better results.
    Ratios are the second simple transformation of a target or output features. Using s(t + 1)/s(t) as the output of a data-mining process instead of absolute value s(t + 1) means that the level of increase or decrease in the values of a feature may also improve the performances of the entire mining process.
    Differences and ratio transformations are not only useful for output features but also for inputs. They can be used as changes in time for one feature or as a composition of different input features. For example, in many medical data sets, there are two features of a patient (height and weight) that are taken as input parameters for different diagnostic analyses. Many applications show that better diagnostic results are obtained when an initial transformation is performed using a new feature called the body mass index (BMI), which is the weighted ratio between weight and height. This composite feature is better than the initial parameters to describe some of the characteristics of the patient, such as whether or not the patient is overweight.
    Logical transformations can also be used to compose new features. For example, sometimes it is useful to generate a new feature that will determine the logical value of the relation A > B between existing features A and B. But there are no universally best data-transformation methods. The lesson to be learned is that a major role remains for human insight while defining the problem. Attention should be paid to composing features, because relatively simple transformations can sometimes be far more effective for the final performance than switching to some other techniques of data mining.
    2.4 MISSING DATA
    For many real-world applications of data mining, even when there are huge amounts of data, the subset of cases with complete data may be relatively small.

Similar Books

Assignment - Karachi

Edward S. Aarons

Godzilla Returns

Marc Cerasini

Mission: Out of Control

Susan May Warren

The Illustrated Man

Ray Bradbury

Past Caring

Robert Goddard