Data Mining

Free Data Mining by Mehmed Kantardzic

Book: Data Mining by Mehmed Kantardzic Read Free Book Online
Authors: Mehmed Kantardzic
applications depend on types of data, amounts of data, and general characteristics of the data-mining task.
    2.3.1 Normalizations
    Some data-mining methods, typically those that are based on distance computation between points in an n-dimensional space, may need normalized data for best results. The measured values can be scaled to a specific range, for example, [−1, 1], or [0, 1]. If the values are not normalized, the distance measures will overweight those features that have, on average, larger values. There are many ways of normalizing data. The following are three simple and effective normalization techniques.
    Decimal Scaling.
    Decimal scaling moves the decimal point but still preserves most of the original digit value. The typical scale maintains the values in a range of −1 to 1. The following equation describes decimal scaling, where v(i) is the value of the feature v for case i and v′(i) is a scaled value

    for the smallest k such that max (|v′(i)|) < 1.
    First, the maximum |v′(i)| is found in the data set, and then the decimal point is moved until the new, scaled, maximum absolute value is less than 1. The divisor is then applied to all other v(i). For example, if the largest value in the set is 455, and the smallest value is −834, then the maximum absolute value of the feature becomes .834, and the divisor for all v(i) is 1000 (k = 3).
    Min–Max Normalization.
    Suppose that the data for a feature v are in a range between 150 and 250. Then, the previous method of normalization will give all normalized data between .15 and .25, but it will accumulate the values on a small subinterval of the entire range. To obtain better distribution of values on a whole normalized interval, for example, [0,1], we can use the min–max formula

    where the minimum and the maximum values for the feature v are computed on a set automatically, or they are estimated by an expert in a given domain. Similar transformation may be used for the normalized interval [−1, 1]. The automatic computation of min and max values requires one additional search through the entire data set, but computationally, the procedure is very simple. On the other hand, expert estimations of min and max values may cause unintentional accumulation of normalized values.
    Standard Deviation Normalization.
    Normalization by standard deviation often works well with distance measures but transforms the data into a form unrecognizable from the original data. For a feature v, the mean value mean (v) and the standard deviation sd (v) are computed for the entire data set. Then, for a case i, the feature value is transformed using the equation

    For example, if the initial set of values of the attribute is v = {1, 2, 3}, then mean (v) = 2, sd (v) = 1, and the new set of normalized values is v* = {−1, 0, 1}.
    Why not treat normalization as an implicit part of a data-mining method? The simple answer is that normalizations are useful for several diverse methods of data mining. Also very important is that the normalization is not a one-time or a one-phase event. If a method requires normalized data, available data should be initially transformed and prepared for the selected data-mining technique, but an identical normalization must be applied in all other phases of data-mining and with all new and future data. Therefore, the normalization parameters must be saved along with a solution.
    2.3.2 Data Smoothing
    A numeric feature, y, may range over many distinct values, sometimes as many as the number of training cases. For many data-mining techniques, minor differences among these values are not significant and may degrade the performance of the method and the final results. They may be considered as random variations of the same underlying value. Hence, it can be advantageous sometimes to smooth the values of the variable.
    Many simple smoothers can be specified that average similar measured values. For example, if the values are real numbers with several decimal

Similar Books

A Baby in His Stocking

Laura marie Altom

The Other Hollywood

Legs McNeil, Jennifer Osborne, Peter Pavia

Children of the Source

Geoffrey Condit

The Broken God

David Zindell

Passionate Investigations

Elizabeth Lapthorne

Holy Enchilada

Henry Winkler