Basic preprocessing

Next: Whitening preprocessor command Up: Preprocessing data Previous: Preprocessing data

Basic preprocessing

prepro Macro command for preprocessing

-d <data> the name of the data

-dout <dataout> result data

[-e] min/max based equalization of fields

[-edev] variance-based equalization of fields

[-n] normalization of data records

[-d2 <equdata>] statistical values to be used for equalization

[-md <value>] missing value to be skipped

Purpose of preprocessing is to eliminate such statistical properties of data, which are caused by data coding. These properties have often an undesired effect to dimensionality reduction algorithms. prepro is a macro operation including several suboperations to preprocess the data. It takes a data frame as input and produces a new data frame as output. The suboperations can be selected according to the flags as follows:

-e

Equalization based on

ranges. Each value in the data field x is scaled according to the following formula:

If <equdata> is not given then and are defined according to the range of the source data (from the field x), and the target range is set to [0,1]. In this case, the scaling formula can be simplified to the form:

If <equdata> contains only two fields, then these fields are used as a source range (equdata 1 in the figure below). And the values (1. field) and (2. field) are read from <equdata> (the index of the field in source data corresponds to the index of data record in <equdata>). The fields in <equdata> should have a length matching the number of fields in the data to be processed. If <equdata> contains four fields (see equdata 2 in the figure below), then all the scaling information is read from this data. In this case, the order of fields in <equdata> is .

-edev

1) Equalization based on standard deviation and 2) data centralization to its average. Values in data fields x are scaled according to the following formula:

Similarly to the -based equalization, the scales can be computed directly from the source data or read from a frame. In the first case, for each field, the average value and standard deviation are computed before the field is scaled. In the second case, and can be given in <equdata> (see the figure below). The first field in this data should contain and the second field . Also, it is possible to specify the target average and deviation through <equdata> (see equdata2 in the figure below). In that case, data fields are scaled as follows:

where and contain the statistics of source data, and and specify the target ranges of data fields.

-n

Normalization of data records. Each component in a data record (vector) is divided by the length of the data record, as follows:

Example (ex4.1): Typically, prepro is used with SOM operations somtr and somcl. This example demonstrates how data is preprocessed and the SOM is trained with the preprocessed data.

...
NDA> prepro -d boston -dout predata -e -n
NDA> somtr -d predata -sout som1 -l 4
...

Example: If a SOM is used as a classifier, it is important that new, unseen data records are preprocessed the same way as training data. Thus, it is necessary to save the statistics of original data so that the same equalization can be performed later. The following commands demonstrate this for the -based equalization, but the same procedure can also be applied to the deviation-based equalization. In that case, the averages and deviations should be computed instead of the minimum and maximum values.

# Load the Boston data and take a sample for training a SOM
NDA> load boston.dat
NDA> selrec -d boston -dout sample1 -expr 'boston.rm' > 5;
# Compute min/max info, save it and use it in preprocess
NDA> fldstat -d sample1 -dout mminfo -min -max
NDA> save mminfo -o mminfo.dat
NDA> prepro -d sample1 -dout predata -d2 mminfo -e -n
NDA> somtr -d predata -sout som1 -l 4
...
NDA> save som1 -o som1.som
...
# Shutdown the NDA and start it again for matching
# the whole data to our sample
NDA> load boston.dat
NDA> load mminfo.dat
NDA> load som1.som
# Preprocess data and classify it with the SOM
NDA> prepro -d boston -dout predata -d2 mminfo -e -n
NDA> somcl -d predata -s som1 -cout cld
NDA> clstat -d boston -dout sta -c cld -hits -avg -min -max
# Continue visualization for exploring the mapped data
...

Next: Whitening preprocessor command Up: Preprocessing data Previous: Preprocessing data

Anssi Lensu
Thu May 17 15:00:44 EET DST 2001

prepro	Macro command for preprocessing
-d <data>	the name of the data
-dout <dataout>	result data
[-e]	min/max based equalization of fields
[-edev]	variance-based equalization of fields
[-n]	normalization of data records
[-d2 <equdata>]	statistical values to be used for equalization
[-md <value>]	missing value to be skipped