The basic preprocessing

Next: Whitening preprocessor command Up: Preprocessing data Previous: Preprocessing data

The basic preprocessing

tabular1020

The purpose of the preprocessing is to eliminate such statistical properties of the data which are caused by data coding. These properties have often an undesired effect to a reduction algorithm. The command prepro is a macro operation involving several suboperations to preprocess the data. It takes data as input and produces a new data frame as output. The suboperations are selected according to the flags as follows:

-e:

Equalization based on min/max ranges. Each values in the data field x is scaled according to the following formula:

If <equdata> is not given then and are defined according to the range of the source data (from the field x), and the target range is set to [0,1]. Then, the scaling formula can be simplified to the form:

If <equdata> include two fields, then these fields are used as a source range (equdata 1 in the figure below). And the values (1. field) and (2. field) are read from <equdata> (the index of the field corresponds to the index of the items in <equdata>). The fields in <equdata> should have lengths matching to the number of the fields in the data to be processed. If <equdata> includes four fields (see equdata 2 in the figure below) then all the scaling information is read from this data. The order of the fields in <equdata> is .

-edev :

1) Equalization based on standard deviation and 2) data centralization to its average. Values in data fields (x) are scaled according to the following formula:

Similar to the min/max-based equalization, the scales can be computed directly from the source data or read from a file. In the first case, for each field, the average value and standard deviation are computed before the field is scaled. In the second case, and can be given in <equdata> (see the figure below). The first field in this data should include and the second field . Also, it is possible to specify the target averages and deviations through the <equdata> (see "equdata2" in the figure below), when the data fields are scaled according as follows:

where and include the statistics of the source data, and and specify the target ranges of the data fields.

-n:

Normalizing data records. Each component in a data record (vector) is divided by the length of the data record, as follows:

Example (ex4.1): Typically, the command prepro is used with the SOM operations somtr and somcl. This example demonstrates how data is preprocessed and the SOM is trained by the preprocessed data.

...
NDA> prepro -d boston -dout predata -e -n
NDA> somtr -d predata -sout som1 -l 4
...

Example: If a SOM is used as a classifier, it is important that new unseen data records are preprocessed in the same way as the training data was processed. Thus, it is necessary to save the statistics of the data so that the same equalization can be done later. The following commands demonstrate this for the min/max-based equalization, but the same procedure can be applied also for the deviation-based equalization. Then the averages and deviations should be computed instead of the minimum and maximum values.

# load the Boston data and take a sample for training a SOM
NDA> load boston.dat
NDA> selrec -d boston -dout sample1 -expr 'boston.rm' > 5;
# compute min/max info, save it, and use it in preprocess
NDA> fldstat -d sample1 -dout mminfo -min -max
NDA> save mminfo -o mminfo.dat
NDA> prepro -d sample1 -dout predata -d2 mminfo -e -n
NDA> somtr -d predata -sout som1 -l 4
NDA> save som1 -o som1.som
...
# shutdown the NDA and start it again for matching
# the whole data to our sample
NDA> load boston.dat
NDA> load mminfo.dat
NDA> load som1.som
#
# preprocess the data and classify it by the SOM
#
NDA> prepro -d boston -dout predata -d2 mminfo -e -n
NDA> somcl -d predata -s som1 -cout cld
NDA> clstat -d boston -dout sta -c cld -hits -avg -min -max
# continue visualization for exploring the mapped data
....

Next: Whitening preprocessor command Up: Preprocessing data Previous: Preprocessing data

Erkki Hakkinen
Thu Sep 24 11:51:34 EET DST 1998