next up previous contents
Next: Whitening preprocessor command Up: Preprocessing data Previous: Preprocessing data

The basic preprocessing

   

tabular1020

The purpose of the preprocessing is to eliminate such statistical properties of the data which are caused by data coding. These properties have often an undesired effect to a reduction algorithm. The command prepro is a macro operation involving several suboperations to preprocess the data. It takes data as input and produces a new data frame as output. The suboperations are selected according to the flags as follows:

-e:
Equalization based on min/max ranges. Each values in the data field x is scaled according to the following formula:

displaymath5262

If <equdata> is not given then tex2html_wrap_inline5264 and tex2html_wrap_inline5266 are defined according to the range of the source data (from the field x), and the target range tex2html_wrap_inline5268 is set to [0,1]. Then, the scaling formula can be simplified to the form:

displaymath5272

If <equdata> include two fields, then these fields are used as a source range (equdata 1 in the figure below). And the values tex2html_wrap_inline5274 (1. field) and tex2html_wrap_inline5276 (2. field) are read from <equdata> (the index of the field corresponds to the index of the items in <equdata>). The fields in <equdata> should have lengths matching to the number of the fields in the data to be processed. If <equdata> includes four fields (see equdata 2 in the figure below) then all the scaling information is read from this data. The order of the fields in <equdata> is tex2html_wrap_inline5278 .

figure1034

-edev :
1) Equalization based on standard deviation and 2) data centralization to its average. Values in data fields (x) are scaled according to the following formula:

displaymath5280

Similar to the min/max-based equalization, the scales can be computed directly from the source data or read from a file. In the first case, for each field, the average value and standard deviation are computed before the field is scaled. In the second case, tex2html_wrap_inline5282 and tex2html_wrap_inline5284 can be given in <equdata> (see the figure below). The first field in this data should include tex2html_wrap_inline5282 and the second field tex2html_wrap_inline5284 . Also, it is possible to specify the target averages and deviations through the <equdata> (see "equdata2" in the figure below), when the data fields are scaled according as follows:

displaymath5290

where tex2html_wrap_inline5292 and tex2html_wrap_inline5294 include the statistics of the source data, and tex2html_wrap_inline5296 and tex2html_wrap_inline5298 specify the target ranges of the data fields.

figure1044

-n:
Normalizing data records. Each component in a data record (vector) is divided by the length of the data record, as follows:

displaymath5300

Example (ex4.1): Typically, the command prepro is used with the SOM operations somtr and somcl. This example demonstrates how data is preprocessed and the SOM is trained by the preprocessed data.

...
NDA> prepro -d boston -dout predata -e -n
NDA> somtr -d predata -sout som1 -l 4
...

Example: If a SOM is used as a classifier, it is important that new unseen data records are preprocessed in the same way as the training data was processed. Thus, it is necessary to save the statistics of the data so that the same equalization can be done later. The following commands demonstrate this for the min/max-based equalization, but the same procedure can be applied also for the deviation-based equalization. Then the averages and deviations should be computed instead of the minimum and maximum values.

# load the Boston data and take a sample for training a SOM
NDA> load boston.dat
NDA> selrec -d boston -dout sample1 -expr 'boston.rm' > 5;
# compute min/max info, save it, and use it in preprocess
NDA> fldstat -d sample1 -dout mminfo -min -max
NDA> save mminfo -o mminfo.dat
NDA> prepro -d sample1 -dout predata -d2 mminfo -e -n
NDA> somtr -d predata -sout som1 -l 4
NDA> save som1 -o som1.som
...
# shutdown the NDA and start it again for matching
# the whole data to our sample
NDA> load boston.dat
NDA> load mminfo.dat
NDA> load som1.som
#
# preprocess the data and classify it by the SOM
#
NDA> prepro -d boston -dout predata -d2 mminfo -e -n
NDA> somcl -d predata -s som1 -cout cld
NDA> clstat -d boston -dout sta -c cld -hits -avg -min -max
# continue visualization for exploring the mapped data
....


next up previous contents
Next: Whitening preprocessor command Up: Preprocessing data Previous: Preprocessing data

Erkki Hakkinen
Thu Sep 24 11:51:34 EET DST 1998