next up previous contents
Next: Whitening preprocessor command Up: Preprocessing data Previous: Preprocessing data

Basic preprocessing

   

prepro Macro command for preprocessing
-d <data> the name of the data
-dout <dataout> result data
[-e] min/max based equalization of fields
[-edev] variance-based equalization of fields
[-n] normalization of data records
[-d2 <equdata>] statistical values to be used for equalization
[-md <value>] missing value to be skipped

Purpose of preprocessing is to eliminate such statistical properties of data, which are caused by data coding. These properties have often an undesired effect to dimensionality reduction algorithms. prepro is a macro operation including several suboperations to preprocess the data. It takes a data frame as input and produces a new data frame as output. The suboperations can be selected according to the flags as follows:

-e
Equalization based on tex2html_wrap_inline6817 ranges. Each value in the data field x is scaled according to the following formula:

displaymath6821

If <equdata> is not given then tex2html_wrap_inline6825 and tex2html_wrap_inline6827 are defined according to the range of the source data (from the field x), and the target range tex2html_wrap_inline6831 is set to [0,1]. In this case, the scaling formula can be simplified to the form:

displaymath6835

If <equdata> contains only two fields, then these fields are used as a source range (equdata 1 in the figure below). And the values tex2html_wrap_inline6841 (1. field) and tex2html_wrap_inline6843 (2. field) are read from <equdata> (the index of the field in source data corresponds to the index of data record in <equdata>). The fields in <equdata> should have a length matching the number of fields in the data to be processed. If <equdata> contains four fields (see equdata 2 in the figure below), then all the scaling information is read from this data. In this case, the order of fields in <equdata> is tex2html_wrap_inline6857 .

figure1353

-edev
1) Equalization based on standard deviation and 2) data centralization to its average. Values in data fields x are scaled according to the following formula:

displaymath6861

Similarly to the tex2html_wrap_inline6817 -based equalization, the scales can be computed directly from the source data or read from a frame. In the first case, for each field, the average value and standard deviation are computed before the field is scaled. In the second case, tex2html_wrap_inline6865 and tex2html_wrap_inline6867 can be given in <equdata> (see the figure below). The first field in this data should contain tex2html_wrap_inline6865 and the second field tex2html_wrap_inline6867 . Also, it is possible to specify the target average and deviation through <equdata> (see equdata2 in the figure below). In that case, data fields are scaled as follows:

displaymath6879

where tex2html_wrap_inline6881 and tex2html_wrap_inline6883 contain the statistics of source data, and tex2html_wrap_inline6885 and tex2html_wrap_inline6887 specify the target ranges of data fields.

figure1365

-n
Normalization of data records. Each component in a data record (vector) is divided by the length of the data record, as follows:

displaymath6889

Example (ex4.1): Typically, prepro is used with SOM operations somtr and somcl. This example demonstrates how data is preprocessed and the SOM is trained with the preprocessed data.

...
NDA> prepro -d boston -dout predata -e -n
NDA> somtr -d predata -sout som1 -l 4
...

Example: If a SOM is used as a classifier, it is important that new, unseen data records are preprocessed the same way as training data. Thus, it is necessary to save the statistics of original data so that the same equalization can be performed later. The following commands demonstrate this for the tex2html_wrap_inline6817 -based equalization, but the same procedure can also be applied to the deviation-based equalization. In that case, the averages and deviations should be computed instead of the minimum and maximum values.

# Load the Boston data and take a sample for training a SOM
NDA> load boston.dat
NDA> selrec -d boston -dout sample1 -expr 'boston.rm' > 5;
# Compute min/max info, save it and use it in preprocess
NDA> fldstat -d sample1 -dout mminfo -min -max
NDA> save mminfo -o mminfo.dat
NDA> prepro -d sample1 -dout predata -d2 mminfo -e -n
NDA> somtr -d predata -sout som1 -l 4
...
NDA> save som1 -o som1.som
...
# Shutdown the NDA and start it again for matching
# the whole data to our sample
NDA> load boston.dat
NDA> load mminfo.dat
NDA> load som1.som
# Preprocess data and classify it with the SOM
NDA> prepro -d boston -dout predata -d2 mminfo -e -n
NDA> somcl -d predata -s som1 -cout cld
NDA> clstat -d boston -dout sta -c cld -hits -avg -min -max
# Continue visualization for exploring the mapped data
...


next up previous contents
Next: Whitening preprocessor command Up: Preprocessing data Previous: Preprocessing data

Anssi Lensu
Wed Oct 6 12:57:48 EET DST 1999