next up previous contents
Next: Whitening preprocessor command Up: Preprocessing data Previous: Preprocessing data

Basic preprocessing

   

prepro Macro command for preprocessing
-d <data> the name of the data
-dout <dataout> result data
[-e] min/max based equalization of fields
[-edev] variance-based equalization of fields
[-n] normalization of data records
[-d2 <equdata>] statistical values to be used for equalization
[-md <value>] missing value to be skipped

Purpose of preprocessing is to eliminate such statistical properties of data, which are caused by data coding. These properties have often an undesired effect to dimensionality reduction algorithms. prepro is a macro operation including several suboperations to preprocess the data. It takes a data frame as input and produces a new data frame as output. The suboperations can be selected according to the flags as follows:

-e
Equalization based on tex2html_wrap_inline7771 ranges. Each value in the data field x is scaled according to the following formula:

displaymath7775

If <equdata> is not given then tex2html_wrap_inline7779 and tex2html_wrap_inline7781 are defined according to the range of the source data (from the field x), and the target range tex2html_wrap_inline7785 is set to [0,1]. In this case, the scaling formula can be simplified to the form:

displaymath7789

If <equdata> contains only two fields, then these fields are used as a source range (equdata 1 in the figure below). And the values tex2html_wrap_inline7795 (1. field) and tex2html_wrap_inline7797 (2. field) are read from <equdata> (the index of the field in source data corresponds to the index of data record in <equdata>). The fields in <equdata> should have a length matching the number of fields in the data to be processed. If <equdata> contains four fields (see equdata 2 in the figure below), then all the scaling information is read from this data. In this case, the order of fields in <equdata> is tex2html_wrap_inline7811 .

figure1433

-edev
1) Equalization based on standard deviation and 2) data centralization to its average. Values in data fields x are scaled according to the following formula:

displaymath7815

Similarly to the tex2html_wrap_inline7771 -based equalization, the scales can be computed directly from the source data or read from a frame. In the first case, for each field, the average value and standard deviation are computed before the field is scaled. In the second case, tex2html_wrap_inline7819 and tex2html_wrap_inline7821 can be given in <equdata> (see the figure below). The first field in this data should contain tex2html_wrap_inline7819 and the second field tex2html_wrap_inline7821 . Also, it is possible to specify the target average and deviation through <equdata> (see equdata2 in the figure below). In that case, data fields are scaled as follows:

displaymath7833

where tex2html_wrap_inline7835 and tex2html_wrap_inline7837 contain the statistics of source data, and tex2html_wrap_inline7839 and tex2html_wrap_inline7841 specify the target ranges of data fields.

figure1445

-n
Normalization of data records. Each component in a data record (vector) is divided by the length of the data record, as follows:

displaymath7843

Example (ex4.1): Typically, prepro is used with SOM operations somtr and somcl. This example demonstrates how data is preprocessed and the SOM is trained with the preprocessed data.

...
NDA> prepro -d boston -dout predata -e -n
NDA> somtr -d predata -sout som1 -l 4
...

Example: If a SOM is used as a classifier, it is important that new, unseen data records are preprocessed the same way as training data. Thus, it is necessary to save the statistics of original data so that the same equalization can be performed later. The following commands demonstrate this for the tex2html_wrap_inline7771 -based equalization, but the same procedure can also be applied to the deviation-based equalization. In that case, the averages and deviations should be computed instead of the minimum and maximum values.

# Load the Boston data and take a sample for training a SOM
NDA> load boston.dat
NDA> selrec -d boston -dout sample1 -expr 'boston.rm' > 5;
# Compute min/max info, save it and use it in preprocess
NDA> fldstat -d sample1 -dout mminfo -min -max
NDA> save mminfo -o mminfo.dat
NDA> prepro -d sample1 -dout predata -d2 mminfo -e -n
NDA> somtr -d predata -sout som1 -l 4
...
NDA> save som1 -o som1.som
...
# Shutdown the NDA and start it again for matching
# the whole data to our sample
NDA> load boston.dat
NDA> load mminfo.dat
NDA> load som1.som
# Preprocess data and classify it with the SOM
NDA> prepro -d boston -dout predata -d2 mminfo -e -n
NDA> somcl -d predata -s som1 -cout cld
NDA> clstat -d boston -dout sta -c cld -hits -avg -min -max
# Continue visualization for exploring the mapped data
...


next up previous contents
Next: Whitening preprocessor command Up: Preprocessing data Previous: Preprocessing data

Anssi Lensu
Thu May 17 15:00:44 EET DST 2001