The purpose of the preprocessing is to eliminate such statistical properties of the data which are caused by data coding. These properties have often an undesired effect to a reduction algorithm. The command prepro is a macro operation involving several suboperations to preprocess the data. It takes data as input and produces a new data frame as output. The suboperations are selected according to the flags as follows:
If <equdata> is not given then and are defined according to the range of the source data (from the field x), and the target range is set to [0,1]. Then, the scaling formula can be simplified to the form:
If <equdata> include two fields, then these fields are used as a source range (equdata 1 in the figure below). And the values (1. field) and (2. field) are read from <equdata> (the index of the field corresponds to the index of the items in <equdata>). The fields in <equdata> should have lengths matching to the number of the fields in the data to be processed. If <equdata> includes four fields (see equdata 2 in the figure below) then all the scaling information is read from this data. The order of the fields in <equdata> is .
Similar to the min/max-based equalization, the scales can be computed directly from the source data or read from a file. In the first case, for each field, the average value and standard deviation are computed before the field is scaled. In the second case, and can be given in <equdata> (see the figure below). The first field in this data should include and the second field . Also, it is possible to specify the target averages and deviations through the <equdata> (see "equdata2" in the figure below), when the data fields are scaled according as follows:
where and include the statistics of the source data, and and specify the target ranges of the data fields.
Example (ex4.1): Typically, the command prepro is used with the SOM operations somtr and somcl. This example demonstrates how data is preprocessed and the SOM is trained by the preprocessed data.
... NDA> prepro -d boston -dout predata -e -n NDA> somtr -d predata -sout som1 -l 4 ...
Example: If a SOM is used as a classifier, it is important that new unseen data records are preprocessed in the same way as the training data was processed. Thus, it is necessary to save the statistics of the data so that the same equalization can be done later. The following commands demonstrate this for the min/max-based equalization, but the same procedure can be applied also for the deviation-based equalization. Then the averages and deviations should be computed instead of the minimum and maximum values.
# load the Boston data and take a sample for training a SOM NDA> load boston.dat NDA> selrec -d boston -dout sample1 -expr 'boston.rm' > 5; # compute min/max info, save it, and use it in preprocess NDA> fldstat -d sample1 -dout mminfo -min -max NDA> save mminfo -o mminfo.dat NDA> prepro -d sample1 -dout predata -d2 mminfo -e -n NDA> somtr -d predata -sout som1 -l 4 NDA> save som1 -o som1.som ... # shutdown the NDA and start it again for matching # the whole data to our sample NDA> load boston.dat NDA> load mminfo.dat NDA> load som1.som # # preprocess the data and classify it by the SOM # NDA> prepro -d boston -dout predata -d2 mminfo -e -n NDA> somcl -d predata -s som1 -cout cld NDA> clstat -d boston -dout sta -c cld -hits -avg -min -max # continue visualization for exploring the mapped data ....