prepro | Macro command for preprocessing |
-d <data> | the name of the data |
-dout <dataout> | result data |
[-e] | min/max based equalization of fields |
[-edev] | variance-based equalization of fields |
[-n] | normalization of data records |
[-d2 <equdata>] | statistical values to be used for equalization |
[-md <value>] | missing value to be skipped |
Purpose of preprocessing is to eliminate such statistical properties of data, which are caused by data coding. These properties have often an undesired effect to dimensionality reduction algorithms. prepro is a macro operation including several suboperations to preprocess the data. It takes a data frame as input and produces a new data frame as output. The suboperations can be selected according to the flags as follows:
If <equdata> is not given then and are defined according to the range of the source data (from the field x), and the target range is set to [0,1]. In this case, the scaling formula can be simplified to the form:
If <equdata> contains only two fields, then these fields are used as a source range (equdata 1 in the figure below). And the values (1. field) and (2. field) are read from <equdata> (the index of the field in source data corresponds to the index of data record in <equdata>). The fields in <equdata> should have a length matching the number of fields in the data to be processed. If <equdata> contains four fields (see equdata 2 in the figure below), then all the scaling information is read from this data. In this case, the order of fields in <equdata> is .
Similarly to the -based equalization, the scales can be computed directly from the source data or read from a frame. In the first case, for each field, the average value and standard deviation are computed before the field is scaled. In the second case, and can be given in <equdata> (see the figure below). The first field in this data should contain and the second field . Also, it is possible to specify the target average and deviation through <equdata> (see equdata2 in the figure below). In that case, data fields are scaled as follows:
where and contain the statistics of source data, and and specify the target ranges of data fields.
Example (ex4.1): Typically, prepro is used with SOM operations somtr and somcl. This example demonstrates how data is preprocessed and the SOM is trained with the preprocessed data.
... NDA> prepro -d boston -dout predata -e -n NDA> somtr -d predata -sout som1 -l 4 ...
Example: If a SOM is used as a classifier, it is important that new, unseen data records are preprocessed the same way as training data. Thus, it is necessary to save the statistics of original data so that the same equalization can be performed later. The following commands demonstrate this for the -based equalization, but the same procedure can also be applied to the deviation-based equalization. In that case, the averages and deviations should be computed instead of the minimum and maximum values.
# Load the Boston data and take a sample for training a SOM NDA> load boston.dat NDA> selrec -d boston -dout sample1 -expr 'boston.rm' > 5; # Compute min/max info, save it and use it in preprocess NDA> fldstat -d sample1 -dout mminfo -min -max NDA> save mminfo -o mminfo.dat NDA> prepro -d sample1 -dout predata -d2 mminfo -e -n NDA> somtr -d predata -sout som1 -l 4 ... NDA> save som1 -o som1.som ... # Shutdown the NDA and start it again for matching # the whole data to our sample NDA> load boston.dat NDA> load mminfo.dat NDA> load som1.som # Preprocess data and classify it with the SOM NDA> prepro -d boston -dout predata -d2 mminfo -e -n NDA> somcl -d predata -s som1 -cout cld NDA> clstat -d boston -dout sta -c cld -hits -avg -min -max # Continue visualization for exploring the mapped data ...