next up previous
Next: Combining categorical analyses Up: Preprocessing and data field Previous: Coding via fuzzy memberships

Motivation for categorization of answers

All of the one-to-many coding schemes result in a large number of new variables. Therefore, problems might arise from the dimensionality of the training set. Figure 2 illustrates the significance of a single variable value change to euclidean distance between two similar vectors. That is, if one or a few components of a binary vector are complemented, how big is the distance from the new vector to the original one compared to the distance calculated from a completely complemented data vector.

   figure35
Figure 2: Dimensionality reduces the importance of a single binary variable in euclidean distance calculation.

The larger the number of data fields is, the more neurons would be needed to represent all possible combinations. A good approach for dimension reduction, and to control what kind of clusters will emerge, is to divide the original data fields into semantically meaningful categories, of which one alone cannot match the analysis objectives. These individual analyses must then be combined to get a general picture of all information.

Often semantic categories exist in the form of prior knowledge, and when they do not, data fields can be categorized with factor analysis or other similar methods.



Anssi Lensu
Tue Nov 3 11:38:53 EET 1998