next up previous
Next: Combining categorical analyses Up: Preprocessing and data field Previous: Coding via fuzzy memberships

Motivation for categorization of answers

All of the one-to-many coding schemes result in a large number of new variables. Therefore, problems might arise from the dimensionality of the training set. The significance of a single variable value diminishes rapidly in calculation of the euclidean distance between two similar vectors, if dimensionality increases. That is, if one or a few components of a long binary vector are complemented, the resulting vector is still very similar to the original one compared to a completely complemented data vector.

The larger the number of data fields is, the more neurons would be needed to represent all possible combinations. A good approach for dimension reduction, and to control what kind of clusters will emerge, is to divide the original data fields into semantically meaningful categories, of which one alone cannot match the analysis objectives. These individual analyses must then be combined to get a general picture of all information.

Often semantic categories exist in the form of prior knowledge, and when they do not, data fields can be categorized with factor analysis or other similar methods.



Anssi Lensu
Tue Nov 3 12:18:16 EET 1998