Calculating distances between data records

Next: Group membership evaluation Up: Basic computing operations Previous: Computing fuzzy membership values

Calculating distances between data records

From data records to data records

This command can be used to calculate a new data frame containing distances from a group of data records of one frame to another group of data records of another (or the same) frame. The distance measure can be a) euclidean, b) Hamming or c) Levenshtein distance. The result is a new data frame containing fields with names `0', `1', `2' and so on. These names indicate the data record indices of the first group, and data record numbers are the indices of data records in the second group.

Euclidean distance is calculated normally with . Hamming distance performs a comparison of data records and the resulting distance indicates the number of field values, in which they differ. When Hamming distance is used, the original data frames may contain STRING fields. In that case the distance indicates the number of character positions that are different in the evaluated strings. Levenshtein or EDIT distance locates the minimum number of character addition, deletion or update operations required to change the string in the first frame into a string in the second frame. In this case a single field name needs to be specified. The same field must be available in both original frames and it must be of type STRING.

Subsets of original data records can be specified using classes <class1> and <class2>. Omission of <class1> (and <class2>) results in the evaluation of all data records. If <data2> is not specified, <data1> is used instead. Minimum, maximum and average distances can be calculated and stored using switches -dmin, -dmax and -davg. All results are FLOATs for euclidean distance and INTs otherwise.

dist Calculate euclidean, Hamming or Levenshtein distances between data records

-d1 <data1> name of the first original data frame

-dout <dataout> distances between data vectors

[-cl1 <class1>] name of the first class specification

[-d2 <data2>] name of the second original data frame

[-cl2 <class2>] name of the second class

[-dmin <do-min>] field name for located minimum distance

[-dmax <do-max>] field name for maximum distance

[-davg <do-avg>] field name for average distance

[-ham] calculate Hamming diatance instead of euclidean

[-lev <lev-fld>] calculate Levenshtein (or EDIT) distance for specified field

This command creates new fields to a data frame containing distances from some data vectors to some other data vectors within the same or different data sets. Distance measure can be euclidean, Hamming or Levenshtein. Located minimum, average and maximum distances can be stored for later use.

Example: Following commands calculate: a) euclidean distance between all data records of data1 and data2, b) Hamming distance between data3 and data4 and c) Levenshtein distance between data3.str and data4.str.

# List the contents of name space
NDA> ls -l -fr
 fr  d   /data1
 int f   /data1.x
 flt f   /data1.y
 fr  d   /data2
 int f   /data2.x
 flt f   /data2.y
 fr  d   /data3
 int f   /data3.x
 flt f   /data3.y
 str f   /data3.str
 fr  d   /data4
 int f   /data4.x
 flt f   /data4.y
 str f   /data4.str
# Calculate distances from data records to others
NDA> dist -d1 data1 -d2 data2 -dout dists1
NDA> dist -d1 data3 -d2 data4 -dout dists2 -ham
NDA> dist -d1 data3 -d2 data4 -dout dists3 -lev str

Calculation of euclidean distance between data3 and data4 would result in an error message:

NDA> dist -d1 data3 -d2 data4 -dout dists
 Returned error -450: Type does not match

From data records to group representatives

This command can be used to calculate a new data frame containing distances from all data records of a frame to several groups identified by a classified data. The distance can be measured as a) single linkage or b) group-average linkage (see figure below). Single linkage method locates the closest group member, and group-average first evaluates the average point for each group and calculates the distances between these group-averages and all data records. The result is a new data frame containing fields having the names of the groups identified in the classified data.

The calculation complexity of group-average linkage can be much smaller, but it cannot be used with certain types of data distributions. Part c) of the following figure depicts a problematic data distribution for group-average linkage.

There is also a local distance model available, in which all distances are limited to located minimum group-to-group distances. These minimum group distances are evaluated using single or group-average linkage according to chosen method.

cldist Calculate euclidean distances between data records and groups

-d <datain> name of the original data to be used

-c <cldata> classified data containing groupings

-dout <dataout> distances from each vector to groups

[-dmax <maxfld>] field name to use for maximum distance

[-gavg] group-average linkage instead of single

[-local] use locally limited distances

This command creates new fields to a data frame containing distances from data vectors to groups of data vectors within the same data set. The groupings are identified by a classified data structure. The fields in output data are named according to group names. Located maximum distance can be stored for later use.

Example: For an example, see grpms (section 4.10).

Next: Group membership evaluation Up: Basic computing operations Previous: Computing fuzzy membership values

Anssi Lensu
Tue Jul 23 11:58:18 EET DST 2002

dist	Calculate euclidean, Hamming or Levenshtein distances between data records
-d1 <data1>	name of the first original data frame
-dout <dataout>	distances between data vectors
[-cl1 <class1>]	name of the first class specification
[-d2 <data2>]	name of the second original data frame
[-cl2 <class2>]	name of the second class
[-dmin <do-min>]	field name for located minimum distance
[-dmax <do-max>]	field name for maximum distance
[-davg <do-avg>]	field name for average distance
[-ham]	calculate Hamming diatance instead of euclidean
[-lev <lev-fld>]	calculate Levenshtein (or EDIT) distance for specified field

cldist	Calculate euclidean distances between data records and groups
-d <datain>	name of the original data to be used
-c <cldata>	classified data containing groupings
-dout <dataout>	distances from each vector to groups
[-dmax <maxfld>]	field name to use for maximum distance
[-gavg]	group-average linkage instead of single
[-local]	use locally limited distances