next up previous contents
Next: Generating data Up: Preprocessing data Previous: Equalize values within records

Treatment for missing data

  

msd Replace missing data
-d <datain> source data frame
-dout <dataout> target data frame
[-ddist <distdata>] distance data (if different from source)
[-md <value>] the missing value (default -9999.0)
[-num <maxmf>] maximum number of missing fields / data record
[-ver] verbose reporting of replacements and discarded records
[-stat] report number of original, completed and discarded data records, minimum and maximum located distance etc.
msdbycen Replace missing data with prototypes
-d <data> source data frame
-d2 <proto-data> prototype data
-c <cldata> connections from prototypes
-dout <dataout> the name of the replaced data frame
-md <value> the missing value
[-ver] verbose

Operation msd replaces all missing values from the source data by using the field values found from the nearest data record. The nearest data record is determined by using the normalized distance that is computed according to the following formula:

displaymath8210

displaymath8211

where tex2html_wrap_inline8240 is the number of such components, from which value is missing in r(a) or in r(b), and N is the total number of data fields. The computation of the distance recognizes the number of missing values and scales the distances in a way that the pairs of the data records containing lots of missing values get larger distances. The located nearest data record may not contain missing values in the same fields as the one being completed.

If the maximum number of missing fields is specified, all data records having more missing fields are discarded from the output data frame. Also records that cannot be completed are discarded. The missing data value can be specified and also a separate frame to be used as a distance data. Statistics and even verbose processing of data records containing missing fields can be requested.

The operation msdbycen replaces missing values with the values of prototypes given in a separate data frame. The classified data frame should indicate, which data records are connected to which prototypes. See also somcl (section 5.1.2) about the missing data flag.

Example: Remove missing fields (marked with -9) from source data, but do not allow more than 4 missing fields / record.

# Load data containing missing fields
NDA> load kv.dat -n km
# Perform missing data removal
NDA> msd -d km -dout kv -md -9 -num 4 -stat
 distances: min   16.6/  515.4, max   92.2/  847.7
 - data rec: src 1120, corrected 130, discarded 28 => trg 1092


next up previous contents
Next: Generating data Up: Preprocessing data Previous: Equalize values within records

Anssi Lensu
Tue Jul 23 11:58:18 EET DST 2002