共查询到20条相似文献,搜索用时 31 毫秒
1.
《Intelligent Data Analysis》1998,2(1-4):265-286
The main problem considered in this paper consists of binarizing categorical (nominal) attributes having a very large number of values (204 in our application). A small number of relevant binary attributes are gathered from each initial attribute. Let us suppose that we want to binarize a categorical attribute v with L values, where L is large or very large. The total number of binary attributes that can be extracted from v is 2L−1− 1, which in the case of a large L is prohibitive. Our idea is to select only those binary attributes that are predictive; and these shall constitute a small fraction of all possible binary attributes. In order to do this, the significant idea consists in grouping the L values of a categorical attribute by means of an hierarchical clustering method. To do so, we need to define a similarity between values, which is associated with their predictive power. By clustering the L values into a small number of clusters (J), we define a new categorical attribute with only J values. The hierarchical clustering method used by us, AVL, allows to choose a significant value for J. Now, we could consider using all the 2L−1− 1 binary attributes associated with this new categorical attribute. Nevertheless, the J values are tree-structured, because we have used a hierarchical clustering method. We profit from this, and consider only about 2 × J binary attributes. If L is extremely large, for complexity and statistical reasons, we might not be able to apply a clustering algorithm directly. In this case, we start by “factorizing” v into a pair (v2, v2), each one with about √L(v) values. For a simple example, consider an attribute v with only four values m1,m2, m3,m4. Obviously, in this example, there is no need to factorize the set of values of v, because it has a very small number of values. Nevertheless, for illustration purposes, v could be decomposed (factorized) into 2 attributes with only two values each; the correspondence between the values of v and (v2, v2) would be
υ | (υ1, | υ2) |
---|---|---|
m1 | 1 | 1 |
m2 | 1 | 2 |
m3 | 2 | 1 |
m4 | 2 | 2 |