Clustering

Clustering and Vector Quantization

Cluster analysis (CA) is an important instrument in many scientific disciplines. Its applications cover several fields such as financial investment, data mining, pattern recognition, texture and image segmentation, boundary detection and surface approximation, magnetic resonance imaging, handwritten character recognition, computer vision, information retrieval, and machine learning.

According to Jain et al., CA is the organization of a collection of patterns (usually represented as a vector of measurements, or a point in multidimensional space) into clusters based on similarity. Intuitively, patterns within a cluster are more similar to each other than they are to a pattern belonging to a different cluster. Pattern proximity is usually measured by a distance function defined on pairs of patterns. In many applications regarding telecommunictions and other fields where the compression and/or the transmission of data is involved, several techniques based on algorithms for vector quantization (VQ) are often used. Also in this case, patterns are subdivided into groups (or cells), based on similarity measured by a distance function. Each cell is represented by a vector (called codeword) approximating all of its elements. The set of the codewords is called the codebook. In the wide scientific community using techniques for CA and/or VQ, a widespread opinion is that they are different. CA is, generally, conceived as the problem of identifying, with an unsupervised approach, the eventual clusters inside the multi-dimensional data set to be analyzed. Differently, a VQ algorithm is not intended on finding the clusters, but on representing the data by a reduced number of elements that approximate the original data set as well as possible. In spite of the differences outlined here, it is possible to demonstrate that, in many cases, CA and VQ are, practically, equivalent. Summarizing, we can say that, often, from an operative point of view, the two approaches roughly execute the same operations: grouping data into a certain number of groups so that a loss (or error) function is minimized. So, we will use a symbology and terminology typical of VQ. Besides, we will use also the term Unsupervised Learning (UL) to indicate one or both of the two approaches.

A widespread accepted classification scheme subdivides these techniques into two main groups: hard (crisp) or soft (fuzzy). The difference between these is mainly the degree of membership of each vector to the clusters. During the construction of the codebook, in the case of the hard group, each vector belongs to only one cluster with a unitary degree of membership, whereas, for the fuzzy group, each vector can belong to several clusters with different degrees of membership. Another classification scheme distinguishes two other main groups: "K-means" and "competitive learning". The clustering techniques belonging to the first scheme try to minimize the average distortion through a suitable choice of codewords. In the second case, the codewords are obtained as a consequence of a process of competition between them.

It is worthwhile to underline that, in many algorithms for unsupervised learning, the quality of the final result is dependent on the choice of the initial conditions and the configuration parameters. A considerable work has been performed by Giuseppe Patanè for eliminating or, at least reduce as most as possible, such dependences. The good results obtained will be presented in the pages describing the related algorithms. The interest of Giusppe Patanè in this field arose at the time of his master thesis. The topic can be subdivided into four distinct lines:

static techniques of clustering;
incremental techniques of clustering
techniques for complexity reduction in clustering problems;
distributed clustering.

Static techniques of clustering

The research in this field has led to the development of a new and original algorithm for vector quantization called ``The Enhanced LBG'' (ELBG) [MyNNsELBG], [MyKES00], [MyIWANN99ELBG], [MyWILF99ELBG]

ELBG

ELBG belongs to the hard and K-means vector quantization groups and derives directly from the simpler LBG. The basic idea developed is the concept of utility of a codeword. Even if Fritzke had already introduced this term, its definition and meaning are totally different in the context of ELBG. The utility index used in ELBG is a powerful instrument to overcome some of the greatest drawbacks of LBG and other VQ algorithms, first of all the dependence of the final result on the choice of the initial codebook. The utility allows to well identify the situations of local minimum. Besides, it permits the recognition of the badly positioned codewords and gives useful indications about regions where they should be placed. ELBG looks for a codebook to which each codeword contributes in the same manner, i.e. the utility of all the codewords is the same. Essentially, ELBG is LBG with the add of some further steps. Starting from the mathematical properties of the utility index and other considerations, several sub-optimum solutions have been developed. These have allowed to keep low the overhead in the calculations of ELBG with respect to LBG.

The experimental results obtained show that ELBG is able to find better codebook than previous works, that it is practically independent of the initial conditions and that the computational complexity is virtually the same as the simpler LBG algorithm.

Look at the ELBG demo applet.

Incremental techniques of clustering

This topic concerns clustering algorithms where, iteration by iteration the system is optimized not only for the positions of the codewords, but also for their number. Such techniques are used for both supervised and unsupervised learning.

Also this research has led to the development and implementation of a new algorithm, called Fully Automatic Clustering System (FACS) [MyFACS].

FACS

FACS is able to automatically find the codebook of the right dimension when the input data set, the distortion measure and the desired error (or target) are fixed. It develops through a sequence of iterations like ELBG but, unlike this, FACS evaluates the error at the end of each iteration and, according to whether it is above or below the target, smartly decides to make another iteration with the same number of codewords or, if it is necessary, increases or decreases it. The results obtained are very good as regards both the number of iterations required and the number of codewords necessary for ensuring the desired target. In fact, FACS, in a number of iterations comparable with the ones required by a static clustering algorithm (like LBG or ELBG) is able to obtain a codebook that, practically, has the same quantization error of a codebook calculated by ELBG with the same number of codewords found by FACS. Although it is a clustering algorithm, i.e. a technique for unsupervised learning, FACS can be used also for typical problems of supervised learning.

Techniques for complexity reduction in clustering problems

At present, some algorithms for the reduction of the calculations required in clustering problems are under investigation. These, by exploiting ``neighborhood relations'' between codewords, try to reduce dramaticly the number of distance calculations between vectors, that constitute the most time-consuming operation for all of the clustering algorithms considered. In literature, several techniques of this kind exist; basically, they rely on the use of the triangular inequality and similar properties. These allow the reduction of the calculations without making the quality of the final result worse. Instead, the goal of the strategy under investigation is a larger reduction at the cost of a low deterioration of the final result.

Distributed clustering

The employment of clustering techniques can be a very resource-demanding task when complex problems with large codebooks and large data sets have to be faced. In that case, the use of very powerful and expensive computers (also multi-processor ones) is required. An intermediate solution, investigated by Giuseppe Patanè, are the commodity supercomputers, a family of powerful and inexpensive computing systems.

This line of research concerns the design and implementation of distributed techniques for vector quantization by employing the MULTISOFT machine (described in detail in the related section and in [MyParClustering]) .

The first results are reported in [MyIJCNN00]. They concern two software tools, implementing the distributed versions of LBG and ELBG respectively. Both or them use PVM for inter-process communication. The results, in the case of easy problems are poor. The lack of a true broadcast function in the communication system make the situation worse. On the contrary, when the complexity of the task increases, i.e. when a heavy computational effort is required and the response-time of a mono-processor machine is high, the results obtained are very interesting. This allows us to say that cheap supercomputers can be very suited for very complex problems of unsupervised learning.

PAUL

The analysis of the results obtained from the distributed versions of LBG and ELBG has given useful information for improving the implementation of the distributed clustering algorithms proposed. In fact, it allowed to understand that, in the distributed version of ELBG, a strong limitation to the speed-up was given by the serial execution of a part of the algorithm: the ELBG-block (it is the block grouping together all of the operations constituting the enhancement with respect to LBG). Such a solution was necessary because of the intrinsicly serial nature of the operations performed by it. So, the idea was to develop a new algorithm that, taking ELBG as its starting point, could be totally implementable on a parallel or distributed architecture. The experiments performed led to a new technique, called ``Parallel Algorithm for Unsupervised Learning'' (PAUL) [MyPAUL], satisfying all of the requirements. Practically, although PAUL only approximates ELBG, the results obtained are very close to it as regards the quantization error and clearly superior for the speed-up, especially when treating complex problems with large data set and codebooks. For example, with a problem of large-sized image compression, PAUL reaches a speed-up of 12.55 when using 16 processors and 15.14 when using 20 processors.

LBG Super-cluster (LBGS)

This work [MyLBGS] presents a distributed algorithm whose results are very good when facing complex problems, i.e. when both the cardinality of the data set and the codebook are high. Like its name, LBGS is derived from the well known (serial) algorithm LBG and it is made more suitable for implementation on parallel and distributed systems. Practically, in LBGS, the inter-process communication is minimized and the tasks involved in the quantization process work are almost independent on each other. Besides, the goal of reducing the computations has been pursued, at the cost of the deterioration of the result with respect to the traditional serial techniques. Nevertheless, the user can handle the compromise between precision and speed. The target was reached and the quality of the results is really remarkable. For a particularly complex problem of image compression, speed-ups of about 200 (two hundred) were obtained with only 16 processors. The price to be paid in terms of quantization errors has been kept inside few percent points (less than 5%). The special expedients briefly presented make LBGS very suitable to its implementation on a cluster of personal computers. However, it is useful to underline that the techniques developed for LBGS are valid in general and, if more powerful computing systems are available, it is possible to obtain lower computing times or, for the same computing time, lower errors.