Greedy Heuristics for the Choice of the Radius of Local Concentrations in FOREL-2 Algorithm

Abstract

The authors examine the problem of choosing the search radius for local concentrations in the FOREL-2 clustering algorithm with an initial number of clusters. Our approach was aimed at improving the accuracy and stability of the result, such as identifying homogeneous batches of industrial products. We examined the k-means and FOREL-2 algorithms by using normalized standard deviation test values and by valid parameter values for the problem of automatic classification of objects in a multi-dimensional space of measured parameters. For such problems, with the use of the FOREL-2 algorithm, we apply greedy heuristic procedures to select the radius of local concentrations. According to the obtained Rand index, the approach which uses the FOREL-2 algorithm demonstrated the best accuracy with a larger value of the objective function in comparison with the k-means algorithm. The accuracy and speed of the software implementation of the algorithm are quite acceptable for solving the problem of clustering electronic radio products based on test data. The use of greedy heuristics for choosing the radius of the search for local concentrations in the FOREL-2 clustering algorithm with a specified number of clusters has an advantage in the speed of exact clustering compared to the k-means algorithm that uses greedy heuristics for choosing centroids.

Keywords: FOREL-2, k-means, clustering, greedy heuristics, electronic radio products

Introduction

MacQueen (1967) proposed the classical and most common problem-oriented clustering algorithms. In the field of data clustering of electronic radio products, an algorithm of this kind is the k-means algorithm proposed primarily by Lloyd (1982). Kazakovtsev et al. (2015) continued research in this area related to the choice of data normalization method, the choice of the distance metric and the choice of the centroid initialization method. The algorithm of Milligan and Cooper (1985) depends on giving a priori the number of clusters and finds all groups to cluster at the same time. In the work of Klosgen and Zytkow (1996), the clustering process consists of grouping n test case observations into k groups. In the algorithm of Anderberg (1973), Bobrowski and Bezdek (1991), a representative observation is specified by a centroid using a d-dimensional feature vector. Optimal clustering is achieved at the minimum value of the objective function. The objective function takes into account the set of observations (x1,x2,…,xn) that are divided into k clusters C={C1,C2,…,Cn} by the following formula:

a r g m i n i = 1 k x C i x - μ i 2 ,(1)

The algorithm of Li and Wu (2012) raises the speed of convergence effectively by improving the way of selecting initial cluster focal point. Hossain et. al. (2019) improved the k-means algorithm also related to the Euclidian distance between two points which is less than or equal to the threshold value, then these two data points will be in the same group. The criterion of Pérez-Ortega et al. (2018) enables us to balance the use of benefits at the convergence stage. In the work of Na et al. (2010), the algorithm saves the running time directly to computing the distance of each data object to the cluster centres frequently. According to Wu et al. (2008), this is one of the most influential data mining algorithms in the research community.

Problem Statement

The FOREL-2 algorithm by Zagoruiko (1999) is a modification of the basic FOREL algorithm that was proposed by Zagoruiko and Yolkina (Zagoruiko, 1999). If we compare it with the k-means algorithm, we can find a lot in common. Clusters obtained by these algorithms have a spherical shape. The results of the algorithm work depend on the data normalization method.

In the algorithm of Zhuravleva, the features of the objects are initially normalized in such a way that the values of all parameters are in the range from zero to one. In comparison with k-means, the FOREL-2 algorithm has a range of specific features. The number of clusters depends on the radius of the sphere. The more it is, the less clusters are obtained.

In the general statement of the problem, of the FOREL algorithm searches for such a partition of m objects into k clusters so that the criterion of similarity F to the center for all clusters is minimal. Let us denote the coordinates of the center of the j-th cluster by Cj. The sum of distances p(Cj, ai) between the center of a sphere with radius R0 and all mj points aj of this cluster is pj=∑p(Cj,aj), where i=1…mj and the sum of such internal distances for all k clusters is F=∑pj, j=1…k.

For the modified FOREL-2 algorithm, the clustering quality function F uses an additional criterion f corresponding to the value of the exact number of clusters

F = f k i j = 1 k p j , (2)

where f(ki)=1, if ki=k, or f(ki)=∞, if ki≠k. The best option corresponds to the minimum value F.

The condition for joining a point to a cluster x∈Kj is determined by the formula

p C j , a i R 0 R ,(3)

where R is the radius of the search for local concentrations.

For the modified FOREL-2 algorithm, it is necessary to obtain a precisely specified number of clusters. In all variations of the FOREL algorithm, there is a basic procedure coinciding with the k-means algorithm at steps 1, 2, 3 and 4. Step 5 is different because all observations of test trials occurring inside the sphere are selected according to (3), as well as in the method for determining the convergence of the objective function (2).

Research Questions

The following questions were posed during the study:

  • How to choose the search radius for local concentrations in the FOREL-2 clustering algorithm with an initial number of clusters?
  • How to improve the accuracy and stability of the result, such as identifying homogeneous batches of industrial products?

Purpose of the Study

The answers to these questions will lead us to find the best accuracy, according to the obtained Rand index with a larger value of the objective function in comparison with the k-means algorithm.

At the same time, being compared to the classical implementation of the k-means algorithm, the use of greedy heuristics in the FOREL-2 algorithm shows greater efficiency.

Research Methods

Orlov and Fedosov (2016) used the data from the results of tests conducted at the test center in order to isolate presumably homogeneous batches with some accuracy. For two types of integrated circuits (ICs), 140UD25A and 140UD26A, belonging to different homogeneous batches, the initial data is a combination of parameters and consists of 18 batches (9 batches of the first and second IC types, respectively). Products (devices) in each batch are described by 18 input measured parameters. For the first batch of IC 1, 3, 8 and 9 data vectors were selected in the amount of n=807. For the type of the ICs, data vectors of batches 4, 5, 7 and 9 were selected in the amount of n=532. The data set was decimated, discarded every 2, 3 and 4 data on the test results, the number was n=201 and n=132 for the first and second types of the ICs, respectively. In the work of Ahmatshin and Kazakotsev (2020), the selected test data were normalized to the standard deviation and to the allowable values of the parameter.

The FOREL-2 algorithm was compared with the k-means algorithm on a system with 8 GB of RAM, an Intel Core i3-3220 processor with a frequency of 3.3 GHz, and an Ubuntu 18.04.6 LTS operating system. We conducted 30 experiments for each algorithm. All results are presented with an average of 30 runs. Each launch lasted 1, 10 and 60 seconds.

The average results of clustering on the speed of the algorithms are given in Table 1: maximum (max), minimum (min), median value (median), average value (avr) and standard deviation (st.dev) for the Rand index and objective function. For the objective function, the coefficient of variation (var) and the span factor (spn) are also calculated.

Table 1 - Accuracy of clustering 1, 3, 8, and 9 batches of IСs 140UD25A data
See Full Size >

Findings

The approach using the FOREL-2 algorithm with a greedy heuristic for choosing the radius of the search for local concentrations demonstrated the best accuracy, according to the obtained Rand index with a larger value of the objective function in comparison with the k-means algorithm.

For solving the problem of clustering electronic radio products based on test data, the accuracy and speed of the software implementation of the algorithm are quite acceptable. Compared to the classical implementation of the k-means algorithm, the use of greedy heuristics in the FOREL-2 algorithm showed greater efficiency.

Conclusion

When solving clustering problems, FOREL-2 algorithm often gives a result which is very far from the optimal solution. In this research, we aimed at developing not only fast but also the most accurate algorithm, based on greedy heuristics for selection radius of local concentrations, for solving problems of clustering electronic radio products based on test data, using the FOREL-2 algorithm.

Computational experiments show that the use of the greedy heuristics for the radius selection of local concentrations in the FOREL-2 algorithm improves the accuracy and speed of obtaining results at identifying electronic radio products. Moreover, the best results can be shown by FOREL-2 algorithm compared to the k-means algorithm used medoids as intermediate centres of spheres.

Acknowledgments

The work was supported by the Grant of the President of the Russian Federation for state support of the Leading Scientific Schools) NSh-421.2022.4.

References

  • Ahmatshin, F. G., & Kazakotsev, L. A. (2020). Impact of data normalization methods and clustering model in the problem of automatic grouping of industrial products. Journal of Physics: Conference Series, 1679(3), 032085.

  • Anderberg, M. R. (1973). Cluster Analysis for Applications. Academic Press.

  • Bobrowski, L., & Bezdek, J. C. (1991). c-means clustering with the l/sub l/ and l/sub infinity / norms. IEEE Transactions on Systems, Man, and Cybernetics, 21(3), 545-554. https://doi.org/10.1109/21.97475

  • Hossain, M. Z., Akhtar, M. N., Ahmad, R. B., & Rahman, M. (2019). A dynamic K-means clustering for data mining. Indonesian Journal of Electrical Engineering and Computer Science, 13(2), 521-526.

  • Kazakovtsev, L. A., Antamoshkin, A. N., & Masich, I. S. (2015). Fast deterministic algorithm for EEE components classification. IOP Conference Series: Materials Science and Engineering, 94, 012015.

  • Klosgen, W., & Zytkow, J. M. (1996). Knowledge discovery in databases terminology. In U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, & R. Uthurusamy (Eds.), Advances in Knowledge Discovery and Data Mining (pp. 573-592). AAAI Press/The MIT Press.

  • Li, Y., & Wu, H. (2012). A clustering method based on K-means algorithm. Physics Procedia, 25, 1104-1109.

  • Lloyd, S. (1982). Least squares quantization in PCM. IEEE transactions on information theory, 28(2), 129-137.

  • MacQueen, J. (1967). Classification and analysis of multivariate observations. In 5th Berkeley Symp. Math. Statist. Probability (pp. 281-297). University of California.

  • Milligan, G. W., & Cooper, M. C. (1985). An examination of procedures for determining the number of clusters in a data set. Psychometrika, 50(2), 159-179.

  • Na, S., Xumin, L., & Yong, G. (2010). Research on k-means clustering algorithm: An improved k-means clustering algorithm. In 2010 Third International Symposium on intelligent information technology and security informatics (pp. 63-67). IEEE.

  • Orlov, V. I., & Fedosov V. V. (2016). ERC clustering dataset. http://levk.info/data1526.zip

  • Pérez-Ortega, J., Almanza-Ortega, N. N., & Romero, D. (2018). Balancing effort and benefit of K-means clustering algorithms in Big Data realms. PLOS ONE, 13(9), e0201874.

  • Wu, X., Kumar, V., Ross Quinlan, J., Ghosh, J., Yang, Q., Motoda, H., McLachlan, G. J., Ng, A., Liu, B., Yu, P. S., Zhou, Z.-H., Steinbach, M., Hand, D. J., & Steinberg, D. (2008). Top 10 algorithms in data mining. Knowledge and Information Systems, 14(1), 1-37.

  • Zagoruiko, N. G. (1999). Applied methods of data and knowledge analysis. Novosibirsk: Institute of Mathematics SD RAS.

Copyright information

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License

About this article

Publication Date

27 February 2023

eBook ISBN

978-1-80296-960-3

Publisher

European Publisher

Volume

1

Print ISBN (optional)

-

Edition Number

1st Edition

Pages

1-403

Subjects

Cite this article as:

Ahmatshin, F. G., & Kazakovtsev, L. A. (2023). Greedy Heuristics for the Choice of the Radius of Local Concentrations in FOREL-2 Algorithm. In P. Stanimorovic, A. A. Stupina, E. Semenkin, & I. V. Kovalev (Eds.), Hybrid Methods of Modeling and Optimization in Complex Systems, vol 1. European Proceedings of Computers and Technology (pp. 366-371). European Publisher. https://doi.org/10.15405/epct.23021.45