Abstract
The authors examine the problem of choosing the search radius for local concentrations in the FOREL2 clustering algorithm with an initial number of clusters. Our approach was aimed at improving the accuracy and stability of the result, such as identifying homogeneous batches of industrial products. We examined the kmeans and FOREL2 algorithms by using normalized standard deviation test values and by valid parameter values for the problem of automatic classification of objects in a multidimensional space of measured parameters. For such problems, with the use of the FOREL2 algorithm, we apply greedy heuristic procedures to select the radius of local concentrations. According to the obtained Rand index, the approach which uses the FOREL2 algorithm demonstrated the best accuracy with a larger value of the objective function in comparison with the kmeans algorithm. The accuracy and speed of the software implementation of the algorithm are quite acceptable for solving the problem of clustering electronic radio products based on test data. The use of greedy heuristics for choosing the radius of the search for local concentrations in the FOREL2 clustering algorithm with a specified number of clusters has an advantage in the speed of exact clustering compared to the kmeans algorithm that uses greedy heuristics for choosing centroids.
Keywords: FOREL2, kmeans, clustering, greedy heuristics, electronic radio products
Introduction
MacQueen (1967) proposed the classical and most common problemoriented clustering algorithms. In the field of data clustering of electronic radio products, an algorithm of this kind is the kmeans algorithm proposed primarily by Lloyd (1982). Kazakovtsev et al. (2015) continued research in this area related to the choice of data normalization method, the choice of the distance metric and the choice of the centroid initialization method. The algorithm of Milligan and Cooper (1985) depends on giving a priori the number of clusters and finds all groups to cluster at the same time. In the work of Klosgen and Zytkow (1996), the clustering process consists of grouping n test case observations into k groups. In the algorithm of Anderberg (1973), Bobrowski and Bezdek (1991), a representative observation is specified by a centroid using a ddimensional feature vector. Optimal clustering is achieved at the minimum value of the objective function. The objective function takes into account the set of observations (x_{1},x_{2},…,x_{n}) that are divided into k clusters C={C_{1},C_{2},…,C_{n}} by the following formula:
$\mathrm{a}\mathrm{r}\mathrm{g}\mathrm{m}\mathrm{i}\mathrm{n}{\sum}_{i=\mathrm{1}}^{k}\u200a{\sum}_{x\in {C}_{i}}\u200a{\u2225x{\mu}_{i}\u2225}^{\mathrm{2}}$,(1)
The algorithm of Li and Wu (2012) raises the speed of convergence effectively by improving the way of selecting initial cluster focal point. Hossain et. al. (2019) improved the kmeans algorithm also related to the Euclidian distance between two points which is less than or equal to the threshold value, then these two data points will be in the same group. The criterion of PérezOrtega et al. (2018) enables us to balance the use of benefits at the convergence stage. In the work of Na et al. (2010), the algorithm saves the running time directly to computing the distance of each data object to the cluster centres frequently. According to Wu et al. (2008), this is one of the most influential data mining algorithms in the research community.
Problem Statement
The FOREL2 algorithm by Zagoruiko (1999) is a modification of the basic FOREL algorithm that was proposed by Zagoruiko and Yolkina (Zagoruiko, 1999). If we compare it with the kmeans algorithm, we can find a lot in common. Clusters obtained by these algorithms have a spherical shape. The results of the algorithm work depend on the data normalization method.
In the algorithm of Zhuravleva, the features of the objects are initially normalized in such a way that the values of all parameters are in the range from zero to one. In comparison with kmeans, the FOREL2 algorithm has a range of specific features. The number of clusters depends on the radius of the sphere. The more it is, the less clusters are obtained.
In the general statement of the problem, of the FOREL algorithm searches for such a partition of m objects into k clusters so that the criterion of similarity F to the center for all clusters is minimal. Let us denote the coordinates of the center of the jth cluster by C_{j}. The sum of distances p(C_{j}, a_{i}) between the center of a sphere with radius R_{0} and all m_{j} points a_{j} of this cluster is p_{j}=∑p(C_{j},a_{j}), where i=1…m_{j} and the sum of such internal distances for all k clusters is F=∑p_{j}, j=1…k.
For the modified FOREL2 algorithm, the clustering quality function F uses an additional criterion f corresponding to the value of the exact number of clusters
$F=f\left({k}_{i}\right){\sum}_{j=\mathrm{1}}^{k}\u200a{p}_{j}$, (2)
where f(k_{i})=1, if k_{i}=k, or f(k_{i})=∞, if k_{i}≠k. The best option corresponds to the minimum value F.
The condition for joining a point to a cluster x∈K_{j} is determined by the formula
$p\left({C}_{j},{a}_{i}\right)\le {R}_{\mathrm{0}}R$,(3)
where R is the radius of the search for local concentrations.
For the modified FOREL2 algorithm, it is necessary to obtain a precisely specified number of clusters. In all variations of the FOREL algorithm, there is a basic procedure coinciding with the kmeans algorithm at steps 1, 2, 3 and 4. Step 5 is different because all observations of test trials occurring inside the sphere are selected according to (3), as well as in the method for determining the convergence of the objective function (2).
Research Questions
The following questions were posed during the study:
 How to choose the search radius for local concentrations in the FOREL2 clustering algorithm with an initial number of clusters?
 How to improve the accuracy and stability of the result, such as identifying homogeneous batches of industrial products?
Purpose of the Study
The answers to these questions will lead us to find the best accuracy, according to the obtained Rand index with a larger value of the objective function in comparison with the kmeans algorithm.
At the same time, being compared to the classical implementation of the kmeans algorithm, the use of greedy heuristics in the FOREL2 algorithm shows greater efficiency.
Research Methods
Orlov and Fedosov (2016) used the data from the results of tests conducted at the test center in order to isolate presumably homogeneous batches with some accuracy. For two types of integrated circuits (ICs), 140UD25A and 140UD26A, belonging to different homogeneous batches, the initial data is a combination of parameters and consists of 18 batches (9 batches of the first and second IC types, respectively). Products (devices) in each batch are described by 18 input measured parameters. For the first batch of IC 1, 3, 8 and 9 data vectors were selected in the amount of n=807. For the type of the ICs, data vectors of batches 4, 5, 7 and 9 were selected in the amount of n=532. The data set was decimated, discarded every 2, 3 and 4 data on the test results, the number was n=201 and n=132 for the first and second types of the ICs, respectively. In the work of Ahmatshin and Kazakotsev (2020), the selected test data were normalized to the standard deviation and to the allowable values of the parameter.
The FOREL2 algorithm was compared with the kmeans algorithm on a system with 8 GB of RAM, an Intel Core i33220 processor with a frequency of 3.3 GHz, and an Ubuntu 18.04.6 LTS operating system. We conducted 30 experiments for each algorithm. All results are presented with an average of 30 runs. Each launch lasted 1, 10 and 60 seconds.
The average results of clustering on the speed of the algorithms are given in Table 1: maximum (max), minimum (min), median value (median), average value (avr) and standard deviation (st.dev) for the Rand index and objective function. For the objective function, the coefficient of variation (var) and the span factor (spn) are also calculated.
Findings
The approach using the FOREL2 algorithm with a greedy heuristic for choosing the radius of the search for local concentrations demonstrated the best accuracy, according to the obtained Rand index with a larger value of the objective function in comparison with the kmeans algorithm.
For solving the problem of clustering electronic radio products based on test data, the accuracy and speed of the software implementation of the algorithm are quite acceptable. Compared to the classical implementation of the kmeans algorithm, the use of greedy heuristics in the FOREL2 algorithm showed greater efficiency.
Conclusion
When solving clustering problems, FOREL2 algorithm often gives a result which is very far from the optimal solution. In this research, we aimed at developing not only fast but also the most accurate algorithm, based on greedy heuristics for selection radius of local concentrations, for solving problems of clustering electronic radio products based on test data, using the FOREL2 algorithm.
Computational experiments show that the use of the greedy heuristics for the radius selection of local concentrations in the FOREL2 algorithm improves the accuracy and speed of obtaining results at identifying electronic radio products. Moreover, the best results can be shown by FOREL2 algorithm compared to the kmeans algorithm used medoids as intermediate centres of spheres.
Acknowledgments
The work was supported by the Grant of the President of the Russian Federation for state support of the Leading Scientific Schools) NSh421.2022.4.
References
Ahmatshin, F. G., & Kazakotsev, L. A. (2020). Impact of data normalization methods and clustering model in the problem of automatic grouping of industrial products. Journal of Physics: Conference Series, 1679(3), 032085.
Anderberg, M. R. (1973). Cluster Analysis for Applications. Academic Press.
Bobrowski, L., & Bezdek, J. C. (1991). cmeans clustering with the l/sub l/ and l/sub infinity / norms. IEEE Transactions on Systems, Man, and Cybernetics, 21(3), 545554. https://doi.org/10.1109/21.97475
Hossain, M. Z., Akhtar, M. N., Ahmad, R. B., & Rahman, M. (2019). A dynamic Kmeans clustering for data mining. Indonesian Journal of Electrical Engineering and Computer Science, 13(2), 521526.
Kazakovtsev, L. A., Antamoshkin, A. N., & Masich, I. S. (2015). Fast deterministic algorithm for EEE components classification. IOP Conference Series: Materials Science and Engineering, 94, 012015.
Klosgen, W., & Zytkow, J. M. (1996). Knowledge discovery in databases terminology. In U. M. Fayyad, G. PiatetskyShapiro, P. Smyth, & R. Uthurusamy (Eds.), Advances in Knowledge Discovery and Data Mining (pp. 573592). AAAI Press/The MIT Press.
Li, Y., & Wu, H. (2012). A clustering method based on Kmeans algorithm. Physics Procedia, 25, 11041109.
Lloyd, S. (1982). Least squares quantization in PCM. IEEE transactions on information theory, 28(2), 129137.
MacQueen, J. (1967). Classification and analysis of multivariate observations. In 5th Berkeley Symp. Math. Statist. Probability (pp. 281297). University of California.
Milligan, G. W., & Cooper, M. C. (1985). An examination of procedures for determining the number of clusters in a data set. Psychometrika, 50(2), 159179.
Na, S., Xumin, L., & Yong, G. (2010). Research on kmeans clustering algorithm: An improved kmeans clustering algorithm. In 2010 Third International Symposium on intelligent information technology and security informatics (pp. 6367). IEEE.
Orlov, V. I., & Fedosov V. V. (2016). ERC clustering dataset. http://levk.info/data1526.zip
PérezOrtega, J., AlmanzaOrtega, N. N., & Romero, D. (2018). Balancing effort and benefit of Kmeans clustering algorithms in Big Data realms. PLOS ONE, 13(9), e0201874.
Wu, X., Kumar, V., Ross Quinlan, J., Ghosh, J., Yang, Q., Motoda, H., McLachlan, G. J., Ng, A., Liu, B., Yu, P. S., Zhou, Z.H., Steinbach, M., Hand, D. J., & Steinberg, D. (2008). Top 10 algorithms in data mining. Knowledge and Information Systems, 14(1), 137.
Zagoruiko, N. G. (1999). Applied methods of data and knowledge analysis. Novosibirsk: Institute of Mathematics SD RAS.
Copyright information
This work is licensed under a Creative Commons AttributionNonCommercialNoDerivatives 4.0 International License
About this article
Publication Date
27 February 2023
Article Doi
eBook ISBN
9781802969603
Publisher
European Publisher
Volume
1
Print ISBN (optional)

Edition Number
1st Edition
Pages
1403
Subjects
Hybrid methods, modeling and optimization, complex systems, mathematical models, data mining, computational intelligence
Cite this article as:
Ahmatshin, F. G., & Kazakovtsev, L. A. (2023). Greedy Heuristics for the Choice of the Radius of Local Concentrations in FOREL2 Algorithm. In P. Stanimorovic, A. A. Stupina, E. Semenkin, & I. V. Kovalev (Eds.), Hybrid Methods of Modeling and Optimization in Complex Systems, vol 1. European Proceedings of Computers and Technology (pp. 366371). European Publisher. https://doi.org/10.15405/epct.23021.45