Abstract
The paper deals with an actual applied problem related to the artificial neural networks training. An approach to the solution based on the idea of random search is proposed. An original training algorithm that implements Boltzmann annealing has been developed and its convergence in probability to the global optimum has been proved. It is also shown that the proposed algorithm can be easily modified to train any artificial neural network. Thus, it has a good prospect for solving applied problems using neural network technologies in general. Experimental studies have been carried out, in which, using the example of compressing color raster images problem, the proposed algorithm was compared with the known adaptive moment algorithm  one of the best gradient methods for training neural networks. Image compression was performed using an ensemble of n GaussBernoulli restricted Boltzmann machines. The use of an ensemble of n machines in combination with a specially developed parallelization procedure made it possible to reduce the computational complexity of the training process and increase the speed of the proposed algorithm. As a result of experiments, it was shown that the proposed approach is not inferior to gradient methods in terms of speed. Moreover, the developed training algorithm turned out to be more than twice as effective as the adaptive moment algorithm in terms of the quality of the solution obtained.
Keywords: Annealing method, gradient decent method, training, neural networks, restricted Boltzmann machine
Introduction
Modern society is moving from the postindustrial to the informational stage of its development. Huge arrays of numerical data are generated daily, which stimulates the development of information and computer technologies (Chen et al., 2022). Recently, neural network technology for data processing has become widespread. The technology is based on a flexible mathematical model  an artificial neural network, with the help of which a wide range of applied problems is solved. To tune a neural network to a specific subject area, it is necessary to train it. Training is a typical optimization problem, where the objective function is given, which describes the solution quality, and the data on which it is necessary to achieve the optimal functional value. At the initial stage of the neural networks development for training, as a rule, the gradient approach was used. This approach has become widespread due to the high convergence rate of the methods that implement it (Nakamura et al., 2021). However, with the development of computer technology, the situation has changed cardinally and this factor has ceased to be a determining factor. This, in turn, made it possible to develop other approaches to training.
The paper considers an alternative approach to training neural networks based on the idea of random search. An algorithm is proposed that implements one of the variants of the annealing method, and its efficiency is studied using the example of solving the color images compression problem.
Problem Statement
Consider the problem of color images compression, which can be described as follows.
Let a bitmap color image be given by a set of color pixels (containing red, blue and green colors). In this case, any of the colors is set to a value from 0 to 255, that is, it is an 8bit number. In this case, can be represented a single pixel as a Cartesian product of three 8bit sets, and an image as a Cartesian product of 8bit sets. As a result of image compression in the bit space, a certain vector of minimum dimension in a certain sense should be constructed.
The problem of color images compression can be formally written as follows:
Let a set of color images be given. It is necessary to construct an algorithm such that:
$\left\{\begin{array}{l}A:\{\mathrm{0},\dots ,\mathrm{255}{\}}^{\mathrm{3}.}\to \{\mathrm{0,1}{\}}^{k}\\ {A}^{\mathrm{1}}:\{\mathrm{0,1}{\}}^{k}\to \{\mathrm{0},\dots ,\mathrm{255}{\}}^{\mathrm{s}\mathrm{N}}\\ k\to \mathrm{m}\mathrm{i}\mathrm{n}\\ {\sum}_{x=X}\u200a{\sum}_{i=\mathrm{1}}^{\mathrm{5}N}\u200a{\left({x}_{i}{y}_{i}\right)}^{\mathrm{2}}\to \mathrm{m}\mathrm{i}\mathrm{n},\end{array}\right.$
where is the original image; is an image restored from a compressed image; is the number of pixels in the image.
Inverse mapping^{1} may not be unique, but it is required to compress images as much as possible ( $k\to \mathrm{m}\mathrm{i}\mathrm{n}$), but square error across all images set should not be greater than some threshold, which is set manually.
Lowerdimensional representations can improve performance on many tasks, such as image compression, reconstruction and clustering (Liu et al., 2019). Color image compression occurs in many information retrieval problems (Dewi et al., 2021; Khanna et al., 2019; Knop et al., 2016). In such tasks, the restricted Boltzmann machine (RBM) is often used, which allows you to extract informative features that are used later for image classification.
Research Questions
In course of the study the following questions were raised:
 Is it possible to build efficient RBM training algorithm implementing annealing method?
 What is the efficiency of the algorithm based on the annealing method in relation to the gradient descent algorithm?
Purpose of the Study
The answers to the issues raised above will help achieve the goal and in future to contribute to the development of neural network training algorithms. It should increase a solution quality in a wide range of applied problems.
Research Methods
RBM refers to models (architectures) of recurrent neural networks. The parameters that define the properties of the connections between the neurons of the layers are called weights, and the parameters of the neurons are called their characteristics. Characteristics depend on the type of distribution that the network generates. For GaussBernoullitype RBMs, often used for data compression, neurons have three characteristics. For the hidden and visible layers, the displacements of the neurons and are set, respectively, as well as the parameters of the dispersion of neurons σ. As a result, the architecture of the RBM can be described by four types of parameters: the sets of weights, the bias of the visible and hidden words and, and the variances of neurons in the visible layer σ. That is, RBM can be formally described as a parametric family=(,,, σ). By fixing certain elements in each of the sets, you can set different subtypes of architectures.
Consider a RBM containing_{1} neurons in the input layer and_{2} neurons in the hidden layer. Then arbitrary parameters' values(_{1}_{2}_{1*2212})_{}, $\in \mathfrak{R}$=1,…,_{1}_{2}2_{1}+_{2 }define a specific version of the this type neural network.
It is proposed to use an approach based on the idea of a random search to train the RBM. The optimization problem in this case can be formulated as follows.
Let Ω be the set of feasible solutions. In the case of neural networks of the described above type, it can be represented as a Cartesian product of subsets of admissible parameter values for each group of network parameters_{} (the GaussBernoulli RBM has four groups of parameters). In this case, for each group of parameters, the set of feasible values is determined by the formula
where_{} is the number of parameters in the group;_{},_{} – upper and lower bounds of values in the group.
Suppose that on the set of feasible solutions Ω the objective function is defined, and for each element $x\in \mathrm{O}$ there is a set of neighboring elements $$ $N\left(x\right)\subset \mathrm{\Omega}$
. Then the optimization problem can be formally specified as a triple (Ω,,). A set of neighboring elements determines the optimization algorithm. Almost all random search theory was build on several restrictions on this set.
Let us consider the possibility of solving the problem described above using Boltzmann annealing.
For this variant of the annealing method, the sequence_{0},_{1},_{2},… is specified, the elements of which are interconnected by the relation:
${T}_{k}=\frac{{T}_{\mathrm{0}}}{\mathrm{l}\mathrm{n}(k+\mathrm{2})},k>\mathrm{0}$ (1)
Using these values, the transition probability from the current solution to the new solution is determined. The probability is determined by the formula:
$P(y\mid x)=\mathrm{m}\mathrm{i}\mathrm{n}\left(\mathrm{1},\mathrm{e}\mathrm{x}\mathrm{p}\left(\frac{F\left(x\right)F\left(y\right)}{{T}_{k}}\right)\right)$ (2)
An algorithm that implements Boltzmann annealing is proposed (Kirkpatrick et al., 1983).
Parameters initialization.
Step 0; Setting initial values for problem parameters(_{1}_{2}_{12212}) and temperature_{0}.
General kth iteration.
Step 1. Four random variables are generated_{1},_{2},_{3},_{4 }according to the formula
${n}_{i}=\u230aR\left[\mathrm{0};{m}_{i}\right]\u230b,i=\stackrel{}{\mathrm{1,4}}$
where R[a;b] is a realization of a uniformly distributed random variable on the segment [;] $a,b\in \mathfrak{R}$;_{} is number of parameters in each group. Values_{1},_{2},_{3},_{4 }are amount of parameters to change.
Step 2. Random permutations are generated_{1},_{2},_{3},_{4 }_{1},_{2},_{3},_{4 }length respectively; First_{1},_{2},_{3},_{4} elements specify the indexes of the parameters to be changed in each of the parameter groups.
Let, for example,_{1}{_{11},_{12},…,_{11}},_{2}{_{21},_{22},…,_{22}},_{3}{_{31},_{32},…,_{33}},_{4}{_{41},_{42},…,_{44}},_{1},_{2},_{3},_{4} $\subseteq \mathrm{\Omega}$ – sets of changing parameters.
Step 3. New solution(_{1}_{2}_{12212}) is generating according to the formula:
${y}_{k}=\left\{\begin{array}{l}{y}_{k}{y}_{k}\notin \left({J}_{\mathrm{1}}\cup {J}_{\mathrm{2}}\cup {J}_{\mathrm{3}}\cup {J}_{\mathrm{4}}\right)\\ {y}_{k}+{a}_{ik}{y}_{k}\in {J}_{i},i=\stackrel{}{\mathrm{1,4}}\\ {a}_{ik}=R\left[{l}_{i}{l}_{i}\right],i=\stackrel{}{\mathrm{1,4}}\\ k=\stackrel{}{\mathrm{1},{n}_{\mathrm{1}}{n}_{\mathrm{2}}+\mathrm{2}{n}_{\mathrm{1}}+{n}_{\mathrm{2}}}\end{array}\right.$
where_{1},_{2},_{3},_{4} – algorithm parameters. These parameters determine the set of neighboring elements size. The set of neighboring elements size is critical. It influences on convergence speed and final solution quality.
Step 4. Calculating of the objective function value().
Step 5. A new solution is chosen according to the probability value (2).
Step 6. Checking the optimality criterion. If the time for training has expired, then the algorithm terminates, otherwise the value of is increased by one and go to Step 1.
It is easy to show that the developed training algorithm is correct, in the sense that the algorithm does not make transitions to invalid solutions.
An important characteristic of this type of algorithm is the convergence property. For this annealing method variant, the convergence theorem is known.
(Hajek, 1988).
For any nonlocal minimum
${\mathrm{l}\mathrm{i}\mathrm{m}}_{k\to +\mathrm{\infty}}\u200aP\left({x}_{k}=x\right)=\mathrm{0}$
2. If is the set of local minima of depth, then for any from
${\mathrm{l}\mathrm{i}\mathrm{m}}_{k\to +\mathrm{\infty}}\u200aP\left({x}_{k}\in x\right)=\mathrm{0}$
if and only if when
${\sum}_{k=\mathrm{1}}^{+\mathrm{\infty}}\u200a\mathrm{e}\mathrm{x}\mathrm{p}\left(d/{T}_{k}\right)=+\mathrm{\infty}$
3. Let Ω* be the set of global minima and* be the maximum from the depths of local minima that do not match with any of the global
${\mathrm{l}\mathrm{i}\mathrm{m}}_{k\to +\mathrm{\infty}}\u200aP\left({x}_{k}\in {\mathrm{\Omega}}^{k}\right)=\mathrm{1}$
if and only if when
${\sum}_{k=\mathrm{1}}^{+\mathrm{\infty}}\u200a\mathrm{e}\mathrm{x}\mathrm{p}\left({d}^{\mathrm{*}}/{T}_{k}\right)=+\mathrm{\infty}$ (3)
According to (Hajek, 1988) theorem, the algorithm must satisfy the following conditions:
The convergence conditions were formulated in the form of statements and proved (Krasnoproshin & Matskevich, 2022).
Problem (Ω,,) is irreducible and has the weak reversibility property.
The temperature sequence (1) monotonic decreasing to zero and satisfies the constraints (3).
. The proposed training algorithm can be easily modified for neural networks of any architecture. To do this, you need to determine the number of network parameter groups and set the parameters of the algorithm from step 3.
It should be noted that this algorithm allows the use of parallelization procedures, including those developed earlier to speed up the training of neural networks.
The theoretical guarantee of convergence makes it possible to obtain a global optimum, but when solving an applied problem, it is important to know how fast the algorithm converges. Therefore, to study the problem and study the effectiveness of the proposed approach, experimental studies were carried out using the example of solving the color images compression problem.
The wellknown dataset CIFAR10 (2020), which contains 60000 color raster images with a resolution of 32x32 pixels, was chosen as experimental data. Each image contains exactly an object belonging to one of the ten classes. Images, in addition to the object, contain an additional background, which greatly complicates compression.
For experimental studies, 8fold, 16fold and 32fold compression ratios were chosen. Higher compression ratios lead to excessive losses, and lower compression ratios have no practical application in neural networks.
An ensemble of RBMs was used as a compression tool. For 8fold compression, 256 GaussBernoulli RBMs were used. 128 machines were used for 16x compression, and 64 machines for 32x compression.
One of the best modifications of the gradient method, the adaptive moment algorithm (Hamis et al., 2019), was used as a comparison algorithm. To estimate the gradient, there are many modifications: CDN (Li et al., 2021), PCD (Oswin et al., 2018), NPTM (Brugge et al., 2013). To speed up training by the gradient method and achieve maximum quality (by mean square error), the CD1 algorithm was chosen.
Training and validation were carried out on 4000 images. The remaining 52,000 images were used as a test set to check the quality of the resulting solution. To measure the quality, the functionals MSE, PSNR, PSNR_HVS, SSIM (Temel & AlRegib, 2019) were used.
The experiments were carried out on a computer with the operating system Lubuntu 20.04, with 4 core CPU intel i74770k with 16 GB 1600 MHz RAM and GPU nvidia rtx 3070 with 5888 cores. The training time was measured using the gettimeofday function.
The results are presented in the Table 1.
Findings
Based on the experimental results, a conclusion can be drawn. Thanks to the use of a special parallelization procedure, it was possible to achieve parity in performance with gradient methods. In terms of the MSE functionality (the lower the value, the better), the proposed algorithm more than doubled the quality of the opponent, in terms of the PSNR and PSNR_HVS functionals (the higher the value, the better) by about 50%, and in terms of the SSIM functionality (more is better), more than twice.
Conclusion
The paper proposes an approach to training neural networks based on random search and develops an original algorithm that implements Boltzmann annealing. The algorithm convergence in probability to the global optimum is proved, and it is shown that the proposed algorithm can be easily modified to train any artificial neural network.
As part of experimental studies, using the example of the problem of compressing color bitmap images, the proposed algorithm was compared with the adaptive moment algorithm.
As a result of experiments, it was shown that the proposed approach is not inferior to gradient methods in terms of speed. Moreover, according to various metrics, the developed algorithm turned out to be more efficient than the adaptive moment algorithm in terms of the quality of the solution obtained.
Thus, the proposed training approach, based on the idea of random search, has good prospects for solving applied problems using neural network technologies in general.
Acknowledgments
This work was supported by the Ministry of Science and Higher Education of the Russian Federation (Grant № 0751520221121).
References
Brugge, K., Fischer, A., & Igel, C. (2013). The flipthestate transition operator for restricted Boltzmann machines. Machine Learning, 93(1), 5369.
Chen, H., He, X., Yang, H., Qing, L., & Teng, Q. (2022). A FeatureEnriched Deep Convolutional Neural Network for JPEG Image Compression Artifacts Reduction and its Applications. IEEE Transactions on Neural Networks and Learning Systems, 33(1), 430444.
CIFAR10 dataset. (2020). Retrieved on 04 March 2022, from https://www.cs.toronto.edu/~kriz/cifar.html
Dewi, C., Chen, R.C., Hendry, & Hung, H.T. (2021). Experiment Improvement of Restricted Boltzmann Machine Methods for Image Classification. Vietnam Journal of Computer Science, 08(03), 417432. https://doi.org/10.1142/s2196888821500184
Temel, D., & AlRegib, G. (2019). Perceptual image quality assessment through spectral analysis of error representations. Signal Processing: Image Communication, 70, 3746.
Hajek, B. (1988). Cooling Schedules for Optimal Annealing. Mathematics of Operations Research, 13(2), 311329.
Hamis, S., Zaharia, T., & Rousseau, O. (2019, June). Image compression at very low bitrate based on deep learned superresolution. In 2019 IEEE 23rd International Symposium on Consumer Technologies (ISCT) (pp. 128133). IEEE.
Khanna, M. T., Ralekar, C., Goel, A., Chaudhury, S., & Lall, B. (2019). Memorabilitybased image compression. IET Image Processing, 13, 14901501.
Kirkpatrick, S., Gelatt, C. D., Jr., & Vecchi, M. P. (1983). Optimization by Simulated Annealing. Science, 220(4598), 671680.
Knop, M., Kapurscirnski, T., Mleczko, W. K., & Angryk, R. (2016). Neural Video Compression Based on RBM Scene Change Detection Algorithm. Artificial Intelligence and Soft Computing, Springer, 660669.
Krasnoproshin, V. V., & Matskevich, V. V. (2022). Random search in neural networks training. Proceedings of the 13th International Conference “Computer Data Analysis and Modeling” – CDAM’2022, Minsk, 9699.
Li, X., Gao, X., & Wang, C. (2021). A Novel Restricted Boltzmann Machine Training Algorithm With Dynamic Tempering Chains. IEEE ACCESS, 9, 2193921950. https://doi.org/10.1109/access.2020.3043599
Liu, W., Meng, F. Y., Liang, Y. S., Yang, H. X., & Wang, C. W. (2019). Loss Function Optimization Based on Adversarial Networks. In Fuzzy Systems and Data Mining V (pp. 619634). IOS Press.
Nakamura, K., Derbel, B., Won, K.J., & Hong, B.W. (2021). LearningRate Annealing Methods for Deep Neural Networks. Electronics, 10(16), 2029. MDPI AG.
Oswin, K., Fischer, A., & Igel, C. (2018). PopulationContrastiveDivergence: Does consistency help with RBM training? Pattern Recognition Letters, 102, 17.
Copyright information
This work is licensed under a Creative Commons AttributionNonCommercialNoDerivatives 4.0 International License
About this article
Publication Date
27 February 2023
Article Doi
eBook ISBN
9781802969603
Publisher
European Publisher
Volume
1
Print ISBN (optional)

Edition Number
1st Edition
Pages
1403
Subjects
Hybrid methods, modeling and optimization, complex systems, mathematical models, data mining, computational intelligence
Cite this article as:
Matskevich, V. V., & Stasiuk, V. A. (2023). An Efficient Training Algorithm of Restricted Boltzmann Machines. In P. Stanimorovic, A. A. Stupina, E. Semenkin, & I. V. Kovalev (Eds.), Hybrid Methods of Modeling and Optimization in Complex Systems, vol 1. European Proceedings of Computers and Technology (pp. 296303). European Publisher. https://doi.org/10.15405/epct.23021.36