Software For Sense Compatibility Analysis Of Educational Texts


One of the areas of innovative development in pedagogy is the search for ways to increase the availability of educational text material due to a decrease in the reading competence of students. To overcome the fragmented perception of educational texts it is necessary to draw on the results of modern research in various sciences. A preliminary analysis showed that the effectiveness of the perception and assimilation of information can be increased by involving the unconscious component of the psyche by fixing attention on the most repeated terms and concepts. One of the solutions is the development of a mathematical model and software product for didactic analysis of the compatibility of educational information by defining an expanded thesaurus, which in relation to educational information implies terminology as the main fixer of the meanings of educational information. The software was developed based on the Django framework, the mathematical model is implemented in python 3.7 using the scikit-learn library. The developed tool is relevant in the context of lifelong education for the analysis and processing of educational texts without taking into account their semantic compatibility and patterns of perception of educational information. The gradual increment of the thesaurus within the framework of the necessary level of perception will allow us to level negative trends in changing the quality of perception of information by students.

Keywords: Educational informationsubject thesaurussemantic compatibility of educational textssoftwaremathematical model


Modern challenges dictate the need for the formation of didactic analysis tools for the sources of educational information, which include the classic traditional textbook in its paper embodiment, as well as electronic and distance learning resources. If the educational literature for school education has always been under the scrutiny of methodologists, the educational literature of other levels of education is more often required only to match the content with the current level of development of the corresponding field of knowledge. At the same time, disputes about the quality of school textbooks and the ambiguity of approaches to the formation of their substantive part do not cease.

The amount of meanings embedded in the content of educational information can be determined from different perspectives. But if we are talking about academic disciplines representing independent branches of knowledge, their specificity is maximally reflected in the terminological component. When studying interdisciplinary issues, a different interpretation of concepts may arise with the addition of a semantic load of terms ( Klochkov et al., 2008; Novikov, 2019; Popova, 2012). In this case, for mastering the terminology, a smooth transition to different levels of complexity is important ( Metcalfe, 2017; LeCun et al., 2015; Wilson et al., 2019 and others). To reduce the influence of material of factors associated with an objective decrease in the level of reading competence on the assimilation ( Feldstein, 2013b) and the ability to perceive the meanings of large amounts of textual information, it is necessary to use an unconscious component of the students' psyche, taking into account modern interdisciplinary trends in the field of its study ( Chernigovskaya, 2015; Chernigovskaya et al., 2016; Damasio, 2018; Dean, 2018; Filippova, 2006; Klementovich et al., 2016; Klochkov et al., 2008; Kryukova et al., 2017; Morozov & Spiridonov, 2019; Pervushina & Osetrin, 2017; Verbitskaya, 2019). Such an opportunity arises thanks to the mechanism of gradual increment of the special thesaurus of disciplines, which is able to form the continuity or compatibility of educational texts among themselves. To realize this possibility, we need a tool that allows analyzing information from a set of educational texts (collections of study books, a sequence of sections, disciplines, wordings), which is quite feasible using computer processing methods.

Problem Statement

The main research issues are formed by the following aspects:

  • the need to involve the unconscious component of the psyche in the process of perceiving educational information;

  • the choice of a method for analyzing the transfer of meanings embedded in an array of educational information;

  • development of software for didactic analysis of semantic compatibility of texts

The main task in this context is to attract the unconscious component of the psyche in the perception of information by means of a systematic increment of the thesaurus, which will ensure text compatibility. The development of software based on this task gives the tool that will allow to analyze texts for semantic compatibility and highlight the most significant terms by the frequency of their presence in the texts to correct them subsequently.

The need to involve the unconscious component of the psyche in the process of perceiving educational information

The urgency of effective perception of educational information is not lost over the years. It undergoes changes due to objective social, technical, biological reasons including getting new ideas about the process of perception, the role of the unconscious, the characteristics of the brain, the fixation of which is given the works of outstanding teachers, psychologists, representatives of neurobiological and some other sciences ( Anokhin & Velichkovsky, 2011; Chernigovskaya, 2010; Chernigovskaya et al., 2016; Damasio, 2018; Dean, 2018; Feldstein, 2013a, etc.).

In the works of Feldstein (2013a, 2013b), a generalization of the results of a large volume of research made it possible to identify the main problems associated with the ability to perceive and assimilate educational information regarding the latest generations of students. The most significant problems include a decrease in readership, manifested in the desire of pupils and students to small text forms, in the inability to holistic perception of the common semantic boundaries of larger amounts of information. All this happens against the background of a reduced motivation for learning in general, to a decrease in the authority of a teacher, mentor, and parent. The reason for this situation can be very simple including possibilities of obtaining and the availability of any information at the modern level of development of information technology. However, in practice, this leads to the loss of the time resource at different stages of development and formation of the child’s brain, when the replacement of traditional development methods occurs at the expense of technical means with the so-called “smart interface”. One of their negative factors of their use, as shown in the work of Morozova and Novikova (2013), is not only the tension of the work of the organs of vision due to the specific adjustment of the visual apparatus to the pixel image of the screens, but also the involvement of a large number of different areas of the brain in this process, causing the child to quickly become fatigued. This does not contribute to the effective development of the brain according to the age of the child, and the missed opportunities further hinder the development of the ability to perceive and process new information. The search for new ways of presenting knowledge that makes it possible to level negative trends, according to Feldstein ( 2013a), is one of the directions of the methodological and pedagogical scientific search for the near future. One of such methods is the use of the patterns of the unconscious component of the psyche in the process of perceiving educational information ( Chernigovskaya, 2015; Chernigovskaya et al., 2016; Damasio, 2018; Dean, 2018; Filippova, 2006; Klementovich et al., 2016; Klochkov et al., 2008; Kryukova et al., 2017; Morozov & Spiridonov, 2019; Pervushina & Osetrin, 2017; Schenk, 2012; Verbitskaya, 2019).

The choice of a method for analyzing the transmission of meanings embedded in an array of educational information

For compatibility of educational information, taking into account its semantics, a system-generating contradiction is expressed in contrasting its logical coherence and isolation. Certain fields of knowledge are naturally associated with others, but their fragmentation in the process of cognition leads to some fragmentation of textbooks among themselves in the branches of the corresponding sciences. This determines the objective need to analyze the logical connectivity and sequence of educational information on the one hand, and isolation, on the other ( Klochkov et al., 2019; Rybakova et al., 2015; Rybakova et al., 2017).

The main property of educational information considered in the framework of this work is the semantic compatibility of its individual blocks in the sequence of perception provided for by the curriculum. The composition in this case can be considered as a set of iconic associations of various levels, presented in the form of a finished text; structure - as relations between them ( Klochkov et al., 2019; Novikov, 2019; Rybakova et al., 2015; Rybakova et al., 2017; Tuchkova, 2019).

To determine the amount of information at the level of its semantic content (semantic level), the thesaurus measure ( Groot et al., 2016; Lagutina et al., 2016; LeCun et al., 2015; Mai et al., 2017; Mai et al., 2018; Wilson et al., 2019). This characteristic determines semantic properties through the student's ability to accept (perceive and assimilate) the information received ( Chernigovskaya et al., 2016; Kiselev, 2018; Popova, 2012). The concept of a thesaurus measure includes the concept of a thesaurus, which implies the totality of information available to the student or system. As applied to educational information, a thesaurus can be understood as a terminological component, which is expressed, as a rule, in the form of words or their combinations, concentrating the content in itself, which means that information compatibility can be determined up to the level of sentence analysis ( Klochkov et al., 2019; Rybakova et al., 2015; Rybakova et al., 2017).

The educational information functions in the communication system between the knowledge carrier ( subject ) and its listener, those for whom it is intended as the ultimate addressee (subject). Information itself ( object ) is the link between the two entities and therefore carries the signs corresponding to its understanding by different parties. The knowledge carrier, laying a certain meaning (content) in the information, transforms its form in accordance with pedagogical goals, the main of which is to convey meaningful content with minimal distortion in the mind of the recipient ( Klochkov et al., 2019; Rybakova et al., 2015; Rybakova et al., 2017).

Development of software for didactic analysis of semantic compatibility of texts

The main task of the software being developed will be to obtain an expanded thesaurus and to analyze publications for compatibility of information in them. This task requires the selection of a number of parameters of the algorithms, therefore, it assumes the presence of a developed interface that allows for research.

In modern conditions, services placed on the Internet are becoming more and more relevant, so the web-model of the application was chosen. Python 3.7 was chosen as the service language for the language model and text analysis. The interface was implemented using the Django ( 2020) framework.

Research Questions

During the development of the educational text processing program, the following questions were raised:

  • choice of language model;

  • what stages of pre-processing are relevant for the problem being solved;

  • selection of vectorization of documents of the case, allowing to solve the task;

  • development of an effective algorithm for extracting a thesaurus;

selection of parameters for the algorithm for extracting the subject thesaurus.

Purpose of the Study

It is assumed that the solution of the research problems reflected in the questions forms the main goal of the study is the development of a mathematical model and software for determining the extended thesaurus and analysis of documents for compatibility of information in them.

Research Methods

To highlight the thesaurus, we need to implement two main algorithms: the Stemming algorithm and the TF-IDF algorithm.

Text model and vector weighting methods

As a rule, the subject of a document is well described by the composition of the dictionary used in this document, as well as by the frequencies of words, and not by the semantic links between them. Therefore, for the task of highlighting a thesaurus containing subject vocabulary, models are usually used that work with particular frequency characteristics ( Andreeva & Ushakov, 2019; Borodaschenko et al., 2015; Golitsyna et al., 2016; Kiselev et al., 2018; Metcalfe, 2017; Shenhav, 2017; Tsatsaronis et al., 2009). Models of this kind are conventionally referred to as “bag of words” ( Bondarchuk, 2015).

They are characterized by a description of each analyzed document by a high-dimensional vector containing the weight of the word in the document. For the purpose of highlighting the thesaurus, two methods were chosen for weighting words: the frequency occurrence of words and the statistical measure TF-IDF.

In frequency vectorization, weight is the number of times a word is used in a document normalized to the L2 norm (or Euclidean norm) to smooth out the influence of the length of the document

n w i = w i j w j 2 ,

where: w i is the frequency of the i -th word.

TF-IDF statistical measure is used to analyze the significance of a word in the context of information contained in a text document that is part of a text array (collection, set of training texts) ( Borodaschenko et al., 2015; Golitsyna et al., 2016; Kiselev et al., 2018 ; Tsatsaronis et al., 2009).

T F w , d = W o r d C o u n t ( w , d ) L e n g t h ( d ) ,

I D F w , c = S i z e ( c ) D o c C o u n t ( w , c ) ,

T F _ I D F ( w , d , c ) = T F w , d × I D F w , c ,

где: WordCount(w,d) is the frequency of w word in d document; Length(d) is the length of d document; Size(c) is case size; DocCount(w,c) is the number of documents containing w word.

Those words in TF-IDF that have a high frequency and at the same time low frequency of use in other documents of the collection ( Borodaschenko et al., 2015; Golitsyna et al., 2016; Kiselev et al., 2018; Tsatsaronis et al., 2009 and others) will have great weight, which allows you to adhere to a balance of frequency and information content.

Approaches to the formation of a subject thesaurus

The preliminary stage when using the “word bag” model is tokenization (allocation of tokens - words) and normalization (reduction of words to normal form), which contributes to a significant reduction in the dimension of vectors. In this work, normalization is performed using a lemmatizer, which is justified for the Russian language even for more complex models ( Kutuzov & Kuzmenko, 2019).

Obviously, after the frequency vectorization and sorting of words by frequency, a distribution according to Zipf's law was obtained - a probability distribution describing the relationship between the frequency of an event and the number of events with such a frequency ( Moreno-Sánchez et al., 2016; Piantadosi, 2014; Qiu et al., 2017; Yatsko, 2015; Zipf, 1936).

The most frequent words in the corpus are usually the least informative and rarely useful for any word processing tasks ( Andreeva & Ushakov, 2019; Lagutina et al., 2019). In turn, low-frequency words are highly informative, however, they are unreliable as factors in decision-making. Thus, to highlight the subject thesaurus, it is necessary to take the middle part of the “rank-frequency” distribution.

Using the TF-IDF measure gives a different distribution, the most frequency words appear at the tail of the “TF-IDF” distribution. And to select the thesaurus, one need to select the words with the highest weights ( Golitsyna et al., 2016; Kiselev et al., 2018; Tsatsaronis et al., 2009).

In the work, it is proposed to combine these two methods of obtaining the subject thesaurus, that is, to select words on the basis of a weighted average assessment of approaches, after which it is possible to analyze the increment of information and text compatibility.

The compatibility of training materials suggests that the growth of the thesaurus upon transition to a new topic does not exceed any threshold value ( Golitsyna et al., 2016). Usually this value is chosen equal to 20%. That is, the growth of the thesaurus in the next chapter, the document should not exceed 15-20% ( Metcalfe, 2017; Shenhav et al., 2017; Wilson et al., 2019).


Software algorithm

In the course of work on the tasks set, the following general software operation algorithm was developed:

  • pre-processing of the document: tokenization (word break); Lemmatization (performed using the mystem parser from Yandex); removal of stop words that do not carry semantic load

  • vectorization and weighting using the normalized word frequency, obtaining the distribution of “rank - frequency” and the formation of the thesaurus as the middle part of the obtained distribution (threshold values ​​are set by the researcher);

  • vectorization and weighting of TF-IDF. obtaining the distribution of “rank - TF-IDF” and the formation of a thesaurus as the left side of the resulting distribution (the threshold value is set by the researcher);

  • an expanded thesaurus is formed as a weighted average of the two received thesauri, an expert assessment is carried out;

  • an analysis of the subject compatibility with other documents of the corpus is performed.

Software description

The software is developed on the basis of the Django framework, the mathematical model is implemented in python 3.7 using the scikit-learn library (2020).

The program interface allows loading a corpus of documents (chapters of one textbook or a set of textbooks on related disciplines).

After that, for each document of the corpus, the subject thesaurus is automatically selected with the specified selection parameters (selection boundaries from the frequency and TF-IDF distributions) and a list of words is formed - the subject thesaurus. At this stage, it is possible to review the thesaurus obtained and perform an expert assessment of the selection.

A separate module allows performing compatibility analysis of any two documents of the corpus, obtain distribution schedules, and also upload the results to a csv file for further work and analysis.


The software product was developed to analyze the compatibility of information in educational texts and is intended to be used as a tool for didactic analysis, both within the framework of a single document (text, file, textbook, manual), and their combination (set, collection, set), which allows adjust the flow of material and the sequence of studying sections, disciplines, taking into account the influence of the thesaurus increment speed.

Given the fact that a large proportion of educational texts for different levels of education (except for school) is compiled by specialists in their field who do not have sufficient knowledge from the point of view of the methodology and methodology of compiling textbooks, such a tool allows us to solve this problem. A smooth increment of the thesaurus within the required level of perception ( Metcalfe, 2017; Shenhav et al., 2017; Wilson et al., 2019) will allow us to neutralize the negative trends in the change in the quality of perception of information (Feldstein, 2013a, 2013b), which is especially important when it is a digital transformation in the framework of continuing education ( Verbitskaya, 2019).

Testing of the program for processing the collection of educational texts (on the example of educational materials on social studies and related disciplines) showed that the developed service copes with this task.

Due to the flexibility of the algorithms of the program, compatibility analysis can be performed on texts that are not educational, where analysis of their compatibility is required.

The development plans of the project include the development of additional modules to an existing system, for example, to automate the selection of model parameters, use distribution semantics methods for vectorizing texts, as well as optimizing the speed of algorithms.


Copyright information

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

About this article

Publication Date

21 October 2020

eBook ISBN



European Publisher



Print ISBN (optional)


Edition Number

1st Edition




Economics, social trends, sustainability, modern society, behavioural sciences, education

Cite this article as:

Rybakova, G. R., Andreeva, A. Y., Krotova, I. V., & Kamoza, T. L. (2020). Software For Sense Compatibility Analysis Of Educational Texts. In I. V. Kovalev, A. A. Voroshilova, G. Herwig, U. Umbetov, A. S. Budagov, & Y. Y. Bocharova (Eds.), Economic and Social Trends for Sustainability of Modern Society (ICEST 2020), vol 90. European Proceedings of Social and Behavioural Sciences (pp. 800-808). European Publisher.