Media Security Protection: New Methodology For Authorship Examination Of The Internet Discourse

Abstract

This article deals with the issues of forensic authorship examination of Internet discourse texts to ensure media security. The existing forensic authorship examination methods are either not suitable for identification and diagnostic tasks, or need to be updated and improved, since modern communication in the Internet environment has new properties that are different from the properties of purely oral / written speech. Authors try to reveal specific mixed nature of this new type of forensic objects (a combination of oral and written speech) and propose new approaches to identification and diagnostic authorship examination of Internet speech products. The most rational approach to develop methodology for Internet discourse authorship examination is, according to the authors’ opinion, to combine the methods of identifying the speaker based on oral speech (in terms of linguistic analysis in framework of “Dialect” technique) and written speech (methods proposed by Vul and the Ministry of Internal Affairs of Russia). The article serves as a basis for the future research in the sphere of forensic authorship examination of the internet discourse and forensic speech science in general. The authors conclude that it is necessary to improve / develop not only authorship identification methods but authorship diagnostic methods which allow one to determine properties of a questioned text and/or properties of an author’s idiostyle, electronic communication skills, etc. that is needed to determine author’s social and demographical characteristics.

Keywords: Authorship examination, forensic linguistics, information security, media security, speech product, speech trace

Introduction

The wide spread of digital technologies of data transmission (digitalization) led to the formation of a new digital environment in which information is circulated in digital form. The major part of such information products is either texts or poly code materials containing a verbal component. Processes that are tied with the development of language accelerate due to the implementation of computer technologies into our life and their active use for communication. There is no unified term to describe this new communication type: computer-mediated communication, computer discourse, internet discourse, communication in the electronic (digital) environment, electronic communication, etc.

Various socially dangerous acts related to the distribution of textual information can be committed in the Internet environment: publishing offensive, slanderous materials; extortion; blackmail; threats; the dissemination of extremist materials; propaganda of drugs, hatred, violence, murder (including mass murder); involvement in destructive subcultures etc. Such speech actions pose a threat to media security of Internet communication. The use of electronic means of communication determines the peculiarities of the investigation in this category of cases: the digital form of information requires special methods of detection, collection and examination of evidence.

Problem Statement

Taking into account proceedings on cases related to ensuring media security, it is often necessary not only to study the speech products themselves (their semantics and pragmatics), but also to identify the author of such speech products (the author and the distributor of the text do not always coincide). Traditionally, the expert in the relevant specialty (in such cases, a forensic authorship expert) should be invited to resolve issues requiring special knowledge. However, the examination of speech products of the Internet discourse is not so easy to provide with forensic expert support, since the existing forensic methods have not been updated for a significant period of time and need theoretical rethinking and revision taking into account the mixed nature of Internet speech products.

The main hypothesis of the research was that adaptation of authorship examination methodology is necessary for examination of speech products of the Internet discourse.

Research Questions

The research was aimed to answer the following questions:

  • What approaches to forensic authorship examination of the Internet discourse exist?
  • What are the key features of typical objects of forensic authorship examinations?
  • What are the perspective ways in sphere of forensic authorship examination of the Internet discourse and forensic speech science in general?

To consider these issues, the authors relied on the following fundamentals:

  • Forensic authorship examination of the Internet discourse is the compulsory component of media security support.
  • Media security system serves for reaching the balance between freedom of speech and the right of citizens for safety information space.
  • The activities of forensic authorship experts have to correspond to the activities of state organizations and must comply with the basic principles of information security and information ethics.

Purpose of the Study

The purpose of the study was to develop theoretical provisions and specific practical recommendations how to use forensic linguistic (authorship) knowledge to ensure media security in the digital environment.

Research Methods

The research has interdisciplinary nature and is based on provisions of forensic expertology, forensics (forensic science), on the one hand, and forensic speech science (including forensic linguistics), applied linguistics, on the other hand.

Findings

Electronic media communication, like any other phenomenon, has a number of advantages and disadvantages. For instance, its advantages include the potential unlimited possibilities of speech activity; lack of reference to a specific territory (several people from remote parts of the world can work together on a project by means of electronic communication; people with certain health problems can also use this type of communication). There are also disadvantages of this phenomenon, namely the relative anonymity of the speech products (it is not always possible to identify the author of the speech product). It is worth noting that some of the aforementioned advantages of electronic communication, such as, for example, the potential unlimited possibilities of speech activity, can also serve as a certain kind of disadvantage (if an anonymous person sends or publishes negative (defamatory) information about another person, etc.).

The criteria for the suitability of objects for forensic authorship examination differ from the criteria presented in the framework of other types of forensic speech examinations (forensic linguistic expertise, forensic phonoscopic expertise). In forensic linguistic examination (the main purpose of which is to establish semantic content), suitable objects are only speech products with signs of textuality (common concept, theme, structure; logical and stylistic unity; grammatical and semantic connectivity of its components). In forensic authorship examination, the main criterion for the suitability of an object is the expression of the author's individual speech-thinking skills; it may be not a whole text but a speech product whose volume is not below a certain threshold value. If the text is stereotyped, emasculated, technical, then the signs that characterize the author's speech-thinking skills are most likely not expressed; therefore, such objects will not be suitable. At the same time, the presence of signs of textuality in the object of forensic authorship examination is secondary and does not have the same meaning as for forensic linguistic examination.

The properties of the Internet discourse differ from the properties of purely oral or written communication. The wide possibilities for coding information provided by electronic data transmission devices lead to the transformation of the natural language and a change in the rules for handling it. The text of Internet communication, expressed in digital form and having a physical medium, has common features with the standardly organized written text, but it also has features of oral communication (this is especially clearly manifested in the texts of blogs and social networks).

The speech of the global network demonstrates the abundance of various deviations from the existing language norms and mainly exist in the format of various forums, chats, the pragmatics of which is as close as possible to colloquial speech and lies in openness, striving for mutual understanding, mutual interest in communication and emotional liberation (Proshakova, 2008). Communication on forums and chats is also brought closer to the pragmatics of colloquial speech due to the emotional connotations, for the expression of which communicants quite actively use graphic non-verbal symbols, as well as symbols already provided by the programmed capabilities of a particular site (netiquette):

  • smilies;
  • stickers;
  • emoji;
  • capital letters / their absence;
  • excess / absence of punctuation marks, etc.

The presence of the above signs describes the nature of the author's communicative intentions and serves as a marker of a certain emotional state.

Key feature of traditional forms of verbal communication (oral and written) is the presence of the paired property of spontaneity / preparedness. As a rule, speaking is characterized by spontaneity, while writing is characterized by preparedness. However, this is not true for all cases: oral speech can be prepared (e.g., when speaking in public), and written speech can be spontaneous (e.g., texts of electronic communication). Establishing the preparedness grade of an oral / written text is important for the correct assessment and interpretation of the differing features of a questioned speech product and samples during the identification process. The existing methods of identifying an author of a written text of the Internet discourse need to be updated as spontaneity implies other organization levels of such texts and other norms of their construction.

Another factor influencing the possibility of differentiating a spontaneous or prepared text is its functional and stylistic affiliation. The features characterizing the spontaneous or prepared generation of a text can have a different degree of stability under the conditions of various functional styles. On the one hand, they can be consciously (or unconsciously) omitted; on the other hand, they can be intensified due to certain components of oral or written discourse, implemented in a certain communication situation.

Thus, the main approaches to improving the methods of author identification of electronic communication products include the study of the features of spontaneous written texts and stylistics of Internet discourse products. The digital transformation of the language leads to the emergence of new speech norms. That is why the formal methods developed in the last third of the 20th century (Vul, 2007), the Ministry of Internal Affairs of Russia (Rubczova et al., 2007) etc.) and used to identify the author of a prepared written Russian-language text are not suitable for the examination of this type of speech products.

Electronic communication is characterized by ‘economy’ of the author's speech means. Traditional norms of spelling and punctuation in the Internet discourse are not relevant. Nowadays, all speech errors that are committed, for example, in messengers are not qualified as erroneous. In the Internet discourse, this kind of errors is considered not as a sign of author’s illiteracy but as an objective way to save time. It was previously considered that a sign of high linguistic competence was compliance with the rules of the literary language, now the main metrics of linguistic competence are the range of vocabulary, the ability to actively perceive new generative constructions borrowed from other languages that can be accompanied by selective use of the rules of grammar and punctuation. The approach to assessing the linguistic competence of communicators in the Internet environment by the conformity / nonconformity of their speech products with the language norms does not allow correctly find a set of identification signs, since electronic communication presupposes non-observance of language rules. Taking into account the above-listed differences between electronic communication and traditional forms of communication (oral and written), it can be concluded that the author’s competence to communicate in the Internet environment should be examined instead of his/her language competence.

Should the research not be limited to the modernization of existing methods of author identification, or is it necessary to develop new methods from scratch? Identification signs of written speech can be divided into general (the grade of development of grammatical, lexico-phraseological and stylistic skills) and specific (features of the structure of language skills, expressed in stable deviations from language norms) signs. The basis of this systematization is the qualitative and quantitative indicators of the limit of a person's possession of language skills.

The general features of written speech are characterized by the following parameters (Romanchenko, 2013):

  • level of proficiency in written speech;
  • grade of development of grammatical skills;
  • grade of development of stylistic skills;
  • features of the skills of using language means:
  • predominance of linguistic means of a certain style of speech: scientific, colloquial, epistolary, journalistic, official-business;
  • length of sentences, presence of paragraphs;
  • prevailing types of sentences (simple, complex);
  • the predominant nature of the syntactic connection (unionized, non-unionized);
  • phraseological means;
  • features of the skills of the presentation architectonics;
  • lexical skills and vocabulary volume.

In authorship identification examination, a separate analysis of the questioned object and comparative samples of the written speech is carried out by a forensic expert. A selection of coinciding and different general and specific signs of language skills that are expressed in the questioned text and samples is carried out. In practice, in order to obtain a high-quality expert examination, certain requirements on the samples are imposed and their inconsistency can lead to incorrect conclusions of the expertise. Samples for comparative examination should be comparable with the questioned speech product: samples should correspond to the tested text in terms of execution time, style of speech, addressee and the nature of speech communication with him/her, etc.

The issue of the required number of samples for the comparative stage of the examination is no less important. The required number of samples will be determined based on the volume of the questioned text: the generally accepted volume of comparative material (samples) should exceed the volume of the questioned material by at least 10-15 times, and the minimum volume of the questioned written text should be at least 500 words. Taking into account the specifics of texts of the Internet discourse, the volume of, for example, a post or messages in social networks may be much less than a minimum threshold. That is why forensic experts have difficulties in detecting significant identification signs of written speech of the Internet discourse. Electronic texts significantly change their properties (including extralinguistic) due to their passage through a certain ‘filter’ of text editors, where in most cases there are certain spellchecking systems (with autocorrection) and fast typing tools, as well as other technical possibilities for correcting texts. The presence of such technologies that change the text creates additional obstacles to the analysis of the author's competencies.

Such scholars as Venčkauskas et al. (2015) dealing with the issues of methodological support of forensic authorship identification have summarized the existing methodological approaches to text attribution in various languages. With regard to the Russian language, it was noted that the main methods of identifying the author are formal methods based on detection of various characteristics (the average number of words in a sentence, syllables in a word, the frequency of occurrence of various parts of speech, etc.), the main of which are sequences of letter and word pairs. However, the issue of the applicability of these methods to texts of the Internet discourse has not yet been studied, taking into account the peculiarities of electronic communication, for example, using non-linguistic symbols (emoticons, etc.), writing transliteration, etc.

Revealing the main features of the text attribution, Vinogradov (1961) noted eleven factors characterizing questioned texts, six of which are objective, but the vast majority are characterized by weak formalizability. One of these factors, according to the scholar, is the linguistic stylistic factor. Researchers examine quantitative indicators of style.

All of the above suggests that forensic authorship examination of texts of the Internet discourse must be based not only on formal, but on qualitative research methods (for example, the method of quasi-synonyms). In addition, a perspective direction in the development of new attribution methods is the study of the properties of an author's idiostyle.

The term ‘idiostyle’ coincides with the term ‘idiolect’. The difference between them depends on the views of a particular researcher. But in general, it can be summarized as follows: an idiolect is understood as the entire set of texts created by a certain author in the original chronological sequence. And an idiostyle is understood as a set of deep text-generating dominants and constants of a certain author which determined the appearance of these texts in that order.

Idiolect (from the Greek idio - own, peculiar, special and dialect) is a set of formal and stylistic features inherent in the speech of an individual speaker of a particular language. It is a designation of the individual characteristics of text generation. In the narrow sense, an idiolect is the specific speech characteristics of a particular native speaker. In a broad sense, an idiolect is the implementation of language in the speech products of a particular individual, that is, a set of texts generated by a speaker or writer.

The concepts of idiostyle and idiolect are closely related to the concept of ‘linguistic personality’ that covers a complex way of describing the linguistic ability of an individual and connects the systemic representation of the language with the functional analysis of texts. Karaulov (2019) classified speech skills and abilities. In his opinion, the structure of a linguistic personality consists of three levels: verbal-semantic (grammatical), thesaurus (cognitive), and motivational-pragmatic. The verbal-grammatical level includes units of the lexical and grammatical structure of the language (word, morpheme, word form, derivative, synonym, phrase, syntaxeme, etc.). The units of the cognitive level that form the linguistic picture of the world of an individual are denotatum, extensional and intensional components of the concept, frame, generalized utterance (i.e. aphorism, proverb, etc.), phraseological unit, metaphor, visual image. The units of the pragmatic level, reflecting the target attitudes of the author, his/her active position in the world and, accordingly, the dynamics of his/her picture of the world, include presupposition, deixis, elements of reflection, assessment, ‘keywords’, precedent texts, methods of argumentation, ‘scenarios’, plans and programs of behavior.

So, the individuality of an author's style lies in the set of stylistic marks and is distinguished by the presence of a certain principle of selection and combination of linguistic means and their transformations. In other words, it is one of possible versions of the linguistic representation of the sense that the author wants to convey. With regard to the forensic authorship examination of the Internet discourse, the author's style implies quasi-synonyms and ordinary synonymous series, etc., since the author's preferences (not only textual ones, but also punctuation ones, as well as the presence of emoticons, infographics, graphic highlights, etc.) are of forensic significance for identification and diagnosis. Therefore, a new methodology for the examination of Internet speech products should take into account the listed features and entail a new understanding of identification and diagnostic features in relation to new objects in which oral and written speech is mixed (since there is a written display of the inherently oral process of generating text in an online format). In such cases forensic authorship experts assess not the linguistic competence of an individual (his/her knowledge of the rules that make up the norm of the literary language) but his electronic communication skills and speech habits (including brevity, economy of speech efforts, etc.) taking into account new speech norms inherent in Internet discourse.

One of the important directions of the forensic authorship examination of Internet discourse products is determining the grade of their spontaneity / preparedness. The solution of this diagnostically significant issue allows one, in particular, to conclude that the author of the text has an intention to commit an offense. Based on the results of such examination, an expert can conclude that the questioned text refers to (copying a previously known or unknown text; compiling a previously learned written text; written speech under dictation; presentation of a written text; written citation of a written text) or to (drawing up a written text according to a template, filling out a form according to a sample; composing a written text on a predetermined topic according to a plan; drawing up a written text with partial reproduction of someone else's speech, verbatim and non-verbatim quotation of someone else's or his/her own text; composing a text on a topic that is not known in advance, but well-known; composing a text on an unknown and unfamiliar topic; writing answers to questions posed in advance; writing spontaneously generated text in monologue or dialogue forms) (Galyashina, 2003).

However, the solution to this issue is not only of diagnostic forensic significance. As we have already indicated above, electronic communication products (even those presented in written form) also contain signs of oral speech. In this regard, we propose to combine the approaches to identifying a speaker based on oral speech (in terms of linguistic analysis in the framework of “Dialect” technique for forensic phonoscopic expertise), written speech (authorship examination methods proposed by Vul (2007) and the MIA of Russia (Rubczova et al., 2007)) for the purpose of forensic authorship examination of Internet discourse products. This will help to update the provisions of these techniques and at the same time to take into account the features of electronic communication that coincide with the features of oral communication.

If the identification is impossible for objective reasons (for example, in case without a suspect), the involvement of a forensic authorship expert seems desirable, as it is possible to narrow the circle of suspects by conducting diagnostic examination to spot the author's social and demographical characteristics (gender, age, ethnic / religious affiliation, education, profession, etc.).

The characterization of the author is also an important area of forensic authorship diagnostic. The fundamental possibility and necessity of determining the demographic characteristics (including age and gender) of authors were proved by the results of research carried out by Koppel et al. (2002), which allowed one to correctly establish the gender identity of the author in 79.5% of cases for men and 82.6% of cases for women. Johannsen et al. (2015) also managed to establish a correlation in the signs of the written language of men and women, as well as different age groups. McDonald et al. (2013) also conducted research in this sphere. Hovy (2015) identified the most common (widespread) categories of topics that are typical of authors of a particular gender (men and women).

Therefore, scholars see the possibilities of using forensic authorship diagnostics in judicial practice (to determine the true identity of Internet users) (Schler et al., 2006). For example, some individuals make attempts to conceal their true identity by creating profiles on social networks with false personal data (last name, first name, age, gender, location) in order to persecute their victims (Peersman et al., 2011).

Conclusion

The existing forensic authorship examination methods are either not suitable for identification and diagnostic tasks, or need to be updated and improved, since modern communication carried out in the Internet environment has new properties that are different from the properties of purely oral / written speech. The methodology of forensic authorship examination needs to be revised in adaptation to a small volume of the questioned texts and its online nature.

It is also necessary to develop new forensic methods based the new approach to the concept of digital communication idiostyle. The most rational way to develop methodology for Internet discourse authorship examination is, in our opinion, to combine methods of speaker-identification based on oral speech (in terms of linguistic analysis in framework of “Dialect” technique) and written speech (methods proposed by Vul (2007) and the Ministry of Internal Affairs of Russia (Rubczova et al., 2007)).

The formal methods of the author’s identification intended for written prepared texts are not suitable for examination of Internet online texts. In forensic authorship expertise, signs are considered from two points of view:

  • norm-error. Everything that is a mistake refers to signs indicating either the grade of linguistic competence (literacy) or knowledge / lack of the language knowledge (when the language in which the text is generated is not native). The norms of the language are not forensically significant due to their normativity (since their manifestation has a low identification significance);
  • norm in terms of functional style. The norm of business speech and the norm of colloquial speech are different. Electronic communication products do not correspond to the norms of the Russian literary language for the most part (the norm for them is non-compliance with the literary norm, economy of speech means). Moreover, the understanding of a norm (and what is not a norm) in relation to electronic communication products needs to be rethought. Therefore, forensic methods that focus on the qualitative indicators of the questioned text and allow determining the competence of the author's electronic communication skills seem to be suitable. A promising area of research is the development of new mixed (formal-qualitative) methods.

It is necessary to improve / develop not only authorship identification methods but authorship diagnostic methods. Forensic authorship diagnostics allows to determine properties of a questioned text and/or properties of an author’s idiostyle, electronic communication skills, etc. that is needed to determine author’s social and demographical characteristics. When identification of an author is impossible, a diagnostic approach with the subsequent narrowing of the number of suspects seems to be an effective way for a forensic expert assisting to a law enforcement officer in the framework of ensuring media security in the digital environment.

Acknowledgments

The reported study was funded by RFBR, project number 20-011-00190 ‘Conceptualization of countering information threats in the Internet environment using special legal and forensic linguistic knowledge’.

References

  • Galyashina, E. I. (2003). Preliminaries in Forensic Speech Science. STENCY Publ.

  • Hovy, D. (2015). Demographic Factors Improve Classification Performance. Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing: vol. 1. (pp. 752–762).

  • Johannsen, A., Hovy, D., & Søgaard, A. (2015). Cross-Lingual Syntactic Variation over Age and Gender. In: Proceedings of the 19th Conference on Computational Natural Language Learning. (pp. 103-112). CoNLL, Association for Computational Linguistics.

  • Karaulov, Yu. N. (2019). The Russian Language and A Linguistic Personality. URSS Publ.

  • Koppel, M., Argamon, S., & Shimoni, А. (2002). Automatically Categorizing Written Texts by Author Gender. Literary and Linguistic Computing, 17(4), 401-412.

  • McDonald, R., Nivre, J., Quirmbach-Brundage, Y., Goldberg Y., Das D., Ganchev K., Hall K., Petrov S., Zhang H., Täckström O., Bedini C., Castelló N. B., & Lee J. (2013). Universal Dependency Annotation for Multilingual Parsing. ACL, 92-97.

  • Peersman, C., Daelemans, W., Vaerenbergh, & Van, L. (2011). Predicting Age and Gender in Online Social Networks. SMUC 2011: Proceedings of the 3rd International Workshop on Search and Mining User-Generated Contents (pp. 37-44).

  • Proshakova, E. Yu. (2008). Forensic Authorship Examination in Criminal Cases involving Internet Communication. Actual Problems of Russian Law, 8(3), 463-466.

  • Romanchenko, T. N. (2013). Methods of Attribution in Author Relation Text Expertise. Saratov State Law Academy Bulletin, 91(2), 228-233.

  • Rubczova, I. I., Ermolova, E. I., & Bezrukova, A. I. (2007). Comprehensive Methodology for Authorship Examination: Methodical Recommendations. Forensic Center of the Ministry of Internal Affairs of the Russian Federation Publ.

  • Schler, J., Koppel, M., Argamon, S., & Pennebaker, J. (2006). Effects of Age and Gender on Blogging. AAAI Spring Symposium - Technical Report, SS-06-03, 191-197.

  • Venčkauskas, A., Damaševičius, R., Marcinkevičius, R., & Karpavičius, A. (2015). Problems of Authorship Identification of the National Language Electronic Discourse. Information and Software Technologies. ICIST. Communications in Computer and Information Science, 538, 415-432.

  • Vinogradov, V. V. (1961). The Problem of Authorship and Theory of Styles. Gosudarstvennoe izdatelstvo hudozhestvennoj literatury.

  • Vul, S. M. (2007). Forensic Authorship Identification. Methodical Foundations. KHNIISE Publ.

Copyright information

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

About this article

Publication Date

02 December 2021

eBook ISBN

978-1-80296-117-1

Publisher

European Publisher

Volume

118

Print ISBN (optional)

-

Edition Number

1st Edition

Pages

1-954

Subjects

Linguistics, cognitive linguistics, education technology, linguistic conceptology, translation

Cite this article as:

Galyashina, E. I., Nikishin, V. D., & Bogatyrev, K. M. (2021). Media Security Protection: New Methodology For Authorship Examination Of The Internet Discourse. In O. Kolmakova, O. Boginskaya, & S. Grichin (Eds.), Language and Technology in the Interdisciplinary Paradigm, vol 118. European Proceedings of Social and Behavioural Sciences (pp. 601-610). European Publisher. https://doi.org/10.15405/epsbs.2021.12.74