The linguistic corpus users must be aware of receiving erroneous data. The first error is related to the word frequency in the diachronic perspective. It is necessary to use special formula in order to correctly calculate the word’s frequency. The second error is related to the grammatical search. It is difficult to set correct search parameters. The third mistake is connected with lexical homonyms. It is necessary to be cautious when you meet all lexical homonyms. The fourth mistake is related to semantic features combination search. If you "play" with semantic features search, you can get ambiguous results. The user of the linguistic corpus should not rely entirely on the search results. There is always some "noise" in search results caused sometimes by contexts shortages. It is necessary to evaluate the received data based on your language competence, to correctly select and set the search parameters. An improperly compiled search can lead to distortion of real results: their exaggeration or understatement.
Keywords: Linguistic corpussearch queryword entrycontext shortageunremoved homonymy
Describing the user's work in the linguistic corpus, many linguists use the Latin expression
The relevance of this study is to analyze mistakes in using the linguistic corpus in order to prevent them in future.
The subject of this analysis is to explore the most frequent mistakes identified in using National Corpus of Russian Language. The object of the study are typical mistaken search parameters and obtained incorrect results in using the linguistic corpus.
Purpose of the Study
The purpose of this article is to consider the most frequent mistakes identified by the use of linguistic corpus.
The methods of using linguistic corpus are now developing. Each scientist independently develops his own methods to compile searches in accordance with his scientific interests.
The first error is related to the calculation of the word frequency in the diachronic perspective.
It is necessary to correctly calculate the tokens’ frequency. Let’s calculate, for example, the tokens’ frequency of the ancient Russian word перст / finger. Using the corpus search results, we can get the following data: 141 tokens before 1800and 1069 tokens after 1900. We understand it can’t be really correct because an ancient word can’t be used more frequent recently. It is necessary to calculate the relative frequency based on the absolute quantity using a certain formula. According to this formula, the relative frequency is equal to: the tokens quantity should be divided by words quantity in the text and multiplied by 1 000 000. There are 12 636 732 tokens available before 1800. We get the frequency of 10. There are 208 953 262 tokens available after 1900. We get the frequency of 0, 05.
Let's consider the frequency of the ancient Russian word
The second error is related to the grammatical search. D. Beiber and S. Conrad explored corpus materials in teaching of English grammar and proved their effectiveness.
According to their opinion, there are three types of important results for teaching grammar: 1) information about the frequency; 2) fixing comparisons; 3) associations between grammatical structures and words (Beiber, Conrad, 2012). Let's search, for example, propositions containing predicates of Future tense form in combination with time interval circumstances containing preposition
(1)Ларка, это письмо я
There are no examples with Future tense predicates in combination with time interval circumstance including preposition
We tried to specify parameters. We used the searches with three words: an auxiliary verb
While you are working with the linguistic corpus you can not completely trust it. It is necessary to evaluate the received data based on your language competence, to correctly select and set the search parameters (Hunston, Francis, 2000).
The main task of the linguistic corpus is not only to contain texts, but to contain correctly annotated texts. For example, a morphologically annotated corpus includes a morphological indication of speech parts for all words, which allows the user to quickly find a word of a desired speech part. However, over time, the corpus volume increased significantly, which led to drop in search performance. The main problem was that different grammatical analyzes attributed to the same word began to mix due to morphological homonymy. So, for example, the word
The linguistic corpus can be imperfect, so the user needs to remember and understand its features and limitations. There can always be omissions in the corpus, because in any live language, as a result of language play or word creation; there are lexemes, word forms and new meanings, which are called occasionalisms.
In the oral speech subcorpus you can find many examples of occasional or innovative word forms that are used to attract the interlocutor’sattentionand make statement informal or ironic. In this case, the annotation is more transparent and clear so that the user can easily find the necessary word form. Therefore, when drawing up a subcorpus, it is necessary to take into account the potential users and to facilitate interaction with the information source.
The "buyer" of the corpus, taking advantage of the "alien" corpus, consciously takes risks. The user takes risks that the corpus annotation may be imperfect. The corpus creators do certain compromise between theoretical knowledge and the possibilities of computer realization.
Therefore, the user should, on the one hand, be cautious about the search results received, on the other hand, the user should not forget that "not everything that seems, at first glance, is a mistake of the corpus".
The third mistake is lexical homonyms. For example, it is necessary to be cautious when you meet all lexical and grammatical homonyms. You shouldn’t forget that unintelligible homonymy can produce ambiguous results (Kutter, Kantner, 2012).
There is a searchexampleof aword
(4) Ну/ тамбылтакойбоевойбизнесмениз
(6) Знаете/ чтотакоеарбалет?
In the example (4) we see the proper name, the location name
6.4. The fourth mistake is to search for a semantic features combination. Also, you need be careful and attentive with the corpus. If you "play" with semantic features search, you can get ambiguous results.
For example, if you specify semantic feature
(8) Что означает «
In the example (7) we see the word
The linguistic corpus user should not rely entirely on the corpus search results because there will be "noise" or context shortage to some extent.
Example of "noise": There are a large number of texts in NCRL, in which homonymy has not been removed. Therefore, in the search results of one word form, we meet contexts with a homonymous token. For example, if you are looking for a verb of imperative mood рой / dig, you will also get a noun пчелиныйрой / swarm (for example, "bee swarm") into search results.
Example of "shortage": if we perform a search for word forms or if we incorrectly create a search for lexical and grammatical forms, the results will not show you the possible word forms, but only a part of them. Let's try to find the use of the quotation рукописинегорят/ "manuscripts do not burn." The search gives us 51 examples of this quotation. A lexico-grammatical search with the distance between words from 1 to 4 gives us 60 citations in the texts. Therearesuchinterestinginterpretationsamongthem.
(10) Я, какдурачок, сиделиплакалнадними, неверя, что
An example of competent corpus use in phonetic research is the works of A. Piperski and A. Kukhto (Piperski, Kukhto, 2016). In their article, the authors automatically analyze the selectedsubcorpusincluding poems by ten poets from National Corpus of Russian language. They performed their search for word forms with stress variability. The word forms quantity lies in the interval from 30 to 200 word forms for different speech parts. In the article, a quantitative measure is proposed for estimating the overall variability of stress, independent of the corpus size.
Thus, we find that an incorrectly compiled search query can lead to distortion of real results: their exaggeration or understatement. This is something that should be aware of the corpus user.
- Baranov, A N. (2003). Corpus linguistics.
- Beiber, D.Conrad, . (). 29). Corpus Linguistics and Grammar Teaching. Retrieved from http://longmanhomeusa.com/content/pl_biber_conrad_monograph5_lo.pdf.
- Hunston, S. (2000). Francis G. Pattern Grammar: A Corpus-Driven Approach to the Lexical Grammar of English. Philadelphia:
- Kutter, A.Kantner, C. (2012). Corpus-based Content Analysis: A Method for Investigating News Coverage for War and Intervention. Retrieved from http://www.uni-stuttgart.de/soz/ib/forschung/IRWorkingPapers/IROWP_Series_2012_1_Kutter_Kantner_Corpus-Based_Content_Analysis.pdf
- (2005). of Russian Language (NCRL)
- Piperski, A.Kukhto, . (2016). Intra-speaker stress variation in Russian: A corpus-driven study of Russian poetry. Retrieved from http://www.dialog-21.ru/media/3419/piperskiachkukhtoav.pdf.
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
About this article
30 April 2018
Print ISBN (optional)
Sociolinguistics, linguistics, semantics, discourse analysis, translation, interpretation
Cite this article as:
Lutfullina, G. (2018). Some Mistakes Of Linguistic Corpususers. In I. V. Denisova (Ed.), Word, Utterance, Text: Cognitive, Pragmatic and Cultural Aspects, vol 39. European Proceedings of Social and Behavioural Sciences (pp. 96-101). Future Academy. https://doi.org/10.15405/epsbs.2018.04.02.14