Regional and Urban Data Science Projects for Citizen and Youth Engagement

This article considers the psychological and pedagogical implications of data analysis practices for students' decisions about their personal and professional future and the life of local communities. Using the example of a 10-day "Data Campus" Data Analysis Bootcamp, it explores how students without specific mathematical training can learn to perform data research in the CRISP-DM cycle. The method of research was an ascertaining experiment involving 600 students at the age between 14 and 18 years. The research showed an emotional and activity-based engagement of students in the projects with regional and urban data relevant to their interests, real-life situations, or self-determination tasks the participants face. While reproductive activities related to technical programming skills are relatively easy for students to master, productive solutions to research tasks can be somewhat difficult due to a lack of experience in analyzing intersubject relationships and representing the object under research in a multidimensional feature space. In terms of implementing educational programs that teach data analysis methods in social sciences and humanities in a project approach, students should be provided with appropriate methodological assistance in the form of project mentoring, giving them the opportunity to conceptualize the project before they start working with data. 2672-815X © 2022 Published by European Publisher.


Data as a tool for innovation
Regional and urban data refers to data related to infrastructural, economic, environmental, social, and other aspects of regional and urban life (Law & Legewie, 2021). Typically, such data is generated by the activities of people living in the city. These data relate to various topics such as traffic environment, safety, energy consumption, air and water pollution as well as the social and human practices of urban communities: their interests, social connection structures, patterns of behavior, etc. Urban data can be collected through a variety of sensors, smart meters, satellite images, security cameras or cell phones, or derived from surveys of residents.
A data-literate individual is able to understand external data as well as his/her own, and is sensitive to issues of privacy and ethical use of data. In addition to the data literacy skills possessed by the data consumer, the data literate creator also possesses a number of specialized competencies. These include knowing how to find and use open data, how to generate and use data such as data from sensors, and how to integrate data into various products such as websites and mobile apps (Celino et al., 2012).
The use and generation of urban data by citizens is particularly important for urban innovation.
Citizens can drive change from the bottom up, they better understand local problems and offer solutions that are more responsive to their needs (Schelings & Elsen, 2019;Tanenbaum et al., 2013). The transition from top-down to bottom-up data-driven innovation shifts citizens from being passive consumers of technology and providers of data to active participants who consciously generate, provide, and research data both to identify problems that can be solved with data (Fotopoulou, 2020) and to innovate urban development (Wolff et al., 2015).

Psychological and pedagogical implications of data analysis
We consider urban data and, in particular, geodata as a tool for placing a student's Self and Identity, and relevant topics, problems, and prospects on a map of the city, and, in a broader perspective, on a map of his/her "possible future", including projecting possible trajectories and conditions for his/her social and geographical mobility. The answer to the question of the prospects and capitalization of an individual's personal opportunities is impossible without an analysis of the trends in the development or degradation of certain territories and areas of activity, knowledge of which can be provided by data research.
The mastering and instrumental use of this knowledge is possible when the student sees his/her own intentions, propensities, and prospects for implementing personal opportunities expressed in relevant data and projected onto a map of the city, region, country, or world, and their individual layerseconomic, educational, and many others, all of which describe his/her present, future, or probable habitat. "Data Campus" is an ongoing educational program that since 2019 organizes student project activities in data analysis, machine learning, and modeling of socioeconomic, sociocultural, and engineering objects. A notable part of Data Campus student projects is urban, regional and country data https: //doi.org/10.15405/epes.22043.4 Corresponding Author: Andrey A. Deryabin Selection and peer-review under responsibility of the Organizing Committee of the conference eISSN: 2672-815X 43 analysis. In a sense, such data projects can be considered as a platform for students to model their personal, educational and career futures. Its main objectives are:  to ensure that participants learn the basics of data analysis and machine learning and have a professional tryout in this field;  to provide participants with the basics of modern spatial-analytical thinking as a tool for designing their own future and the future of the territory under study;  to develop rational, data-driven attitudes among participants toward global processes that affect their settlement area and the entire country;  to actualize self-determination among participants in relation to various professional and social groups, to modern forms of social, educational and professional mobility.

Problem Statement
Although the didactic merits of applying data to learning within not only natural science, but also humanities and social science subjects is obvious to many professionals (Gibson & Mourad, 2018;Glukhov et al., 2021;Locke, 2017), one often encounters objections concerning the fact that students who are not familiar with the relevant parts of mathematics (in particular, with mathematical statistics) are hardly capable of any meaningful practical application of data science methods, especially for researching such complex sociocultural constructs as "city" or "region." We believe, however, that outside the field of professional education in Data Science, the focus of pedagogical efforts may not be on the technical aspects of working with data, but on the professional trial and development of students' understanding of how data analysis can be useful in solving specific problems of cities and communities. In this case, it would be more correct to talk about students' achievement not of craft perfection in programming for data analysis and professional status, but of data literacy as one of the modern digital competencies and of the data-literate individual as a rational and informed citizen (D'Ignazio, 2017;Fotopoulou, 2020;Pangrazio & Selwyn, 2019).
In terms of the educational challenge of developing an analytical approach for students to explore such complex objects and relate them to their own self-determination, a lack of proficiency in mathematical methods or programming is not an insurmountable obstacle. This part of the data research cycle can be done at the reproductive level. Of much greater interest was students' productivity in formulating a research topic, hypothesis, and selecting data to test it, activities consistent with the initial phases of the standard data research process: understanding goals -initial data investigation -data preparation (Chapman et al., 2000).
Reproductive activities here refer to the activities performed by students using a previously mastered algorithm, and reproductive competencies refer to those that make it possible to understand the conditions for applying that algorithm and the ability to apply it. Reproductive competencies are assessed with the help of tasks that have a solution given in advance. A productive activity refers to an activity in a new situation, where a previously learnt algorithm is not applicable. A productive action in culturalhistorical psychology is a two-fold event of overcoming one's a habitual way of acting and presenting the emerging result of action (a product) to others. As such, productive action is an act of one's development (Elkonin, 2019). In the course of productive action, the ability of a person to act in a situation of uncertainty acquires special significance, and the question of whether the subject has or does not have certain productive competencies is actualized. Assessment of productive competences is carried out in the 44 course of solving problematic situations, where tasks are yet to be defined by the actors, which also requires the application of knowledge and certain ways of activity (Glukhov, 2016).

Research Questions
The hypothesis of the research was that students at the age of 15-18 are able to implement the cycle of data research in a short time, with some of its phases at the productive level and only on the basis of existing knowledge and thinking skills ("data understanding", "initial data investigation" and "data preparation"), and the part requiring the possession of special mathematical and software tools -at the reproductive one ("modeling", "evaluation", "implementation").

Purpose of the Study
The research objective was to test the "Data Campus" educational program on data analysis and machine learning, involving participants in collaborative project activities.

Research Methods
The method of the research was an ascertaining experiment in which 600 students of 8-11 grades of schools from Siberian cities, at the age between 14 and 18, were trained in the "Data-Campus" bootcamp between June and November 2020. The participants underwent 8 days of training for the basics of Data Science and Machine Learning with lectures and master classes, in addition, they carried out a team data project. The students were asked to set their own research objective with approximately the following instructions: "Each team should formulate the topic of its project, the problem it solves, and define its goals and objectives. The project can be either exploratory or applied. Both suggested data sets and any data from the Internet can be used to implement the project." Teams presented the result in a form that is standard in the data analysis and machine learning industry -as files in Jupyter Notebooks format containing the developed program code with comments, sample datasets, infographics, results and conclusions, and presented the work results publicly.
Collaborative programming was carried out in the Google Colab environment. All required content, data sets, as well as didactic and test materials, were provided via cloud drives and Google Classroom.
Due to the fact that the students carried out research projects, the quality of their mastery of the concepts, categories and data that they used in their work was of considerable importance. Accordingly, the evaluation of the educational result had a three-level structure: 1) the conceptual part is concerned, on the one hand, with the understanding of what data he/she deals with, and on the other hand, with the understanding and interpretation of the concepts used in the subject field; at this level, a student is supposed to understand both and be able explain them; 2) the analytical part concerns the description of the chosen object or process of analysis; a student can apply the concepts, analyze the object with their help; 3) the modeling part is associated with the elaboration of scenarios for the behavior of objects and processes in question; a student easily uses this or that concept in connection with other concepts, can build models on this basis, discuss predictions and scenarios within the conceptual framework.

45
A generalized correspondence of the didactic phases of the project to the CRISP-DM data research methodology is given in Table 1. Table 1. Project work phases and Examples of projects of participants of "Data-Campus" on regional analytics and data-urbanism

Project work phases and evaluation parameters Phases of the CRISP-DM data research cycle
Conceptual (mastering the concepts of a particular subject area) data understanding, initial data investigation Analytical (multidimensional description of a chosen datadriven subject) initial data investigation, data preparation Predictive (interpretation of results, identification of analytical models, forecasts and scenarios) modeling, evaluation, implementation (presentation of results) Here is an example of one of the projects: "Representation of the region in the federal media".
Conceptual phase: students may ask themselves the question "How is their region represented in the news agenda published by news agencies and news websites?" and are introduced to: (1) basic communication models, principles, and metaphors of media functioning -as a "mirror" of reality, a "window," a "filter," a "pointer," a "forum," a "barrier"; (2) the social functions of media (to inform, coordinate, reproduce, entertain, mobilize). The data set can be an array of texts from news agencies or other media.
Analytical phase: students perform standard textual data processing operations (stemming, lemmatization, etc.) and conduct quantitative analysis using frequency methods (counting the frequency of words, phrases, etc.), compare these metrics across geographical regions and time periods (years). This stage is equivalent to Exploratory Data Analysis stage in data research, and it itself can lead to some insights about the objects and processes the data represent.
Modeling phase: in this case, it is possible to apply "topic modeling" technique, which identifies hidden topics from the entire text array -sequences of words that occur together and most frequently in the sample. A necessarily creative objective here is the selection of the optimal number of computergenerated, derived from data yet interpretable topics, their interpretation and naming. The same actions can be performed not only on the regional, but also on the federal data sample for their subsequent comparison. Further, it is possible to analyze the temporal and geographical distribution of the topics identified and their variations in relation to known socio-political, economic, and other events. Finally, it is possible to compare regions by similarity in their news agenda and identify clusters of thematically similar regions, which may be geographically very distant from each other. Visualization of the results is also possible -overlaying the tags of topics on a computer map in the form of layers.
Obviously, the results of such task-activity practices in some respects are different from traditional disciplinary education aimed at reproducing a previously known and didactically "correct" result. In this case, much attention is paid to ensuring that students independently determine the goals and objectives of the project, cultivating their active subject position.
Such organization of project activities facilitates the students' reconstruction of the assumed areas and places of implementation of their interests and intentions. Working with data and machine learning models in this case provides a basis for choosing a preferred space for education and career, a https: //doi.org/10.15405/epes.22043.4 Corresponding Author: Andrey A. Deryabin Selection and peer-review under  46 consideration of desirable and rejected places of residence and self-actualization and a critical attitude to them, allowing to make an informed choice.

Findings
Examples of student projects related to regional or urban issues are given in Table 2. Table 2. Examples of "Data Campus" participants' projects on regional analytics and data-urbanism. Most of the teams used the datasets that were prepared for them by the organizers. However, a number of teams who formulated topics related to local urban or regional issues found themselves in a situation where they needed to find or create their own datasets. In this regard, the result of one team, which organized an online social media survey on perceived public transportation waiting times in different cities, is interesting, as a result of which this team quickly collected a dataset of 7 attributes and 30 thousand observations. The team made a link between fare and public transportation waiting time and interpreted the relationship between satisfaction with public transportation and access of private carriers to provide services in the city. This example shows how the application of authentic data (Kjelvik & Schultheis, 2019), which participants generate independently, has the greatest educational effect (Wolff et al., 2019) through their "appropriation" of this data. This educational situation allows for productive student activities and keeps students highly motivated.

Summary
Students who had a good command of Python programming successfully applied the methods and code examples given to them in Data Science classes to their project, even though their knowledge of statistics and the mathematical foundations of the algorithms behind machine learning models was shallow. This allows us to conclude that reproductive activity in this area was relatively easy for them.
From the experts' perspective, more difficult although manageable in most cases, was the conceptual work on some projects, which required students to define or elaborate concepts, operationalize hypotheses, or tasks in terms of available data, understand inter-subject connections, and represent the object under research in a multidimensional space of attributes.

Conclusion
The research showed an emotional and activity-based engagement of students in projects with regional and urban data relevant to the participants' place of living, interests, life situations, or selfdetermination tasks. In this respect, working with such data has great educational potential. 47 While reproductive activities related to technical programming skills were relatively easy for students to master without deep knowledge of the mathematical foundations of machine learning algorithms and mathematical statistics, students' productive research activities were somewhat impeded.
This was especially noticeable when they performed an analysis of sociocultural objects, which often required from students (i) decomposing research objects into lower-level entities that can be represented in the data, and (ii) actualizing students' knowledge in Social Science, History and Economics. In addition, research activities required students to know the basics of scientific research methodology and the ability to apply them in a new situation.
In terms of implementing educational programs that teach data analysis methods in social sciences and humanities in a project mode, students should be provided with mentoring methodological assistance in the form of pre-project consultations, giving them the opportunity to conceptualize the research project before they start working with data.