Prediction Of Pm10 Concentrations Using Logistic Regression Analysis: Case Study In Jerantut

Abstract

Particulate matter (PM 10) can cause several serious negative health effects to humans when it is present in the environment. Thus, it is important for us to forecast its concentration levels in the environment so that we can reduce the risk of exposure towards particulate matter. Secondary data on the concentration of PM 10, sulphur dioxide (SO 2), nitrogen dioxide (NO 2), ground level ozone (O 3), carbon monoxide (CO) along with temperature and relative humidity at Jerantut monitoring stations between 2010 to 2012 obtained from Department of Environment. The main objective of this study is to describe the relationship between PM 10 with other gases and weather conditions by using correlation. It also aims to determine the best prediction categories. Furthermore, this research aims to find a model for predicting the concentration of PM 10 using logistic regression. PM 10 and O 3 at Jerantut monitoring station were found to have a strong positive correlation. The best logistic regression model was obtained at Jerantut station in 2010 with an R 2 value of 0.565. The best prediction category for Jerantut monitoring stations was shown to be healthy with a correct percentage of more than 85% obtained from the analysis of the overall and annual results between 2010 to 2012.

Keywords: Air pollutionParticulate matterprediction model

Introduction

Air pollution basically refers to the contamination of the indoor or outdoor environment by any types of agent that modifies the natural characteristics of the atmosphere ( World Health Organization, 2014). Hanapi and Din ( 2012) have described that the cause of air pollution may come from many sources such as waste products, construction work, factory emissions and vehicles. Pollutants at ground level are caused by human activities and natural events. Acson International in their Healthy Air Booklet stated that the main sources of air pollution in Malaysia are industrial fuel burning, motor vehicles, domestic fuel burning, power stations as well as the burning of industrial and municipal waste ( Acson Malaysia Sales and Service Sdn Bhd, 2012). Malaysia, Air Pollution Index (API) is currently use as an indicator to measure the air quality ( Hanapi & Din, 2012). According to the Department of Environment Malaysia (2013), API is calculated based on five major types of air pollutants at air pollution monitoring stations belong to Department of Environment Malaysia. These include PM 10, sulphur dioxide, nitrogen dioxide, ground level ozone and carbon monoxide. The indications of API value below 50 classified as good, 51-100 classified as moderate, 201-300 classified as unhealthy, more than 300 is classified as hazardous whereas an API value above 500 is classified as an emergency.

PM known as particulate matter or fine dust, it is a complex mixture of liquid droplets with extremely small particles. In addition, it is made up of several components including organic chemicals, acids, dust particles, soil and metals. PM with an aerodynamic diameter of less than 10μ m (PM 10) is one of the major air pollutants in Malaysia and in most of cities in Southeast Asia ( Afroz, Hassan, & Ibrahim, 2003). The yearly average ambient concentration levels of PM 10 between 1999 and 2013 in Malaysia were generally within the Malaysian Ambient Air Quality Guidelines (MAAQG) with a value of less than 50μ g/m 3 . The highest level of concentration was 50μ g/m 3 which was recorded in 2002 whereas the lowest level of concentration was 39μ g/m 3 which was recorded in 2010. The concentration of PM at a specific location depends on many factors such as local and regional particulate matter sources as well as geographical situation and meteorological conditions ( Titos, Lyamani, Pandilfi, Alastuey, & Alados-Arboledas, 2014). The main source of air pollutants, especially PM is traffic exhaust emissions ( Bycenkiene, Plauskaite, Dudoitis, & Ulevicius, 2014).

Department of Occupational Safety and Health Malaysia (2014) stated that PM 10 can negatively affect human health if the API value exceeds 100. Environment Statistics Time Series Malaysia (2013) summarised that unhealthy events caused by transboundary heavy particulate matter were recorded between 2002 to 2013 with a maximum of 3 days recorded in 2005. By referring to heavy particulate matter pollution reported by the New Straits Times ( 2014), the standard operating procedure for schools to close is when the API exceeds 200 where the air quality is at a “very unhealthy” level. However, the number of days for the school to be closed depends on the duration of the high particulate event. This causes uncertainty to the public because they would not be able to know the duration of closure. It would make it difficult for them to plan or schedule outdoor social activities. In reduced the difficulties, the appropriate method of prediction.

Problem Statement

According to Pascal et al. ( 2014), exposure to PM 10 has been consistently associated with serious health outcomes, resulting in an increase in mortality and hospital admissions predominantly related to cardiovascular and respiratory disease. There are many significant studies have linked PM 10 to a series of significant health problems, including aggravated asthma, increase in respiratory symptoms like coughing and difficult breathing, chronic bronchitis, decreased lung function, and premature death. One of the unhealthy events in Malaysia is the presence of heavy particulate matter caused by uncontrolled forest fires originating from the Indonesian province of Sumatra during the burning season ( Norela, Saidah, & Mahmud, 2013). Forest fires are normally used for land preparation and forest clearance by people involved in farming. Unfortunately, this could develop into uncontrollable wildfires. This situation usually happens between June and November coinciding with drier weather conditions ( Salinas et al., 2013). Due to these issues, there are need to provide an early warning to those who may be effected. Short term prediction is quite relevant to provide the information about PM 10 concentration.

Research Questions

Is logistic regression suitable for prediction of PM 10 concentration?

Purpose of the Study

The main objective in this study is to describe the relationship between PM 10 with other gases and weather conditions by using correlation. It also aims to determine the best prediction categories. Furthermore, this research aims to develop a model for predicting the concentration of PM 10 using logistic regression.

Research Methods

From a set of variables that can be continuous, discrete, dichotomous or a mixture of these variables, we can use a method to predict a discrete outcome. This method is known as logistic regression. Logistic regression can be used to answer the same questions as discriminant analysis. However, the difference between logistic regression and discriminant analysis is that it has no assumption about the distribution of independent variables. The application of logistic analysis is predicting the success or failure of a new product, determining what category of a credit risk a person will fall into and predicting whether a firm will be successful or otherwise.

In statistical analysis, the main objectives of logistic regression are to correctly predict categories of outcome for individual cases as well as to establish a relationship between the outcome and the independent variables.

The main purpose of logistic regression in statistical analysis to correctly predict categories of outcome for individual cases. A model must create that includes useful and related independent variables in order accomplish this purpose. Beside that, logistic regression also purposely to measure the relationship between categorical dependent variable and independent variables.

Logistic regression does not require the assumption of normality. However, the sample size must be large enough, at least 100 observations and a ratio of 20 observations for each independent variable. For this distribution, a log transformation needed along to create link with a normal regression equation. The log transformation or known as logistic regression of p also called as logit ( p ) defined as:

(1) l o g i t p = log e p 1 - p = ln p 1 - p

Logit ( p ) is the log base e of the p . From Equation 1, value p must in range between 0 and 1, then logit ( p ) will scale from negative infinity and positive infinity. The graph of logit ( p ) symmetrical at p = 0.5 . From Equation ( 1 ), the logistic regression equation form:

(2) l o g i t p = ln p 1 - p = α + β 1 X 1 + β 3 X 3 + + β k X k # 2

Equation ( 2 ) show the logistic equation form behaviour of linear fit model. This model uses maximum likehood in criterion for find the best fit rather than least square deviation. The value of p can calculated by following formula:

(3) p = e α + β 1 X 1 + β 2 X 2 + 1 + e α + β 1 X 1 + β 2 X 2 + # 3

where p is the probability of the parameter of interest, which is the probability of the concentration of PM 10, e is value of natural logarithm (approximate 2.178…), α is the value of constant coefficient and β is the coefficient for independent variables (temperature, relative humidity, NO 2, SO 2, O 3 and CO. There are three possible outcomes of PM 10 level for the logistic regression model which are healthy (Y=1), moderate (Y=2) and unhealthy (Y=3). These variables of PM 10 are grouped according to the relationship between PM 10 concentration and Air Pollution Index in Malaysia as shown in Table 1 .

Table 1 -
See Full Size >

60% of the training data was used to obtain the logistic regression model. Another 40% of the data was used for validation purposes. When the percentage correct prediction of the training data is the same or higher than the validation data, the model is considered as good and suitable to used for prediction.

Data and Area of Study

In this research, the secondary data used was recorded between 2010 to 2012. This data set consists of the data on air pollutants such as PM 10, CO, SO 2, NO 2 and O 3 with the meteorological data of temperature and relative humidity. The secondary data was obtained from the Air Quality Division of the Department of Environment Malaysia. The data was collected and monitored by Alam Sekitar Malaysia Sdn. Bhd. (ASMA), which is the authorized agency for DoE ( Azid et al., 2014). The data was subjected to standard quality control processes and quality assurance procedures which followed the standard quality outlines by the United States Environment Protection Agency (USEPA) ( Latif et al., 2014).

Findings

Based on the descriptive statistics provided in Table 02 , the reading of PM10 concentration does not exceed the hourly Malaysia Ambient Air Quality Guidelines (MAAQG) which is 150 μ g/ m 3 . The highest average of PM 10 concentrations recorded in 2011 ( 39.09 μ g/ m 3 ) while in 2010 ( 37.00 μ g/ m 3 ) and 2012 ( 37.49 μ g/ m 3 ) . These averages are lower than 50 μ g / m 3 (daily MAAQG), give indication the PM 10 concentration in Jerantut area still meet the standard set by DoE. For the standard deviation and coefficient of variance, the lowest value show in 2011 rather than 2010 and 2012. The value of kurtosis (-.10) and skewness (0.77) in 2010 lowest from this three consecutive years state that the pattern of distribution of data in 2010 close to the normal distribution.

Table 2 -
See Full Size >

Correlation between PM 10, other Gaseous and Meteorological Parameters

The Pearson correlation analysis was used to study the correlation between gaseous (SO 2, NO 2, O 3 and CO 2), PM 10 and meteorological parameters. The correlation between other gaseous, PM 10 and meteorological parameters for Jerantut monitoring stations is shown in Table 03 .

Table 3 -
See Full Size >

From the Table 03 , a strong correlation of 0.614 between PM 10 and O 3 while the correlation between PM 10 and SO 2 was weak as indicated by values of 0.004. The strong correlation between PM 10 and O 3 at Jerantut station indicated that an increase in the concentration of O 3 will increase the concentration of PM 10. There was no negative correlation recorded between PM 10 and other gaseous parameters.

A positive significant correlation between PM 10 and temperature is expected as higher temperature leads to high evaporation and resuspension of particles in ambient air. Furthermore, the negative correlation between relative humidity and PM 10 was also expected. This is because humidity and rainfall would reduce the number of particulate matter in the air because of the wash-out process ( Mahiyuddin et al., 2013). High temperature tends to cause lower humidity level and hot weather, which in turn promotes local and regional biomass burning that subsequently increases the quantity of particles in air ( Latif et al., 2014).

Logistic Regression Analysis

The logistic regression analysis was conducted to determine the best fitting model describing the relationship between dependent variables which include healthy, moderate or unhealthy and a set of independent explanatory variables which include temperature, relative humidity, SO 2, NO 2, O 3 and CO. The value of R 2 and the percentage of the correct prediction of group classification were also calculated between 2010 to 2012 to find the best fit model.

The overall and yearly regression model and R 2 values between 2010 to 2012 at Jerantut Station are shown in Table 04 . The results showed the overall R 2 value of the model was 0.956. The highest R 2 values for each year obtained for the model in 2010, 2011 and 2012 were 0.565, 0.349 and 0.296 respectively.

Table 4 -
See Full Size >

Table 05 and Table 06 show the results of the overall percentage correct group classification of training data and validation data, respectively. The results obtained show that training data obtained a percentage correct classification of 97.2% while validation data obtained a percentage correct classification of 97.0%. This indicated that the model was good because training data had a higher percentage correct prediction value compared to validation data. The healthy group obtained 100% in terms of correct prediction. However, the percentage of prediction for the moderate group was 0.0% due to the small number of PM 10 data that in the moderate category.

Table 5 -
See Full Size >
Table 6 -
See Full Size >

Table 07 and Table 08 show the results of the percentage correct group classification of training data and validation data in 2010, respectively. The results showed that the training data obtained a percentage correct classification of 97.3% whereas the validation data obtained a percentage correct classification of 94.5%. This indicated that the model was good because the training data had a higher percentage correct prediction value compared to the validation data. The healthy group obtained 100% in terms of correct prediction for the training data and 98.1% for the validation data. The percentage of prediction for the moderate group was 25.0% for the validation data and 0.0% for the training data.

Table 7 -
See Full Size >
Table 8 -
See Full Size >

Table 09 and Table 10 show the results of the percentage correct group classification of training data and validation data in 2011 at, respectively. The results showed that the training data obtained a percentage correct classification of 96.6% whereas the validation data obtained a percentage correct classification of 96.2%. This indicated that the model was good because the training data had a higher percentage correct prediction value compared to validation data. The healthy group obtained 99.6% in terms of correct prediction for the training data and 98.4% for the validation data. The percentage of prediction for the moderate group was 40.0% for the validation data and 0.0% for the training data.

Table 9 -
See Full Size >
Table 10 -
See Full Size >

Table 11 and Table 12 show the results of the percentage correct group classification of training data and validation data in 2012 at Jerantut station, respectively. The results obtained showed that the training data obtained a percentage correct classification of 98.6% whereas the validation data obtained a percentage correct classification of 95.9%. This indicated that the model was good because the training data had a higher percentage correct prediction value compared to the validation data. In terms of correct prediction, the healthy group scored 100.0% for the training data and 99.3% for the validation data. The percentage of prediction for the moderate group was 0.0% for both training and validation data.

Table 11 -
See Full Size >
Table 12 -
See Full Size >

Conclusion

From the secondary data obtained from the DoE which was analysed via descriptive statistics and correlation, the result shows that the level of maximum concentration of PM 10 at Jerantut station was under the limit based on the Malaysian Ambient Air Quality Guidelines (MAAQG) from 2010 to 2012. The correlation analysis between PM 10 and other gases and meteorological parameters at Jerantut station showed a strong correlation value of 0.614 between PM 10 and O 3. The result of the logistic regression analysis had a classification percentage of more than 90% for training and validation data every year. Moreover, the best logistic regression at Jerantut station in 2010 was an R 2 value of 0.565. The best prediction of percentage correct obtained was more than 85% which is considered healthy for the overall and yearly analysis.

Acknowledgments

Special thanks to Universiti Sains Malaysia for the funding with a Short-term Grant (PJJAUH/6315089).

References

Copyright information

About this article

Cite this paper as:

Click here to view the available options for cite this article.

Publisher

European Publisher

First Online

30.03.2020

Doi

10.15405/epsbs.2020.03.03.92

Online ISSN

2357-1330