{"componentChunkName":"component---src-templates-blog-post-js","path":"/predicting-forest-fires","result":{"data":{"markdownRemark":{"html":"<p><img src=\"https://user-images.githubusercontent.com/10103699/152072241-11c1c563-44d3-444f-8284-58040d77a57e.jpg\" alt=\"intro-img\"></p>\n<h6><em>Photo by <a href=\"https://unsplash.com/@thematthoward?utm_source=unsplash&#x26;utm_medium=referral&#x26;utm_content=creditCopyText\">Matt Howard</a> on <a href=\"https://unsplash.com\">Unsplash</a></em></h6>\n<p>Forest fires are one of the major natural catastrophes. </p>\n<p>They can endanger both human and wild life. And can severely destruct the environment.</p>\n<p>This is why predicting forest fires accurately is important.  </p>\n<p>In this project, I used data to explore forest fires that occurred in the Montesinho Natural Park in Portugal.\nMy goal was to see how data can be used to predict the amount of danger caused by the fires. </p>\n<p>I used R language for the data analysis. The codebase is shared on <a href=\"https://github.com/MalshaL/forest-fires\">GitHub</a>.</p>\n<p>This dataset was collected by the researchers <a href=\"http://www3.dsi.uminho.pt/pcortez/fires.pdf\">Cortez and Morais</a>\nduring the time from January 2000 to December 2003. You can find the complete dataset on the <a href=\"https://archive.ics.uci.edu/ml/datasets/forest+fires\">UCI Machine Learning\nRepository</a>.</p>\n<h3>Where is Montesinho Natural Park?</h3>\n<p>Now, you may be interested to know about the park the data is coming from. It's one of the largest natural parks in\nPortugal, with an area of 74,230 hectares. </p>\n<p><img style=\"max-width: 50%; display: block; margin-left: auto; margin-right: auto\" \nalt=\"1\" src=\"https://user-images.githubusercontent.com/10103699/151633402-f81f27af-1de7-45e8-a490-6dca9aa50cb2.png\"/></p>\n<h6><em>Location of Montesinho Natural Park</em></h6>\n<p>Montesinho has a diverse natural habitat with 240 species of animals.\nAnnual temperature of the park varies from 8 to 12<sup>0</sup>C, although the\ntemperature in summer could reach up to 40<sup>0</sup>C. </p>\n<p><img src=\"https://upload.wikimedia.org/wikipedia/commons/0/04/Montesinho.jpg\" alt=\"2\"></p>\n<h6><em>Montesinho Natural Park (Image by <a href=\"https://commons.wikimedia.org/wiki/File:Montesinho.jpg\">Elisha.wolf</a>, <a href=\"https://creativecommons.org/licenses/by-sa/4.0\">CC BY-SA 4.0</a>, via Wikimedia Commons)</em></h6>\n<h3>About the Dataset</h3>\n<p>Each row in the dataset is about a fire that had occurred in the park. There are 517 entries, described using 13 variables. </p>\n<p>The image below summarises the 13 variables in the dataset. They are categorised into categorical variables,\nfire weather indexes and weather conditions. The amount of area burnt by fire is used as the target variable.</p>\n<p><img src=\"https://user-images.githubusercontent.com/10103699/152072554-c69defc7-2e03-40be-8192-57b662a1db70.png\" alt=\"2\"></p>\n<h6><em>Figure 1: Variables in the dataset</em></h6>\n<p>The fire weather indexes FFMC, DMC, DC and ISI in the above variables are defined in the Fire Weather Index (FWI). They\nindicate the danger of the fire based on weather conditions. The higher these indexes are, the more dangerous the fire could be. </p>\n<p>FFMC indicates the fuel moisture content in forest litter, while DMC indicates the fuel moisture of decomposed\norganic material. DC determines long-term moisture conditions, and ISI measures the speed of fire spread.</p>\n<h3>Problem to solve</h3>\n<p>The purpose of this project is predicting the burnt area of a fire in Montesinho Park\nbased on variables such as location, weather conditions and fire indexes.</p>\n<h3>Data Transformations</h3>\n<p>Before working on any data analysis, the target variable 'burnt area' needed to be transformed. After plotting the target variable,\nwe can see that more than 47% of the values are zero (first histogram below). To reduce this skewed distribution,\nI took the logarithmic value (second histogram). As the target variable should be positive (since area should be a\npositive value), the <em>log(area+1)</em> transformation was used (third histogram). </p>\n<p><img style=\"max-width: 70%; display: block; margin-left: auto; margin-right: auto\" \nalt=\"3\" src=\"https://user-images.githubusercontent.com/10103699/152499846-33e7ff4f-6f6e-433b-b019-37ee30ce2fcb.png\"/></p>\n<h6><em>Figure 2: Distribution of target variable before and after transformation</em></h6>\n<h3>Exploring Data</h3>\n<h4>Linear correlation</h4>\n<p>To start off with, I looked at the linear relationship and correlation among the variables in the dataset. The\nscatter plots below show the distribution of target variable against the 12 independent variables. The plots don't\nshow any clear indication of linear relationship among the variables. When examining the Pearson correlation for\nnumerical variables, the variable 'DMC' had the highest positive correlation of 0.067 with target variable 'Burnt area',\nwhich shows very small linear relationship.</p>\n<p><img src=\"https://user-images.githubusercontent.com/10103699/152502929-10f1601e-e343-4365-917c-1e21521d4c8c.png\" alt=\"4\"></p>\n<h6><em>Figure 3: Scatter plots of target variable against dependent variables</em></h6>\n<p>Because there is little evidence about any linear correlation among variables, next I looked at the dataset-specific\naspects. The 4 categorical variables—x and y coordinates and Month and Day capture geographical\nand temporal information in the dataset. </p>\n<h4>Fire intensity by location</h4>\n<p>Therefore, first I used the x and y coordinates to generate a heatmap and identify the areas that were more\nprone to fires. The generated heatmap below show the areas outside the park in green, while the park is in heatmap colours.\nAs we can see in the map, most of the heavy fires had been contained in the edges of the park, while the middle of\nthe park had faced less severe fires. However, as the leftmost edge of the park has areas with less severe fires\nclose to highly burnt areas, we can decide that fires have not spread rapidly across the park. </p>\n<p><img style=\"max-width: 70%; display: block; margin-left: auto; margin-right: auto\" \nalt=\"5\" src=\"https://user-images.githubusercontent.com/10103699/152502962-8ac2ed61-2288-4349-a338-e8b92aca80e7.png\"/></p>\n<h6><em>Figure 4: Heatmap of burnt areas in the park using location coordinates</em></h6>\n<h4>Fire intensity by season</h4>\n<p>Next, I used the month attribute to find if fires had any seasonal variations. The scatter plot below shows\noccurrence of fires across the four seasons. As expected, we can see that there had been more fires during summer\nand autumn compared to the low frequency of fires in winter and spring. However, it's difficult to identify a clear\npattern among burnt area and the season, as we can see many fires with an area of zero during summer, while some fires\nin winter have had burnt areas with the values 2 and 4.</p>\n<p><img style=\"max-width: 70%; display: block; margin-left: auto; margin-right: auto\" \nalt=\"6\" src=\"https://user-images.githubusercontent.com/10103699/152504483-1a9bd41c-a9d2-4c1e-b2e3-799575f65a93.png\"/></p>\n<h6><em>Figure 5: Variation of burnt area by seasons</em></h6>\n<h3>Data Analysis using non-linear regression</h3>\n<p>Since the dataset didn't show any evident linear correlation, I used non-linear regression with the numerical variables.</p>\n<p>To further explore the nature of relationship among variables, I used the pair plot below. </p>\n<p><img style=\"max-width: 70%; display: block; margin-left: auto; margin-right: auto\" \nalt=\"7\" src=\"https://user-images.githubusercontent.com/10103699/153519473-ad1eb487-7a32-46be-bed8-f22f278a62c7.png\"/></p>\n<h6><em>Figure 6: Pair plots for numerical variables</em></h6>\n<p>The last row of the above pair plot shows the variation of Area against each predictor variable. We can see that\nTemperature has an evident curve in its graph, indicating a possible non-linear relationship. Similarly, if we\ndisregarded the outlier points in the graphs of FFMC and ISI, we can see that majority of data points have a\ncurved distribution. Considering these behaviours, I added the variables as polynomial terms to the model\nwith an order of 2.</p>\n<p>So far we have,</p>\n<p><code class=\"language-text\">Area = ... + Temperature^2 + ISI^2 + FFMC^2 + ...</code></p>\n<p>Then, I considered the relationships among variables to add the interaction terms to the model. If two predictors\ndid not show any evident relationship among each other, they were added as interaction terms to the model.</p>\n<p>For instance, Temperature and Wind variables show an equally spread set of points, which does not show a clear\nconnection. Therefore, the combination of Temperature and Wind was added to the model as an interaction term. Another\nexample is the plot of Wind against ISI, which shows that ISI has a curved relationship with Wind. Therefore, an\ninteraction term with a polynomial term was added to the model as ISI<sup>2</sup> * Wind.</p>\n<p>After training the model, variables that had a low significance (high P-value) such as <code class=\"language-text\">ISI</code> and <code class=\"language-text\">RH</code> were removed\nto improve the model.</p>\n<p>The model became, </p>\n<div class=\"gatsby-highlight\" data-language=\"text\"><pre style=\"counter-reset: linenumber NaN\" class=\"language-text line-numbers\"><code class=\"language-text\">Area =  FFMC + DMC + DC + Temperature + Wind + Rain + Temperature^2 + ISI^2 + FFMC^2 + DMC^2 + Wind^2 + DC*ISI + \n        FFMC*ISI + RH*Wind*Rain + Temperature*Wind + Temperature*DMC + ISI2*Wind</code><span aria-hidden=\"true\" class=\"line-numbers-rows\" style=\"white-space: normal; width: auto; left: 0;\"><span></span><span></span></span></pre></div>\n<div class=\"gatsby-highlight\" data-language=\"shellscript\"><pre style=\"counter-reset: linenumber NaN\" class=\"language-shellscript line-numbers\"><code class=\"language-shellscript\">nolin_model1 &lt;- lm(Area ~ .+ I(Temperature^2) + I(ISI^2) + I(FFMC^2) + I(DMC^2) + I(Wind^2) + DC*ISI + RH*Wind*Rain \n                + Temperature*Wind + I(ISI^2)*Wind + FFMC*ISI\n                + Temperature*DMC -ISI -RH, data = fire_norm) \n\nsummary(nolin_model1)</code><span aria-hidden=\"true\" class=\"line-numbers-rows\" style=\"white-space: normal; width: auto; left: 0;\"><span></span><span></span><span></span><span></span><span></span></span></pre></div>\n<p><img style=\"max-width: 60%; display: block; margin-left: auto; margin-right: auto\" \nalt=\"8\" src=\"https://user-images.githubusercontent.com/10103699/153526586-b0545e3b-6185-488a-b32b-ab017c384dcd.png\"/></p>\n<h6><em>Figure 7: Summary of initial non-linear model</em></h6>\n<p>This model had a high R<sup>2</sup> value of 0.05345, but the adjusted R<sup>2</sup> of the model was 0.01513. The significant gap between\nthese values show that the model is not well generalised and may over fit the data. We can see this in the low\nsignificance of majority of variables in the model.</p>\n<h3>Improving the model</h3>\n<p>Removing variables with high P-values (low significance) helped to improve the significance of the model. The\nresulting model was as follows:</p>\n<div class=\"gatsby-highlight\" data-language=\"shellscript\"><pre style=\"counter-reset: linenumber NaN\" class=\"language-shellscript line-numbers\"><code class=\"language-shellscript\">Area = FFMC + DC + Temperature + Temperature2 + FFMC2 + Wind2 + DC*ISI + RH*Rain + ISI2*Wind</code><span aria-hidden=\"true\" class=\"line-numbers-rows\" style=\"white-space: normal; width: auto; left: 0;\"><span></span></span></pre></div>\n<div class=\"gatsby-highlight\" data-language=\"shellscript\"><pre style=\"counter-reset: linenumber NaN\" class=\"language-shellscript line-numbers\"><code class=\"language-shellscript\"># simplified model\nnolin_model2 &lt;- lm(Area ~ .+ I(Temperature^2) + I(FFMC^2) \n                + I(Wind^2) + DC*ISI + RH*Rain + I(ISI^2)*Wind\n                -ISI -RH -DMC -Rain -Wind, data = fire_norm) \nsummary(nolin_model2)</code><span aria-hidden=\"true\" class=\"line-numbers-rows\" style=\"white-space: normal; width: auto; left: 0;\"><span></span><span></span><span></span><span></span><span></span></span></pre></div>\n<p><img style=\"max-width: 60%; display: block; margin-left: auto; margin-right: auto\" \nalt=\"9\" src=\"https://user-images.githubusercontent.com/10103699/153526586-b0545e3b-6185-488a-b32b-ab017c384dcd.png\"/></p>\n<h6><em>Figure 8: Summary of improved non-linear model</em></h6>\n<p>This model gave an R<sup>2</sup> value of 0.04602 which was lower than the R<sup>2</sup> from the initial non-linear model of 0.05345.\nHowever, the improved model had more significant terms, of which Temperature<sup>2</sup> had the highest significance with three stars\nand a P-value of 0.000907. Temperature and DC had high significances with two- and one-star ratings respectively.\nThe interaction term RH*Rain had a significance rating of one dot (.), with a P-value of 0.063176. The adjusted R<sup>2</sup> was\n0.02709, which gave a reduced difference between the R<sup>2</sup> values. </p>\n<p>To sum up, the implemented model shows that weather conditions such as temperature, rain, relative humidity and Drought\nCode (DC) play an important role in defining the severity of fires. </p>\n<p>To find more interesting effects of each parameter, we can cluster data to find similar data points and build separate\nregression models for each group.</p>","frontmatter":{"date":"September 20, 2020","path":"/predicting-forest-fires","title":"How much danger can a forest fire cause?","tags":["Project","Machine Learning","Data Analytics","R","Regression"]}}},"pageContext":{}},"staticQueryHashes":["3649515864"]}