This post is an extension to the earlier post. Please read it first to better understand the below additional information.
Since the last release of the model we worked on better understanding our learning variable: the total number of houses damaged, adding new algorithms and finding the minimal set of variables that can give a reasonably good prediction.
The first run of the model was trained using the UN OCHA dataset for the Typhoon Haiyan of around 300 municipalities in the Philippines for which the number of houses damaged was available [see note 1]. Later on we were able to obtain data for 4 typhoons (Haiyan included) where the damage was split in completely damaged and partially damaged houses.
Note 1: In the rest of the post we discuss the number of houses damaged even if the learning is always done on percentage of houses damaged with respect to the total in the municipality. This is not affecting any of our reasoning but makes things easier to explain.
We soon realized that as a measure of the severity of a typhoon the number of houses damaged can be misleading. For example two municipalities (A and B) can have the same number of houses damaged, say 10000, but different splits between completely and partially damaged. In fact municipality A may have 9000 houses completely damaged whereas municipality B only 1000, and still they can have the same grand total.
This means that in the original model, even if the severity of the damage is widely different in the two municipalities, they will be treated as having the same severity of the damage.
Therefore we came up with two possible solutions:
From the first approach it became clear that for the model it is easier to learn from the completely damaged houses. Our best model in this case can reach an accuracy of 80%, meaning that in 80% of the cases it is able to correctly predict the percentage damage class of the municipality (for example a damage of 23% will fall in the class 20%-30% damage).
On the other hand when trying to predict the number of partially damaged houses even the best model can not go higher than 53%.
This divide and conquer approach can be justified if the two type of damage are caused by different factors that make the sum of the two harder to predict. If this is the case the model is trying to find two relations instead of one and therefore it settles for an unsatisfactory middle ground.
The second approach instead tries to find a way to still combine the two types of damage but in a more reasonable fashion. In fact the best model that we can build using the total number of houses damaged (partially + completely ) can reach only a 53% accuracy, whereas the best model with the weighted sum of the damage (partially/4 + completely ) reaches a 70% accuracy.
In conclusion it appears that even by using the weighted sum of the two type of damage we can not quite reach the accuracy level of the model learning from only the completely damaged houses. On the other hand the weighted sum of the two types of damage should give a more comprehensive understanding of the severity of the damage in a given municipality (for two municipalities with the same number of completely damaged houses the municipality with more partially damaged houses should have a higher priority).
Since the last update we added two algorithms to our model to select from (based on performance) a Gradient Boosting Tree (GBT) and a multi-layer Perceptron (Neural Network).
These two additions gave us more options for the choice of best model. Especially the GBT seems to outperform the random forest when the available features set is reduced.
We also started to explore what the minimal set of features is that we can use that still give us a good prediction. This is important for a future deploy of the model in other regions where many of our learning variables may not be available (roof and wall type for example).
Our preliminary findings seem to suggest that when we predict the weighted sum of the damaged houses (see previous section) a GBT model can do fairly well with only event specific features.
Namely a GBT algorithm with only information about wind speed, precipitation and the path of the typhoon [see note 2] can get an accuracy of 70% (as for the best model that uses the full set of learning variables) at the cost of a decrease of the parameter that evaluates how well the model can describe the variance of the data (r2 score goes from 0.75 To 0.73) .
The same seems to qualitatively hold when we use the other learning variables (see previous section). It is important to notice that almost always only the GBT can handle the reduced subset of variables reasonably well. More analysis is needed to understand why this is.
note 2: the path of the typhoon now is used to compute the distance of the municipality from the path, the distance from the first impact and the projected distance from the first impact (assuming that the typhoon is travelling with constant speed this should give an estimation of when the typhoon was passing over the municipality, i.e at the beginning or at the end of the typhoon
To better understand the analysis of the prediction error we have published an interactive dashboard that gives insight in the error for different runs of the model. The results of the different models tried are available here.
If you have any questions about our methodology, want to play around with the data, or have any suggestions for improvement, make sure to get in touch or leave a comment below.
An initiative of the Netherlands Red Cross. We want to shape the future of humanitarian aid by converting data into understanding, and put it in the hands of humanitarian relief workers, decision makers and people affected, so that they can better prepare for and cope with disasters and crises. Among our data scientists are many volunteers and their input to our work is highly appreciated.
Want to join us and have an impact in humanitarian aid through the use of data? Contact us.