A priority index for earthquakes

A priority index for earthquakes

Identifying aid priority areas by learning from past events

In the first few days following a severe earthquake there typically is a scarcity of reliable information about the impact on local communities. In this situation, decision makers of the humanitarian aid sector struggle to get an overview of how the severity of the impact is spatially distributed. Time and quantity of their relief resources are limiting factors, and thus they look for an optimal schedule to effectively target the most affected communities. By learning from data of past events Priority Index Models can rapidly produce an estimate of a disaster’s impact, which can help decision makers to identify aid priority areas.

You might have read our blogposts about the initial and improved priority index model for typhoons. In the meanwhile we have started to develop a similar model for seismic hazards. In the first blogpost we have already highlighted the importance of full area covering impact information for successfully identifying priority areas. During the first days following a natural disaster, when the impact has not yet been field assessed, a rapidly produced impact estimation can serve as a stopgap to support important aid coordination decision-making.

Additionally, we believe that maps in general are a very powerful medium. Estimated impact maps can help aid workers to form an image in their mind about the situation they work in, which can help them distribute relief resources more effectively.

The weighing applied in current earthquake severity index models is often subjective, which can have a negative impact on the accuracy of their output. By learning from past-events, priority index models enable empirically-based decision support.

A Case Study for Nepal

On the 25th of April 2015 a 7.8 Mw earthquake hit the Central Northern part of Nepal. Because of the extensive amount of assessment data produced after this event, it was selected as the first case to build an earthquake priority index model on.

During the search for relevant model input data an important requirement was availability on a low geographic administrative level, to be able to distinguish different levels of severity within the most affected area, and not only between heavily and medium affected areas. This would preferably be administrative level 4, w

hich is the lowest level possible. On this level there are around 3150 VDCs (Village Development Committees) in Nepal (see figure below)1. An output on this low resolution was considered highly preferable over the administrative level 3 (district).

Administrative levels 3 (District) and 4 (VDC) in Nepal.


Open data on 27 potential predictor variables was collected mainly from the national population census (2011) and through platforms such as Humanitarian Data Exchange, The predictor variable were divided over four categories: hazard predictors (mean Macroseismic intensity derived from USGS ShakeMaps), exposure predictors (total population and population density), physical vulnerability predictors (slope and various foundation, wall and roof building materials) and socio-economic vulnerability related predictors (literacy rate, household size, school attendance, drinking water source quality and toilet presence).

Concerning the response variable, we tried to find data that could accurately indicate the distinct levels of aid-neediness between multiple geographical entities. Damage to residential buildings was selected as the most suitable response variable, because it is a relatively objective measure and it’s relevant for multiple aid clusters (Shelter, WASH, Health and Food Security). An Initial Rapid Assessment performed by volunteers of the Nepal Red Cross Society provided damage numbers for 517 VDCs in the most affected area. This type of assessment is typically done by letting volunteers assess and report impact soon after a new rapid-onset crisis to provide an overview of how the population has been affected.

Number of completely damaged houses

Model Fitting and Validating

Different statistical models with different response variables were compared in order to provide solid argumentation for model decisions. Due to its earlier success on the typhoon priority index model, we applied the random forest regression model. In the typhoon Haima priority index blogpost we explained more about this machine learning method. Additionally, a multivariate linear regression model was fitted to the data in order to get insight into the nature of individual indicators’ relations to the response variables. Two response variables were tested with: the absolute number of completely damaged houses and a composite damage variable, referred to as the house damage factor:

house damage factor = (0.75 completely damaged houses) + (0.25 partially damaged houses)


Statistically, the random forest model predicting a composite variable of both partially and completely damaged houses performed best with an R-squared of 0.63 on an independent test dataset. However, despite a lower model accuracy, we favour the random forest model predicting only completely damaged houses because the output is more intuitive. Meaning that a non-composite measure is easier to interpret for aid workers. Another reason to favour this model is that the output can be extended, for example by dividing it by the total number of houses the relative damage per VDC can be retrieved. This cannot be done with a composite measure. The R-squared of this model was not much lower with 0.60. Two-third of the highest priority areas were identified correctly (VDCs were divided over five equal-sized priority levels). The standard deviation of the unexplained variance (root mean squared error) was 626 houses per VDC.3

By interpreting the coefficients of determination resulting from the multivariate linear regression model we found that some relationships in the model might actually be unique to this particular case or area. It was observed that heavier damaged VDCs correlated with VDCs with better quality of building foundations, higher toilet presence and higher school attendance.

The below graph shows the increase in Mean Squared Error, reported on a 0% to 100% scale, in case the predictor variable of interest is permuted (randomly shuffled).

Relative importance plot of favoured random forest regression model. (explanation of variable codes: thatch_roof = number of households with thatch/straw roofs, mi = Macroseismic intensity, mud_found = “ mud bonded bricks/stone foundations, mud_wall = “ mud bonded bricks/stone outer walls, pop = population, tap_water = percentage of households with tap water as their main source for drinking water, tile_roof = “ tile/slate roofs, galv_roof = “ galvanized iron roofs, bamboo_wall = “ bamboo outer walls, slope = mean slope value (%), cem_wall = “ cement bonded bricks/stone outer walls, school = school attendance 5-25 year olds (%), wood_found = “ wooden pillar foundations, rcc_roof = “ RCC roofs, literacy_rate = literacy rate, no_toilet = percentage of households without a toilet facility, pop_dens = population density, cem_found = “ cement bonded bricks/stone foundation, wood_wall = “ wood/planks outer walls, rcc_found = “ RCC with pillar foundations, wood_roof = “ wood/planks roofs, hhsize = average household size, unbaked_wall = “ unbaked brick outer walls, mud_roof = ‘’ mud roofs)

The mean Macroseismic intensity (mi) and total population (pop) were among the most important predictors in all models and are therefore considered to be indispensable model components. The importance of the population variable was expected since the model predicts the absolute number of houses damaged per VDC, which logically is closely related to the total number of inhabitants per VDC. The high importance of the Macroseismic intensity variable proofs that this is a good quantitative indicator for seismic hazards. Also single building material variables scored a relatively high importance. This stresses the importance of collecting and preparing datasets on the quality of buildings in earthquake prone areas.

Visualization of model prediction for the complete study area

Would the model apply to other events and areas?

At this point, where the model has been trained on one case only, it is unlikely that a useful model output can be produced by the model for a future event in another country. This is mostly due to the presence of case- and country specific relationships in the model, such as the positive relations between school attendance and damage or toilet presence and damage. By training the model on more cases in different environments these relationships can be eliminated with time. Besides that, we found that universality could be improved by excluding secondary hazard susceptibility variables, finding an alternative uniform socio-economic vulnerability variable and using composite building quality variables.

Keeping model input data simple seems to be the only way to create a model that can produce useful output for seismic events in different areas around the world. Apart from that, the successful application of priority index models in general stands or falls by data preparedness. Data collection and pre-processing are time-consuming tasks. Therefore, to be able to run the model within the first hours after initial impact, it is necessary to have data on as much predictor variables as possible prepared in a structured data matrix. Of course data on earthquake parameters can be added only after initial impact.

What is New?

An important change in comparison to our typhoon models, is that we decided to no longer use a composite variable as the response variable. Instead, we favoured the number of completely damaged houses. Besides the reasons mentioned above, this also avoids issues of reliability with the reported numbers for ‘partially damaged’, which can differ based on someone’s judgement of what is partially damaged.

What lies ahead?

The results of the model are quite promising, and we expect that a more universally applicable model can be created based on it. In fact, the model is already operationally useful for the study area in Nepal. Currently we are working on training the model on events at different places in the world. This will show if the model can indeed produce useful estimations for aid decision-makers in a post-earthquake situation.

We believe that rapid impact estimation models trained on data from past events can support the humanitarian aid sector in the near future and will continue to do research on and develop them. To get the details on the earthquake priority index model or the associated research we direct you to the thesis.



2 Nepalese Red Cross Society, 2015a. Initial Rapid Assessment.

3 For more details on predictive accuracy of the model see thesis

Outliers and missing data in datasets

Outliers and missing data in datasets

In the previous blog post on data verification, see: [link] we mentioned the need to identify outliers in the data. This blog post will look at some of the humanitarian aid datasets/formats and the types of techniques we apply to identify outliers and how to effectively deal with them. The remainder of this blog post will focus on the occurrence of missing data and possible reasons for their occurrence.


Outliers are data points whose values are much lower, or much higher than the rest of the data points. We need to identify them, as they may impact the predictive accuracy and model fit when applying simple or multivariate regression analyses. If there are many outliers in the higher values, it is likely that the model will underestimate these values. Similarly, if there are many outliers in the lower values, the model can overestimate them.

Outliers can be classified as either “valid” or “invalid”, depending on the underlying cause(s). For example, in the case of a survey, an observation may have been wrongly entered, or numerical values may have been inserted where descriptive text was expected. Other causes may be the inconsistent use of zero, or “non-applicable”. If a CSV file (“comma separated values”), or a TSV file (“tab separated values”) does not have the proper encoding format (e.g. UTF-8) then rendering their contents on a UTF-8 encoded website may result in specific characters being replaced by black squares or question marks. As a side effect, due to the encoding errors, table entries may be shifted and so end up in the wrong columns. If the table is “scraped” from a website using scripting, these encoding errors will also appear in the downloaded data.

Identifying outliers

In case of large structured datasets such as surveys, manually identifying outliers remains cumbersome.

In order to accelerate the identification we use a number of methods:

  • Visual inspection methods such as histogram (Figure 1) or box-and-whisker plots (Figure 2) help us to ‘see’ and interpret the distribution of the data
  • Non-visual inspection methods using spreadsheet functions or scripts help us to identify any empty entries, inconsistencies between numerical values and their textual entries, etc.

Figure 1: An example of a histogram, showing the number of completely damaged houses due to the Gorka earthquake in Nepal.

Figure 2: An example of a box-and-whisker plots showing outliers below the “minimum value mark”.

Handling outliers

Once the outliers have been identified, the next step is to determine what to do with them:

  • Retaining outliers that appear to be valid data.
  • Replacing outliers with a known (or derived) entry from related datasets.
  • Deleting outliers from the dataset.

In early stages after a natural disaster, when detailed data is still scarce, humanitarian aid organizations often publish high-level data which then gets updated over time. Also, as more and more data becomes available, we can use triangulation, where outliers are retained if different sources report similar/equal values and are removed if values differ in two or more of the datasets.

Missing data

In other instances values may be missing from the dataset, either intentionally (subjects refusing to provide answers to survey questions), or unintentionally (data got corrupted or subjects were no longer available to complete a survey). The extent to which the missing data impacts further analyses starts with determining the type of missing data:

  • Missing Completely At Random (MCAR): data are missing independently of both observed and unobserved data. An example of this would be: entire surveys that, at random, were not submitted, leading to missing values.
  • Missing At Random (MAR): given the observed data, data are missing independently of unobserved data. An example of this would be: collecting data about a subject’s profession where it is known that certain professions are more likely not to share their income. Within subgroups of the profession, missing incomes will be random.
  • Missing Not at Random (MNAR): missing observations related to values of unobserved data. An example of this would be: people with a low income are less likely to report their income on a data collection form.

We can ignore missing data (= omit missing observations) if we have MAR or MCAR.

In a recently held survey in 2016 on competitiveness index for municipalities in the Philippines, only 1245 municipalities out of a total of more than 1600 municipalities were ranked and no reason was given for the missing 400 municipalities. In such instances it is advised to reach out to the researchers to understand why not all data was published.

After the Gorkan earthquake in Nepal, the number of damaged houses was reported at the lowest administrative level (level 4, Village Development Committee). Each VDC was associated with an identification label. The p-coding of the document was done automatically by means of an algorithm searching for matching letters in the label. Unfortunately, as the administrative borders in Nepal are rather dynamic, several VDCs did not have an associated p-code.

Figure 3: A visualisation of the number of completely damaged houses in Nepal due to earthquake Gorka.


Our champions

510An initiative of the Netherlands Red Cross. We want to shape the future of humanitarian aid by converting data into understanding, and put it in the hands of humanitarian relief workers, decision makers and people affected, so that they can better prepare for and cope with disasters and crises. Among our data scientists are many volunteers and their input to our work is highly appreciated.

Want to join us and have an impact in humanitarian aid through the use of data? Contact us.

Netherlands Red Cross a humanitarian aid organization