In the previous blog post on data verification, see: [link] we mentioned the need to identify outliers in the data. This blog post will look at some of the humanitarian aid datasets/formats and the types of techniques we apply to identify outliers and how to effectively deal with them. The remainder of this blog post will focus on the occurrence of missing data and possible reasons for their occurrence.
Outliers are data points whose values are much lower, or much higher than the rest of the data points. We need to identify them, as they may impact the predictive accuracy and model fit when applying simple or multivariate regression analyses. If there are many outliers in the higher values, it is likely that the model will underestimate these values. Similarly, if there are many outliers in the lower values, the model can overestimate them.
Outliers can be classified as either “valid” or “invalid”, depending on the underlying cause(s). For example, in the case of a survey, an observation may have been wrongly entered, or numerical values may have been inserted where descriptive text was expected. Other causes may be the inconsistent use of zero, or “non-applicable”. If a CSV file (“comma separated values”), or a TSV file (“tab separated values”) does not have the proper encoding format (e.g. UTF-8) then rendering their contents on a UTF-8 encoded website may result in specific characters being replaced by black squares or question marks. As a side effect, due to the encoding errors, table entries may be shifted and so end up in the wrong columns. If the table is “scraped” from a website using scripting, these encoding errors will also appear in the downloaded data.
In case of large structured datasets such as surveys, manually identifying outliers remains cumbersome.
In order to accelerate the identification we use a number of methods:
Once the outliers have been identified, the next step is to determine what to do with them:
In early stages after a natural disaster, when detailed data is still scarce, humanitarian aid organizations often publish high-level data which then gets updated over time. Also, as more and more data becomes available, we can use triangulation, where outliers are retained if different sources report similar/equal values and are removed if values differ in two or more of the datasets.
In other instances values may be missing from the dataset, either intentionally (subjects refusing to provide answers to survey questions), or unintentionally (data got corrupted or subjects were no longer available to complete a survey). The extent to which the missing data impacts further analyses starts with determining the type of missing data:
We can ignore missing data (= omit missing observations) if we have MAR or MCAR.
In a recently held survey in 2016 on competitiveness index for municipalities in the Philippines, only 1245 municipalities out of a total of more than 1600 municipalities were ranked and no reason was given for the missing 400 municipalities. In such instances it is advised to reach out to the researchers to understand why not all data was published.
After the Gorkan earthquake in Nepal, the number of damaged houses was reported at the lowest administrative level (level 4, Village Development Committee). Each VDC was associated with an identification label. The p-coding of the document was done automatically by means of an algorithm searching for matching letters in the label. Unfortunately, as the administrative borders in Nepal are rather dynamic, several VDCs did not have an associated p-code.
An initiative of the Netherlands Red Cross. We want to shape the future of humanitarian aid by converting data into understanding, and put it in the hands of humanitarian relief workers, decision makers and people affected, so that they can better prepare for and cope with disasters and crises. Among our data scientists are many volunteers and their input to our work is highly appreciated.
Want to join us and have an impact in humanitarian aid through the use of data? Contact us.