Unavailable, unreliable, or incomplete household maps or lists are common challenges to epidemiologic research, particularly in resource-limited settings. The time, effort, and/or money required to develop or acquire such maps or lists has often compelled researchers to opt for variants of the random walk,1,2 but the outputs of such activities are not necessarily true probability-based samples.3,4 A probability sample must rely on a map or list from which households may be randomly sampled (the sampling frame).5 Satellite and aerial imagery that are available freely or at a modest price provide a possible alternative to traditional maps and lists. One may employ geospatial mapping of all potential residential-associated (PRA) roofs on satellite or aerial images within a chosen geographic area to develop a map for simple random sampling of households. This method has been used in resource-poor areas with inadequate maps, census data, and infrastructure, or for surveying vulnerable populations in insecure environments.6–16 It has required advanced mapping and spatial software, such as ArcGIS or AutoCAD, to identify PRA roofs in overhead images obtained using Quickbird, Google Earth, or IKONOS.6,7,9,12–14 To estimate coverage using overhead imagery, ground teams are often sent with global positioning system (GPS) devices to pinpoint the identified roofs on-site and verify their status as residences or non-residential structures.7–9,11–16 In two studies for which time frames were provided,7,13 residential structures made up approximately 95% of the roofs that were successfully located on the ground using overhead images < 3 years old.
While such success in locating residential structures is encouraging, few studies have estimated the proportion of residential structures not captured in the images.16 Non-coverage arises when residential structures are not included in a particular sampling frame and have no chance of being sampled. Consequently, their absence may affect the representativeness of the study. When using overhead imagery, non-coverage likely occurs for several reasons and may vary by location if (1) new roofs have been added or existing roofs have been removed since the images were taken, (2) roofs have been obscured in the images (e.g., by foliage), (3) roofs are outside the designated PRA-roof size range or are thought to be non-residential and are not mapped, or (4) the image resolution is inadequate to distinguish multiple roofs (and therefore potentially multiple households) from single roofs (e.g., in high-density urban areas).
The extent of non-coverage bias associated with overhead imagery has not been fully explored. Therefore, we conducted a sub-study in the municipio of Nueva Santa Rosa in Guatemala in August–September 2010 to evaluate the non-coverage bias and sensitivity of an aerial photograph methodology used to generate the sampling frame for a larger cross-sectional survey of diarrhea and soil-transmitted helminthiasis prevalence and associated water, sanitation, and hygiene risk factors.17
METHODS
Site
Nueva Santa Rosa (NSR) is a municipio (municipality) in the departamento (state) of Santa Rosa, approximately 45 kilometers southeast of Guatemala City, the capital of Guatemala. NSR is a sparsely populated mountainous area with three main urban areas linked by paved roads; the remainder of NSR is served by dirt roads. In 2010, NSR had an estimated population of 31,044 people18 and 5,918 households,19 as determined by the Instituto Nacional de Estadística. Further vital statistic details for Nueva Santa Rosa in 2010 can be found at the following link (https://www.ine.gob.gt/sistema/uploads/2013/12/10/MVdhUf5YNLubC3ZikAABJekA0ettQNw1.pdf).
Maps
We obtained high-resolution, georeferenced, geometrically corrected (orthorectified) aerial photographs of NSR from Guatemala’s Instituto Geográfico Nacional taken in 2006. The photograph pixel resolution was approximately 0.16m2, functioning on a 1:10,000 scale. We formatted all photographs with ArcGIS and overlaid them with the 2002 NSR census tract maps to identify the municipio boundaries. We created a grid of 200m x 200m cells (0.04km2) to cover NSR, overlaid a topographic map with GPS coordinates (Figure 1, panel A), then discarded all cells that did not touch the municipio (Figure 1, panel B).
Identifying potential residential-associated roofs
We defined PRA roofs as human constructions 16–150m2 that might be residential, similar to a previous study of household enumeration using overhead imagery, where structures between 9-330m2 were selected for study assessment.20 Based on investigator knowledge, we believed this was the most appropriate size range for NSR roofs that could represent residential structures such as entire houses, or buildings used for separate household functions such as kitchens, dining rooms, living rooms, and bedrooms. If part of a roof was hidden, we could still measure and categorize it as a PRA roof if two diagonal corners or three corners were visible. We used the length of the shadow cast by the building compared to other buildings of similar roof size to distinguish multi-story buildings. Once identified, we marked PRA roofs with a red dot on the digitized aerial photographs and gave each dot a GPS coordinate. We excluded the following non-residential roofs (≥16m2) by shape or based on previous knowledge: churches, community halls, supermarkets, health centers, schools, government buildings, police stations, factories, warehouses, barns, wineries, and circuses. Digitization of the aerial photographs and manual identification, marking, and geo-location of the PRA roofs took 60 person-days of full-time work.
Sampling
We hypothesized that non-coverage bias varied by location. High-density urban NSR areas were generally organized into blocks or segments of compact structures and posed the greatest challenge in separating one roof from the next in the photographs. To ensure we evaluated these challenging urban areas, we sub-divided NSR by population density and chose to set our cut-off at a very high population density of ≥5,000 persons/km2. With a cell size of 0.04km2, an estimated average Santa Rosa household size of 4.8 persons per household,19 and an estimated average urban roof-to-household ratio of 1.25 based on a priori investigator opinions, we predicted that cells with very high-density populations would contain ≥52 PRA roofs. Using this cut-off, we categorized each cell containing ≥52 PRA roofs as a very-high-density (VHD) cell and the rest as non-VHD cells; empty cells were excluded from the sampling frame (Figure 1, panel C).
To achieve a fairly even distribution of PRA roofs between the two groups, we selected twice as many non-VHD cells as VHD segments to account for the higher density of roofs in the VHD segments. We randomly selected 10 non-VHD cells from a list of all non-VHD cells. VHD cells underwent a two-step selection process. First, we randomly selected five VHD cells from a list of all VHD cells. Next, we further divided these five VHD cells into blocks or segments, each containing approximately equivalent but <20 PRA roofs. We used natural divisions like roads or rivers to guide segmentation, where possible, and included the entire area within each VHD cell in the segmentation process (Figure 2). We randomly selected one segment to represent each of the five VHD cells. We could only evaluate 10 non-VHD cells and five VHD segments with the available time and manpower.
In the selected non-VHD cells and VHD segments, we marked all non-PRA roofs that were either too small (4m2–<16m2), too large (>150m2), or otherwise excluded (e.g., known to be a church) with green dots (Figure 3). No green dots were given to roofs <4m2 under the assumptions that they were too small to be stand-alone houses and therefore their associated households would be represented in the sampling frame by other larger roofs.
Ground work
Using personal digital assistants (PDAs) and copies of the aerial photographs, we attempted to locate red-dot and green-dot roofs by their GPS coordinates and positions on the photographs. On the ground, we generated GPS coordinates for roofs we found and to which we had access that had no corresponding images on the photographs (i.e., no-dot roofs), either because they were newly built since the photographs were taken or were obscured in the photographs.
Statistical analyses
Non-coverage was defined as one minus coverage, where coverage was the percent of residential roofs found during the ground work that were red-dot PRA roofs and therefore in the sampling frame. We compared residential proportions between non-VHD cells and VHD segments for green-dot roofs and no-dot roofs using Chi-square tests or Fisher’s exact tests, where appropriate. Not all red-dot PRA roofs were visited due to resource constraints so, to estimate non-coverages and confidence intervals, we imputed the residential status (yes or no) of the non-visited red-dot PRA roofs using the Bernoulli distribution where the probability of imputation as a residential-associated roof vs. non-residential-associated roof was based on the estimated residential rate where the non-visited roof was located (non-VHD cell or VHD segment), i.e., a weighted coin-flip approach. We then resampled with replacement the set of residences, both observed and imputed, and calculated the non-coverage proportion for that sample. This process of imputing and resampling was repeated 10,000 times, to obtain an average non-coverage proportion and 95% empirical confidence interval based on the 2.5th and 97.5th percentiles of the calculated non-coverage probabilities, thus capturing both sampling variability and variability based on imputation within our confidence interval estimates. We compared non-coverage probabilities of non-VHD cells and VHD segments using the difference in resampled proportions, and estimated a 95% confidence interval using the 2.5th and 97.5th percentiles of the resampled distribution of the difference in non-coverage probabilities. We set statistical significance at 0.05 and performed the analysis using SAS 9.4 (SAS Institute Inc., Cary, NC) and R 4.0.5.21
We repeated these analyses using a different population density definition to re-classify cells from a very high density of ≥5000 persons/km2 to ≥1000 persons/km2, a figure some have used to define urban areas.22 This changed the proportion of roofs in high versus low-density cells to allow us to further evaluate the effect of population density on non-coverage.
Finally, we evaluated the sensitivity and specificity of the PRA-roof 16–150m2 size range used to generate the sampling frame. Because of the 4-year interval between the photographs and the fieldwork, we excluded no-dot roofs (i.e. roofs discovered in the field that were not identified in the photographs) because there was likely a lot of new construction in the 4-year interval. An unbiased analyst retrospectively categorized green-dot roofs into “too large” (>150m2), “too small” (4m2–<16m2), or “size unknown” based on visual size estimates in comparison to red-dot PRA roofs sizes using the same aerial photographs used by staff who generated the original maps. The sensitivity of the residential size range used for classification was calculated as the percent of residential roofs within the size range, and the specificity was calculated as the percent of non-residential roofs outside of the size range; confidence intervals were calculated using the efficient-score method corrected for continuity.23
RESULTS
Non-coverage analysis
A total of 3,848 cells covered NSR across 144.3 km2. We identified 10,770 red-dot PRA roofs in NSR. The non-coverage sub-study included 193 PRA roofs: 102 in the 10 non-VHD cells and 91 in the five VHD segments (Table 1). One randomly selected non-VHD cell in a very remote mountainous area could not physically be reached; we replaced it with the next non-VHD cell in the sampling frame list.
We looked for a convenience sample of 122 (63.2%) of 193 red-dot PRA roofs and found 102 (83.6%), of which 87 (85.3%) were residentially associated; we did not look for 71 (36.8%) PRA roofs due to personnel and time constraints. Of 36 PRA roofs not visited in non-VHD cells, 29 (80.6%) were from two cells; similarly of 35 PRA roofs not visited in VHD segments, 31 (88.6%) were from two segments. The structure types for residential and non-residential buildings are listed in Tables 2 and 3, respectively.
We also mapped 102 green-dot roofs either 4m2–<16m2 or >150m2. We searched on the ground for all green-dot roofs and found 81 (79.4%). We found 159 no-dot roofs (Table 1). Of the 240 non-PRA roofs located in the field (81 green-dot roofs + 159 no-dot roofs), we determined 94 (39.2%) to be residential structures, which served a variety of functions and varied in sizes (Tables 4 and 5). Seven of these belonged to households that already had red-dot PRA roofs included in the sampling frame, based on homeowner confirmation. Of the remaining 87 roofs, 53 were in non-VHD cells and 34 were in VHD segments. These roofs were not known to be associated with households that had PRA roofs already in the sampling frame, and we assumed all 87 roofs belonged to separate households. Of these 87 missed residential structures, 30 (34.5%) were within the 16–150m2 range, 37 (42.5%) were <16m2, 11 (12.6%) were >150m2, and 9 (10.3%) were of unknown size. Correcting for the seven green-dot and no-dot roofs already associated with red-dot PRA roofs, 57.1% (32/56) of green-dot roofs were residential structures in non-VHD cells compared to 54.5% (12/22) of green-dot roofs in VHD segments (P=0.84); 23.6% (21/89) of no-dot roofs were residential structures in non-VHD cells, compared to 33.3% (22/66) of no-dot roofs in VHD segments (P=0.18).
We determined 87 red-dot PRA roofs to be residential, 63.2% (55/87) in non-VHD cells and 36.8% (32/87) in VHD segments, and found a total of 87 missed green-dot and no-dot residential structures, 60.9% in non-VHD cells and 39.1% in VHD segments. In addition, 71 red-dot PRA roofs were not visited so their residential status could not be determined. Using imputation and resampling techniques, we estimated the coverage proportion for non-VHD cells to be 61.6%, and the non-coverage proportion to be 38.4% (95% confidence interval, CI=30.4–46.8). The coverage proportion for VHD segments was 60.4%, and the non-coverage proportion was 39.6% (95% CI=29.1–50.6). The difference in non-coverage proportion between non-VHD cells and VHD segments was not statistically significant (Difference = 1.2%, 95% CI= -12.1-14.6).
Population density sensitivity analysis
To evaluate the effect of population density, we used an alternative high-density definition of ≥1000 persons/km2; five non-VHD cells were now considered high density. Of the 87 red-dot residential roofs, 82.8% (72/87) were in high-density cells and 17.2% (15/87) were in low-density cells. Of the 87 green-dot/no-dot residential roofs, 90.8% (79/87) were in high-density cells and 9.2% (8/87) were in low-density cells. All 71 PRA roofs that we did not visit were located in high-density cells. The estimated non-coverage proportion in high-density cells was 39.4% (95% CI=32.4–46.4), and the estimated non-coverage proportion in low-density cells was 34.8% (95% CI=17.4–52.2). The difference in non-coverage proportion between high- and low-density cells was not statistically significant (Difference = 4.8%, 95% CI= -16.6-24.8).
Roof size range sensitivity and specificity
We determined that 47 of 102 green-dot roofs either 4m2–<16m2 or >150 m2 were residential roofs, although three belonged to households that already had red-dot PRA roofs included in the sampling frame. Comparing these green-dot roofs to the 122 red-dot PRA roofs we looked for, 87 of which were residential, we calculated the sensitivity of the 16–150m2 size range used for red-dot PRA roofs to be 66.4% (95% CI=57.6–74.2) and the specificity to be 69.4% (95% CI=54.4–81.3).
DISCUSSION
Using overhead images to build a sampling frame has several advantages. It permits simple random sampling of roofs and associated households, improving study power and reducing reliance on cluster sampling. Consequently, this methodology could eliminate the subjectivity and convenience bias associated with selecting households in the field. It also creates a sampling frame for subsequent research and surveillance activities and provides geographic coordinates for the exploration of spatial relationships. With increasing use of overhead imagery, one must consider how much is missed with the usage of that technology.
Our study also revealed disadvantages of overhead imagery, some of which may not be apparent from the way overhead imagery is presented and analyzed in the literature. Although we were able to find a PRA roof on the ground associated with 102 out of 122 selected PRA roofs (84%), only 87 (71%) of these roofs were associated with residences. This proportion of roofs associated with residential structures was lower than the 94%–97% found in other studies.7,9,11 Some of this discrepancy was anticipated given the age of the photographs we used (4 years in our study vs. < 3 years in others), which was a significant limitation of the study. However, it has been shown previously that images of that age (>4 years) may not introduce significant geographic bias when constructing a random sampling of households.24 This provides further justification for utilizing aerial photographs despite a gap between the time in which the images were captured and a ground study was conducted.
The high rates of identification reported in the literature may be over-estimates because they discount the residential structures not seen on the images and, therefore, not searched for in the field. Through our non-coverage assessment, we identified 87 non-PRA (green-dot/no-dot) residential roofs that were missed because of outdated photographs, new construction, and the specified PRA-roof size range. The sensitivity was 66.4% and specificity 69.4%. Therefore, future studies in Guatemala using overhead imagery might consider expanding both ends of the PRA-roof size range. Population density or urbanization did not seem to play a significant role in non-coverage error, as there were no statistical differences in the non-coverage error between non-VHD cells and VHD segments or between high- and low-density cells.
Our sampling frame was restricted by limits in photograph resolution, which could have resulted in bias against selecting smaller roofs. Failure to identify small roofs may have had a cascade of effects, such as potentially introducing a socioeconomic bias or reducing the chances of household selection (which was proportional to the number of roofs in the sampling frame). However, only 12% (5/41) of roofs <16m2 covered small houses versus stand-alone rooms (e.g., kitchens, bedrooms) (Table 4), suggesting that the resolution of the photographs may not have substantially biased against poor families who might only have afforded one dwelling with a single small roof versus more affluent families who could afford to have separate buildings for different household functions (e.g., one building for the kitchen, a separate building for the bedrooms, etc.).
Resource constraints restricted our time and choices resulting in several study limitations. As noted, the age of the photographs at the time of ground work (4 years) was a significant limitation. In addition, although the cells and segments were selected at random, the numbers of each were set based on limited resources and an unmeasured assumption that roof density would be twice as great in urban versus rural areas. This may have introduced bias towards urban or rural, although no significant difference was observed. We also imputed data for 71 PRA roofs we could not search for on the ground. There were 94 green-dot and no-dot residential roofs found on the ground; seven were associated with PRA roofs already in the sampling frame but there were no resources or time left to interview the owners of the other 87 structures. This information would potentially have modified the numbers used in our calculations and decreased non-coverage and changed sensitivity and specificity values. Nevertheless, quantifiable data were generated from this sub-study from which other researchers can draw conclusions and lessons when designing their own sampling frames using overhead imagery.
Finally, time and cost were important considerations. Geocoding of roofs has the advantage that field staff is not required to be present at the study site to perform the initial household identification and sampling, although overhead imagery is needed. The GPS coordinates can be used by interviewers to navigate to the correct destinations, and this reduces the bias towards selecting and visiting households in the field with good access to roads. Although it took 60 person-days of full-time work to digitize photographs and geocode red-dot roofs, it was possible to initiate this process five months before any interviewer went to the field. As the photograph digitization was a limitation, it should be noted that more advanced methods for object detection and pixel classification can now be handled via geospatial deep learning within ArcGIS25. Artificial intelligence (AI) could be applied to reduce the amount of time spent geocoding roofs, thereby enhancing our approach for estimating non-coverage. In addition, although aerial photography was utilized for this study, future researchers could consider quantifying non-coverage with satellite imagery. Households for the larger cross-sectional survey were selected before the pilot study began, which greatly facilitated logistical planning for the study. It reduced costs compared to sampling in the field, which would have required our staff of 20 interviewers, supervisors, drivers, and vehicles to extend their time in the field because of the time needed for this activity and the time needed to travel across NSR. This study and the usage of GIS highlights the need to incorporate this technology to enhance public health approaches in low income countries.
CONCLUSIONS
To our knowledge, this is the first study providing data to assess non-coverage bias associated with the use of overhead images to develop sampling frames for public health research. The overall non-coverage proportion for a sub-study evaluating the use of 4-year-old aerial photographs to generate a simple random sample of PRA roofs was 38.4% for non-VHD cells and 39.6% for VHD segments. The sensitivity for the PRA-roof size definition of 16–150m2 was 66.4% with a specificity of 69.4%. Although these values are less than ideal, we conclude that there appears to be no substantial bias in coverage with regards to population density or socioeconomic status. Therefore, we believe we can proceed with analyzing other study results based on data relying on random sampling from this sampling frame without concern about significant systematic sampling bias.
We have presented just one of what we believe are multiple protocols whereby overhead imagery can be used to develop a multi-use sampling frame in resource-poor regions. This study is not a definitive evaluation of one such method but rather we hope it will stimulate further assessments of non-coverage bias in a variety of locations and situations to make these increasingly popular aerial-imagery methodologies more statistically robust and generate geographically based recommendations and best practices for identifying residential structures on overhead images.
ACKNOWLEDGEMENTS
We thank the many people involved in organizing and carrying out this study, including the Guatemala Ministry of Public Health and Social Welfare in Guatemala City and Nueva Santa Rosa; the leadership in the city and towns of Nueva Santa Rosa; Allen Hightower formerly of the Centers for Disease Control and Prevention; and most especially the gracious families in Nueva Santa Rosa who opened their homes for this survey and whose participation made this evaluation possible. We would also like to thank Kelly Squires for her assistance with secondary data analyses.
ETHICS APPROVAL
The protocol for this sub-study was embedded within the protocol for a larger cross-sectional study and this larger protocol was submitted for ethical review. The protocol was approved by the Ethics Committee of the Universidad del Valle de Guatemala (UVG) in Guatemala City, Guatemala (protocol 038-04-2010, approval date July 19, 2010) and the Institutional Review Board of the Centers for Disease Control and Prevention (CDC) in Atlanta GA, USA (protocol 5936, approval date June 18, 2010).
FUNDING
This project was a collaboration between the Centro de Estudios en Salud, Universidad del Valle de Guatemala, the Centers for Disease Control and Prevention Regional Office for Central America and Panama, specifically the International Emerging Infections Program, and the Guatemala Ministry of Public Health and Social Welfare. This study was funded by the Centers for Disease Control and Prevention (CDC) Global Disease Detection Program.
AUTHORSHIP CONTRIBUTIONS
KL, PJ, and SR conceived of the study; AT, BL, FM, GL, KL, LR, MA, PJ, SR, VC, and WA designed the study protocol; FM, GL, and JR developed the aerial image maps and sampling frames and managed the data; AT, BL, FM, GL, JP, KL, LR, MA, PJ, and VC carried out the field work; JS analyzed and interpreted the data for this sub-study with input from AT, FM, GD, SR, and VC; JS and SR drafted this manuscript with input from AT, FM, GD, and VC; all authors read and approved the final manuscript. GD, JS, and SR are the guarantors of the paper.
COMPETING INTERESTS
The authors completed the Unified Competing Interest form at www.icmje.org/coi_disclosure.pdf (available upon request from the corresponding author), and declare no conflicts of interest.
Correspondence to:
Jeffrey M. Switchenko, Ph.D.
Address: 1518 Clifton Road NE, Atlanta GA, 30322;
Tel: 404-778-4157;
Email: [email protected]