Variables and Data Sources
The dependent variable for this study was the rate of overweight ninth grade students for public schools in California in 2007. The data were obtained from California Department of Education, which administers a physical fitness test (FITNESSGRAM
®) between February and May of each year. The test is required by state law for public school students in 5
th, 7
th and ninth grade, and in 2007 approximately 90% of schools participated[
28]. The test reports students' body composition, as measured by skinfold (preferred method), body mass index (weight in kilograms divided by height in meters squared), or bioelectric impedance analyzers. These three different measurement options are provided to ensure broad participation, and while all are subject to error (typically 3 to 6%), comparisons of different methods report high test-retest reliability [
29,
30]. The definition of overweight differs slightly from others, such as the Centers for Disease Control. Instead, classification in this category is determined by criterion-referenced gender-, age- and test-specific cut-offs (e.g. skinfold <32 for a 15 year old female; body mass index <24.5 for a 14 year old male). These standards were established by a national advisory panel, convened by the Cooper Institute of Dallas, Texas[
29].
Our analysis is at the school-level because the physical fitness test results are publicly available only at this level of aggregation--individual results are confidential. The original dataset included all public schools in California that reported physical fitness test results. To obtain more stable rates of overweight students, we excluded schools with less than 100 students from the analysis. We focused on ninth graders because they are expected to have greater mobility, and greater financial ability to purchase food from nearby retailers, when compared to 7th and 5th graders.
The physical fitness dataset also provided information on the gender and ethnic composition of students tested in each school. We used this information to construct a variable for gender composition, with percentage of male students as the reference category. We also constructed variables for student composition by major ethnic groups, with percentage non-Hispanic white as the reference category, and additional variables for the percentage of students that were Hispanic/Latino, African-American, Asian, Native American (including Alaskan native), Pacific Islander, and those who declined to state their ethnicity.
The California Department of Education was the source for several additional datasets, all of which are also publicly available. One contained the street addresses of schools in the study, which we used in constructing the retailer proximity variables described below. We obtained two other datasets to construct independent variables for 1) percentage of all students (not just ninth graders) in each school receiving free & reduced price meals in 2007, and 2) urban or non-urban school location. The subsidized meal variable was used as a proxy for school-level socioeconomic status, which at the individual-level is known to be strongly associated with overweight. We defined urban schools as those located in large or mid-size cities (based on U.S. Census Bureau classifications) and compared them to schools that were not in these areas, defined as non-urban.
Another publicly available dataset was purchased from Environmental Systems Research Institute, Inc. (ESRI), It was used to construct variables for the presence/absence of three classes of retailers near schools: 1) fast food restaurants, 2) convenience stores, and 3) supermarkets. ESRI provided geocoded information for data that was first collected by the marketing firm InfoUSA, in 2007. We then selected retailers based initially on the following 8-digit National American Industry Classification System (NAICS) codes: Limited Service Restaurants (72221105), Convenience Stores (44512001), Supermarkets and Grocery Stores (44511001). As reported in another study however [
31], the NAICS codes were inconsistently applied. For instance, there were many recognizable convenience stores identified with the NAICS code for Grocer Retail (44511003). To reduce misclassification, retail locations in this category that contained the terms 'Quick-,' 'Mini-,' and 'Liquor-' in the company name were recoded into the "convenience stores" category (n = 668). Supermarkets were defined as retailers in the Supermarket and Grocery Stores category with $2 million or more in annual sales, based on the Food Marketing Institute's definition [
32]. Because Limited Service Restaurants (72221105) was also an overly broad category, those with five or more locations were selected as "fast food restaurants," or business chains that provide low price meals without table service. This category included all of the major fast food chains (e.g. McDonald's, Burger King, Taco Bell, Domino's). For the entire state of California we identified 3,646 supermarkets, 4,069 convenience stores, and 20,668 fast food restaurants.
GIS analysis
The point locations of the schools were geocoded with the Streetmap USA (2006) dataset provided by ESRI, based on street addresses. We validated the accuracy of geocoded locations by spot-checking 10% of the schools. We found the addresses of these schools through web searches and re-geocoded them using Mapquest, an online geocoding service owned by America Online, Inc. [
33]. Our spot-checking indicated more than 99% accuracy in our geocoded school locations.
Most previous studies of food proximity have not incorporated the actual pedestrian walking network into their analyses of proximity or food access. Instead they rely on Euclidean (straight-line, circular radius) buffers that do not account for the street network, sidewalks, and other elements of the areas surrounding schools where walking is most feasible [
24,
31,
34,
35]. The result is that the area covered by Euclidean buffers can be substantially more than the area covered by equivalent distance network buffers. This can lead to erroneous or misleading results. To improve accuracy in this study we utilized network buffers along the actual street network for the entire state of California.
We used ESRI's Network Analyst extension to ArcView 9.1 to create 800 m network buffers around the final geocoded school points. This distance was selected because it is approximately a ten-minute walk, and is commonly used in other studies of retail food access near schools [
24,
31,
35,
36]. These network buffers included both the street lines making up the 800 m street networks around each school, as well as polygons that contained the area encompassing these networked buffers. Once the network buffers were created, we performed a spatial join to calculate the number of fast food, convenience and supermarket retailers located within each of these network buffers. The final result of our GIS analysis was a matrix denoting the presence or absence of each of three classes of retailers within 800 m along the street network for each school. These data were exported out of the GIS for statistical analysis.
Statistical analysis
We analyzed all data with SPSS (version 15.0.1). Our analysis included descriptive statistics, correlations (Kendall's tau-b), and linear regression. Our first model regressed the dependent variable of school rate of overweight ninth grade students on the three types of nearby retailers assessed in the GIS analysis. Our second model included the additional school-level variables (ethnic, socioeconomic and gender composition, and urban/non-urban location) for comparison.
Prior to regression, we applied logarithm transformations to school ethnic composition variables, with the exceptions of the two largest groups (Hispanic/Latino and White, non-Hispanic), in order to meet assumptions of normality. We also employed multiple imputation as a strategy for dealing with missing data for two independent variables: percentage of students receiving free & reduced price meals, and urban/non-urban school location. This involved utilizing information from the existing data to generate plausible values for missing data, while also representing uncertainty by generating five different data sets with imputed values. These values were imputed with the software Amelia II (version 1.2), and combined for analysis using the procedures detailed by King
et al.[
37]