Sampling plan in health surveys, city of São Paulo, Brazil, 2015

ABSTRACT OBJECTIVE To evaluate the sampling plan of the Health Survey of the City of São Paulo (ISA-Capital 2015) regarding the accuracy of estimates and the conformation of domains of study by the Health Coordinations of the city of São Paulo, Brazil. METHODS We have described the population, domains of study, and sampling procedures, including stratification, calculation of sample size, and random selection of sample units, of the Health Survey of the City of São Paulo, 2015. The estimates of proportions were analyzed in relation to precision using the coefficient of variation and the design effect. We considered suitable the coefficients below 30% at the regional level and 20% at the city level and the estimates of the design effect below 1.5. We considered suitable the strategy of establishing the Health Coordinations as domains after verifying that, within the coordinations, the estimates of proportions for the age and sex groups had the minimum acceptable precision. The estimated parameters were related to the subjects of use of services, morbidity, and self-assessment of health. RESULTS A total of 150 census tracts were randomly selected, 30 in each Health Coordination, 5,469 households were randomly selected and visited, and 4,043 interviews were conducted. Of the 115 estimates made for the domains of study, 97.4% presented coefficients of variation below 30%, and 82.6% were below 20%. Of the 24 estimates made for the total of the city, 23 presented coefficient of variation below 20%. More than two-thirds of the estimates of the design effect were below 1.5, which was estimated in the sample size calculation, and the design effect was below 2.0 for 88%. CONCLUSIONS The ISA-Capital 2015 sample generated estimates at the predicted levels of precision at both the city and regional levels. The decision to establish the regional health coordinations of the city of São Paulo as domains of study was adequate.


INTRODUCTION
It is important to know the sampling plans used in epidemiological surveys and the evaluation of the alternatives applied to improve the practice of household surveys. There are few publications on this subject in the Brazilian literature to support new experiences [1][2][3][4][5][6] . It is particularly interesting the provision of subsidies to improve sampling designs in time trend studies, which are based on data from successive surveys. More such studies have been carried out in recent years [7][8][9][10] .
In cities in the State of São Paulo, Brazil, health surveys called ISA have been carried out since 2001. The objective is to evaluate the health status of the population living in the city, according to their living conditions, addressing aspects related to lifestyle, acute and chronic morbidities, preventive practices, and use of health services 11  In these surveys, probabilistic sampling is used, always seeking inferences to the study population based on measures of precision. Although they are similar, the sampling plans used in the different years of the ISA-Capital have different aspects. Their adoption was motivated by the desire to improve the process of data collection based on acquired experiences, preserving the possibility of comparison between the different editions.
The planning of the 2015 survey was based on the interest to produce information on smaller areas of the city, which are more homogeneous in relation to the epidemiological profile. Consistent with this objective, the City Health Department, the main funder of the project, intended to reinforce the use of results by regional managers. This confluence of interests culminated in the definition of regional Health Coordinations of the city of São Paulo f as domains of study in the ISA-Capital 2015.
The objective of this study was to evaluate the sampling plan of the ISA-Capital 2015 regarding the precision of estimates and the conformation of the domains of study by the Health Coordinations of the city of São Paulo, Brazil.

METHODS
Below, we describe the sampling plan of ISA-Capital 2015, highlighting the following aspectcs: population and domains of study and sampling procedures, including calculation of sample size, and random selection of sample units. In addition, we present the results of the application of the sampling plan, considering the households visited and the interviews obtained.
The estimates obtained with the ISA-Capital 2015 sample for the parameters of interest were analyzed for precision using the coefficient of variation. Estimates with coefficients below 20% for the city level and below 30% for the regional were considered sufficiently precise. Thus, we would consider as suitable the establishment of the Health Coordinations as the domains of study if the estimates of proportions according to the age and sex domains had minimum acceptable precision within the coordinations, indicated by coefficients of variation below 30%.
We also evaluated the measures of effect of design, widely used as measures of efficiency of complex sampling designs 12,13 . Those below 1.5 were considered suitable, which was adopted in the planning of the sample. We also verified the frequency of estimates below 2.0, which is frequently adopted in sampling plans 3,4,6,14 .
The parameters estimated in this study were the prevalence of persons who reported the following: use of health service in the last 30 days, hospitalization in the last year, visit to the dentist in the last year, hypertension, allergy, health problem in the last 15 days, and excellent or good self-assessment of health. These parameters were related to the following subjects: use of services, morbidity, and self-assessment of health, usually studied in health surveys. The reference to allergy was selected because it was the only morbidity in which the estimates for adolescents were greater than 10% for the most part.
The reference population of the ISA-Capital 2015 consisted of individuals aged 12 years or more living in permanent private households in the urban area of the city of São Paulo (Table 1, block 1) g . For the delimitation of the population, the survey used the census tracts classified in the 2010 Census as urban situation -urbanized area, non-urbanized area, and isolated urbanized area -and 'common' and 'special subnormal' types.
Stratified sampling was used and clusters were selected in two stages: census tracts and households.
The strata were formed by the five Health Coordinations of the city of São Paulo: North, Central-West, Southeast, South, and East, which were domains of study. For the sample planning, we also considered the age and sex groups as domains: adolescents (12 to 19 years), male adults (men aged 20 to 59 years), female adults (women aged 20 to 59 years), and older adults (60 years or more). We defined 20 domains of study, both geographic and demographic.
For operational reasons, the total sample size would be 4,250 persons. In order for the Health Coordinations to have the same potential for data analysis, 850 persons were assigned to each one. The sample would have the distribution presented in Table 1 (block 2) if the distribution by the age and sex domains were proportional to the population of these domains in each Coordination. However, the participation of the "adolescent" and "older adult" groups was changed in the sample for more precise estimates in these domains. A 50% larger adolescent population was considered, as well as a 100% larger older population, and a new distribution of the sample was carried out. The numbers of interviews was increased to 150 for two domains: adolescents of the Central-West and Southeast Coordinations ( This number could allow the estimation of proportions of 0.50, with a sampling error of 0.10, considering a 95% confidence level and a design effect of 1.5. The calculation was carried out from the algebraic expression that determines the minimum sample size to estimate proportions under complex samples 13,15 : n = P × (1-P) (d/z) 2 × deƒƒ, where n is the sample size, P is the parameter to be estimated, z = 1.96 is the value in the reduced normal curve related to the 95% confidence level of the confidence intervals, d is the sampling error, and deff is the effect of the design.
The expected mean number of persons per household (ratio between persons and households) was calculated in each domain from the 2010 Census data ( However, in order to reached the minimum number of interviews in the presence of non-response (vacant or closed households, refusals, or households with a resident unable to respond), the inclusion of a larger number of households in the sample was planned (Table 1, block 6). A non-response rate of 40% and a percentage of vacant households of 10% were considered.
The interviewees were randomly selected using two-stage sampling. In the first stage, 30 census tracts were randomly selected in each Coordination, with probability proportional to size, measured by the number of permanent private households counted in the 2010 Census, sorted by the average per capita income of the households in the tract.
In the second stage, the households were selected using two different random selections. In tracts classified by the IBGE g as "common", the households were systematically selected, based on the list of households carried out in the field. In the census tracts classified as "special subnormal" (which corresponds to favela slums in the city of São Paulo), segments of households were created (mean size of six households). These segments were the second stage of selection, in which the random selection of six segments per tract was planned. The households were randomly selected corresponding to the rarest domain (adolescents in the Central-West region and older adults in the other four Coordinations) in each tract, and this sample was called the main sample. From the main sample, sub-samples were randomly selected with sizes defined for the other age and sex domains (Table 1, block 7). This type of random selection is equivalent to obtaining four concomitant samples, related to the four domains of study.
There was no intra-household random selection. All persons belonging to the domain for which the household was selected were included in the sample. In the data collection equipment of the interviewers, it was indicated the domains to be searched in each household of the sample.
The overall sampling fractions in each Coordination were: where M i is the number of households in tract i (data from the 2010 Census), M is the total number of households in the Coordination (data from the 2010 Census), b is the number of households in the rarest domain, i.e., the main sample, and b domain is the number of households required for each of the three less rare domains.
The second-stage sampling fraction was fixed, which increased (or decreased) the number of households randomly selected in relation to what was planned if the census tract had grown (or decreased) since the 2010 Census. With this option, the second-stage sampling fraction can be rewritten as: , where M i ' is the number of households in tract i obtained in the listing of households, performed in the field.
In order to compensate for the differences between the probabilities of the random selection of the individuals in the sample, design weights were introduced in the data analysis step, expressed as the inverse of the sampling fractions, F=1/f ( Table 1, block 8) 16 . This weight can be interpreted as the number of persons in the population "represented" for each person randomly selected.

RESULTS
The fieldwork of ISA-Capital began in the second half of 2014, but 80.0% of the interviews were conducted in 2015, between January and December. A total of 5,942 households were effectively selected and visited. Of these, 8.0% were vacant households, which amounted to 5,469 occupied houses (Table 2). Information could be obtained in 76.4% of the occupied households about the residents and the presence of persons belonging to the age and sex groups of interest. In these households, 73.4% of the eligible residents were interviewed.
The number of households randomly selected was higher than that considered necessary for the interviews (n = 4,831), provided for in the sampling plan. Nevertheless, the number of interviews was smaller than planned. The minimum of 150 interviews was not reached in two domains (adolescents and adult males in the Central-West Coordination). The Coordinations that had a smaller number of interviews were North (9.0% smaller) and Central-West (28.0% smaller). The target number of interviews for the total of the city was reached for adolescents and older adults and it was further from what was proposed for the group of adult males (18.0% smaller).
Of the 115 estimates made for the domains of study, 97.4% presented coefficients of variation below 30.0%, and 82.6% were below 20.0% (Tables 3 and 4). Of the few estimates that did not reach the desired level of precision (three estimates), one was obtained with a small sample of interviews, below 150, and one estimate was below 0.10, which means an event of very small frequency. All prevalence estimates above 0.30 showed low coefficients of variation; the inverse happened for the estimates below 0.10; none reached the desired levels of accuracy. Of the 24 estimates made for the city, almost all (23 estimates) showed a coefficient of variation below 20%.
More than two-thirds (69.0%) of the estimates of the design effect were below 1.5, which was estimated in the sample size calculation, and the design effect was below 2.0 for 88.0%.
The mean number of interviews per tract for the age and sex groups for the set of Coordinations ranged from 5.7 to 8.1.

DISCUSSION
The ISA-Capital 2015 sample generated estimates at the predicted levels of precision at both the city and regional levels, which indicates that the decision to establish the regional health coordinations of the city of São Paulo as domains of study was adequate.
There is no single criterion adopted universally to establish a limit for the values of coefficient of variation. Several factors must be considered. The knowledge on whether a particular coefficient of variation is too high or too low requires experience on similar data 17 .
The Fundação Sistema Estadual de Análise de Dados (SEAD), responsible for several surveys in the State of São Paulo, guides its decision according to the frequency of the survey and the nature of the phenomenon under study. It does not adopt a single policy for the dissemination of the results of the research it carries out h . Thus, different limits for the coefficient of variation were stipulated in the various surveys conducted i . When disclosing the results of the Household Expenditure Survey of 2015/2016, the National Statistical Institute of Portugal proposed that estimates with coefficients of variation between 20% and 30% should be carefully used and those with coefficients above 30% should be disregarded j . These limits, as well as those proposed in other health studies 3,6 , coincide with those adopted in our study.
The number of households effectively selected was higher than planned. The use of constant fractions in the random selection in the second sampling step may be responsible for this result. With this strategy, the 38% increase in the number of households between the Census and the survey data collection was reflected in the number of households sampled. The equiprobability of the sample was kept by the random selection with probability proportional to size, sacrificing control over its final size.
In addition, sampling fractions were changed in the tracts not yet visited when the follow-up of the field work detected that the non-response rates were greater than expected. This further increased the number of households randomly selected. These increases were offset by the use of weights in the data analysis.
The follow-up of the field work by the team responsible for the survey was carried out through spreadsheets, whose models were improved throughout the various editions of the ISA project. The detailing of the response rates at the household and resident levels by census tract together with the interviews allowed problems to be detected as soon as they occurred. This helped the introduction of adjustments in the sampling plan.
The number of interviews was lower than planned, which shows that population participation in the survey was lower than expected. All households were visited at least three times, at different times and days, which did not prevent high non-response rates.
Although the sample size of the ISA-Capital 2015 is similar to previous editions, the field work was extended for a longer period, mainly due to the greater number of census tracts selected (150 in 2015, 80 in 2008, and 60 in 2003). This was a necessity created by the option of adopting the Health Coordinations as domains of study, setting the number of tracts to 30 in each one. It can be understood as the cost of obtaining regional estimates in this edition of the ISA.
The increase in the number of census tracts meant a smaller number of interviews by tract: 5.7 to 8.1, on average, by age and sex domain. These numbers are far from the optimal number of interviews in each primary sampling unit. This number seeks the balance between precision and cost, considering the ratio between the costs of including a new conglomerate and a new household in the sample, in addition to the degree of intra-cluster homogeneity 18 often increases the variance of the estimates according to the intraclass correlation, which is a characteristic of the population that cannot be changed by the sampling process. However, the inclusion of fewer elements per cluster in the sample can reduce the impact of intraclass correlation on variance, leading to smaller estimates for the design effect.
The random selection adopted in the ISA-Capital, in which the four samples related to the four age and sex domains are obtained simultaneously, relativizes the importance of the increase in cost. The number of interviews per tract, considering the four domains, was between 20.3 and 32.5, depending on the Coordination.
One of the consequences of using weights in the data analysis step is the increase of the estimates of the design effect, and the increase is proportional to the variation between applied weights 20 . In the first ISA editions, sample size was the same for all age and sex domains. This resulted in very different weights, which impacted the design effect when more than one domain was analyzed together. The 2015 survey sought the closer proportional distribution of the sample by the age and sex domains in each Coordination, avoiding the previously observed discrepancy between weights.
One of the characteristics common to all issues of the ISA-Capital is the non-use of intra-household selection. In terms of efficiency, the strategy of interviewing all residents belonging to the age and sex group of interest is superior to that in which only one of the residents of the household is randomly selected for the interview 21 . To apply it, the option of the ISA is to select randomly a main sample and, from it, obtain subsamples of households, according to the need of each domain defined based on the mean number of persons per household indicated in the Census. With the adequate number of households for each domain, there is no need for an intra-household selection.
Based on data from previous editions of the ISA, Alves et al. 22 have shown that it is particularly advantageous to use segments as an alternative to full address listing when applied to the random selection of households in favela slums. In the ISA-Capital 2015, in addition to being used in favela slums, this strategy was also applied in the last tracts to fasten the field work. Among the advantages associated with the use of segments, we can highlight the speed in locating and identifying households.
The estimates of prevalence within Health Coordinations, for the most part, were considered suitable for all age and sex groups defined as a domain in the ISA-Capital 2015. This result allows the use of data from the survey by health managers in the city of São Paulo, who will have regional information that is sufficiently precise to assess issues related to reported morbidity and the use of services. However, it is important to be aware of the use of results related to rare events, especially when done with small samples.
It was a good choice to follow this path so that the sample in this edition of ISA-Capital could have the data disaggregated by Health Coordinations. The comparison of the results of different regions of the city can help in the understanding of the determinants of the epidemiological situation of the resident population and aspects related to the use of health services available in the area. The government of the city of São Paulo has prepared reports that analyze the data related to the various subjects addressed in the research m . This production shows the potential contribution of the survey in the analysis of health problems of the population of the city and the adequacy of the coping strategies adopted. The repetition of surveys in the city meets the interest of studying trends in several measures related to the health of the population living in it.