Cadernos de Saúde Pública
ISSN 16784464
37 nº.8
Rio de Janeiro, Agosto 2021
ESPAÇO TEMÁTICO
Plano amostral do Estudo Nacional de Alimentação e Nutrição Infantil (ENANI2019): inquérito domiciliar de base populacional
Maurício Teixeira Leite de Vasconcellos, Pedro Luis do Nascimento Silva, Inês Rugani Ribeiro de Castro, Cristiano Siqueira Boccolini, Nadya Helena AlvesSantos, Gilberto Kac
http://dx.doi.org/10.1590/0102311X00037221
Lactente; Préescolar; Modelos Estatísticos; Amostragem; Métodos
Introduction
The Ministry of Health funded the Brazilian National Survey on Child Nutrition (ENANI2019) (call for projects CNPq/MS/SCTIE/DECIT/SAS/DAB/CGAN n. 11/2017). ENANI2019 is structured in three domains: assessment of breastfeeding and dietary intake; anthropometric assessment of nutritional status; and assessment of micronutrient deficiencies in children under five years of age, by major geographic region, sex, and age group.
The data were obtained with a probabilistic household sample survey, with geographic stratification and clustering by census enumeration areas (CEAs), conducted with sampling methods such as those adopted by the official statistical institutes in their human population surveys ^{1}. This allowed the ENANI2019 to reproducibly and scientifically estimate the required population parameters to reach its objectives. The basic idea when sampling human populations consists of sampling them through the households, grouped in turn in CEAs, which are grouped in turn according to the situation (urban versus rural) in subdistricts, districts, municipalities, and so on. The basis is the concept of household and resident (the latter to ensure that individuals with more than one residential address would not be more likely to enter the sample) and selection of areas.
ENANI2019 provides a unique opportunity to elucidate the various aspects of the nutritional assessment of children and to support public health policies for this vulnerable age group. Thus, the manuscript aims to describe methodological aspects in defining of the study population, sampling plan, sample weighting, and effective sample of the ENANI2019.
Study population
The study population for ENANI2019 was defined as the set of children under five years of age residing in permanent private households throughout Brazil with at least one child under five years of age on the date of the survey interview. Therefore, the study population did not include: (1) children residing in collective households (hotels, boarding houses, orphanages, shelters, detention centers, barracks, hospitals, etc.), improvised private households, and permanent private households without children; (2) indigenous children living in villages; (3) foreign children living in households where Portuguese was not spoken; and (4) children with conditions that prevented them from undergoing anthropometric measurement.
Ethical aspects
The Institutional Review Board of the Clementino Fraga Filho University Hospital of the Federal University of Rio de Janeiro (UFRJ) approved the study under number CAAE 89798718.7.0000.5257. Data were collected after the child's parents or guardians signed two copies of the free and informed consent form. The methods used in the development of ENANI2019 have been described in detail in specific publications ^{2}^{,}^{3}^{,}^{4}^{,}^{5}.
Sampling plan
The sampling plan of ENANI2019 used stratification and clustering, incorporating two or three selection stages. The population's stratification for sampling purposes was guided by the study's objectives and the definition of the five major geographic regions in Brazilian territory as target domains for publication of results.
The primary sampling units (PSUs) were the municipalities or the CEAs, and the elementary sampling units were always the households. In each selected household, all residents were enrolled, and the study's target data were recorded for all the resident children under five years of age.
Stratification
Strata were formed through the allocation of Brazilian municipalities (according to the territorial base used by the Brazilian Institute of Geography and Statistics  IBGE, in the population estimates for July 1, 2016) ^{6} in two blocks: (1) each of the state capitals plus the Federal District (27 strata) and each of the 20 municipalities with more than 500,000 inhabitants (20 strata) and (2) the other municipalities in each major geographic region (5 strata)
Table 1 Projection of the Brazilian population under five years of age according to major geographic regions and sample selection strata. Brazilian National Survey on Child Nutrition (ENANI2019).

The data for the total population and the population of children under five years of age were estimated for July 1, 2016, for each of the 5,570 Brazilian municipalities using the linear trend method ^{7}, the same used by the IBGE in the elaboration of the population estimates used by the Federal Accounts Court to determine their share of the participatory fund for municipalities ^{6}.
In the 47 strata formed by each of the municipalities included with certainty in the sample (block 1), the PSU was the CEA (IBGE), and the secondary sampling unit (SSU) was the eligible household (with children from the study population). In the other strata (block 2), the PSU was the municipality, the SSU was the CEA, and the tertiary sampling unit (TSU) was the eligible household.
Calculation of the sample size
Calculation of the sample size was guided by the project's budget parameters, the blood sample collection logistics, and the experience with similar surveys conducted by the Society for the Development of Scientific Research (Science).
Considering the target domain (major geographic region), the minimum proportion was specified as P_{min} = 2%. The estimated relative margin of error should be a maximum of d_{R} = 35%, with a confidence coefficient of (1α) = 95%. According to Cochran ^{8} and assuming simple random sampling without replacement (SRS), the necessary sample size to estimate proportions equal to or greater than P_{min} with a relative error no greater than d_{R} with a level of confidence 1α is calculated by:
where z_{∝/2} is the (1  α/2) quantile of the standard normal distribution.
Since the sample design is complex (stratified and clustered), it is necessary to consider the design effect on calculating the sample size. Pessoa & Silva ^{9} recommend multiplying the sample size obtained by the Expression 1 by an estimate of the design effect (deff) referring to the key survey variable. A deff of 1.95 was set for calculating the sample size since there were no data on deff from previous household surveys on the topic. However, selecting an arbitrary value for deff greater than one is preferable to the alternative of not making any adjustment to the sample size for the expected effects of clustering with the sampling design adopted. Data from the study showed that the deff for the estimates of the proportion of children that did not receive breastmilk on the eve of the interview by sex and age group varied from 2.3 to 5.7, and the proportions varied from 12.8% (girls under six months of age) to 97.4% (girls four years old). Similar ranges for deff were observed for estimates by sex and age concerning the children's average weight (2.9 to 7.3) and height (2.7 to 6.6). These results suggest that the value used in calculating the sample size was small. For future calculation of samples from the same population, the data from this study can be used to estimate deff values for other key survey variables.
The sample size of households to be interviewed for each major geographic region was thus calculated by the Expression 2:
Since there are five estimation domains, the total sample size was calculated at 14,990 (= 5 x 2,998) households.
It was also determined that ten eligible households would be interviewed for each selected CEA, which led to a sample of m = 1,500 CEAs, 300 in each major geographic region. This definition also resulted from the accumulated experience with samples from household surveys by the Science team and from the evidence of the effects of CEA sample size on the estimates' precision and data collection costs. The number ten could be considered small compared to that adopted in other household surveys, such as the Brazilian Continuous National Household Sample Survey (PNAD Contínua), which selects 14 households (eligible or not) per CEA ^{10}. However, in ENANI2019, it would be difficult to reach 14 eligible households per CEA. Based on an average CEA size of 300 households, considering that the proportion of children under five years of age in 2016 was estimated at 7.2%, besides assuming that each household would have a maximum of one child under five years of age, there would be an expected number of 21.6 eligible households per CEA. Since the CEA sizes vary considerably (above and below the average number of 300 households), and since the above estimate is optimistic, dependent on the hypothesis of one eligible child per household, the target sample size of ten eligible households per CEA appeared reasonable and was adopted.
Allocation of the CEA sample in the selection strata
There are various ways of allocating the CEA sample size among the selection strata. At one extreme, there is proportional allocation, which ensures that the sample size in each stratum is proportional to its population, with the disadvantage of concentrating the sample in the strata with the largest population. The other extreme is equal allocation (as among the major geographic regions), which ensures that the margin of error (or sampling precision) is similar across strata, but only recommendable when the strata are estimation domains. Finally, between these extremes, there is power allocation, which ensures a certain proportionality between the sample size in the stratum and a power p (0 < p < 1) of its population. The larger the power p, the more closely power allocation approximates proportional allocation, and the smaller the power p, the closer it gets to equal allocation.
Expression 3 presents the form of power allocation ^{11} used to define the sample size of CEAs for each selection stratum h within each major geographic region:
where pop_{h} represents the population under five years of age in stratum h, estimated for July 1, 2016, as previously indicated in
The Science experience in household sample surveys led to the use of a power allocation with p = 1/3, which displays a certain proportionality with the stratum's population, without allowing excessive concentration in the more highly populated strata.
For the strata of “other municipalities” in the five major geographic regions, the definition of the number of CEAs to select in each municipality determined the number of municipalities to select in each of these strata. The decision was to select five CEAs per municipality in all the major geographic regions, except in the North, where eight CEAs were selected per municipality. This larger number of CEAs per municipality in the stratum “other municipalities” in the North of Brazil allowed reducing the number of selected municipalities. The North of Brazil has huge difficulties involving access and traveling time from the municipalities to their respective state capitals. In most municipalities, the traveling time could prevent taking blood samples and increasing the study's costs.
Table 2 Size of sample of census enumeration areas and households for Brazil and according to major geographic regions, selection strata, and municipalities. Brazilian National Survey on Child Nutrition (ENANI2019).

Sample selection methods in the various stages
When the municipality was the PSU (block 2, strata “other municipalities” of the major geographic regions), its selection was performed with systematic sampling with probabilities proportional to size (PPS), used as a measure of the size of the population under five years of age in the municipality, estimated for July 1, 2016.
Since lowerincome CEAs were expected to have more eligible children than the higherincome CEAs, care was taken for the sample to cover the range of the population's income in the selected municipalities, guaranteeing different children's feeding patterns in the study population. Thus, before the CEAs' selection, an additional stratification was performed, based on quartiles of the distribution of the average headofhousehold's income in each CEA, according to the 2010 Population Census. Next, the numbers of CEAs to be selected in each income stratum were allocated. Finally, within each municipality and income stratum, CEAs were selected by Pareto's PPS sampling ^{12}^{,}^{13}. The size measure for CEA sampling was the number of children under five years of age in the CEA, based on the 2010 Population Census, the most recent source of information available per CEA at the time of the survey.
The adopted selection scheme prioritized the CEAs' stratification by income and did not consider stratification by the urbanversusrural situation. In this sense, the participation of rural CEAs in the sample would be approximately proportional to that observed in the municipalities. However, due to the logistic difficulty of household blood sample collection and the samples' transportation to the local laboratory for processing, 46 rural CEAs which were more than two hours' travel time from the municipal center (time interval greater than allowed by the study's protocol for collection and transportation of blood samples) were replaced by closer CEAs. Later, as a function of the blood sample logistics, another 11 rural CEAs were also replaced during data collection. The implication of these operational restrictions of the blood sample collection and processing was the small presence of rural CEAs in the sample (only 32 rural CEAs among the 1,392 CEAs with data collected), resulting in estimates with a low level of precision for this setting.
The selection of eligible households within each selected CEA used inverse sampling ^{14}^{,}^{15}^{,}^{16} during the data collection operation.
The collection began with the identification of selected CEAs (maps, descriptions, limits, and areas of exclusion, and the list of addresses in the National Registry of Addresses for Statistical Purposes  CNEFE, all available on the IBGE website). This was followed by updating the registry of addresses per CEA via the Census Tract Address Updating System (SAES), an app developed by Science and operated via the mobile data collection device (MDC). At this time, the interviewers canvassed each selected CEA, conducting the confirmations, corrections, inclusions, and exclusions of addresses for the buildings found along the way. Each identified building was classified as either a household (private or collective) or an establishment.
In each selected CEA, having concluded the update of the address registry, the SAES numbered the addresses classified as private households (PH) sequentially, starting with one, according to the order of the path taken by the interviewer in the CEA. Then, selection tables were used to generate a random permutation of the PH by blocks of ten for the CEA's addresses (in each block, the ten PH were placed in the order of the path to facilitate the interviewer's movement). The interviewer's MDC displayed the first 20 addresses (in random order) to be visited to define the household's eligibility and obtain (if eligible) the family's consent to conduct the interview.
For each selected household in which the visit and contact did not result in an interview (ineligible PH, vacant PH, refusal, etc.), the data control app installed in the MDC added a new address to the list of PH addresses to be visited. This procedure ended when ten complete interviews had been obtained in the CEA or when all PH in the CEA had been visited. Thus, in each eligible interviewed household, information was collected on all the resident children under five years of age.
Probabilistic sampling scheme
The probability of inclusion in the sample of municipality i in stratum h, represented by P(M_{hi}), depends on it being included with certainty in the sample (making it a selection stratum) or on it having been a PSU in one of the “other municipalities” strata, as indicated in Expression 4:
where pop_{hi} represents the population under five years of age in municipality M_{hi}, estimated for July 1, 2016, by the linear trend method ^{7}; T_{h} represents the total number of municipalities in stratum h; and t_{h} represents the size of the sample of municipalities in stratum h.
The conditional probability of inclusion in the sample of CEA j in municipality i in stratum h, conditioned by the selection or inclusion of municipality M_{hi}, represented by P(S_{hij}M_{hi}), is indicated by the Expression 5:
where dom_{hij} represents the number of households in CEA S_{hij} according to the 2010 Population Census; T_{hi} represents the total number of CEAs in income stratum g, to which CEA j of municipality M_{hi} belongs; and t_{hi} represents the sample size of CEAs in income stratum g, to which CEA j of municipality M_{hi} belongs.
The sum of households in the CEAs was calculated in the set of CEAs belonging to each income stratum g in the municipality.
Thus, the probability of inclusion in the sample of CEA S_{hij} is expressed by:
In CEA S_{hij}, the conditional probability of interviewing household D_{hijk} is expressed by:
where represents the number of private households in CEA S_{hij} obtained after updating the CEA's address registry, performed at the time of the study; v_{hij} is the total number of eligible private households selected and visited in CEA S_{hij}; and e_{hij} represents the total number of households interviewed in CEA S_{hij}.
Thus, the probability of inclusion in the sample of household D_{hijk} is expressed by:
Sample weighting
The objective of this stage was to calculate and assign sampling weights to the children to allow estimating target parameters in the study population as a whole and for specific target analyses. Good sampling weights allow unbiased estimation for the target population parameters, compensating for nonresponse effects (of units) and estimating with efficiency (small margin of error). The guide proposed by Valliant & Dever ^{17} was followed in the elaboration of the study's final sampling weights.
Since the study sample was stratified and clustered with unequal selection probabilities, it was necessary to calculate and use sampling weights for each of the households interviewed to allow unbiased estimation of target parameters in the population. The sampling weights were calculated in three or four stages, depending on the set of target information. The sampling weights were all calibrated to known population totals, seeking to correct typical biases in household samples and biases resulting from potential differential nonresponse or due to other difficulties faced while conducting the study.
Basic sampling weights were obtained in the first stage, corresponding to the inverse probabilities of inclusion of interviewed households. The basic weights for the households were calculated with the Expression (9):
To better control the estimates' variability, the basic weights received upper truncation at 10,000 (that is, weights greater than 10,000 were trimmed to this value). This type of treatment is frequently used when the basic weights vary widely ^{18}.
The household's basic weight is applied to all the data obtained since no selection is made among the resident children. Therefore, the basic weight for all the children was set equal to their household's basic weight. The basic weight calculated with Expression 9 underwent two or three adjustment stages, depending on the set of target variables for the analysis.
The study's data collection was interrupted on March 17, 2020, due to the adoption of social distancing measures in response to the COVID19 pandemic. Due to the interruption of data collection, the sample of CEAs was not collected in its entirety. The collection was concluded in most strata and PSUs, but in some, it did not occur
where A_{hi} is the set of CEAs sampled in PSU i in stratum h; and C_{hi} is the set of CEAs collected in PSU i in stratum h.
To facilitate the presentation of the following stages in the weights' adjustment, we will change the notation, omitting the stratum, municipality, and CEA indices, which are unnecessary to facilitate understanding the expressions and calculations of the adjustment factors in the subsequent stages.
In the absence of nonresponse, the population total for a study variable y, denoted , could be estimated without bias using the HorvitzThompson estimator ^{19}, as given by the Expression 11:
where w_{k} is the adjusted basic weight of unit k, obtained by the Expression 10 at the end of stage 1, and s is the set of units in the sample.
Likewise, the population average , where N is the population's size, would be estimated using the Hàjek estimator ^{20}, as shown in Expression 12:
As in any study, the ENANI2019 sample presented both unit and item nonresponse that need to be compensated for in the analyses. Therefore, imputation was used to compensate for item nonresponse for most of the variables.
Laboratory analyses of blood samples showed unit nonresponse (lack of measures for all the biomarkers) and item nonresponse (lack of measures for some subset of biomarkers), as observed in Castro et al. ^{5}. Considering the nature of these measurements, we decided that it was not possible to compensate for nonresponse in this set of variables using imputation. To compensate for this nonresponse, adjustments were made to the children's basic weights via the following steps.
Step 1: 25 groups of children were created with available responses for different subsets of variables in the data of blood biomarkers
Table 3 Number of children 6 to 59 months of age with results of blood biomarkers, according to age group. Brazilian National Survey on Child Nutrition (ENANI2019).

Step 2: for each dummy variable with an available response in group r, a logistic regression model was fitted for the probability of response, defined in the Expression 13:
where x_{k} is a vector with selected predictive variables for explaining the propensity to respond, and θ is a vector of parameters to be estimated.
The fitted model was used to obtain estimates of response probabilities in group r, as shown in the Expression 14:
The predictive variables considered in the fitted models in all the response groups were the same and are listed in Box 1. The selection of variables for inclusion in these models was based on a set of potentially relevant predictors for explaining the pattern of responses to groups of blood biomarkers, including characteristics of the region, households, and children. Next, initial models were fitted to the data, followed by stepbystep inclusion of new predictors until reaching the set of variables with significant and relevant main effects. No models were tested for interactions between predictors.
Box 1 Predictive variables used to model the probability of response for each group of blood biomarkers. Brazilian National Survey on Child Nutrition (ENANI2019).

Step 3: for each group of records with an available response, the inverse estimated probability of response in the group was used as a factor to correct the child´s basic weight, obtaining adjusted weights according to the Expression 15:
Since 25 groups of children were formed with different sets of available variables in the section on blood biomarkers, there are 25 sets of weights adjusted for nonresponse. In addition to the basic weight recommended for all the other analyses, each child will have a specific weight for each of these 25 sets. For each set of variables in which the child presents a complete response in all the variables, the corresponding weight is positive. It is null in case of nonresponse in at least one variable in the set of target variables. The data analysts will be responsible for selecting the adequate weights for the analyses that include blood biomarkers.
The final stage in the adjustment of the basic weights was calibration. The basic idea of calibration is to estimate factors f_{k} (called calibration factors) that multiply basic weights to generate the calibrated weights. These factors have the property of eliminating differences between estimates obtained with the calibrated weights and the corresponding population totals (known from other sources) for a set of ancillary calibration or poststrata variables ^{21}^{,}^{22}. Calibration helps compensate for children's total nonresponse, seeking to mitigate the effects of differential nonresponse that can affect estimates derived from the sample.
Calibration in ENANI2019 employed total populations of children for 60 poststrata defined by crossclassifying the following variables: major geographic region (5 classes), sex (2 classes), and age (6 classes  0 to 5 months; 6 to 11 months; 1 year; 2 years; 3 years; 4 years).
The subdivision of children under one year in two age classes for calibration purposes was necessary given the rules for applying part of the questionnaire and collecting blood samples: only children six months of age or older had blood samples drawn. Therefore, to avoid the need to use different population totals for calibration of the principal weights and the weights for groups of blood biomarker variables, the calibrations of all the weights to the two age classes for children under one year were considered separately. The totals used for the calibration are population projections by IBGE for January 1, 2020, disaggregated by major geographic region, sex, and five groups of individual ages in years. To obtain the totals for the two age groups under one year, the IBGE projections for children under one year were divided by two.
In calibration of weights, the objective is to minimize the distance (Expression 16):
between the calibrated weights (f_{k}w_{k}) and the weights one wishes to calibrate (w_{k}), simultaneously complying with two sets of restrictions:
and
where U represents the set of children in the study population; C represents the set of children in the available effective sample; H represents a household with two or more children interviewed; x_{k} is the vector with values for the variables that identify the poststratum to which the children belong (indicators of cells in the table obtained by crossclassifying major geographic region x sex x age group); X c is the estimated total with calibrated weights f_{k}w_{k} for the poststrata, and .jpg are the population totals for the poststrata according to the respective population projections.
The estimator using calibrated weights for totals is expressed as:
and the corresponding estimator for population means is expressed as:
The calibrated weights should be used in all the analyses, not only with the children's data but also with other data, such as those of the households, the children's parents or guardians. Calibration of weights in the way described here is called “integrated household weighting”, ensuring that all the units (children, etc.) from the same household have equal weights ^{23}.
This statement applies to basic weights but not to weights for the groups of blood biomarker variables. In this case, if the household has children under six months and children over six months of age, the former will have null weights since they did not participate in this part of the survey. As mentioned, for groups of blood biomarker variables, it will always be up to the data analyst to select the adequate weight for each analysis.
To estimate variances, it is recommended to use a combination of the ultimate cluster and linearization methods ^{9}, as implemented, for example, in the survey package ^{24} of R software (http://www.rproject.org).
Effective sample
Even samples with optimal planning may undergo adjustments during data collection for various reasons. Although potential sources of bias, in practice, such adjustments are unavoidable. In the specific case of the ENANI2019 sample, the main problem during data collection was the interruption on March 17, 2020, due to the COVID19 pandemic. Before the interruption, there was a need to make substitutions and inclusions of CEAs in the sample, as described below. Even one entire municipality, Jataí (Goiás State), had to be replaced by the municipality of Luziânia (Goiás State), since it was not possible to find a clinical laboratory that could perform the blood sample collection in Jataí
As presented in
In the total sample of CEAs, 37 had to be replaced due to difficulties during data collection, because of distance to the municipal center (preventing the blood sample collection), or difficulties in access to the CEA, resulting from civil unrest (drug trafficking, land disputes, etc.), representing 2.5% of the total planned sample of CEAs. Besides these replacements, CEAs had to be added to the study sample to solve 18 cases. The data collection did not produce interviews with eligible households, having exhausted all the PH addresses. Eighteen CEAs were added to the sample to compensate for these cases.
As for the sample of households, 12,524 (83.5%) households were obtained, compared to the expected total of 15,000 eligible households. In addition, data were obtained from 14,558 children, with a loss of only 3% in relation to the expected total of 15,000.
One technique used in the survey was the inverse sampling of households, which functions as “sample screening”. In a sampling process that seeks to locate households with members of a specific population, as in the current survey (children under five years of age), a standard alternative procedure would be to use complete or “census screening”, visiting all the households in the selected CEAs and attempting to determine whether they contained members of the target population. This alternative would involve a higher cost in updating the registry of addresses in the CEAs. In addition, it would represent a stage of creation of a complete registry of eligible households in each CEA. This stage would involve an increase not only in costs but also in time.
By adopting inverse sampling, the screening for eligible households was carried out by sampling and allowed the sample selection and approach for interviewing to occur during the same process of visiting the CEA to locate and visit the selected households. The necessary cost and time for data collection were thus much smaller. An effect of this approach is the selection of large numbers of addresses that lack an eligible household. Still, this cost is much lower than with the alternative approach of registering all the households in each CEA with a visit to verify eligibility.
Overall, it was necessary to select and visit 193,212 addresses in the selected CEAs, resulting in an average of 140 households visited per CEA, with an average of 9 eligible households interviewed per selected CEA. Of all the selected addresses, 75.1% were ultimately ineligible for various reasons, mostly households without children under five years of age
Table 4 Number of addresses visited, according to the result of visit. Brazilian National Survey on Child Nutrition (ENANI2019).

In addition to the ineligible households, 14.2% of all the selected and visited households were classified as closed at the end of the operation
ENANI2019 experienced a refusal rate of 35.8%, considering only selected eligible households that were contacted successfully
Final remarks
ENANI2019 is the first nationwide household survey in Brazil that jointly investigated breastfeeding and complementary feeding practices, individual dietary intake, anthropometric nutritional status, and micronutrient deficiencies in children under five years of age. Determination of the sample size and the methodology used in allocating CEAs in the selection strata allowed the representation of the target population in each major geographic region. The study's fieldwork presented good results compared to other household surveys using the highest sampling standards in Brazil. The results will allow comparisons with previous studies and support strategic decisions on implementing public policies for underfive children.
Acknowledgments
To the field staff and participating families who made this study possible. To the Brazilian Ministry of Health/Brazilian Nacional Research Council (CNPq)  process: 440890/20179.
References
This is an openaccess article distributed under the terms of the Creative Commons Attribution License
Cadernos de Saúde Pública  Reports in Public Health
Rua Leopoldo Bulhões 1480  Rio de Janeiro RJ 21041210 Brasil
Secretaria Editorial +55 21 25982511.
cadernos@fiocruz.br