Cadernos de Saúde Pública
ISSN 16784464
33 nº.6
Rio de Janeiro, Junho 2017
QUESTÕES METODOLÓGICAS
Correção de medidas de associação pela variação do dia a dia no consumo alimentar: avaliação do desempenho por meio de simulação
Eliseu VerlyJr, Rosely Sichieri, Valéria Troncoso Baltar
http://dx.doi.org/10.1590/0102311X00173216
Inquéritos sobre Dietas; Consumo de Alimentos; Análise de Regressão; Nutrição em Saúde Pública
Introduction
Studies in nutritional epidemiology frequently aim to describe dietoutcome associations. To estimate the measure of association that describes the relationship between diet and outcome requires knowing each individual's disease status and usual dietary intake. Traditionally, usual intake in large cohorts has been collected with the food frequency questionnaire
One way of dealing with this error is applying the regression calibration methodology, which estimates the measure of association using each individual's predicted intake as the dietary exposure variable
The current study aims to assess the performance of regression calibration for correcting measures of association, in different scenarios, through a simulation study. The simulation is based on a previous study conducted by the authors that collected 20 days of 24hR in a sample of 302 persons. The study provides the necessary parameters for the simulation, allowing the generation of populations with the desired size and data collection days.
Methods
Regression calibration
The following annotations were used to be consistent with those in the international literature: for individual i on day j, i = 1, …, n; j = 1,…, k; Rij represents 24hR intake (R: reported intake); Ti represents the individual's usual intake i (T: true intake; unbiased intake measured for a long period of time); and Yi an outcome associated with Ti. A hypothetical association between Yi and Ti can be described by the following linear regression model:
In which Zi = (Zi1,…,Zip) is a vector with the covariates, measured without errors, for each individual i, and m1 is the link function (in this study, identity). Since usual intake is generally not known (Ti), the model uses the mean from few 24hR for each individual i (Ri) as the dietary exposure variable for obtaining the measure of association between the food and the outcome Yi, which leads to a biased (attenuated) estimate of βT. Calibration consists of the prediction of usual individual intake, based on a twopart mixed effects model that uses the intake obtained from the 24hR recalls as the dependent variable and the same set of variables used for adjusting the diet/outcome model (1) as the independent variables. Predicted individual intakes then sibstitute Ri to obtain a deattenuated estimate of the measure of association between the food and the outcome Yi. The complete description of the model can be found in Kipnis et al.
This study applies regression calibration to estimate corrected linear regression coefficients for the relationship between Ri (with information on intake from two or more 24HR for each individual) and Yi obtained by the model (1). Corrected coefficients will be compared to the real coefficients (βT, with information on intake from 200 24hR for each individual) for each study scenario (described later).
Considering that prediction of intake is a function of the amount of food reported on the collection days, i.e., prediction using the first and second days will be different from prediction using the first and third days, and so on, therefore 300 combinations of two or more 24hR were selected per individual. For each combination, we performed regression calibration to obtain the corrected measure of association and its confidence interval; these were compared to βT (set at 1.0 in the simulation).
The regression calibration was performed with the mixtran and indivint macros available for SAS (SAS Inst., Cary, USA).
Data simulation
Individual intake
A population of 1,000 individuals was simulated with information on intake for 200 days for each individual. The simulation assumed that Ti (the individual's usual intake i) is the product of the mean amount consumed on the intake days (Ai) and the probability of the individual consuming the food (Pi). The amount consumed for each intake day was generated by the equation:
Where Ai ~ Normal (µ, ), Pi ~ Bernoulli (Pi), Rij represents the individual's intake i on day j, and ?ij ~ Normal (0, ); and are the inter and intrapersonal variances, respectively. The random variables (_{Ai } , _{Pi } ) were generated with bivariate distribution with correlation between them. Since _{Ai } on the original scale is generally rightskewed, the parameters µ and in the normal distribution were generated on the BoxCox scale, chosen because it is the most widely used in food intake studies with skewed data, and subsequently backtransformed to the original scale. In order to define the days with and without intake, a random variable with Bernoulli distribution was generated with the intake probability defined for each individual (_{Pi } ), since the probability of consuming the food varies between persons according to the observed distribution of probabilities in the population. The model considered rightskewed distribution, since it is the one most frequently observed in real data. The correlation between amount consumed and probability of intake was considered in the simulation. The parameters (µ, , _{Pi } , and ) and the distributions and lambda from the BoxCox transformation were obtained from data collected in the baseline study. The mean of the 200 intake days was calculated for each individual, and was considered as individual usual intakes (_{Ti } ).
Covariates
For this population, we generated age values (in years), assuming normal distribution (mean = 25; standard deviation  SD = 5), with a correlation of 0.3 with usual intake. Sex distribution was defined as 50% for each sex, in with mean usual intake and mean age were 20g and 2 years higher for men (sex and age were assumed to be errorfree).
Outcome
Next, outcome (_{Yi } ) was simulated, whereby its relationship with usual intake (_{Ti } ) was specified by the following linear regression model:
Where: _{Yi } the simulated outcome, with normal distribution with mean and SD of 25 and 3, respectively, arbitrarily chosen; β _{0} is the intercept; coefficient _{βT } was set at 1.0 for the relationship with usual intake on the original scale with sample power set at 80% to detect _{βT ≠ } 0; _{βZ } is the vector with the effects of the covariates in _{Zi }: 1 and 5 for sex and age, respectively; _{?i } Normal (0, 1). Data were analyzed in the Stata v.13 statistical package (StataCorp LP, College Station, USA).
Scenarios assessed
The following scenarios were tested:
a) Different percentages of the study population answering a second 24hR: 100%, 60%, 40%, and 20%. In each of the 300 combinations of intake days, we selected an intake day for the entire population and a second intake day only for the previously defined percentages.
b) Different numbers of 24hR for each individual in the study population: j = 2, 3, 4, and 5. In each of the 300 combinations of intake days, specific numbers of 24hR were selected for each individual. We also tested a scenario in which different percentages of the population answer different numbers of 24hR: 40% with four 24hR, 30% with three 24HR, and 20% with two 24hR.
c) Different population sizes: 1,000, 600, 300. This item also included a scenario with the necessary sample size calculated to obtain coefficients statistically different from zero: n = 2,400. This size was obtained by simulation, with the lowest value that guaranteed that at least 2.5% of the coefficients were different from zero.
Finally we compared corrected and uncorrected coefficients and their confidence intervals for the scenario with n = 1,000 and 100% of the sample with the second 24hR.
Simulation parameters
All the parameters used in the food intake simulation were obtained from a longitudinal study with 302 participants in the city of Rio de Janeiro, Brazil, in which each answered 20 nonconsecutive 24hR. The study used a snowball sampling strategy in which the interviewers were selected (23 undergraduate nutrition students) and they later chose the interviewees. To guarantee adherence to the data collection, the interviewees were preferably from the same social circle or lived close to the interviewers, in addition to expressing their willingness to remain in the study and provide detailed information 20 times on their food intake. Although the sample was not random, the participants were dispersed all across the city. Data were collected from March 2013 to April 2014, with a mean followup lenght for each individual of three months. The multiple pass method was used to collect the 24hR intake data ^{15}. During the fieldwork, the interviewers took the first recalls applied to each participant for the initial data check. The reported foods were entered into the Brasil Nutri program, which is based on foods, serving sizes, and preparations, reported in a national food acquisition survey.
The study was approved by the Institutional Review Board of the Institute of Social Medicine, State University of Rio de Janeiro.
Results
In the simulated population, mean usual intake was equal to the mean oneday intake (78g). SD were 75 and 136 for usual and oneday intake, respectively. Distribution of usual intake was rightskewed (skewness = 1.39, kurtosis = 5.35) with 9.4% of the population consisting of usual nonconsumers (intake equal to zero in the mean of the 200 intake days)
Figure 1 Distribution of simulated usual intake.

Figure 2 Corrected linear regression coefficients and 95% confidence intervals (95%CI) for the study population with 100%, 60%, 40%, and 20% with repetition of the 24hour recalls (24hR).

Figure 3 Corrected linear regression coefficients and 95% confidence intervals (95%CI) for the study population with 2, 3, 4, and 5 24hour recalls (24hR) for each individual.

Figure 4 Corrected linear regression coefficients and 95% confidence intervals (95%CI) for populations with 2,400, 1,000, 600, and 300 individuals.

Figure 5 Corrected and uncorrected linear regression coefficients and 95% confidence intervals (95%CI) for the population with 1,000 individuals with 2 24hour recalls (24hR) for all the individuals.

Discussion
The purpose of calibration is to correct measures of association attenuated by random error, which occurs when few days of 24hR data are collected for each individual in the study population ^{12}. It is thus expected that the corrected measure of association will be as close as possible to the true measure of association, i.e., the measure that would be obtained if each individual's usual intake were known. However, since different combinations of recall days can generate different measures of association, and thus different corrected coefficients, comparison of the corrected and true measures of association should take into account a large number of possibilities of combinations of recall days.
In this sense, the mean of the coefficients obtained from the combinations of two days of recall should be close to the real coefficient; higher or lower means indicate a tendency towards under or overestimation of the corrected coefficients. Meanwhile, the coefficients' dispersion indicates their precision. The mean coefficients were very similar for all the tested scenarios, with very narrow variation in relation to the real coefficient. However, precision varied according to the scenario, such that the sample's power becomes insufficient to detect an association, even when it really exists and the sample size has been calculated to be able to detect it. One consequence of the random error is a decrease in the sample's power; this loss of power was not restored after correction, since the percentage of significant coefficients was similar in the corrected and uncorrected analysis. For some scenarios the precision becomes even smaller. When up to two 24hR are used for each individual, the precision decreases significantly when the proportion of individuals with the second 24hR is less than 40%. A previous study had already suggested that between 60% and 40% of repetition is sufficient to maintain precision in estimating usual food intake percentiles ^{14}.
An important issue is sample size. This study simulated an outcome whose association with usual intake (mean of 200 intake days) could be statistically significant with 80% power and sample size of 1,000. Even using n = 1,000, the association was not significant in 53% of the 300 selected combinations of 24hR days. Using the same outcome in smaller samples, especially less than 300, the likelihood of finding an association is decreased substantially, even when it really exists. The extent to which a reduction in sample size will increase the probability of type error 2 (not rejecting the null hypothesis when it is false) will depend on the true coefficient's effect size and precision ^{16}. In a real scenario, in which the effect of diet on health outcomes are usually small ^{17}^{,}^{18}, any loss of precision may decrease the probability of rejecting the null hypothesis, even if it is false. In this simulation, while n = 1,000 would be sufficient to find an association with usual intake, drawing on correction with two or more intake days, the estimated necessary size was 2,400.
Another way of increasing the coefficients' precision is to increase the number of 24hR repetitions for each individual, which was observed in this study when increasing from two to five days of 24hR. Carroll et al. ^{13}, using real and simulated data, found that between four and six days of 24hR for each individual would be sufficient for the majority of the dietary items. Therefore, correction for random error requires a sufficient sample size or number of repetitions to obtain measures of association that represent the real association. Considering the difficulty in obtaining various days of 24hR in large epidemiological studies, one possibility is to collect more repetitions in a subsample. In this simulation, similar results were observed in scenarios in which all the individuals answered four 24hR and in which a subsample answered four, three, and two 24hR.
Importantly, these results refer to a sample of 1,000 individuals with 80% power to detect an association (β = 1) between usual intake and the outcome in a multiple model. Thus, it is not a general recommendation for planning new studies; adequate sample size will depend on each study's objectives, including food variance and the target outcome, the expected effect size, and the covariates in the predictive model ^{19}. In addition, the decision to increase the sample or the number of repetitions should be based on the costs involved in each procedure. The researcher should assess whether the increase in cost and fieldwork time compensates for the gain in precision. A simulation study can assist planning new studies with an estimate of the best combination of sample size and 24hR repetitions, so as to optimize efficiency in data analysis.
Both Carroll et al. ^{13} and Kipnis et al. ^{12} found substantial improvement in the prediction of some items by including intake frequency as a variable in the prediction model; other variables related to food intake such as socioeconomic variables, body mass index, and others, even those not present in the diet/outcome model (equation 3 in the methods section) can be included in the predictive model for usual intake and potentially increase the corrected coefficients' precision.
Importantly, the method only proposes correction for random error; the effects of other types of errors such as underreporting or differential error are not reduced. The latter is particularly important in crosssectional studies and some types of casecontrol studies in which disease status can interfere in intake report and modify the direction of the measure of association (reverse causality) ^{20}, which is not resolved with regression calibration. Finally, the study tested the identity link function by providing a direct interpretation of the relationship between dietary intake and the outcome. An example of the method's application involves estimating the degree to which blood pressure increases in mmHg for each 1,000mg of sodium consumed. However, the method can be applied to other link functions such as log or logit function ^{12}.
In conclusion, correction for random error will produce coefficients close to the true coefficient as long as the sample size or number of repetitions per individual is sufficient to guarantee the estimate's precision. Otherwise, the coefficients may be under or overestimated, in addition to the increased likelihood of not finding an association even when it really exists. One should thus be aware of it when interpreting results in which the coefficient is not statistically significant, which will probably not allow concluding lack of association. Increasing the number of 24HR in at least a portion of the study population has a positive impact on the estimated coefficient's precision.
Acknowledgments
The authors wish to acknowledge the funding from the Carlos Chagas Rio de Janeiro State Research Foundation (Farpej; grant n. E26/201.488/2014) and the Brazilian National Council for Scientific and Technological Development (CNPq; n. 481434/20135).
References
Cadernos de Saúde Pública  Reports in Public Health
Rua Leopoldo Bulhões 1480  Rio de Janeiro RJ 21041210 Brasil
Secretaria Editorial +55 21 25982511.