Cadernos de Saúde Pública
ISSN 16784464
37 nº.9
Rio de Janeiro, Setembro 2021
ARTIGO
Prevalência estimada de COVID19 no Brasil com correção probabilística de vieses
Erik Alencar de Figueiredo, Démerson André Polli, Bernardo Borba de Andrade
http://dx.doi.org/10.1590/0102311X00290120
Imunidade Coletiva; Viés de Seleção; Análise Quantitativa
Introduction
The Brazilian National Household Sample Survey  COVID19 (PNADCOVID19) ^{1} is a nationwide, complex, survey aimed at “estimating the number of persons with symptoms associated with the flu syndrome and at following up the impact of the COVID19 pandemic in the Brazilian labor market. The data collection of the Brazilian National Household Sample Survey  PNAD COVID19 began on May 4, 2020, including interviews by telephone in nearly 48,000 households per week, adding up to nearly 193,000 households per month in the entire country. The sample is permanent, i.e., the households interviewed in the first month of data collection will remain in the sample along the next months, up to the end of the survey”. Considering its latest release, in December 2020 (survey end date: December 5th), the survey indicates a total of 22.7% positive results for COVID19, that is, 3.1% of the Brazilian population has tested positive. Official tallies by the Brazilian Ministry of Health are also at 3.1%. Two states are above 7% (confirmed) infection rate: states of Roraima and Amapá both in the North Region of Brazil. Most states have rates between 3% and 5%. According to our results presented by state in the section Results from Probabilistic Bias Analysis, the prevalence of COVID19 in Brazil at the end of 2020 is estimated to be around 15% of the population, slightly more than 30 million individuals.
The number of confirmed cases is well known as being only a fraction of the actual number of infected individuals. Firstly, testing is not always available and most moderate or asymptomatic cases go undetected even when testing is accessible. Secondly, COVID19 testing in Brazil has been carried out with a several testing kits and survey respondents may have used procedures with high rates of false negatives. Finally, and somewhat related to the first issue, there is the problem of selection bias as only those with stronger symptoms will seek medical attention and testing ^{2}^{,}^{3}^{,}^{4}^{,}^{5}. Some earlier studies have indicated that the true number of cases was about 12 times the number of confirmed cases ^{6}^{,}^{7}. The September issue of The Economist^{8} publicized some estimates for the share of the population with COVID19 antibodies obtained by serosurveys in June and the implied ratio of cases to confirmed cases: it was reported that Moscow (Russia) could have 27 times more cases than reported ones, Stockholm (Sweden) 17 times, London (England) 14 times, Madrid (Spain) 10 times, and New York City (United States) 7 times. More recent estimates from Wu et al. ^{9} suggest a ratio of nine times for the United States and that the ratio would vary from three to 20 across the 50 states in the country. In late 2020, studies from the São Paulo University ^{10} and the Federal University of Pelotas ^{11}, in Brazil, indicated a much higher share of antibodies in the population, as high as 60%, close to standard herd immunity thresholds, but recent increases in cases across the country suggest that antibodies may not be as prevalent. In any event, common sense and empirical evidence strongly suggest that the actual number of infected individuals is much higher than the number of confirmed (cumulative) cases.
For estimating the prevalence of COVID19 in Brazil at a given point in time we can resort to at least three major approaches. Firstly, there are simple analyses using data from other years on similar respiratory diseases and some demographical considerations: for instance, Ribeiro et al. ^{7} concluded in May 2020 that 3.8 times more hospitalizations in Brazil due to COVID19 were identified than reported by analyzing hospitalization patterns of acute respiratory distress syndrome between 2012 and 2019 as compared to 2020.
Another approach is based on dynamic mathematical models, such as SEIR models, which have been made widely available for COVID19. Such models yield projections that are higher than what is observed as illustrated by the compilation of four popular models in OurWorldInData.org ^{12}: recent estimates regarding SARSCoV2 from the SEIR model by Youyang Gu (YYG) indicate that Brazil has reached 16.9% infection rate, eight times the confirmed cases. The model from the London School of Hygiene & Tropical Medicine (LSHTM) estimates that between 28% and 42% of symptomatic cases are unreported. The models by the Imperial College London (ICL) and the Institute for Health Metrics and Evaluation (IHME) place estimates at much higher values, especially around June.
A third approach, adopted by Wu et al. ^{9} who developed a bias correction model for the estimation of COVID19 cases, is to perform a quantitative bias analysis ^{13}. This approach is entirely databased and does not aim to model transmission mechanisms or dynamics. Quantitative bias analysis aims to quantify the effects of systematic error (due to selection bias, unmeasured confounders, information bias, etc.) on estimates derived from nonrandomized epidemiologic studies.
The next section briefly describes the approach of bias correction developed by Wu et al. ^{9} and reproduced in this study. Section Results from Probabilistic Bias Analysis provides the results we obtained for Brazil using the national survey mentioned above (PNADCOVID19) and Section Discussion concludes with a summary and final remarks briefly discussing the related issue of herd immunity.
Methods
Probabilistic bias analysis by Wu et al.
Along the lines of probabilistic bias analysis ^{13}, Wu et al. ^{9} have developed a simulationbased bias correction method aimed to adjust count estimates on confirmed cases for selection bias (preferential testing of moderate/severe cases) and imperfect test accuracy. Despite additional computational cost, empirical, simulationbased methods provide the flexibility needed for the kind of multiple correction desired. In this study, the simple estimate based on confirmed cases is biased away from the true value due to preferential testing and imperfect test accuracy. The parameters affecting the distributions used to correct for bias are treated as random variables and, hence, the procedure is known as probabilistic bias analysis. Even though the only modification we have made to the method proposed by Wu et al. ^{9} is the selection of (hyper)parameter values in the prior models (their specifications reflect the U.S. reality), we explain their method for completeness.
More specifically, we want to estimate N* which is the number of cases in the population (for each of the 27 Brazilian states, including the Federal District). The starting point, which is reported in surveys or official reports, is , the number of confirmed cases among tested individuals which we identified to be just a fraction of N* due to selection bias and imperfect test accuracy. These two figures are connected by an epidemiological identity which provides a correction for imperfect test accuracy ^{13},
where: N is the population size (known), is the number of infected individuals not tested, , and S _{e } and S _{p } are test sensibility and specificity, respectively.
The value of is unknown. It may be obtained as the sum of the number of untested individuals who have moderate or severe symptoms and would result positive if tested,
and the number of untested individuals who have mild or no symptoms and would result positive if tested,
The above expressions provide correction for incomplete testing. Following Wu et al. ^{9}, we considered them to be binomially distributed with size 16784464csp3709e00290120eq5.jpg and success probability equal to , j = 0 or 1. In simulations, these two quantities are held fixed at their mean values since their variability is negligible compared to other sources of uncertainty and the population size is large ^{9}. However, the probabilities involved in the above expressions are unknown. They are modelled and simulated: thus the set of parameters for which prior information must be provided contains the probability of having moderate to severe symptoms among tested individuals, and also , and the probability of having mild symptoms among positive cases, . It also contains the sensibility and the specificity of testing procedures used for COVID19 and the ratios α and β which refer, respectively, to and divided by .
A probabilistic identity is used to coherently connect these probabilities,
where the ratios α and β have been defined above. Because priors are specified on both sides of the equation above, a sample for the vector is obtained by a technique known as Bayesian melding ^{14}. Bayesian melding is a procedure that uses the fact that one parameter can be expressed as a deterministic function of other parameters. It can be useful to simplify complex models when a deterministic relation is present. In this sense, it is enough to define the prior for one of them with the other being fully determined by the deterministic relation among them.
Simulations
Considering the above framework, seven quantities are present in the probabilistic bias analysis just described whose uncertainties must be assessed with simulations.
For each state, the empirical estimate (cumulative number of cases divided by the cumulative number of tests) from the latest release of PNADCOVID19 is fixed. The ratio of tests performed to the state population ranged from 9% to 26% across the 27 states and the point estimates of positive rates range from 18% to 50 across states.
With those probabilities fixed the relevant quantities are simulated and a distribution of expected cases is obtained for each state as described next.
Table 1 Prior specifications for truncated beta models used in the probabilistic bias analysis.

Finally, a decomposition of the two sources of biases, incomplete testing and imperfect test accuracy, can be obtained through
and , where p 1 is attributable to the inaccuracies of testing and its complement to incomplete testing.
Initially, 10^{4} values are sampled from the distributions of , , α, β, S_{e } and S_{p } with these six variables independent and identically distributed across states. Then, values of and are simulated based on and simulated values of α and β. Despite the theoretical possibility that some parameters may be correlated between some states, we and Wu et al. ^{9} do not have robust evidence to inform an appropriate model of correlation structure.
In Wu et al. ^{9}, and in this study, Bayesian melding is used to relate the components of . As stated by Poole & Raftery ^{15} the use of pooling weight to be ½ makes the combined prior distribution of and to be the geometric mean of each prior distribution. We do not have any evidence to support any of the prior distributions, therefore using ½ as pooling weight is the natural choice and it is employed in this study. Bayesian melding is performed with 10^{5} iterations of samplingimportanceresampling algorithm (SIR); a simulation size greater than 10^{4}, since this stage involves a more complex generator as opposed to independent univariate sampling. These simulation sizes were also used by Wu et al. ^{9}.
After all samples are simulated, point estimates are obtained as sample medians. The analysis does not involve a probability and only the quantities of interest are sampled. No sampling of likelihood parameters was necessary.
Before we report our results based on the methodology described, we argue that a suitable way to estimate the number of infected individuals is to try to emulate a natural experiment based on the data provided by PNADCOVID19. Many employers and local governments have requested mass testing for certain groups of individuals. We took a subsample from the national survey considering only working individuals, aged from 18 to 60 years old who have declared absence of previous COVID19like symptoms. This should partially eliminate selection bias. By calculating the percentage of those individuals who tested positive we reached 14.3%.
Results from probabilistic bias analysis
We will report our results in terms of percentage of the population infected by COVID19 (prevalence),
and the associated correction factor F,
Figure 1 Confirmed cases and estimated percentage of infected individuals by Brazilian state.

Table 2 Estimated prevalence and corresponding 95% credibility bounds.

Figure 2 Estimated correction factors for Brazilian states.

Finally, our results indicate a strong linear association between correction factors and testing coverage in the loglog scale
Figure 3 Approximate linear relationship (loglog) between correction factors and testing coverage.

Therefore, we can estimate F for a given region, say a large city, which has not been directly targeted by the survey. For instance, if 5% of the population has been tested then we estimate that F ≈ 9.0; for 10% coverage we estimate F ≈ 4.9 and for 20% testing we should obtain F ≈ 2.7.
Discussion
Estimating the number of infected individuals during a pandemic is essential to decisionmakers and researchers. National surveys such as PNADCOVID19 have been a significant source of information to adjust official counts of reported cases for different biases. Our use of the simulation methods developed by Wu et al. ^{9} show that Brazilian official estimation is about five times lower than the expected number of cases. It must be understood that this number is static and represents the magnitude of underreporting in August and that this number is likely to have been higher in the peak of the pandemic. We have presented national and state figures but for practical policy use, local (municipality level) estimates may be necessary.
The city of Manaus is located in Brazil and is the largest city in the state of Amazonas which accounts for slightly more than half of the state's population. The COVID infection in this region presented two peaks (with peaks in late July/2020 and in midMarch/2021) and has thus attracted much attention as an example of a largely unmitigated epidemic. A recent study ^{10} has estimated that, by October of 2020, the cumulative incidence in Manaus was 76% (95%CI: 67% to 98%), whereas our estimate for the state stands at 21% (95%CI: 15% to 26%). The fact that, in March, the city experienced a stronger second peak suggests that the 76% estimate is, most likely, unreliable and too high. Considering such high infection rate, a strong second peak would have to be assigned to the new variant found in Manaus (P.1 lineage) and to antibody waning. Reinfection would have then been very common which, as far as we know, has not been well documented. Despite these possibilities we still find the 76% estimate to be too high. Buss et al. ^{10} have indicated that their results rely on a certain assumption about the dynamics of seroreversion and that their sample is not a random sample but, rather, a sample of blood donors which they argue to be representative of the population of Manaus. Notably, without adjusting for seroreversion, Buss et al. ^{10} report an estimated prevalence, adjusted for specificity and sensitivity, of 26% (95%CI: 21% to 31%) which is close to what is reported in this study. The disparity appears after adjustment for seroconversion. Whether this adjustment is reliable is open for research it certainly suggests that the estimates around 2030% are too conservative. Similar arguments may be applied to the case of the state of São Paulo. Buss et al. ^{10} report an estimated prevalence of 29% (95%CI: 26% to 37%) adjusting for seroconversion and 14% (95%CI: 11% to 17%) without adjustment, for the capital city which accounts for slightly less than 30% of the state's population. Our state estimate for the state of São Paulo is 11% (95%CI: 7.8% to 13%). We conclude that our estimates, or any estimate not adjusted for seroconversion, must be considered conservative.
Since the prevalence (or a lower bound for it) has been estimated, it is natural to ask about herd (or collective) immunity, the level at which contagion becomes under control (herd immunity threshold  HIT). Despite lacking a precise definition, the concept of herd immunity is inevitable in discussions about COVID19 and infectious diseases in general.
The possibility that collective immunity is not applicable to SARSCoV2 could be based on the epidemiology of the coronavirus HCoVNL63 ^{16} for which long term individual immunization is not achieved and reinfection is common. However, our understanding of the epidemiological literature ^{17}and of expert opinions available in the media is that collective immunity is applicable to COVID19. Nonetheless, the actual value to be targeted has been a topic for discussion.
The standard formula for the threshold is , where R_{0 } is the number of persons infected, on average, by a given individual carrying the virus. It assumes uniform susceptibility and it results from an idealized scenario. For COVID19 it has been argued that, approximately, R_{0} = 3, and thus HIT = 67%. Such calculations have been questioned in the epidemiological literature and, regarding SARSCoV2, Aguas et al. ^{18} have argued that random immunization is far from reality and considerations about nonuniform susceptibility considering the individual coefficient of variation (CV) would lead to a more realistic formula: in an epidemic that takes its natural course, by contrast, the virus very specifically infects the people that are most susceptible first. This removes all of the strongest vectors early on, and continues to selectively remove the vectors until the herd immunity threshold is reached. The new parameter, CV, plays an opposite role than that of the reproduction number. The larger the CV the lower the HIT. It is proposed by Aguas et al. ^{18} that . Just like R_{0}, the CV needs to be estimated. Setting aside (relevant) discussions about the validity of proposed values for de R_{0} and CV for SARSCoV2, if we consider R_{0} ≈ 2.5 and recent estimates of the CV for SARSCoV1 for Singapore and Beijing (China), that is CV ≈ 2.6, we obtain HIT = 11%. For comparison, CV values for malaria in the Amazon and tuberculosis in Brazil are 1.8 and 3.3 respectively. First versions of Aguas et al. ^{18} concluded that HIT values could be much lower than 50% in many cases but this seems to be valid without mutations and other changes in the dynamics of the epidemic as can be observed throughout the world with cases still on the rise in places where infection has passed the 10%. Nevertheless, the work of Aguas et al. provides an important discussion in the ongoing research on HIT.
Another source of uncertainty must be addressed before answering the herd immunity question. The question of preexisting immunity ^{19}. An aspect much harder to be measured. We thus expect more research to be conducted along these lines, not only regarding the best expression for HIT but also on best estimates for R_{0} and CV and more insights into preexisting immunity.
If we assume that the HIT for SARSCoV2 is in fact less than 67% (due to the effect of CV and preexisting immunity), say 50%, then, considering the estimated prevalence of 8% to 28%
Acknowledgments
We thank the revisors for their insightful comments and careful revisions. All remaining errors and imprecisions are the responsibility of the authors.
References
This is an openaccess article distributed under the terms of the Creative Commons Attribution License
Cadernos de Saúde Pública  Reports in Public Health
Rua Leopoldo Bulhões 1480  Rio de Janeiro RJ 21041210 Brasil
Secretaria Editorial +55 21 25982511.
cadernos@fiocruz.br