COVID-19/Adjusted Number of Cases

This learning resource is required to process the number of reported cases in a way that they provide an estimation of the real number of cases in the total population. The adjusted number of cases $$a_n$$ at day $$n$$ is important for the forcast and mathematical modelling for COVID-19.

History of Learning Resource
Julian Mendez originally posted this idea here and it is archived here. For an updated reformulation, see below.

Adjusted number of cases
We start with a reference day of data collection with the index $$t=0$$ (e.g. January 6, 2020 is $$t=0$$, January 7, 2020 is $$t=1$$, ...). We use the variable $$t$$ because it is a time index.
 * $$t$$ number of days after reference day of data collection.
 * $$b_{t}$$ number of tests performed at day $$t$$, defines the baseline ($$b$$) for positive tested patients.
 * $$c_{t}$$ number of COVID-19 positive tests, for day $$t$$ on which the samples for the tests are collected. Please keep in mind, that laboratory needs time to process the collected samples, so do not count the positive tested COVID-19 cases to the day, when they are official reported. Count the positive tests to the day, when the samples are taken from the patients. Variable ($$c$$) used because of first letter COVID-19.
 * fraction $$\frac{c_{t}}{b_{t}}$$ indicates the percentage of positive test of all tests, e.g. $$b_t=1000$$ tests $$c_t=50$$ positive COVID-19 tests creates the fraction $$\frac{c_{t}}{b_{t}}=0.05$$ (i.e. 5% of the tests are positive)
 * Patients that show minor symptoms and do not popup in the health system for testing create a bias in the the fraction $$\frac{c_{t}}{b_{t}}=0.05$$. Otherwise we could adjust the asymptomatic patients do not show symptoms and the spread the disease without knowing that they are COVID-19 positive.

Susceptible, Infected and Recovered (SIR)
The spread of the virus is dependend on the immune status of the population. Consider the two completely different situations for the public health status of the population with $$n_t$$ citizens at day $$t$$:
 * (Situation 1: vulnerable population) There is only one infected citizen ($$I_t=1$$) at time index $$t$$ among the population and all the other citizens are susceptible (i.e. $$S_t=n_t - 1$$ for COVID-19. Hence there are no citizens, that recovered from the COVID-19 disease (i.e. $$R_t=0$$). Therefore an epdidemiologically extremly vulnerable community is exposed to a single infected citizen among the population (patient zero). The spreading of the disease will show an exponential growth.
 * (Situation 2: protected population) There is again only one infected citizen ($$I_t=1$$) among the population, but this time all the other citizens recovered from the COVID-19 infection (i.e. $$R_t=n_t - 1$$. Hence there are no susceptible citizens among the population, that can be infected from the single COVID-19 patient (i.e. $$S_t=0$$). The single infected patient among the population cannot infect somebody in population because the recovered patients were immune against COVID-19 infection.

Lesson learnt: Having a higher percentage of recovered (immune) patients among the population slows down the spreading of the disease in the population.

Learning Task

 * Explain why the people in the compartment "recovered" (R) may return to "susceptible" (S) after a period of time. Do you know disease that you keep you immune status for lifetime?

Vaccination and protected Population
Vaccination moves citizens from the vulnerable status "susceptible" (S) into a recovered status (R), because the vaccination "emulates" an infection for the immune system and the allows immune system to produce antibodies against the disease. COVID-19 was a new virus in 2019 and therefore vaccination of the population was not possible before the outbreak. COVID-19 disease could cause a critical status of the patient, so that she/he must be treated on an Intensive Care Unit (ICU), so a protected population would be ideal but was not possible because the new virus COVID-19 was exposed to a totally vulnerable society.

Lesson learnt: Due to the fact that vaccination was not possible for COVID-19, the only option to protect the health system is, that the number of cases increase slowly, so that the health system can provide the health service delivery for patients. Keep in mind, that health system has other patients on ICU and the capacity might not be sufficient for a huge number of COVID-19 cases. Therefore staying at home and reducing the number of physical contacts among the population to slow down the epidemiological spreading of COVID-19.

Estimation of aggregated COVID-19 Infection among Population

 * Now we consider the once again $$b_{t}$$ as number of tests performed at day $$t$$. The selection of people that are tested are not randomly selected among the population $$N$$ (e.g. $$N=100\ million$$ people), so that :: $$\frac{c_{t}}{b_{t}}\cdot N=0.05\cdot 100,000,000=5,000,000$$ might be a wrong estimate
 * for the number of infected people among the population. There might be a bias especially when only patients with symptoms are tested. A randomly selected test sample of $$\widehat{b}_t$$ tests at time $$t$$ can be selected among the population. The tested people are selected without consideration of any symptoms and the selection should be representative for the total population. This study will also detect a number of $$\widehat{c}_t$$ COVID-19 positive tests. The ratio $$\frac{\widehat{c}_t}{\widehat{b}_t}$$ is an estimate for fraction of infected people among the population. This might show probably a difference between the calculated ratio $$\frac{c_{t}}{b_{t}}$$ and random control test $$\frac{\widehat{c_t}}{\widehat{b_t}}$$ at day $$t$$, e.g.
 * $$\frac{\widehat{c_t}}{\widehat{b_t}}=0.02 \not= 0.05 = \frac{c_t}{b_t} $$
 * This leads to better estimate for the total number of infected people among the population, if the number of $$\widehat{b_t}$$ of tests for the randomly selected people is high (see Borels Law of large numbers) with a total number of population $$N$$ (e.g. $$N=100\ million$$ people).
 * $$\frac{\widehat{c_t}}{\widehat{b_t}}\cdot N =0.02 \cdot 100,000,000 = 2,000,000 $$
 * Testing capacity is limited and so random selection of samples is costly and therefore this test design might be applied just for calibrating the model. A COVID-19 tests is a limited resource and tests are mostly applied if the patient showing symptoms for COVID-19 or the immune status must be clarified if someone (e.g. member of medical staff) is a risk for the enviroment in which she/he is working/living. So the estimation for the total number of people that show an immune response or exposure to the COVID-19 virus in the test must be based on $$b_{t}$$ and $$c_{t}$$ of tested patients resp. the fraction

$$\frac{c_t}{b_t}$$. A control test was performed only once at day $$t_0$$ constant for the number of people in the population the show an immune response or exposure to the COVID-19 virus can be calculated by:
 * $$e_{t_0} := \frac{\widehat{c_{t_0}}}{\widehat{b_{t_0}}} \cdot \frac{b_{t_0}}{c_{t_0}} = \frac{0.02}{0.05} = \frac{2}{5} $$


 * With the error correction value $$e_{t_0}$$ (e.g. $$e_{t_0} = \frac{2}{5} $$) the estimate for total number of people that show an immune response or exposure to the COVID-19 virus can be estimated by
 * $$\frac{c_{t}}{b_{t}}\cdot e_{t_0} \cdot N=0.05\cdot \frac{2}{5} \cdot 100,000,000=2,000,000$$
 * Please keep in mind, that the error correction value $$e_{t_0}$$ might be updated not as other as new cases are reported.


 * The daily growth rate $$d_{t}$$ for the value of new cases is defined as $$d_{t} = 1-\frac{a_{t}}{a_{t-1}} $$. E.g. if you have $$a_{t} = 3000 $$ at day $$ t $$ and $$a_{t-1} = 2000 $$ at day before at index $$ t-1 $$ then
 * $$d_{t} = 1- \frac{a_{t}}{a_{t-1}} = 1 -\frac{3000}{2000} = 0.5 $$.
 * This means that we have an increase of 50% in the number of new cases for the day $$t$$. The daily growth rate $$d_{t}$$ for the value of new cases could also be negative, e.g. if you have $$a_{t} = 3000 $$ at day $$ t $$ and $$a_{t-1} = 4000 $$ at day before at index $$ t-1 $$ then
 * $$d_{t} = 1- \frac{a_{t}}{a_{t-1}} = 1 -\frac{3000}{4000} = -\frac{1}{4} = -0.25 $$.
 * This means that the number of new cases for the day $$t$$ decrease by 25%.

Logistical Growth and SIR Model
If we assume, that the logistical growth can be applied on COVID-19 disease, the point in time when the number of new detected cases do not increase anymore. This point in time can be estimated if $$d_t \approx 0$$ and the point $$S$$ in the following graph. With the SIR model is applied on the epidemiological modelling, the logistical growth is with a delay in time similar to the green curve of the recovered.

Each member of the population typically progresses from susceptible to infectious to recovered. This can be shown as a flow chart in which the boxes represent the different compartments and the arrows the transition between compartments. An arrow from recovered (R) back to susceptible (S) might be added if the patients loose the immune status after a while. That is similar to the status of patients must refresh their vaccination after a number of year. For some diseases one infection or one vaccination is sufficient for life time. COVID-19 is a new disease, so it difficult to estimate in 2020 how the immune system will be prepare for a new exposure to the Corona virus, if the patient recovered.

Learning Task

 * Identify disease that need just one vaccination for life time and identify a disease that need a new vaccination for immune system after a number of years.
 * Explain why a arrow in SIR-model might be added to the flow chart from recovered (R) to susceptible (S) if scientific evidence will be available for the model extension?

Testing
There are different tests for a viral disease:
 * Polymerase Chain Reaction (PCR): Polymerase chain reaction (PCR) is a method in molecular biology for making millions of copies of a specific DNA sample of the virus DNA. If the replication fails, the tests provides the result, that the sample did not contain the specific DNA sample of the test. Please keep in test addresses not the complete virus DNA. Therefore a fragmented virus DNA that is not capable to program cells for the production of new viruses might leed to a positive test (false positive). The PCR tet is used to detect patient that might infect other patients. So a PCR provides information about the red curve of infected people in the population.
 * Antibodies: a test for antibodies of COVID-19 shows if the immune system was exposed to COVID-19 virus and responded to the virus exposure by creating antibodies. This test provides information about the green curve of recovered. Please keep in mind that the immune system needs time to respond to the exposure to a new virus. Therefore the antibody test might fail and patient may be infected and is able to infect others.

Julian Mendez Contribution
I would like to share an idea to have a more precise understanding of the number of cases of COVID-19. A more precise current number of cases could be approximated by computing the square of cases today divided by the cases one week ago. My suggestion is to use the growth of the previous days. Let us imagine that a region had the following cases:

We can approximate the future growth by the past, and say that an adjusted number of cases can be approximated with:

$$a_{0} \cdot \frac{a_{0}}{a_{1}} \cdot \frac{a_{1}}{a_{2}} \cdot \ldots \cdot \frac{a_{n-2}}{a_{n-1}} \cdot \frac{a_{n-1}}{a_{n}} $$

which is the same as $$\frac{a_{0}^{2}}{a_{n}}$$

As an example, these are the values for the first 10 countries in the list on 2020-03-14:

These adjusted values may fluctuate with sudden high values, like in the case of Spain.

I hope someone can find this idea useful.

Adjustments due to the delay in the tests
This is a reformulation of the previous section. I originally posted it here. The updated version is here.

We would like to approximate how many true cases are there. Let us assume that:
 * the time between a patient gets infected and the case is reported is always the same
 * people do not significantly change the growth of infected cases

The variables are:
 * $$k$$ is the number of days between a patient gets infected and the case is reported
 * $$r_{i}$$ is the reported cases for day $$i$$
 * $$t_{i}$$ is an approximation of true cases for day $$i$$

We would like to find a formula for $$t_{n}$$ that approximates $$r_{n + k}$$.

One possibility is using the previous $$k$$ growth rates. In this case:

$$t_{n} = r_{n} \cdot \frac{r_{n}}{r_{n - 1}} \cdot \frac{r_{n - 1}}{r_{n - 2}} \cdot \ldots \cdot \frac{r_{n - k + 2}}{r_{n - k + 1}} \cdot \frac{r_{n - k + 1}}{r_{n - k}} $$

Hence, $$t_{n} = \frac{r_{n}^{2}}{r_{n-k}}$$