# Dataset

• COVID-19 Confirmed Cases Forecast

## COVID-19 Confirmed Cases Forecast

• Datasets
• COVID-19 Confirmed Cases Forecast
• LATEST

# 1 Dataset Description

As an Artificial Intelligence technology company, we feel it’s our duty to provide datasets, AI tools, and predictive analytics that are relevant to the current COVID-19 crisis. Our aims are to catalyze other AI and modeling developments, help researchers analyze the large body of available research, and share visualizations with the general public. If the public is equipped with the accurate information about the spread of COVID-19, we are convinced that many will take action to prevent its spread.

We provide dataset that has been reduced and condensed from its raw form in a way that is amenable to modeling and streaming for exploratory purposes. Specifically, this dataset contains a reduction of global and US historical COVID - 19 data [1] from Johns Hopkins University [2] as well as “reasonable” future (rolling 14 - day forward) projections of expected total confirmed cases and deaths from COVID-19. These projections follow standard SIR trends with assumptions about the percent of population that is deemed susceptible. Work to improve upon the existing model is ongoing, and we plan to include a number of new datasets related to lock-down severity and other external interventions (e.g., social distancing, public place closures, etc.) to what would otherwise be a closed system of brownian motion particles (see [3]).

# 3 Methodology

Our model is constantly evolving, but its current version borrows principals from the common compartmental model used in epidemiology research known as the SIR model (susceptible, infected, and removed). Since infection rate is a time-dependent signal, we model total infection per region (as available through Johns Hopkins University COVID-19 confirmed cases data) through a general linear model fit where the susceptible population is a function of the near-term infection rate and population density. On this topic, all predictions are dependent entirely on the underlying model parameters. For example, Ferguson et al. predict that 81% of the population is susceptible based on models that suggest 2.1e6 deaths in the US [4]. Furthermore, the infection fatality ratio (IFR) is largely unknown since the total number of actual infections remains unknown. Reported confirmed cases likely underestimate infections by a factor of 15 [5].

We built our model with the assumption that current social distancing and other lock down measures will have a significant impact on the near term spread of COVID-19. Our projections use time series models to predict the signal of the underlying SIR parameters (beta, gamma, and tau [6]), which ultimately determine the expected total susceptible population. Specifically, the infection curve is fit to a quasibinomial curve with a logit link function, which essentially serves as a cumulative distribution function of the total infections over time (Figure 3.1). The so-called “flattening the curve” initiative seeks to increase the width of the probability mass function of daily new infections over time, and the integral of this probability mass function is the cumulative distribution function to which our model is fit.

# 4 Data Tables

Common data modeling for analytics (e.g., for dashboard renderings) most efficient when the underlying model conforms to a “star-schema”, which uses a relational model between dimension and fact tables. We’ve curated the dataset so that it has a plug-and-play ability for dashboard renderings. There are 4 tables, which have the following relationships:

1. Dimension tables:
1. synthetaic_covid19_dataset_dimension_table.csv
2. synthetaic_covid19_date_dimension_table.csv
2. Fact tables:
1. synthetaic_covid19_historical_JHU_data.csv
2. synthetaic_covid19_projections.csv
1a.region >> relationship (1:) >> 2a.region
1a.region >> relationship (1:) >> 2b.region
1b.dates >> relationship (1:) >> 2a.date
1b.dates >> relationship (1:) >> 2b.date

Where each table is represented by it’s enumeration (1a, 1b, 2a, and 2b), the column name is preceded by the table name (i.e., name[dot]column; e.g., 1a.region is referring to the “region” column in table 1a), and relationship (1:) refers to the relationship between the two tables (in this case, all relationships are 1 to many; i.e., 1:).

# 5 Column Descriptions

## 5.1 Dimension Table

A random selection from synthetaic_covid19_dataset_dimension_table.csv yields the following rows

   region            loc1       loc2 Population       lat       long dataset
1  2_1900  North Carolina       Ashe    27109.0  36.43296  -81.49863      US
2    1_14       Australia   Tasmania   509965.0 -41.45450  145.97070  global
3   2_620        Illinois    De Witt    16226.0  40.17515  -88.90960      US
4   2_443         Georgia  Effingham    62190.0  32.36616  -81.34281      US
5  2_3235  North Carolina Unassigned   205585.4   0.00000    0.00000      US
6   1_175 North Macedonia       <NA>  2075301.0  41.60860   21.74530  global
7  2_1469     Mississippi    Sharkey     4552.0  32.88149  -90.81595      US
8  2_1246        Michigan     Benzie    17753.0  44.63899  -86.01608      US
9   2_902          Kansas      Chase     2629.0  38.30293  -96.59564      US
10 2_2219          Oregon       Coos    64389.0  43.17407 -124.05945      US

Where region is the unique identifier of the location beginning with a 1_ (global) or 2_ (US), loc1 is the higher level location (state for US and country for global), loc2 is the next location level detail (city for US and state/region for global), Population is population of the specified region, lat and long are the latitude and longitude coordinates of the region, and dataset is either US or global representing the respective dataset from JHU.

## 5.2 Date Table

The head of synthetaic_covid19_date_dimension_table.csv yields the following

       dates ind
1 2020-01-22   1
2 2020-01-23   2
3 2020-01-24   3
4 2020-01-25   4
5 2020-01-26   5
6 2020-01-27   6

Where dates are the days for which historical data exists and projections are made and ind is the day index where day 1 is 2020-01-22.

## 5.3 Historical Data

A random selection from synthetaic_covid19_historical_JHU_data.csv yields the following rows

   region       date confirmed deaths log.confirmed log.deaths
1  2_1590 2020-03-30         1      0     -5.247353  -5.940500
2  2_1125 2020-03-25         1      1     -6.071387  -6.071387
3  2_1986 2020-03-26         2      0     -6.833884  -7.932497
4  2_1385 2020-03-25         2      0     -7.234102  -8.332714
5    1_32 2020-03-25       146      4     -9.007055 -12.388049
6  2_3192 2020-03-27       141      0     -7.991585 -12.947412
7   1_114 2020-03-18         3      0     -5.030540  -6.416834
8    1_36 2020-03-31       690      8     -5.964800 -10.305716
9  2_2877 2020-03-29         2      0     -6.222039  -7.320651
10  2_210 2020-01-26         1      0    -12.666057 -13.359204

Where region is the unique identifier corresponding to the dimension table, date is the date to which the row pertains, confirmed is the logged number of confirmed cases of COVID-19, and deaths is the logged number of deaths resulting from COVID-19. Both log.confirmed and log.deaths are for internal use at Synthetaic and are of no real use to the outside community.

## 5.4 Projections

A random selection from synthetaic_covid19_projections.csv yields the following rows

Table 5.1: Random selection of rows from the model output table.

region	ind	fit.confirmed	upr.confirmed	lwr.confirmed	fit.diff.confirmed	     dates	fit.deaths	upr.deaths	lwr.deaths	fit.diff.deaths
1_237	 78	            0	            0	            0	         0.0000000	2020-04-08	         0	         0	         0	      0.0000000
2_236	 90	          334	          569	          195	        59.3878184	2020-04-20	        22	        51	        10	      2.5488131
1_207	 88	       36,869	       40,490	       33,562	     1,535.7202749	2020-04-18	     2,439	     2,727	     2,182	    220.0696295
2_310	 79	           17	           22	           13	         1.3815989	2020-04-09	         0	         0	         0	      0.0000000
1_32	 85	          587	          633	          545	        30.7181017	2020-04-15	        25	        29	        21	      0.6987557
2_2460	 78	           16	           20	           14	         1.8106904	2020-04-08	         0	         0	         0	      0.0000000
2_965	 88	           24	           53	           11	         3.1042233	2020-04-18	         0	         0	         0	      0.0000000
1_119	 88	            0	            0	            0	         0.0000000	2020-04-18	         0	         0	         0	      0.0000000
1_54	 87	          141	          143	          139	         0.1082622	2020-04-17	         0	         0	         0	      0.0000000



Where region is the unique identifier corresponding to the dimension table, ind is the day index corresponding to the date dimension table, fit.confirmed is the projection of total confirmed cases at the respective date in the row, upr.* and lwr.* are the upper and lower confidence bounds of the respective prediction, fit.diff.* is the 1st order time difference of the respective prediction, and dates is the date to which the prediction pertains. All negative numbers should be treated as 0.0.

[2] Dong E, Du H, Gardner L. An interactive web-based dashboard to track COVID-19 in real time. Lancet Infect Dis; published online Feb 19. https://doi.org/10.1016/S1473-3099(20)30120-1.

[4] Ferguson, N.M., et al., 2020. Impact of non-pharmaceutical interventions (NPIs) to reduce COVID- 19 mortality and healthcare demand 20.

[5] Roques, L., Klein, E., Papaix, J., Soubeyrand, S., 2020. Mechanistic-statistical SIR modelling for early estimation of the actual number of cases and mortality rate from COVID-19 (preprint). Epidemiology. https://doi.org/10.1101/2020.03.22.20040915