As an Artificial Intelligence technology company, we feel it’s our duty to provide datasets, AI tools, and predictive analytics that are relevant to the current COVID-19 crisis. Our aims are to catalyze other AI and modeling developments, help researchers analyze the large body of available research, and share visualizations with the general public. If the public is equipped with the accurate information about the spread of COVID-19, we are convinced that many will take action to prevent its spread.
We provide dataset that has been reduced and condensed from its raw form in a way that is amenable to modeling and streaming for exploratory purposes. Specifically, this dataset contains a reduction of global and US historical COVID - 19 data [1] from Johns Hopkins University [2] as well as “reasonable” future (rolling 14 - day forward) projections of expected total confirmed cases and deaths from COVID-19. These projections follow standard SIR trends with assumptions about the percent of population that is deemed susceptible. Work to improve upon the existing model is ongoing, and we plan to include a number of new datasets related to lock-down severity and other external interventions (e.g., social distancing, public place closures, etc.) to what would otherwise be a closed system of brownian motion particles (see [3]).
Our model is constantly evolving, but its current version borrows principals from the common compartmental model used in epidemiology research known as the SIR model (susceptible, infected, and removed). Since infection rate is a time-dependent signal, we model total infection per region (as available through Johns Hopkins University COVID-19 confirmed cases data) through a general linear model fit where the susceptible population is a function of the near-term infection rate and population density. On this topic, all predictions are dependent entirely on the underlying model parameters. For example, Ferguson et al. predict that 81% of the population is susceptible based on models that suggest 2.1e6 deaths in the US [4]. Furthermore, the infection fatality ratio (IFR) is largely unknown since the total number of actual infections remains unknown. Reported confirmed cases likely underestimate infections by a factor of 15 [5].
We built our model with the assumption that current social distancing and other lock down measures will have a significant impact on the near term spread of COVID-19. Our projections use time series models to predict the signal of the underlying SIR parameters (beta, gamma, and tau [6]), which ultimately determine the expected total susceptible population. Specifically, the infection curve is fit to a quasibinomial curve with a logit link function, which essentially serves as a cumulative distribution function of the total infections over time (Figure 3.1). The so-called “flattening the curve” initiative seeks to increase the width of the probability mass function of daily new infections over time, and the integral of this probability mass function is the cumulative distribution function to which our model is fit.
Figure 3.1: Example model fit for confirmed cases in the United States with confidence intervals.
Common data modeling for analytics (e.g., for dashboard renderings) most efficient when the underlying model conforms to a “star-schema”, which uses a relational model between dimension and fact tables. We’ve curated the dataset so that it has a plug-and-play ability for dashboard renderings. There are 4 tables, which have the following relationships:
1a.region >> relationship (1:) >> 2a.region
1a.region >> relationship (1:) >> 2b.region
1b.dates >> relationship (1:) >> 2a.date
1b.dates >> relationship (1:) >> 2b.date
Where each table is represented by it’s enumeration (1a, 1b, 2a, and 2b), the column name is preceded by the table name (i.e., name[dot]column; e.g., 1a.region is referring to the “region” column in table 1a), and relationship (1:)
refers to the relationship between the two tables (in this case, all relationships are 1 to many; i.e., 1:
).
A random selection from synthetaic_covid19_dataset_dimension_table.csv
yields the following rows
region loc1 loc2 Population lat long dataset
1 2_1900 North Carolina Ashe 27109.0 36.43296 -81.49863 US
2 1_14 Australia Tasmania 509965.0 -41.45450 145.97070 global
3 2_620 Illinois De Witt 16226.0 40.17515 -88.90960 US
4 2_443 Georgia Effingham 62190.0 32.36616 -81.34281 US
5 2_3235 North Carolina Unassigned 205585.4 0.00000 0.00000 US
6 1_175 North Macedonia <NA> 2075301.0 41.60860 21.74530 global
7 2_1469 Mississippi Sharkey 4552.0 32.88149 -90.81595 US
8 2_1246 Michigan Benzie 17753.0 44.63899 -86.01608 US
9 2_902 Kansas Chase 2629.0 38.30293 -96.59564 US
10 2_2219 Oregon Coos 64389.0 43.17407 -124.05945 US
Where region is the unique identifier of the location beginning with a 1_
(global) or 2_
(US), loc1 is the higher level location (state for US and country for global), loc2 is the next location level detail (city for US and state/region for global), Population is population of the specified region, lat and long are the latitude and longitude coordinates of the region, and dataset is either US or global representing the respective dataset from JHU.
The head of synthetaic_covid19_date_dimension_table.csv
yields the following
dates ind
1 2020-01-22 1
2 2020-01-23 2
3 2020-01-24 3
4 2020-01-25 4
5 2020-01-26 5
6 2020-01-27 6
Where dates are the days for which historical data exists and projections are made and ind is the day index where day 1 is 2020-01-22.
A random selection from synthetaic_covid19_historical_JHU_data.csv
yields the following rows
region date confirmed deaths log.confirmed log.deaths
1 2_1590 2020-03-30 1 0 -5.247353 -5.940500
2 2_1125 2020-03-25 1 1 -6.071387 -6.071387
3 2_1986 2020-03-26 2 0 -6.833884 -7.932497
4 2_1385 2020-03-25 2 0 -7.234102 -8.332714
5 1_32 2020-03-25 146 4 -9.007055 -12.388049
6 2_3192 2020-03-27 141 0 -7.991585 -12.947412
7 1_114 2020-03-18 3 0 -5.030540 -6.416834
8 1_36 2020-03-31 690 8 -5.964800 -10.305716
9 2_2877 2020-03-29 2 0 -6.222039 -7.320651
10 2_210 2020-01-26 1 0 -12.666057 -13.359204
Where region is the unique identifier corresponding to the dimension table, date is the date to which the row pertains, confirmed is the logged number of confirmed cases of COVID-19, and deaths is the logged number of deaths resulting from COVID-19. Both log.confirmed and log.deaths are for internal use at Synthetaic and are of no real use to the outside community.
A random selection from synthetaic_covid19_projections.csv
yields the following rows
region ind fit.confirmed upr.confirmed lwr.confirmed fit.diff.confirmed dates fit.deaths upr.deaths lwr.deaths fit.diff.deaths
1_237 78 0 0 0 0.0000000 2020-04-08 0 0 0 0.0000000
2_236 90 334 569 195 59.3878184 2020-04-20 22 51 10 2.5488131
1_207 88 36,869 40,490 33,562 1,535.7202749 2020-04-18 2,439 2,727 2,182 220.0696295
2_310 79 17 22 13 1.3815989 2020-04-09 0 0 0 0.0000000
1_32 85 587 633 545 30.7181017 2020-04-15 25 29 21 0.6987557
2_2460 78 16 20 14 1.8106904 2020-04-08 0 0 0 0.0000000
2_965 88 24 53 11 3.1042233 2020-04-18 0 0 0 0.0000000
1_119 88 0 0 0 0.0000000 2020-04-18 0 0 0 0.0000000
1_54 87 141 143 139 0.1082622 2020-04-17 0 0 0 0.0000000
Where region is the unique identifier corresponding to the dimension table, ind is the day index corresponding to the date dimension table, fit.confirmed is the projection of total confirmed cases at the respective date in the row, upr.* and lwr.* are the upper and lower confidence bounds of the respective prediction, fit.diff.* is the 1st order time difference of the respective prediction, and dates is the date to which the prediction pertains. All negative numbers should be treated as 0.0
.
[1] https://github.com/CSSEGISandData/COVID-19
[2] Dong E, Du H, Gardner L. An interactive web-based dashboard to track COVID-19 in real time. Lancet Infect Dis; published online Feb 19. https://doi.org/10.1016/S1473-3099(20)30120-1.
[3] https://en.wikipedia.org/wiki/Brownian_motion
[4] Ferguson, N.M., et al., 2020. Impact of non-pharmaceutical interventions (NPIs) to reduce COVID- 19 mortality and healthcare demand 20.
[5] Roques, L., Klein, E., Papaix, J., Soubeyrand, S., 2020. Mechanistic-statistical SIR modelling for early estimation of the actual number of cases and mortality rate from COVID-19 (preprint). Epidemiology. https://doi.org/10.1101/2020.03.22.20040915
[6] https://code-for-philly.gitbook.io/chime/what-is-chime/sir-modeling
With the release of the Open Research Dataset, a mass of 44,000 coronavirus-related research papers and articles, Synthetaic implores members of the AI community to mobilize and apply innovative AI techniques to generate novel insights in the battle against COVID-19.
synthetaic_covid19_dataset_dimension_table.csv
synthetaic_covid19_date_dimension_table.csv
synthetaic_covid19_historical_jhu_data.csv
synthetaic_covid19_projections.csv