COVID-19 Data Analysis & Forecast¶

The analysis is done on Coronavirus dataset for South Korea from Kaggle dated March 5, 2020¶

This project is divided into multiple sections.¶

Section 1: Patient Analysis

Section 2: Contagion Analysis

Section 3: Confirmed Cases Forecast

Section 4: Accuracy of forecast

Section 5: Update on forecast

import numpy as np
import pandas as pd

from IPython.display import Markdown as md

from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import plotly.graph_objects as go
import plotly.express as px
import cufflinks as cf

init_notebook_mode(connected=True)
cf.go_offline()

import warnings

warnings.filterwarnings('ignore')
warnings.simplefilter('ignore')

Section 1: Patient Analysis¶

This section, I will visualize patient data to learn more about Coronavirus and find some trends¶

patient_df = pd.read_csv('patient.csv')
patient_df

fig = px.histogram(patient_df, x="infection_reason", title='Infection Reason Distribution')
fig.show()

patient_state_no_null_df = patient_df['state'].dropna()

fig = px.pie(patient_state_no_null_df, names='state', title='Patient State Distribution')
fig.show()

Let's add age of the patients¶

age_list = 2020 - patient_df['birth_year']
patient_df.insert(3, "Age", age_list)
patient_df

I will break the patient state into dummy variables¶

age_state_df = patient_df.filter(['Age','state']).dropna()
age_state_df = pd.get_dummies(age_state_df)
age_state_df

age_list = age_state_df['Age'].unique()

state_deceased_list, state_isolated_list, state_released_list = [],[],[]

for age in age_list:
    state_deceased_list_i = age_state_df[age_state_df['Age']== age]['state_deceased'].sum()
    state_deceased_list.append(state_deceased_list_i)
    
    state_isolated_list_i = age_state_df[age_state_df['Age']== age]['state_isolated'].sum()
    state_isolated_list.append(state_isolated_list_i)
    
    state_released_list_i = age_state_df[age_state_df['Age']== age]['state_released'].sum()
    state_released_list.append(state_released_list_i)

fig = go.Figure(data=[
    go.Bar(name='Deceased', x=age_list, y=state_deceased_list, marker_color='red'),
    go.Bar(name='Isolated', x=age_list, y=state_isolated_list, marker_color='orange'),
    go.Bar(name='Released', x=age_list, y=state_released_list, marker_color='green'),
])
# Change the bar mode
fig.update_layout(barmode='stack', title='Age-State Distribution', xaxis_title='Patient Age', yaxis_title='Count')
fig.show()

All patient deaths were over the age 36, most likely due to underlying health conditions as the news suggested¶

Section 2: Contagion Analysis¶

This section will visualize the spread of COVID-19 through S. Korea. I am using the updated dataset from March 11, 2020 as this section has no code dependency with the upcoming sections¶

route_df = pd.read_csv('route_updated.csv')
route_df

# Mapbox API token
token = open("token.txt","r").readline()

I am going to leave out 'hospital isolated' from 'visit' column as it probably means the patient is currently isolated in a hospital. I want to show the places where the virus was contracted¶

route_df_filtered = route_df[route_df['visit'] != 'hospital_isolated']
route_df_filtered = route_df_filtered.sort_values(by=['date'])
route_df_filtered.head(10)

Let's visualize the public places where people caught the virus¶

Here's a list of reported places¶

route_df_filtered['visit'].unique()

array(['airport', 'bus_terminal', 'movie_theater', 'office', 'clinic',
       'restaurant', 'train_station', 'hotel', 'store', 'etc', 'cafe',
       'church', 'hospital', 'market', 'hair_salon', 'company'],
      dtype=object)

For the sake of this graph, I am going to condense some of the places as they both fall in the same category. For example, 'hospital' and 'clinic' are the same¶

route_df_condensed = route_df_filtered.replace('clinic', 'hospital')
route_df_condensed = route_df_condensed.replace('cafe', 'restaurant')
route_df_condensed = route_df_condensed.replace('company', 'office')
route_df_condensed = route_df_condensed.replace('store', 'market')

route_df_condensed['visit'].unique()

array(['airport', 'bus_terminal', 'movie_theater', 'office', 'hospital',
       'restaurant', 'train_station', 'hotel', 'market', 'etc', 'church',
       'hair_salon'], dtype=object)

fig1 = px.histogram(route_df_condensed, x="visit", title='Public places where the virus was contracted', nbins=60)
fig1.update_layout(
    bargap=0.1)
fig1.show()

Let's visualize the count of COVID-19 cases in provinces¶

fig2 = px.histogram(route_df_filtered, x="province", title='Count in S. Korean provinces', nbins=60)
fig2.update_layout(
    bargap=0.1)
fig2.show()

Now I want to visualize the spread of COVID-19 over dates. I will demonstrate this using plotly scatter plot on the map of South Korea¶

I ran into this blog post by Amaral Lab on how to implement a slider bar with plotly and decided to try this out¶

At first, let's create all the lists I need for the plot¶

unique_date_list = np.array(route_df_filtered['date'].dropna().unique())
unique_date_list.sort()

lat = np.array(route_df_filtered['latitude'].dropna())
lon = np.array(route_df_filtered['longitude'].dropna())
city = np.array(route_df_filtered['city'].dropna())
date = np.array(route_df_filtered['date'].dropna())
visit = np.array(route_df_filtered['visit'].dropna())

unique_date_list_len = unique_date_list.size
date_len = date.size

def remove_special_chars(text):
    new_text = text.replace('_', ' ')
    return new_text

The data_slider set is going to be a tuple of scattermapbox dictionaries. Each dictionary will be an aggregation by passing dates.¶

This data_slider will be used to create a go.Figure object for plotting¶

data_slider = ()

lat_arr, lon_arr, text_arr, city_arr, date_arr, visit_arr, hover_data_arr = [], [], [], [], [], [], []

for i in range(0, lat.size):
    lat_arr = np.append(lat_arr, lat[i])
    lon_arr = np.append(lon_arr, lon[i])
    city_arr = np.append(city_arr, city[i])
    date_arr = np.append(date_arr, date[i])
    visit_arr = np.append(visit_arr, remove_special_chars(visit[i]))
    
    hover_data_i = f'Date: {date_arr[i]}<br>Lat: {lat_arr[i]} <br>Lon: {lon_arr[i]}<br>Visit: {visit_arr[i]}'
    hover_data_arr = np.append(hover_data_arr, hover_data_i)
    
    data_one_day = dict(
          lat = lat_arr,
          lon = lon_arr,
          marker = {'size': 6, 'color':'crimson'},
          mode = 'markers',
          hovertext = hover_data_arr,
          hoverinfo="text",
          type = 'scattermapbox',
    )
    
    data_slider = data_slider + (data_one_day,)

fig3 = go.Figure(data_slider)

I will group by date to find the cumulative sum of daily count of COVID-19 cases¶

route_df_date_group = route_df_filtered.groupby('date')
route_df_date_group.first()

date_case_cumsum = np.array(route_df_date_group['date'].count().cumsum())
date_case_cumsum

array([  4,   9,  11,  20,  28,  39,  44,  50,  55,  61,  67,  72,  76,
        81,  84,  89,  94, 106, 108, 114, 119, 120, 131, 132, 136, 140,
       143, 149, 151, 154, 157, 158], dtype=int64)

Creating the slider bar for all the unique dates. The slider bar will be used for the plot¶

steps = []

step_0 = dict(method='restyle',
                args=['visible', [False] * date_len],
                label='Start',
               )

steps.append(step_0)

for i in range(unique_date_list_len):
    step = dict(method='restyle',
                args=np.array(['visible', np.full((date_len), False)]),
                label='{}'.format(unique_date_list[i])) # label to be displayed for each date

    step['args'][1][:date_case_cumsum[i]] = True
    steps.append(step)



##  Creating the 'sliders' object from the 'steps' 
sliders = [dict(active=0, pad={"t": 1}, steps=steps)]

Creating the plot¶

fig3.update_layout(
    autosize=True,
    mapbox_style="dark",
    showlegend=False,
    height=600,
    mapbox=dict(
        accesstoken=token,
        bearing=0,
        center=dict(
            lat=36.735362,
            lon=127.828125
        ),
        pitch=0,
        zoom=5.5
    ),
    sliders=sliders,
)

Scrolling through the dates will show the locations where the virus was contracted and how it spread through S. Korea. Please be patient with the slider, there's a small delay for the map to update¶

From the plot it can be seen that the first cases (Jan 19, 2020) of Coronavirus was contracted at airports of some major cities, possibly from people travelling from Mainland China or other affected countries. The virus then eventually spread in Seoul and other main cities via public places like hospital/clinic, restaurants, train stations etc.¶

Section 3: Confirmed Cases Forecast¶

In this section, I will make a forecast for the Coronavirus patient count for N days after the last reported date (March 4, 2020) using the same dataframe from Section 1¶

patient_df_date_group = patient_df.groupby('confirmed_date')

patient_df.head(10)

patient_df_date_group['confirmed_date'].count()

confirmed_date
2020-01-20       1
2020-01-24       1
2020-01-26       1
2020-01-27       1
2020-01-30       3
2020-01-31       4
2020-02-01       1
2020-02-02       3
2020-02-04       1
2020-02-05       5
2020-02-06       3
2020-02-09       3
2020-02-10       1
2020-02-16       2
2020-02-18       9
2020-02-19      26
2020-02-20      38
2020-02-21     100
2020-02-22     229
2020-02-23     169
2020-02-24     231
2020-02-25     143
2020-02-26     285
2020-02-27     505
2020-02-28     571
2020-02-29     813
2020-03-01    1062
2020-03-02     600
2020-03-03     516
2020-03-04     438
Name: confirmed_date, dtype: int64

confirmed_case_cumsum = list(patient_df_date_group['confirmed_date'].count().cumsum())

date_list = list(patient_df['confirmed_date'].dropna().unique())

fig = go.Figure(data=go.Scatter(x=date_list, y=confirmed_case_cumsum, mode='markers'))
fig.update_layout(xaxis_title='Date', yaxis_title='COVID-19 Patients', title="Cumulative Sum of Confirmed Cases in South Korea")
fig.show()

It can seen from the plot that the total confirmed case is exponentially increasing. Also there are no data entry for dates in between the date series¶

For this case, Exponential Smoothing should be the best candidate for prediction but first I will need to fill in the missing dates as this method only works for data without any missing time series values¶

from datetime import datetime
from datetime import timedelta

converted_date_list = []


def get_str_from_date(str_date, add_day=False):
    '''
    This function converts string to date and adds one day and returns new date in string format
    
    Inputs:
    - str_date: dates in string format
    - add_day: if True add a day to the dates
    
    Returns:
    - dates as datetime object
    
    '''
    if add_day:
        datetime_obj = datetime.strptime(str_date, '%Y-%m-%d') + timedelta(days=1)
        
    elif len(str_date) > 1:
        for d in str_date:
            datetime_obj_i = datetime.strptime(d, '%Y-%m-%d')
            datetime_obj_i = datetime_obj_i.strftime('%Y-%m-%d')
            converted_date_list.append(datetime_obj_i)
        return converted_date_list
    
    else:
        datetime_obj = datetime.strptime(str_date, '%Y-%m-%d')
        
    return datetime_obj.strftime('%Y-%m-%d')

from scipy.interpolate import splrep, splev

def spline_interp(x, y, x_new):
    '''
    Spline interpolation for missing dates
    
    Inputs:
    x: dates
    y: cumulative sum of confirmed cases
    x_new: dates  for which cumulative sum needs to be interpolated
    
    Returns:
    Interpolated cumulative sum
    '''
    tck = splrep(x, y)
    return splev(x_new, tck)

# Creating a continous date array
df_interp = pd.DataFrame()
df_interp['dates'] = np.arange(date_list[0], get_str_from_date(date_list[-1], add_day=True), dtype='datetime64[D]')

# Finding spline interpolated cumulative sum for missing dates
import matplotlib.dates as mdates

datetime_date_list = [datetime.strptime(d, '%Y-%m-%d') for d in date_list]
df_interp['cum_sum'] = spline_interp(mdates.date2num(datetime_date_list), confirmed_case_cumsum, mdates.date2num(df_interp['dates']))

fig2 = go.Figure()

fig2.add_trace(go.Scatter(x=df_interp['dates'], y=df_interp['cum_sum'], mode='lines', name='spline interpolation'))
fig2.add_trace(go.Scatter(x=date_list, y=confirmed_case_cumsum, mode='markers', name='actual'))
fig2.update_layout(xaxis_title='Date', yaxis_title='COVID-19 Patients', title="Spline Interpolation for Missing Dates")
fig2.show()

The filled-in data fit in perfectly with the actual data¶

forecast_days = 6
md("#### Now let's find the forecast for the next {} days and plot it against actual data".format(forecast_days))

I will add a damping slope because I'm noticing the gradient of the curve is slowly starting to decrease. Damping slope of 0.91 results in the most realistic trend for the curve¶

from statsmodels.tsa.holtwinters import ExponentialSmoothing

def get_exp_smoothing_forecast(end_date, forecast_days):
    '''
    Creates and trains ExponentialSmoothing model and returns the forecast and model
    
    Inputs: 
    - end date: desired end date of forecast (datetime)
    - forecast_days: number of forecast days (int)
    
    Returns:
    - forecasted_df: dataframe containing the forecast dates and predicted case count
    - model: ExponentialSmoothing model

    '''
    
    series_interp = pd.Series(df_interp['cum_sum'].values, 
                          pd.date_range(start=date_list[0], end=date_list[-1], freq='D'))
    model = ExponentialSmoothing(series_interp, trend='add', damped=True).fit(damping_slope=0.91, optimized=True)
    forecasted_df = pd.concat([series_interp, model.forecast(forecast_days)])
    return forecasted_df, model

forecasted_df, fit1 = get_exp_smoothing_forecast(end_date=date_list[-1], forecast_days=forecast_days)

c:\users\shafi\desktop\data science project\venv\lib\site-packages\statsmodels\tsa\holtwinters.py:744: ConvergenceWarning:

Optimization failed to converge. Check mle_retvals.

fig3 = go.Figure()
fig3.add_trace(go.Scatter(x=date_list, y=confirmed_case_cumsum, mode='markers', name='actual'))
fig3.add_trace(go.Scatter(x=forecasted_df.index.tolist(), y=forecasted_df.values.tolist(), mode='lines', name='forecast'))
fig3.update_layout(xaxis_title='Date', yaxis_title='COVID-19 Patients', title="Forecasted number of patients vs actual patients in South Korea")
fig3.show()

exp_smoothing_forecast = fit1.forecast(forecast_days)
exp_smoothing_forecast

2020-03-05    6163.586151
2020-03-06    6526.296678
2020-03-07    6856.363258
2020-03-08    7156.723845
2020-03-09    7430.051980
2020-03-10    7678.780583
Freq: D, dtype: float64

Section 4: Accuracy of forecast¶

For this section, I'm using the updated (as of March 11, 2020) dataset from Kaggle. I will use this dataset to check the accuracy of my prediction¶

new_patient_df = pd.read_csv('patient_updated_mar_10_2020.csv')

new_patient_df_date_group = new_patient_df.groupby('confirmed_date')

cumsum_series = new_patient_df_date_group['confirmed_date'].count().cumsum()

updated_confirmed_case_cumsum = list(cumsum_series)

new_date_list = list(new_patient_df['confirmed_date'].dropna().unique())

fig4 = go.Figure()
fig4.add_trace(go.Scatter(x=new_date_list, y=updated_confirmed_case_cumsum, mode='markers', name='updated actual'))
fig4.add_trace(go.Scatter(x=forecasted_df.index.tolist(), y=forecasted_df.values.tolist(), mode='lines', name='forecast'))
fig4.update_layout(xaxis_title='Date', yaxis_title='COVID-19 Patients', title="Forecasted number of patients vs updated actual patients in South Korea")
fig4.show()

display(exp_smoothing_forecast) 

# Actual updated count
display(new_patient_df_date_group['confirmed_date'].count().cumsum()[-forecast_days:])

2020-03-05    6163.586151
2020-03-06    6526.296678
2020-03-07    6856.363258
2020-03-08    7156.723845
2020-03-09    7430.051980
2020-03-10    7678.780583
Freq: D, dtype: float64

confirmed_date
2020-03-05    6284
2020-03-06    6769
2020-03-07    7133
2020-03-08    7381
2020-03-09    7512
2020-03-10    7754
Name: confirmed_date, dtype: int64

from sklearn.metrics import mean_squared_error

y_true = np.array(updated_confirmed_case_cumsum[-forecast_days:])
y_pred = np.array(forecasted_df.values[-forecast_days:])

forecast_rmse = mean_squared_error(y_true, y_pred, squared=False)
md("#### My forecasted results have a mean deviation of approx. {} counts".format(round(forecast_rmse)))

I will calculate Mean Absolute Error of my forecast to better assess it's accuracy¶

def mean_absolute_percentage_error(y_true, y_pred):
    if len(y_true) > 1 and len(y_pred) > 1 and len(y_true) == len(y_pred):
        return np.mean(np.abs((y_true - y_pred) / y_true)) * 100
    else:
        print("check if y_true and y_pred are same length arrays ")

forecast_mape = mean_absolute_percentage_error(y_true, y_pred)
md("#### My forecasted results have a MAPE of {} %".format(round(forecast_mape, 2)))

Section 5: Update on forecast¶

Let's check how my forecast model is doing as of April 27, 2020¶

new_forecast_days = 54   # 54 days since last dataset

updated_forecasted_df, model = get_exp_smoothing_forecast(end_date='2020-04-27', forecast_days=new_forecast_days)

c:\users\shafi\desktop\data science project\venv\lib\site-packages\statsmodels\tsa\holtwinters.py:744: ConvergenceWarning:

Optimization failed to converge. Check mle_retvals.

updated_forecasted_df

2020-01-20        1.000000
2020-01-21        1.672011
2020-01-22        1.916812
2020-01-23        1.953207
2020-01-24        2.000000
                  ...     
2020-04-23    10154.043745
2020-04-24    10157.613089
2020-04-25    10160.861192
2020-04-26    10163.816965
2020-04-27    10166.506719
Freq: D, Length: 99, dtype: float64

https://www.statista.com/statistics/1095848/south-korea-confirmed-and-suspected-coronavirus-cases/¶

Number of coronavirus (COVID-19) confirmed, recovered, and test cases in South Korea as of April 27, 2020:¶

10,738¶

My forecast for April 27th:¶

10166¶

percentage_diff = (10738 - 10166) / 10738 * 100 
md("### My forecasted result for April 27th is off by {} %".format(round(percentage_diff, 2)))

	id	sex	birth_year	country	region	group	infection_reason	infection_order	infected_by	contact_number	confirmed_date	released_date	deceased_date	state
0	1	female	1984.0	China	filtered at airport	NaN	visit to Wuhan	1.0	NaN	45.0	2020-01-20	2020-02-06	NaN	released
1	2	male	1964.0	Korea	filtered at airport	NaN	visit to Wuhan	1.0	NaN	75.0	2020-01-24	2020-02-05	NaN	released
2	3	male	1966.0	Korea	capital area	NaN	visit to Wuhan	1.0	NaN	16.0	2020-01-26	2020-02-12	NaN	released
3	4	male	1964.0	Korea	capital area	NaN	visit to Wuhan	1.0	NaN	95.0	2020-01-27	2020-02-09	NaN	released
4	5	male	1987.0	Korea	capital area	NaN	visit to Wuhan	1.0	NaN	31.0	2020-01-30	2020-03-02	NaN	released
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
5761	5762	NaN	NaN	Korea	NaN	NaN	NaN	NaN	NaN	NaN	2020-03-04	NaN	NaN	isolated
5762	5763	NaN	NaN	Korea	NaN	NaN	NaN	NaN	NaN	NaN	2020-03-04	NaN	NaN	isolated
5763	5764	NaN	NaN	Korea	NaN	NaN	NaN	NaN	NaN	NaN	2020-03-04	NaN	NaN	isolated
5764	5765	NaN	NaN	Korea	NaN	NaN	NaN	NaN	NaN	NaN	2020-03-04	NaN	NaN	isolated
5765	5766	NaN	NaN	Korea	NaN	NaN	NaN	NaN	NaN	NaN	2020-03-04	NaN	NaN	isolated

	id	sex	birth_year	Age	country	region	group	infection_reason	infection_order	infected_by	contact_number	confirmed_date	released_date	deceased_date	state
0	1	female	1984.0	36.0	China	filtered at airport	NaN	visit to Wuhan	1.0	NaN	45.0	2020-01-20	2020-02-06	NaN	released
1	2	male	1964.0	56.0	Korea	filtered at airport	NaN	visit to Wuhan	1.0	NaN	75.0	2020-01-24	2020-02-05	NaN	released
2	3	male	1966.0	54.0	Korea	capital area	NaN	visit to Wuhan	1.0	NaN	16.0	2020-01-26	2020-02-12	NaN	released
3	4	male	1964.0	56.0	Korea	capital area	NaN	visit to Wuhan	1.0	NaN	95.0	2020-01-27	2020-02-09	NaN	released
4	5	male	1987.0	33.0	Korea	capital area	NaN	visit to Wuhan	1.0	NaN	31.0	2020-01-30	2020-03-02	NaN	released
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
5761	5762	NaN	NaN	NaN	Korea	NaN	NaN	NaN	NaN	NaN	NaN	2020-03-04	NaN	NaN	isolated
5762	5763	NaN	NaN	NaN	Korea	NaN	NaN	NaN	NaN	NaN	NaN	2020-03-04	NaN	NaN	isolated
5763	5764	NaN	NaN	NaN	Korea	NaN	NaN	NaN	NaN	NaN	NaN	2020-03-04	NaN	NaN	isolated
5764	5765	NaN	NaN	NaN	Korea	NaN	NaN	NaN	NaN	NaN	NaN	2020-03-04	NaN	NaN	isolated
5765	5766	NaN	NaN	NaN	Korea	NaN	NaN	NaN	NaN	NaN	NaN	2020-03-04	NaN	NaN	isolated

	Age	state_deceased	state_isolated	state_released
0	36.0	0	0	1
1	56.0	0	0	1
2	54.0	0	0	1
3	56.0	0	0	1
4	33.0	0	0	1
...	...	...	...	...
5025	47.0	0	1	0
5050	39.0	0	1	0
5142	60.0	1	0	0
5172	46.0	0	1	0
5580	7.0	0	1	0

	patient_id	date	province	city	visit	latitude	longitude
0	1	2020-01-19	Incheon	Jung-gu	airport	37.460459	126.440680
1	1	2020-01-20	Incheon	Seo-gu	hospital_isolated	37.478832	126.668558
2	2	2020-01-22	Gyeonggi-do	Gimpo-si	airport	37.562143	126.801884
3	2	2020-01-23	Seoul	Jung-gu	hospital_isolated	37.567454	127.005627
4	3	2020-01-20	Incheon	Jung-gu	airport	37.460459	126.440680
...	...	...	...	...	...	...	...
207	55	2020-02-19	Gyeongsangbuk-do	Pohang-si	hospital_isolated	36.034762	129.355059
208	56	2020-02-17	Gyeongsangbuk-do	Pohang-si	hospital	37.576420	126.972759
209	56	2020-02-13	Seoul	Dongdaemun-gu	hospital	37.593919	127.051291
210	56	2020-02-13	Seoul	Jongno-gu	hospital	37.581837	126.969186
211	56	2020-02-19	Seoul	Jungnang-gu	hospital_isolated	37.612806	127.098134

	patient_id	date	province	city	visit	latitude	longitude
0	1	2020-01-19	Incheon	Jung-gu	airport	37.460459	126.440680
68	16	2020-01-19	Jeollanam-do	Muan-gun	airport	34.996485	126.387447
61	14	2020-01-19	Gyeonggi-do	Gimpo-si	airport	37.563581	126.802056
42	12	2020-01-19	Gyeonggi-do	Gimpo-si	airport	37.563581	126.802056
4	3	2020-01-20	Incheon	Jung-gu	airport	37.460459	126.440680
65	15	2020-01-20	Incheon	Jung-gu	airport	37.460459	126.440680
13	4	2020-01-20	Incheon	Jung-gu	airport	37.460459	126.440680
14	4	2020-01-20	Gyeonggi-do	Pyeongtaek-si	bus_terminal	37.079940	127.058282
43	12	2020-01-20	Gyeonggi-do	Bucheon-si	movie_theater	37.486112	126.781023
44	12	2020-01-21	Incheon	Jung-gu	office	37.463100	126.631371

	patient_id	province	city	visit	latitude	longitude
date
2020-01-19	1	Incheon	Jung-gu	airport	37.460459	126.440680
2020-01-20	3	Incheon	Jung-gu	airport	37.460459	126.440680
2020-01-21	12	Incheon	Jung-gu	office	37.463100	126.631371
2020-01-22	12	Gangwon-do	Gangneung-si	restaurant	37.690782	129.032031
2020-01-23	3	Seoul	Gangnam-gu	store	37.524669	127.015911
2020-01-24	12	Gyeonggi-do	Suwon-si	train_station	37.266602	126.999805
2020-01-25	17	Daegu	Dong-gu	train_station	35.878754	128.625494
2020-01-26	5	Seoul	Seongbuk-gu	movie_theater	37.592858	127.017016
2020-01-27	8	Jeollabuk-do	Gunsan-si	clinic	35.968603	126.716109
2020-01-28	5	Seoul	Jungnang-gu	restaurant	37.588913	127.091112
2020-01-29	17	Gyeonggi-do	Guri-si	market	37.586809	127.138323
2020-01-30	14	Gyeonggi-do	Bucheon-si	market	37.484044	126.782436
2020-01-31	19	Gyeonggi-do	Seongnam-si	company	37.378511	127.114316
2020-02-01	19	Incheon	Yeonsu-gu	market	37.381624	126.657218
2020-02-02	23	Seoul	Mapo-gu	market	37.542533	126.953310
2020-02-03	21	Seoul	Seongbuk-gu	hospital	37.602813	127.039582
2020-02-04	17	Gyeonggi-do	Guri-si	hospital	37.601095	127.132179
2020-02-05	29	Seoul	Jongno-gu	hospital	37.575739	127.015399
2020-02-06	31	Daegu	Dong-gu	company	35.875120	128.627600
2020-02-07	29	Gyeonggi-do	Dongducheon-si	etc	37.948023	127.061052
2020-02-08	30	Seoul	Jongno-gu	hospital	37.579541	126.999305
2020-02-09	31	Daegu	Nam-gu	church	35.839820	128.566600
2020-02-10	30	Seoul	Jongno-gu	etc	37.574541	127.015927
2020-02-11	29	Seoul	Jongno-gu	clinic	37.572596	127.015270
2020-02-12	29	Seoul	Jongno-gu	etc	37.579471	127.015224
2020-02-13	56	Seoul	Dongdaemun-gu	hospital	37.593919	127.051291
2020-02-14	30	Seoul	Jongno-gu	clinic	37.572596	127.015270
2020-02-15	40	Seoul	Seongdong-gu	etc	37.588230	127.063600
2020-02-16	37	Gyeongsangbuk-do	Yeongcheon-si	clinic	35.932450	128.872000
2020-02-17	31	Daegu	Suseong-gu	hospital	35.844730	128.612300
2020-02-18	37	Gyeongsangbuk-do	Yeongcheon-si	clinic	35.965200	128.938700
2020-02-19	40	Seoul	Seongdong-gu	hospital	37.559700	127.044000

COVID-19 Data Analysis & Forecast¶

The analysis is done on Coronavirus dataset for South Korea from Kaggle dated March 5, 2020¶

This project is divided into multiple sections.¶

Section 1: Patient AnalysisSection 2: Contagion AnalysisSection 3: Confirmed Cases ForecastSection 4: Accuracy of forecastSection 5: Update on forecast

Section 1: Patient Analysis¶

This section, I will visualize patient data to learn more about Coronavirus and find some trends¶

Let's add age of the patients¶

I will break the patient state into dummy variables¶

All patient deaths were over the age 36, most likely due to underlying health conditions as the news suggested¶

Section 2: Contagion Analysis¶

This section will visualize the spread of COVID-19 through S. Korea. I am using the updated dataset from March 11, 2020 as this section has no code dependency with the upcoming sections¶

I am going to leave out 'hospital isolated' from 'visit' column as it probably means the patient is currently isolated in a hospital. I want to show the places where the virus was contracted¶

Let's visualize the public places where people caught the virus¶

Here's a list of reported places¶

For the sake of this graph, I am going to condense some of the places as they both fall in the same category. For example, 'hospital' and 'clinic' are the same¶

Let's visualize the count of COVID-19 cases in provinces¶

Now I want to visualize the spread of COVID-19 over dates. I will demonstrate this using plotly scatter plot on the map of South Korea¶

I ran into this blog post by Amaral Lab on how to implement a slider bar with plotly and decided to try this out¶

At first, let's create all the lists I need for the plot¶

The data_slider set is going to be a tuple of scattermapbox dictionaries. Each dictionary will be an aggregation by passing dates.¶

This data_slider will be used to create a go.Figure object for plotting¶

I will group by date to find the cumulative sum of daily count of COVID-19 cases¶

Creating the slider bar for all the unique dates. The slider bar will be used for the plot¶

Creating the plot¶

Scrolling through the dates will show the locations where the virus was contracted and how it spread through S. Korea. Please be patient with the slider, there's a small delay for the map to update¶

Section 3: Confirmed Cases Forecast¶

In this section, I will make a forecast for the Coronavirus patient count for N days after the last reported date (March 4, 2020) using the same dataframe from Section 1¶

It can seen from the plot that the total confirmed case is exponentially increasing. Also there are no data entry for dates in between the date series¶

For this case, Exponential Smoothing should be the best candidate for prediction but first I will need to fill in the missing dates as this method only works for data without any missing time series values¶

The filled-in data fit in perfectly with the actual data¶

Now let's find the forecast for the next 6 days and plot it against actual data¶

I will add a damping slope because I'm noticing the gradient of the curve is slowly starting to decrease. Damping slope of 0.91 results in the most realistic trend for the curve¶

Section 4: Accuracy of forecast¶

For this section, I'm using the updated (as of March 11, 2020) dataset from Kaggle. I will use this dataset to check the accuracy of my prediction¶

My forecasted results have a mean deviation of approx. 188.0 counts¶

I will calculate Mean Absolute Error of my forecast to better assess it's accuracy¶

My forecasted results have a MAPE of 2.41 %¶

Section 5: Update on forecast¶

Let's check how my forecast model is doing as of April 27, 2020¶

https://www.statista.com/statistics/1095848/south-korea-confirmed-and-suspected-coronavirus-cases/¶

Number of coronavirus (COVID-19) confirmed, recovered, and test cases in South Korea as of April 27, 2020:¶

10,738¶

My forecast for April 27th:¶

10166¶

My forecasted result for April 27th is off by 5.33 %¶

Section 1: Patient Analysis

Section 2: Contagion Analysis

Section 3: Confirmed Cases Forecast

Section 4: Accuracy of forecast

Section 5: Update on forecast