COVID-19 Data Analysis & Forecast

The analysis is done on Coronavirus dataset for South Korea from Kaggle dated March 5, 2020

This project is divided into multiple sections.

  • Section 1: Patient Analysis

  • Section 2: Contagion Analysis

  • Section 3: Confirmed Cases Forecast

  • Section 4: Accuracy of forecast

  • Section 5: Update on forecast


In [1]:
import numpy as np
import pandas as pd

from IPython.display import Markdown as md
In [2]:
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import plotly.graph_objects as go
import plotly.express as px
import cufflinks as cf

init_notebook_mode(connected=True)
cf.go_offline()
In [3]:
import warnings

warnings.filterwarnings('ignore')
warnings.simplefilter('ignore')

Section 1: Patient Analysis

In [4]:
patient_df = pd.read_csv('patient.csv')
patient_df
Out[4]:
id sex birth_year country region group infection_reason infection_order infected_by contact_number confirmed_date released_date deceased_date state
0 1 female 1984.0 China filtered at airport NaN visit to Wuhan 1.0 NaN 45.0 2020-01-20 2020-02-06 NaN released
1 2 male 1964.0 Korea filtered at airport NaN visit to Wuhan 1.0 NaN 75.0 2020-01-24 2020-02-05 NaN released
2 3 male 1966.0 Korea capital area NaN visit to Wuhan 1.0 NaN 16.0 2020-01-26 2020-02-12 NaN released
3 4 male 1964.0 Korea capital area NaN visit to Wuhan 1.0 NaN 95.0 2020-01-27 2020-02-09 NaN released
4 5 male 1987.0 Korea capital area NaN visit to Wuhan 1.0 NaN 31.0 2020-01-30 2020-03-02 NaN released
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
5761 5762 NaN NaN Korea NaN NaN NaN NaN NaN NaN 2020-03-04 NaN NaN isolated
5762 5763 NaN NaN Korea NaN NaN NaN NaN NaN NaN 2020-03-04 NaN NaN isolated
5763 5764 NaN NaN Korea NaN NaN NaN NaN NaN NaN 2020-03-04 NaN NaN isolated
5764 5765 NaN NaN Korea NaN NaN NaN NaN NaN NaN 2020-03-04 NaN NaN isolated
5765 5766 NaN NaN Korea NaN NaN NaN NaN NaN NaN 2020-03-04 NaN NaN isolated

5766 rows × 14 columns

In [5]:
fig = px.histogram(patient_df, x="infection_reason", title='Infection Reason Distribution')
fig.show()
In [6]:
patient_state_no_null_df = patient_df['state'].dropna()

fig = px.pie(patient_state_no_null_df, names='state', title='Patient State Distribution')
fig.show()

Let's add age of the patients

In [7]:
age_list = 2020 - patient_df['birth_year']
patient_df.insert(3, "Age", age_list)
patient_df
Out[7]:
id sex birth_year Age country region group infection_reason infection_order infected_by contact_number confirmed_date released_date deceased_date state
0 1 female 1984.0 36.0 China filtered at airport NaN visit to Wuhan 1.0 NaN 45.0 2020-01-20 2020-02-06 NaN released
1 2 male 1964.0 56.0 Korea filtered at airport NaN visit to Wuhan 1.0 NaN 75.0 2020-01-24 2020-02-05 NaN released
2 3 male 1966.0 54.0 Korea capital area NaN visit to Wuhan 1.0 NaN 16.0 2020-01-26 2020-02-12 NaN released
3 4 male 1964.0 56.0 Korea capital area NaN visit to Wuhan 1.0 NaN 95.0 2020-01-27 2020-02-09 NaN released
4 5 male 1987.0 33.0 Korea capital area NaN visit to Wuhan 1.0 NaN 31.0 2020-01-30 2020-03-02 NaN released
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
5761 5762 NaN NaN NaN Korea NaN NaN NaN NaN NaN NaN 2020-03-04 NaN NaN isolated
5762 5763 NaN NaN NaN Korea NaN NaN NaN NaN NaN NaN 2020-03-04 NaN NaN isolated
5763 5764 NaN NaN NaN Korea NaN NaN NaN NaN NaN NaN 2020-03-04 NaN NaN isolated
5764 5765 NaN NaN NaN Korea NaN NaN NaN NaN NaN NaN 2020-03-04 NaN NaN isolated
5765 5766 NaN NaN NaN Korea NaN NaN NaN NaN NaN NaN 2020-03-04 NaN NaN isolated

5766 rows × 15 columns

I will break the patient state into dummy variables

In [8]:
age_state_df = patient_df.filter(['Age','state']).dropna()
age_state_df = pd.get_dummies(age_state_df)
age_state_df
Out[8]:
Age state_deceased state_isolated state_released
0 36.0 0 0 1
1 56.0 0 0 1
2 54.0 0 0 1
3 56.0 0 0 1
4 33.0 0 0 1
... ... ... ... ...
5025 47.0 0 1 0
5050 39.0 0 1 0
5142 60.0 1 0 0
5172 46.0 0 1 0
5580 7.0 0 1 0

408 rows × 4 columns

In [9]:
age_list = age_state_df['Age'].unique()
In [10]:
state_deceased_list, state_isolated_list, state_released_list = [],[],[]

for age in age_list:
    state_deceased_list_i = age_state_df[age_state_df['Age']== age]['state_deceased'].sum()
    state_deceased_list.append(state_deceased_list_i)
    
    state_isolated_list_i = age_state_df[age_state_df['Age']== age]['state_isolated'].sum()
    state_isolated_list.append(state_isolated_list_i)
    
    state_released_list_i = age_state_df[age_state_df['Age']== age]['state_released'].sum()
    state_released_list.append(state_released_list_i)
In [11]:
fig = go.Figure(data=[
    go.Bar(name='Deceased', x=age_list, y=state_deceased_list, marker_color='red'),
    go.Bar(name='Isolated', x=age_list, y=state_isolated_list, marker_color='orange'),
    go.Bar(name='Released', x=age_list, y=state_released_list, marker_color='green'),
])
# Change the bar mode
fig.update_layout(barmode='stack', title='Age-State Distribution', xaxis_title='Patient Age', yaxis_title='Count')
fig.show()

All patient deaths were over the age 36, most likely due to underlying health conditions as the news suggested


Section 2: Contagion Analysis

This section will visualize the spread of COVID-19 through S. Korea. I am using the updated dataset from March 11, 2020 as this section has no code dependency with the upcoming sections

In [12]:
route_df = pd.read_csv('route_updated.csv')
route_df
Out[12]:
patient_id date province city visit latitude longitude
0 1 2020-01-19 Incheon Jung-gu airport 37.460459 126.440680
1 1 2020-01-20 Incheon Seo-gu hospital_isolated 37.478832 126.668558
2 2 2020-01-22 Gyeonggi-do Gimpo-si airport 37.562143 126.801884
3 2 2020-01-23 Seoul Jung-gu hospital_isolated 37.567454 127.005627
4 3 2020-01-20 Incheon Jung-gu airport 37.460459 126.440680
... ... ... ... ... ... ... ...
207 55 2020-02-19 Gyeongsangbuk-do Pohang-si hospital_isolated 36.034762 129.355059
208 56 2020-02-17 Gyeongsangbuk-do Pohang-si hospital 37.576420 126.972759
209 56 2020-02-13 Seoul Dongdaemun-gu hospital 37.593919 127.051291
210 56 2020-02-13 Seoul Jongno-gu hospital 37.581837 126.969186
211 56 2020-02-19 Seoul Jungnang-gu hospital_isolated 37.612806 127.098134

212 rows × 7 columns

In [13]:
# Mapbox API token
token = open("token.txt","r").readline()

I am going to leave out 'hospital isolated' from 'visit' column as it probably means the patient is currently isolated in a hospital. I want to show the places where the virus was contracted

In [14]:
route_df_filtered = route_df[route_df['visit'] != 'hospital_isolated']
route_df_filtered = route_df_filtered.sort_values(by=['date'])
route_df_filtered.head(10)
Out[14]:
patient_id date province city visit latitude longitude
0 1 2020-01-19 Incheon Jung-gu airport 37.460459 126.440680
68 16 2020-01-19 Jeollanam-do Muan-gun airport 34.996485 126.387447
61 14 2020-01-19 Gyeonggi-do Gimpo-si airport 37.563581 126.802056
42 12 2020-01-19 Gyeonggi-do Gimpo-si airport 37.563581 126.802056
4 3 2020-01-20 Incheon Jung-gu airport 37.460459 126.440680
65 15 2020-01-20 Incheon Jung-gu airport 37.460459 126.440680
13 4 2020-01-20 Incheon Jung-gu airport 37.460459 126.440680
14 4 2020-01-20 Gyeonggi-do Pyeongtaek-si bus_terminal 37.079940 127.058282
43 12 2020-01-20 Gyeonggi-do Bucheon-si movie_theater 37.486112 126.781023
44 12 2020-01-21 Incheon Jung-gu office 37.463100 126.631371

Let's visualize the public places where people caught the virus

Here's a list of reported places

In [15]:
route_df_filtered['visit'].unique()
Out[15]:
array(['airport', 'bus_terminal', 'movie_theater', 'office', 'clinic',
       'restaurant', 'train_station', 'hotel', 'store', 'etc', 'cafe',
       'church', 'hospital', 'market', 'hair_salon', 'company'],
      dtype=object)

For the sake of this graph, I am going to condense some of the places as they both fall in the same category. For example, 'hospital' and 'clinic' are the same

In [16]:
route_df_condensed = route_df_filtered.replace('clinic', 'hospital')
route_df_condensed = route_df_condensed.replace('cafe', 'restaurant')
route_df_condensed = route_df_condensed.replace('company', 'office')
route_df_condensed = route_df_condensed.replace('store', 'market')
In [17]:
route_df_condensed['visit'].unique()
Out[17]:
array(['airport', 'bus_terminal', 'movie_theater', 'office', 'hospital',
       'restaurant', 'train_station', 'hotel', 'market', 'etc', 'church',
       'hair_salon'], dtype=object)
In [18]:
fig1 = px.histogram(route_df_condensed, x="visit", title='Public places where the virus was contracted', nbins=60)
fig1.update_layout(
    bargap=0.1)
fig1.show()

Let's visualize the count of COVID-19 cases in provinces

In [19]:
fig2 = px.histogram(route_df_filtered, x="province", title='Count in S. Korean provinces', nbins=60)
fig2.update_layout(
    bargap=0.1)
fig2.show()

Now I want to visualize the spread of COVID-19 over dates. I will demonstrate this using plotly scatter plot on the map of South Korea

I ran into this blog post by Amaral Lab on how to implement a slider bar with plotly and decided to try this out

At first, let's create all the lists I need for the plot

In [20]:
unique_date_list = np.array(route_df_filtered['date'].dropna().unique())
unique_date_list.sort()
In [21]:
lat = np.array(route_df_filtered['latitude'].dropna())
lon = np.array(route_df_filtered['longitude'].dropna())
city = np.array(route_df_filtered['city'].dropna())
date = np.array(route_df_filtered['date'].dropna())
visit = np.array(route_df_filtered['visit'].dropna())
In [22]:
unique_date_list_len = unique_date_list.size
date_len = date.size
In [23]:
def remove_special_chars(text):
    new_text = text.replace('_', ' ')
    return new_text

The data_slider set is going to be a tuple of scattermapbox dictionaries. Each dictionary will be an aggregation by passing dates.

This data_slider will be used to create a go.Figure object for plotting

In [24]:
data_slider = ()

lat_arr, lon_arr, text_arr, city_arr, date_arr, visit_arr, hover_data_arr = [], [], [], [], [], [], []

for i in range(0, lat.size):
    lat_arr = np.append(lat_arr, lat[i])
    lon_arr = np.append(lon_arr, lon[i])
    city_arr = np.append(city_arr, city[i])
    date_arr = np.append(date_arr, date[i])
    visit_arr = np.append(visit_arr, remove_special_chars(visit[i]))
    
    hover_data_i = f'Date: {date_arr[i]}<br>Lat: {lat_arr[i]} <br>Lon: {lon_arr[i]}<br>Visit: {visit_arr[i]}'
    hover_data_arr = np.append(hover_data_arr, hover_data_i)
    
    data_one_day = dict(
          lat = lat_arr,
          lon = lon_arr,
          marker = {'size': 6, 'color':'crimson'},
          mode = 'markers',
          hovertext = hover_data_arr,
          hoverinfo="text",
          type = 'scattermapbox',
    )
    
    data_slider = data_slider + (data_one_day,)
In [25]:
fig3 = go.Figure(data_slider)

I will group by date to find the cumulative sum of daily count of COVID-19 cases

In [26]:
route_df_date_group = route_df_filtered.groupby('date')
route_df_date_group.first()
Out[26]:
patient_id province city visit latitude longitude
date
2020-01-19 1 Incheon Jung-gu airport 37.460459 126.440680
2020-01-20 3 Incheon Jung-gu airport 37.460459 126.440680
2020-01-21 12 Incheon Jung-gu office 37.463100 126.631371
2020-01-22 12 Gangwon-do Gangneung-si restaurant 37.690782 129.032031
2020-01-23 3 Seoul Gangnam-gu store 37.524669 127.015911
2020-01-24 12 Gyeonggi-do Suwon-si train_station 37.266602 126.999805
2020-01-25 17 Daegu Dong-gu train_station 35.878754 128.625494
2020-01-26 5 Seoul Seongbuk-gu movie_theater 37.592858 127.017016
2020-01-27 8 Jeollabuk-do Gunsan-si clinic 35.968603 126.716109
2020-01-28 5 Seoul Jungnang-gu restaurant 37.588913 127.091112
2020-01-29 17 Gyeonggi-do Guri-si market 37.586809 127.138323
2020-01-30 14 Gyeonggi-do Bucheon-si market 37.484044 126.782436
2020-01-31 19 Gyeonggi-do Seongnam-si company 37.378511 127.114316
2020-02-01 19 Incheon Yeonsu-gu market 37.381624 126.657218
2020-02-02 23 Seoul Mapo-gu market 37.542533 126.953310
2020-02-03 21 Seoul Seongbuk-gu hospital 37.602813 127.039582
2020-02-04 17 Gyeonggi-do Guri-si hospital 37.601095 127.132179
2020-02-05 29 Seoul Jongno-gu hospital 37.575739 127.015399
2020-02-06 31 Daegu Dong-gu company 35.875120 128.627600
2020-02-07 29 Gyeonggi-do Dongducheon-si etc 37.948023 127.061052
2020-02-08 30 Seoul Jongno-gu hospital 37.579541 126.999305
2020-02-09 31 Daegu Nam-gu church 35.839820 128.566600
2020-02-10 30 Seoul Jongno-gu etc 37.574541 127.015927
2020-02-11 29 Seoul Jongno-gu clinic 37.572596 127.015270
2020-02-12 29 Seoul Jongno-gu etc 37.579471 127.015224
2020-02-13 56 Seoul Dongdaemun-gu hospital 37.593919 127.051291
2020-02-14 30 Seoul Jongno-gu clinic 37.572596 127.015270
2020-02-15 40 Seoul Seongdong-gu etc 37.588230 127.063600
2020-02-16 37 Gyeongsangbuk-do Yeongcheon-si clinic 35.932450 128.872000
2020-02-17 31 Daegu Suseong-gu hospital 35.844730 128.612300
2020-02-18 37 Gyeongsangbuk-do Yeongcheon-si clinic 35.965200 128.938700
2020-02-19 40 Seoul Seongdong-gu hospital 37.559700 127.044000
In [27]:
date_case_cumsum = np.array(route_df_date_group['date'].count().cumsum())
date_case_cumsum
Out[27]:
array([  4,   9,  11,  20,  28,  39,  44,  50,  55,  61,  67,  72,  76,
        81,  84,  89,  94, 106, 108, 114, 119, 120, 131, 132, 136, 140,
       143, 149, 151, 154, 157, 158], dtype=int64)

Creating the slider bar for all the unique dates. The slider bar will be used for the plot

In [28]:
steps = []

step_0 = dict(method='restyle',
                args=['visible', [False] * date_len],
                label='Start',
               )

steps.append(step_0)

for i in range(unique_date_list_len):
    step = dict(method='restyle',
                args=np.array(['visible', np.full((date_len), False)]),
                label='{}'.format(unique_date_list[i])) # label to be displayed for each date

    step['args'][1][:date_case_cumsum[i]] = True
    steps.append(step)



##  Creating the 'sliders' object from the 'steps' 
sliders = [dict(active=0, pad={"t": 1}, steps=steps)]  

Creating the plot

In [29]:
fig3.update_layout(
    autosize=True,
    mapbox_style="dark",
    showlegend=False,
    height=600,
    mapbox=dict(
        accesstoken=token,
        bearing=0,
        center=dict(
            lat=36.735362,
            lon=127.828125
        ),
        pitch=0,
        zoom=5.5
    ),
    sliders=sliders,
)

Scrolling through the dates will show the locations where the virus was contracted and how it spread through S. Korea. Please be patient with the slider, there's a small delay for the map to update

From the plot it can be seen that the first cases (Jan 19, 2020) of Coronavirus was contracted at airports of some major cities, possibly from people travelling from Mainland China or other affected countries. The virus then eventually spread in Seoul and other main cities via public places like hospital/clinic, restaurants, train stations etc.


Section 3: Confirmed Cases Forecast

In this section, I will make a forecast for the Coronavirus patient count for N days after the last reported date (March 4, 2020) using the same dataframe from Section 1

In [30]:
patient_df_date_group = patient_df.groupby('confirmed_date')
In [31]:
patient_df.head(10)
Out[31]:
id sex birth_year Age country region group infection_reason infection_order infected_by contact_number confirmed_date released_date deceased_date state
0 1 female 1984.0 36.0 China filtered at airport NaN visit to Wuhan 1.0 NaN 45.0 2020-01-20 2020-02-06 NaN released
1 2 male 1964.0 56.0 Korea filtered at airport NaN visit to Wuhan 1.0 NaN 75.0 2020-01-24 2020-02-05 NaN released
2 3 male 1966.0 54.0 Korea capital area NaN visit to Wuhan 1.0 NaN 16.0 2020-01-26 2020-02-12 NaN released
3 4 male 1964.0 56.0 Korea capital area NaN visit to Wuhan 1.0 NaN 95.0 2020-01-27 2020-02-09 NaN released
4 5 male 1987.0 33.0 Korea capital area NaN visit to Wuhan 1.0 NaN 31.0 2020-01-30 2020-03-02 NaN released
5 6 male 1964.0 56.0 Korea capital area NaN contact with patient 2.0 3.0 17.0 2020-01-30 2020-02-19 NaN released
6 7 male 1991.0 29.0 Korea capital area NaN visit to Wuhan 1.0 NaN 9.0 2020-01-30 2020-02-15 NaN released
7 8 female 1957.0 63.0 Korea Jeollabuk-do NaN visit to Wuhan 1.0 NaN 113.0 2020-01-31 2020-02-12 NaN released
8 9 female 1992.0 28.0 Korea capital area NaN contact with patient 2.0 5.0 2.0 2020-01-31 2020-02-24 NaN released
9 10 female 1966.0 54.0 Korea capital area NaN contact with patient 3.0 6.0 43.0 2020-01-31 2020-02-19 NaN released
In [32]:
patient_df_date_group['confirmed_date'].count()
Out[32]:
confirmed_date
2020-01-20       1
2020-01-24       1
2020-01-26       1
2020-01-27       1
2020-01-30       3
2020-01-31       4
2020-02-01       1
2020-02-02       3
2020-02-04       1
2020-02-05       5
2020-02-06       3
2020-02-09       3
2020-02-10       1
2020-02-16       2
2020-02-18       9
2020-02-19      26
2020-02-20      38
2020-02-21     100
2020-02-22     229
2020-02-23     169
2020-02-24     231
2020-02-25     143
2020-02-26     285
2020-02-27     505
2020-02-28     571
2020-02-29     813
2020-03-01    1062
2020-03-02     600
2020-03-03     516
2020-03-04     438
Name: confirmed_date, dtype: int64
In [33]:
confirmed_case_cumsum = list(patient_df_date_group['confirmed_date'].count().cumsum())
In [34]:
date_list = list(patient_df['confirmed_date'].dropna().unique())
In [35]:
fig = go.Figure(data=go.Scatter(x=date_list, y=confirmed_case_cumsum, mode='markers'))
fig.update_layout(xaxis_title='Date', yaxis_title='COVID-19 Patients', title="Cumulative Sum of Confirmed Cases in South Korea")
fig.show()

It can seen from the plot that the total confirmed case is exponentially increasing. Also there are no data entry for dates in between the date series

For this case, Exponential Smoothing should be the best candidate for prediction but first I will need to fill in the missing dates as this method only works for data without any missing time series values

In [36]:
from datetime import datetime
from datetime import timedelta

converted_date_list = []


def get_str_from_date(str_date, add_day=False):
    '''
    This function converts string to date and adds one day and returns new date in string format
    
    Inputs:
    - str_date: dates in string format
    - add_day: if True add a day to the dates
    
    Returns:
    - dates as datetime object
    
    '''
    if add_day:
        datetime_obj = datetime.strptime(str_date, '%Y-%m-%d') + timedelta(days=1)
        
    elif len(str_date) > 1:
        for d in str_date:
            datetime_obj_i = datetime.strptime(d, '%Y-%m-%d')
            datetime_obj_i = datetime_obj_i.strftime('%Y-%m-%d')
            converted_date_list.append(datetime_obj_i)
        return converted_date_list
    
    else:
        datetime_obj = datetime.strptime(str_date, '%Y-%m-%d')
        
    return datetime_obj.strftime('%Y-%m-%d')
In [37]:
from scipy.interpolate import splrep, splev

def spline_interp(x, y, x_new):
    '''
    Spline interpolation for missing dates
    
    Inputs:
    x: dates
    y: cumulative sum of confirmed cases
    x_new: dates  for which cumulative sum needs to be interpolated
    
    Returns:
    Interpolated cumulative sum
    '''
    tck = splrep(x, y)
    return splev(x_new, tck)
In [38]:
# Creating a continous date array
df_interp = pd.DataFrame()
df_interp['dates'] = np.arange(date_list[0], get_str_from_date(date_list[-1], add_day=True), dtype='datetime64[D]')                     
In [39]:
# Finding spline interpolated cumulative sum for missing dates
import matplotlib.dates as mdates

datetime_date_list = [datetime.strptime(d, '%Y-%m-%d') for d in date_list]
df_interp['cum_sum'] = spline_interp(mdates.date2num(datetime_date_list), confirmed_case_cumsum, mdates.date2num(df_interp['dates']))
In [40]:
fig2 = go.Figure()

fig2.add_trace(go.Scatter(x=df_interp['dates'], y=df_interp['cum_sum'], mode='lines', name='spline interpolation'))
fig2.add_trace(go.Scatter(x=date_list, y=confirmed_case_cumsum, mode='markers', name='actual'))
fig2.update_layout(xaxis_title='Date', yaxis_title='COVID-19 Patients', title="Spline Interpolation for Missing Dates")
fig2.show()

The filled-in data fit in perfectly with the actual data

In [41]:
forecast_days = 6
md("#### Now let's find the forecast for the next {} days and plot it against actual data".format(forecast_days))
Out[41]:

Now let's find the forecast for the next 6 days and plot it against actual data

I will add a damping slope because I'm noticing the gradient of the curve is slowly starting to decrease. Damping slope of 0.91 results in the most realistic trend for the curve

In [56]:
from statsmodels.tsa.holtwinters import ExponentialSmoothing

def get_exp_smoothing_forecast(end_date, forecast_days):
    '''
    Creates and trains ExponentialSmoothing model and returns the forecast and model
    
    Inputs: 
    - end date: desired end date of forecast (datetime)
    - forecast_days: number of forecast days (int)
    
    Returns:
    - forecasted_df: dataframe containing the forecast dates and predicted case count
    - model: ExponentialSmoothing model

    '''
    
    series_interp = pd.Series(df_interp['cum_sum'].values, 
                          pd.date_range(start=date_list[0], end=date_list[-1], freq='D'))
    model = ExponentialSmoothing(series_interp, trend='add', damped=True).fit(damping_slope=0.91, optimized=True)
    forecasted_df = pd.concat([series_interp, model.forecast(forecast_days)])
    return forecasted_df, model
In [57]:
forecasted_df, fit1 = get_exp_smoothing_forecast(end_date=date_list[-1], forecast_days=forecast_days)
c:\users\shafi\desktop\data science project\venv\lib\site-packages\statsmodels\tsa\holtwinters.py:744: ConvergenceWarning:

Optimization failed to converge. Check mle_retvals.

In [58]:
fig3 = go.Figure()
fig3.add_trace(go.Scatter(x=date_list, y=confirmed_case_cumsum, mode='markers', name='actual'))
fig3.add_trace(go.Scatter(x=forecasted_df.index.tolist(), y=forecasted_df.values.tolist(), mode='lines', name='forecast'))
fig3.update_layout(xaxis_title='Date', yaxis_title='COVID-19 Patients', title="Forecasted number of patients vs actual patients in South Korea")
fig3.show()
In [60]:
exp_smoothing_forecast = fit1.forecast(forecast_days)
exp_smoothing_forecast
Out[60]:
2020-03-05    6163.586151
2020-03-06    6526.296678
2020-03-07    6856.363258
2020-03-08    7156.723845
2020-03-09    7430.051980
2020-03-10    7678.780583
Freq: D, dtype: float64

Section 4: Accuracy of forecast

For this section, I'm using the updated (as of March 11, 2020) dataset from Kaggle. I will use this dataset to check the accuracy of my prediction

In [61]:
new_patient_df = pd.read_csv('patient_updated_mar_10_2020.csv')
In [62]:
new_patient_df_date_group = new_patient_df.groupby('confirmed_date')
In [63]:
cumsum_series = new_patient_df_date_group['confirmed_date'].count().cumsum()
In [64]:
updated_confirmed_case_cumsum = list(cumsum_series)
In [65]:
new_date_list = list(new_patient_df['confirmed_date'].dropna().unique())
In [66]:
fig4 = go.Figure()
fig4.add_trace(go.Scatter(x=new_date_list, y=updated_confirmed_case_cumsum, mode='markers', name='updated actual'))
fig4.add_trace(go.Scatter(x=forecasted_df.index.tolist(), y=forecasted_df.values.tolist(), mode='lines', name='forecast'))
fig4.update_layout(xaxis_title='Date', yaxis_title='COVID-19 Patients', title="Forecasted number of patients vs updated actual patients in South Korea")
fig4.show()
In [67]:
display(exp_smoothing_forecast) 

# Actual updated count
display(new_patient_df_date_group['confirmed_date'].count().cumsum()[-forecast_days:])
2020-03-05    6163.586151
2020-03-06    6526.296678
2020-03-07    6856.363258
2020-03-08    7156.723845
2020-03-09    7430.051980
2020-03-10    7678.780583
Freq: D, dtype: float64
confirmed_date
2020-03-05    6284
2020-03-06    6769
2020-03-07    7133
2020-03-08    7381
2020-03-09    7512
2020-03-10    7754
Name: confirmed_date, dtype: int64
In [68]:
from sklearn.metrics import mean_squared_error

y_true = np.array(updated_confirmed_case_cumsum[-forecast_days:])
y_pred = np.array(forecasted_df.values[-forecast_days:])

forecast_rmse = mean_squared_error(y_true, y_pred, squared=False)
md("#### My forecasted results have a mean deviation of approx. {} counts".format(round(forecast_rmse)))
Out[68]:

My forecasted results have a mean deviation of approx. 188.0 counts

I will calculate Mean Absolute Error of my forecast to better assess it's accuracy

In [69]:
def mean_absolute_percentage_error(y_true, y_pred):
    if len(y_true) > 1 and len(y_pred) > 1 and len(y_true) == len(y_pred):
        return np.mean(np.abs((y_true - y_pred) / y_true)) * 100
    else:
        print("check if y_true and y_pred are same length arrays ")
In [70]:
forecast_mape = mean_absolute_percentage_error(y_true, y_pred)
md("#### My forecasted results have a MAPE of {} %".format(round(forecast_mape, 2)))
Out[70]:

My forecasted results have a MAPE of 2.41 %

In [ ]:
 

Section 5: Update on forecast

Let's check how my forecast model is doing as of April 27, 2020

In [71]:
new_forecast_days = 54   # 54 days since last dataset 
In [72]:
updated_forecasted_df, model = get_exp_smoothing_forecast(end_date='2020-04-27', forecast_days=new_forecast_days)
c:\users\shafi\desktop\data science project\venv\lib\site-packages\statsmodels\tsa\holtwinters.py:744: ConvergenceWarning:

Optimization failed to converge. Check mle_retvals.

In [73]:
updated_forecasted_df
Out[73]:
2020-01-20        1.000000
2020-01-21        1.672011
2020-01-22        1.916812
2020-01-23        1.953207
2020-01-24        2.000000
                  ...     
2020-04-23    10154.043745
2020-04-24    10157.613089
2020-04-25    10160.861192
2020-04-26    10163.816965
2020-04-27    10166.506719
Freq: D, Length: 99, dtype: float64

Number of coronavirus (COVID-19) confirmed, recovered, and test cases in South Korea as of April 27, 2020:

10,738

My forecast for April 27th:

10166

In [74]:
percentage_diff = (10738 - 10166) / 10738 * 100 
md("### My forecasted result for April 27th is off by {} %".format(round(percentage_diff, 2)))
Out[74]:

My forecasted result for April 27th is off by 5.33 %

In [ ]: