The novel coronavirus disease 2019 (2019-nCoV, or COVID-19) epidemic first broke out in Wuhan, China.Reference Zhu, Wei and Niu1,Reference Parry2 The virus was identified in the second half of December 2019.3 The epidemiological features of the disease are still unknown, and the number of total cases and deaths varies daily. In the wake of its rapid spread and reports revealing the crucial consequences of this spread, countries adopted strict measures to tackle the disease. However, confirmed positive cases were recorded after the second half of January 2020. Mathematical models used to identify the quantitative description of the outbreak of COVID-19 in this study may provide significant insight into the cessation of the spread of the novel coronavirus.Reference Siegenfeld, Taleb and Bar-Yam4 Various indicators are used in the models to describe the course of the outbreak. Among these indicators, the total number of confirmed cases and the total number of deaths are the most commonly used ones.
The objective of this study is to compare the various nonlinear and time series models in describing the course of the COVID-19 outbreak in China. To present this, we focused on 2 indicators: the number of total cases diagnosed with the disease and the number of deaths.
METHODS
Data Management
We obtained daily updates of the cumulative number of reported confirmed cases and deaths for the 2019-nCoV pandemic of China between January 22 and June 18, 2020, from Worldometer and WHO websites.5,6 In this study, we focus on China. Because it is not only the country where the novel coronavirus emerged and has spread throughout the world but is also the country that has been fighting against the coronavirus for the longest time. Also, due to unpreparedness for the outbreak, the studied period can be observed as a sample of natural course, especially in the first month.
Models for Describing the Course of the Outbreak
The models that we apply for the abovementioned indicators can be categorized into 2 categories: (1) nonlinear growth curves including the Weibull, negative exponential, Von Bertalanffy, Janoscheck, Lundqvist-Korf, and Sloboda models (Table 1)Reference Panik7-Reference Weibull13; and (2) time series models including Box-Jenkins and exponential smoothing methods (Tables 2 and 3).
Y t is the observed dependent variable named the number of total cases and the number of total deaths, and t is the independent variable named as time (Table 1). In our models, t is the day. The A term is the asymptotic limit of the number of total cases and the number of total deaths as time goes to infinity; B is the proportion of the number of total cases to the number of total deaths. The k term is the proportion of the maximum increase rate to the highest number of cases or deaths. γ, c, and d are the changing points that occur when the change in the estimated increase rate goes from increase to decrease.
AR (p) is the p th degree of autoregressive series.Reference Wei14 MA (q) refers to the moving average model of order q. In this series, ε_t~WN(0,σ 2 ) is the white noise series.Reference Montgomery, Johnson and Gardiner15 The ARMA(p, q) model is expressed by both AR (p) and MA (q) processes.Reference Cryer16
In the Holt method, L t is the new smoothed value, α is the smoothing coefficient, (0 < α < 1), Y t is the actual value at t th period, β is the smoothing coefficient for trend estimation, (0 < β < 1), T t is the trend predicted value, p is the number of forecasting periods and $${\overline y_{t + p}}{\rm{\;}}$$ is the forecasting value after p period.Reference Holt17 In the damped trend method, if 0 < φ < 1, the trend is damped, if φ = 1, the equations become identical to Holt’s linear trend method. Tashman and Kruk (1996) suggested that there may be value in allocating φ > 1, if applied in a series with a strong tendency, with exponential trend.Reference Tashman and Kruk18 The Brown’s single parameter linear exponential smoothing model is more suitable if there is an increasing or decreasing trend in the time series. In this model, the initial equations $$y_t^1$$ and $$y_t^2$$ are obtained by single exponential smoothing and double exponential smoothing, respectively.Reference Armutlu19 For the estimation of post m process, the equation is given below.Reference Orhunbilge20
In exponential smoothing methods, the estimations are constantly updated, taking into account recent changes in data.Reference Kadılar21 In these methods, the weighted average of past period values is calculated and taken as the estimated value of future periods.
Estimation accuracy of the applied methods were evaluated with BIC, R2 and MSE. BIC was developed by Gideon E. Schwarz (1978), who gave a Bayesian argument for adopting it.Reference Schwarz22
Where $$\hat {\it \sigma} _e^2$$ is the error variance
RESULTS
Results of Nonlinear Growth Models for the Number of Total Cases
The parameters estimated and goodness of fit measures of the nonlinear models between January 22 and June 18, 2020, in China were presented in Table 4. R2 and MSE statistics were used to compare models. The R2 and MSE values of the Weibull and Janoscheck models were equal. The MSE of the Sloboda model was slightly smaller than the Weibull and Janoscheck models, but R2 was equal. The Sloboda model can be considered the most suitable model, as it has a smaller MSE value and a larger pseudo R2 value. The Weibull and Janoscheck models could also be chosen as alternative models.
The curve for prediction of nonlinear growth models are given in Figure 1.
Results of Time Series Models for the Number of Total Cases
Box-Jenkins and exponential smoothing methods were chosen from the various time series models available for the total number of cases. Autocorrelation (ACF) and partial autocorrelation (PACF) graphs of the series were examined. When the ACF and PACF graphs in Figure 2 were examined, the first degree difference was taken because the series was not stationary at that level. But the stationary assumption had not been provided yet. The difference from the second degree was taken and the series became stationary. According to the ACF and PACF charts, the series quickly approached zero after the first delay in the ACF graph. In this case, because p = 0, d = 2, and q = 1, it was modeled by the integrated first degree moving averages method. In other words, the most suitable time series method was the ARIMA(0,2,1) model. In addition, exponential smoothing methods were used and the model performances were given in Table 5.
The performance of the ARIMA(0,2,1) model is given in Table 6, and it is observed that this model’s fits are successful as nonlinear models.
The parameters estimated of the ARIMA(0,2,1) model are given in Table 7.
The ARIMA(0,2,1) model was found to be the most appropriate among different time series models. This model can be written as follows:
Forecasting data for future 30 d are given in Table 8.
The number of total cases continues increasingly, albeit at a low speed. The number of total cases is predicted to be 85,589 on July 18, 2020 (Table 8). Observed and predicted values of the total cases are given in Figure 3.
Results of Nonlinear Growth Models for the Number of Total Deaths
The parameters estimated and goodness of fit measures of the nonlinear models for the number of total deaths are presented in Table 9. The most suitable models for predicting the number of total deaths are the Lundqvist-Korf and Sloboda models, respectively (Table 9). The R2 values of these models were found to be the highest at 0.963 and also the MSE values of these were lower than the others. The Lundqvist-Korf model can be considered the most suitable one, because mean square error (MSE) is smaller than other models. The curve for prediction of nonlinear models are given in Figure 4.
Results of Time Series Models for the Number of Total Deaths
The most suitable time series model was found to be the Brown linear trend exponential smoothing model among time series models for the number of deaths. The goodness of fit of the various models are given in Table 10, and it was observed that the predictions are as successful as nonlinear models.
The parameters estimated of the Brown linear trend exponential smoothing model are presented in Table 11. The observed and predicted values are given in Figure 5.
The forecasts of the number of total deaths using Holt’s linear trend exponential smoothing model for 30 d are given in Table 12. The rate of increase in the number of deaths in China was decreasing, and it was predicted that the number will be between 3343 and 3355 in the period between June 19 and July 18, with a slight increase (Table 12). The Holt linear trend exponential smoothing curve for the exponential smoothing model is given in Figure 5.
DISCUSSION
In this study, we found that the Sloboda model for the number of total cases and the Lundqvist-Korf model for the number of total deaths were the best explanatory models among the nonlinear models used in the study. Also, the ARIMA(0,2,1) model for the number of cases and the Brown linear trend exponential smoothing model for the number of deaths were the most suitable models among the time series models used in the study.
In a different study, the ARIMA model was used on the daily prevalence data of COVID-2019 from January 20, 2020, to February 10, 2020, and the ARIMA(1,2,0) and ARIMA(1,0,4) models were obtained.Reference Benvenuto, Giovanetti and Vassallo23 The logistics, Bertalanffy, and Gompertz models were previously used to estimate the number of cases and deaths from COVID-19 in different regions in China by Jia et al. (2020).Reference Jia, Li and Jiang24 In their study, the Logistics model was reported to be better than the others. They conducted an extensive research with quasi-experimental analysis methods in various provinces in China and investigated the relationship between population and the number of outbreak cases. Accordingly, they found that the correlation coefficients of the relationship between the population and the number of cases differed by regions. They observed that the number of cases was higher in regions with high populations and that there was a high correlation between them. They concluded factors such as immigration, tourism, and mobility play an important role in this situation. The authors also determined the number of cases using the epidemic growth model.Reference Jia, Li and Jiang24,Reference Fan, Liu and Guo25
On the other hand, Roosa et al. (2020) analyzed the number of cases in some regions of China using the generalized logistic growth model (GLM), the Richards Model and the sub-epidemic model for a short time period (10 d). They found that the number of confirmed cases will continue to increase. They estimated that the predicted case increase (GLM) in the Guangdong and Zhejiang regions would be lower by using the Richards models and that it would be higher using the sub-epidemic model.Reference Roosa, Lee and Luo26
In a study on the risk of infection when COVID-19 was detected in a cruise ship in China in February 2020, it was noted that the risk of infection among people who have close contact was higher than those who maintained a social distance from others. The estimated number of cases was obtained by the back-calculation method.Reference Nishiura27 Al-qaness et al. (2020) used the Adaptive Neuro-Fuzzy Inference System (ANFIS), the Flower Pollination Algorithm (FPA), the Salp Swarm Algorithm (SSA), and the FPASSA-ANFIS method to estimate the number of cases of COVID-19 in China and the United States. They calculated model performance using root mean square error (RMSE), mean absolute error (MAE), mean absolute percentage error (MAPE), root mean squared relative error (RMSRE), and R2. They found that the best method for modeling and estimating the number of total cases was the FPASSA-ANFIS method.Reference Al-qaness, Ewees and Fan28
Kuniya (2020) estimated the outbreak peak of coronavirus disease in Japan using the susceptible-exposed-infected-removed (SEIR) compartmental model.Reference Kuniya29 In another study, the reproduction number of the Wuhan novel coronavirus 2019-nCoV was estimated using the SEIR compartment model.Reference Zhou, Liu and Yang30 There are many studies on coronavirus disease by various researchers using different statistical methods. Among these studies, the following are highlighted: Yuan et al. (2020) used the median (interquartile range, IQR) and Mann Whitney U test or Wilcoxon test. Twu et al. (2020), Prem et al. (2020), and Neher et al. (2020) used the SEIR model.Reference Yuan, Yin and Tao31-Reference Neher, Dyrdak and Druelle34
In our study, we compared the time series analysis using the Weibull, negative exponential, Von Bertalanffy, Janoscheck, Lundqvist-Korf, and Sloboda models, which are different from the methods used in previous studies. Based on our extensive literature review, this study has been the first and most comprehensive study based on the nonlinear models as we discussed in detail.
CONCLUSIONS
While some models are simple and give general results, some are complex and provide detailed information, but their results cannot be generalized.Reference Siegenfeld, Taleb and Bar-Yam4 Models that were used in the initial phase of the outbreak can be misleading because of a lack of sufficient data. Therefore, short-term predictions should be made for the early stages of the epidemic, and the effects of any measures taken in this process must be taken into consideration by virtue of their results. As for the further stages of pandemics, different models can be used to understand biological systems and to develop models which can be used for the simulation for future similar situations. However, although model assumptions are mostly incompatible with real-world problems, they can capture general behavior and predict the rate of the spread of the outbreak. If large-scale behaviors of a system are correctly identified, certain details can be understood in terms of their impact on these behaviors. Statistical or data-based models that fit curves of the past temporal prevalence of a disease, do not make any assumptions about the internal mechanisms that a mathematical model provides and, hence, have become more popular in infectious diseases. Because the major use of these models is to fit past data and estimate the future, it can also be used for different patterns of the epidemic.
As a result of the literature review, it was observed that the Sloboda model and Lundqvist-Korf model, which gave the best results among the nonlinear models used in this study, have never been used for modeling COVID-19 outbreak indicators before. Our most recent forecasts remained relatively stable. This reflects the impact of the measures implemented by the China government, which likely helped to stabilize the pandemic. The forecasts presented here are based on the assumption that current mitigation efforts will continue. In addition, comparing with other modeling studies on COVID-19, results were obtained for longer periods. Therefore, the results in this study are more favorable in terms of comprehending the biological structure of the outbreak and producing preliminary information for possible similar conditions in the future.
Author Contributions
S.C. and H.A conducted forecasts and data analysis; all authors contributed to writing and revising subsequent versions of the manuscript. All authors read and approved the final manuscript.
Conflict of Interest
The authors declare no conflict of interest.