Comparison of machine learning models for coronavirus prediction

10. Efficient synapse memory structure for reconfigurable digital neuromorphic hardware / J. Kim, J. Koo, T. Kim, J. J. Kim // Frontiers in neuroscience. — 2018. — Vol. 12. — P. 829. https://doi.org/10.3389/fnins.2018.00829 11. Федоров, А. Квантовые вычисления: от науки к приложениям / А. Федоров // Открытые системы. СУБД. — 2019. — No 3. — С. 14. 12. What Happens When ‘If’ Turns to ‘When’ in Quantum Computing? / J. F. Bobier, M. Langione, E. Tao [et al.] // BCG Digital Transformation. — 2021. — P. 20. 13. Сабанов, А. Г. Доверенные системы как средство противодействия киберугрозам / А. Г. Сабанов // Защита информации. Инсайд. — 2015. — No 3 (63). — С. 17–21. 14. Каляев, И. А. Доверенные системы управления / И. А. Каляев, Э. В. Мельник // Мехатроника, автоматизация, управление. — 2021. — Т. 22, No 5. — С. 227–236. https://doi.org/10.17587/mau.22.227-236


Introduction
The coronavirus is a very severe acute respiratory syndrome caused by the SARS-COV-2 virus. This virus, which can infect humans or animals, was discovered in the Chinese region of Wuhan, more precisely in the province of Hubei, during the pneumonia epidemic of January 2020 [1,2]. It is therefore the seventh human coronavirus. To everyone's surprise, this virus spread worldwide, causing 318,599 deaths and 4,806,299 infected persons [3].
It should be noted that since the Covid-19 epidemic, there has been much speculation about the origin of this virus [8]. Some said that it was the result of work done in a laboratory. However, after studies conducted on genetic data, this hypothesis was dismissed [9]. Analysis and comparison with the genomes of previously known coronaviruses clearly show that SARS-COV-2 is different from other coronaviruses [8,11]. The virus responsible for the coronavirus (SARS-COV-2) is similar to the SARS virus of bats [2]. Thus, the Covid-19 virus is believed to have originated from a bat coronavirus that became infectious to humans while acquiring genes specific to pangolin coronaviruses. It should be noted that the actual causes of Covid-19 are still unclear.
The symptoms of Covid-19 are similar to those of seasonal flu. The disease is more severe in the elderly and in people who are vulnerable to certain chronic diseases. Patients with Covid-19 can have symptoms ranging from mild to severe. The most common symptoms are fever (83 %), cough (82 %) and breathlessness (31 %) [12]. In patients with pneumonia, the X-ray of the lungs shows numerous mottles and ground glass opacity [12,13].
We also see a decrease in lymphocytes and eosinophils, lower haemoglobin levels, and an increase in white blood cells and neutrophils [15][16][17][18].
Like all other viruses, Covid-19 is transmitted mainly by the respiratory route. Among these routes of transmission, we have droplet transmission, which is the most widespread [28,29]. Other transmission routes exist, namely the faecal route, via saliva. Indeed, SARS-CoV-2 RNA was found in the stool of a patient with Covid-19 [31]. SARS-CoV-2 RNA can be detected on inanimate surfaces (door handles). People who have been in contact with these surfaces could be contaminated [29]. This model will make it possible to identify positive and negative cases from the dataset studied and the elements responsible for COVID-19. The proposed prediction model ensures that it tracks the results regarding this epidemic situation so that the huge economic losses, the spread of the community, the amount of detachment social gens can be detected and a precise decision can also be made accordingly. This method will allow government authorities to put in place preventive measures based on our future work to predict the onset of this disease in the future.

Data Resources and Methods
The dataset used was uploaded to Kaggle. It is open source and available on this link kaggle.com/einsteindata4u/covid19. This dataset contains anonymized data in accordance with best international practices and patient recommendations at the Israelita Albert Einstein Hospital in São Paulo, Brazil. This section describes the proposed approach and a detailed overview of the tasks. These tasks can help to understand and extract knowledge from COVID 19 data, which can help countries contain the spread of the virus, raise awareness, launch initiatives, determine if mitigation has a positive effect or not, identify other factors affecting the virus, etc. This will allow countries to prepare for what may happen in the near future. This could help save lives and alleviate the agony. Epidemiological information includes various characteristics of the case studied, including case identification, age, sex, target value, lymphocytes, leukocytes, monocytes, hco3, etc.

Data Pre-processing
In data analysis, the most important step is pre-processing. However, it is not clear what methods of pretreatment the author used. This part must be completed.

Data Transformation
The data is transformed to be processed and stored in. xls for further processing. All data were normalized to have a mean of zero and a unit standard deviation. With a dataset containing 111 characteristics, data mining eliminated missing values (78 characteristics) and retained important characteristics (33). This exploratory analysis of the data also allowed us to identify two categories of characteristics, namely virus-related characteristics and blood-related characteristics. The target value is divided into two categories which are negative cases coded by 0 and positive cases coded by 1.
The dataset from the Israelita Albert Einstein Hospital in São Paulo is divided into training and test data. 70 % of the data is used for predictive model training, and the remaining 30 % is used for testing. The objective of model training is to adapt the model using data from the training set. After the model is formed, the prediction models sound tested to evaluate performance in the test datasets.

The Proposed Models
This section describes the different machine learning models used in this paper. These models are: Random Drills (RF), K-plus Close Neighbors (KNN), Linear Support Vector Machine (SVM), Logistic Regression (LR), Decision Tree (DT), and AdaBoost (AB).

Random Forest (RF)
Random forests (RF) or random decision forests were first proposed in 1995. This is a general classification training method that tends to work better than traditional decision tree classification methods (Gangaie et al., 2019). Decision trees are the fundamental RF classifiers that vote for each of the forecasts, and the survival prediction is based on the majority voting method in each tree (Breiman, 2001). The accuracy of each tree and the independence of the trees from each other provide the reliability of the classification. We used 100 trees to predict two target classes, survival or death of patients with hepatitis.

Nearest Neighbor (KNN)
The K-Nest Neighbor (KNN) classifier is one of the most commonly used classification algorithms. This algorithm can be used in several applications. It saves all valid attributes and classifies new attributes according to their similarity dimension. KNN is a statistical recognition model method for detecting the different classes of a model. A tree data structure is used to determine the distance between the point of interest and the points in the training dataset. The attribute is classified by its neighbors. In the classification method, the value of k is always a positive integer closest to the neighbor. The nearest visions are selected from a set of classes or property values of the object.

Support Vector Machine (SVM)
SVM-controlled learning method is used for classification and regression [29]. This algorithm is a relatively new approach and has performed well in recent years. The SVM classifier is based on linear classifiers and in the data separated by a row, the SVM isolates the objects in the specified classes. It can also identify and classify instances that are not supported by the data. The only extension of this algorithm is to perform a regression analysis to obtain a linear function, and another extension teaches to classify the elements to obtain a classification of individual elements.

Logistic Regression Model (LR)
Logistic regression is the corresponding regression analysis that should be performed when the dependent variable is dichotomous (binary). Like all regression analyses, logistic regression is predictive analysis. It is used to describe the data and explain the relationship between a dependent binary variable and one or more nominal, ordinal, interval or ordinal independent variables, report [30,31]. This approach assumes that the binary result follows a binomial distribution.

Decision Tree (DT) Model
The Decision Tree is a controlled learning method that is used to solve classification and regression problems, but it is more used to solve classification. This is a powerful classification method for disease prediction. This is a tree model where the internal nodes represent the characteristics of a data set, the branches represent the decision rules, and each leaf node represents a result. The decision tree consists of two nodes, a decision node and a leaf node. Decision nodes have multiple branches and are used to make a decision, while leaf nodes are the result of those decisions.
Model AdaBoost (AB) AdaBoost, short for "Adaptive Boosting", is the first boost algorithm proposed by Freund and Schapire in 1996. Its goal is to turn weak predictors into strong predictors to solve classification problems. For classification, the final equation can be put under the heading below: denotes the weak classifier m and m denotes the corresponding weight. AdaBoost can be used for face recognition, as it is a standard algorithm for detecting faces in images. AdaBoost is fast, requires no setup, and is simple and easy to program. Plus, it has the flexibility to be able to be combined with any machine learning algorithm.

Evaluation of Performance Measures
For the comparison of the different classification algorithms used in this paper, some metrics were evaluated. These are accuracy, recall, and F1-score. These metrics are calculated based on true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN). The standardized confusion matrix illustrates the relationship between classification results and predicted classes. The level of the classification performance is calculated by the number of samples correctly and incorrectly classified in each class. The accuracy is calculated based on the total number of correct predictions, defined as follows:

Бру К. А. и др. Сравнение моделей машинного обучения для прогнозирования коронавируса
Recall, or sensitivity, is the proportion of true positive predictions that have been correctly identified, defined as follows: The F1 score is the harmonic mean of accuracy and recall, and it is calculated by:

Result
The objective of this paper is to compare the different models of machine learning for the detection of coronavirus. Our task was to find out which machine learning model has the best recall and f1-score for Class 1. The learning machine models used are: Radom drill, k-nearest neighbor, logistic regression, support vector machine, AdaBoost, and decision tree. Out of a total of 5,644 people tested for COVID-19, 5,086 people tested negative and 558 people tested positive. The results of our study are presented in Figure 1 and Figure 3. These results show that the vector-machine gave better results with a recall of 75 % and an F1 score of 60 %. The different learning curves were also traced in order to understand the phenomenon of over-fitting and under-fitting Figure 2. Indeed, the learning curve is very well known to data scientists, the learning curve shows the efficiency and quality of learning of our machine learning model. Learning curves are widely used as a diagnostic tool in machine learning for algorithms that incrementally learn a training data set. This means that we increase our dataset by a certain step, and then we see the performance of our model. The model can be evaluated on the training dataset and on the exception validation dataset after each update during training, and it traces the measured performance. This can be represented as a curve.  Т. 22, № 1. С. 67−75. ISSN 2687− Fig. 3. Results of predictions from various machine learning techniques Figure 3 shows the performance of the different machine learning algorithms according to the performance measures used in this paper. We see that for recall and F1-score, LSVM outperforms the other machine learning models used, namely LR, KNN, RF, AB, and DT. For accuracy, LR is much better than the others. As for accuracy, we find that LR and AB performed better than the other models. In this paper, we chose recall and F1 score to measure the performance of the model. Recall allowed us to correctly identify the Covid-19 positive test subjects among all the real positive cases. As for the F1 score, we used it because we had an imbalance between different classes, i.e., positive and negative cases.

Discussion and Conclusion
The data used in this paper was collected at the Israelita Albert Einstein Hospital in São Paulo, Brazil. After an exploratory analysis, two categories of characteristics were identified. These are the characteristics related to the virus and the characteristics related to the blood. Out of a total of 5,644 people tested with COVID-19, 5,086 people tested negative and 558 people tested positive. The results of this study clearly illustrated that in relation to our goal, machine vector support showed better results in coronavirus detection with a recall of 75 % and an F1 score of 60 %. This co-calculation was done with the other machine learning models, namely the Radom drill, the k-nearest neighbor, the logistic regression, the AdaBoost, and the decision tree. As such, this model can be useful for the diagnosis of COVID-19. However, it is possible to optimize the parameters of this model in order to improve its performance.
After the analysis of the learning curve in Figure 2, we find that apart from the supporting sensor, other machine learning models can be studied for the detection of COVID-19. These include AdaBoost and k-nearest neighbor. Indeed, we find that if we perform a little more advanced optimization of the parameters of these models, they could be candidates for the diagnosis of COVID-19 because the difference between the learning score curve and the validation score curve would have reduced the model's ability to generalize.