Prediction and Diagnosis of Diabetes by Using Data Mining Techniques

Background: Diabetes mellitus (DM) is one of the most common diseases in the world. Complications of this disease include nephropathy, cardiac arrest, blindness, and even mutilation of the body. The accurate diagnosis of this condition is very important. Objectives: This study was to identify and provide a model for diagnosis of DM using data mining. Methods: The data used in this study were obtained from 768 women aged 21-83 year old. Nine variables were selected for investigation. The neural network, Basin network, C5.0, and support vector machine models were compared for predicting diabetes and their precision to this end. Clementine 12 software was used to analyze the data. Results: ThThe proposed method for classification of records with the C5.0 algorithm for accuracy data is 80.2% and for accuracy data 87.5%. In comparison with similar studies, it was better to diagnose people with diabetes, while glucose, body mass index and age variables were important in this study. Conclusion: The C5.0 algorithm showed the highest value of accuracy, specificity, and sensitivity compared with other methods studied. Therefore, the C5.0 algorithm probably performs the best classification among other algorithms and is recommended as the best method for diabetes prediction using available data.

the accuracy of the diagnosis one year after development of type 2 diabetes so that they made an accurate diagnosis in 94% of patients (8).
In 2011, in a study in Turkey on the type and amount of different drugs to treat patients with type 2 diabetes, 6 different types of drug combinations were tested on patients, and then a model was designed based on data mining, including the fuzzy neural network and dependency rules.Using this model, the correct drug combination and the correct dosage were achieved for 80% of the patients (9).In another study, researchers were able to accurately predict 73.32% of the neural network algorithm (10).
Considering the above mentioned, this study was aimed to identify and provide an appropriate model for diagnosis of diabetes.

Methods
This study was done by the CRISP methodology.This includes 5 phases: problem identification, data collection and description, data preparation, data modeling and data evaluation (12).
Understanding the problem: In this phase that addresses the goal of identifying the problem, because of inappropriate diet and physical inactivity of people, the number of people with diabetes is increasing day by day.The goal is to provide a model to predict the likelihood of making a diagnosis of diabetes to achieve rapid and inexpensive diagnosis of the disease.
Data collection and description: Most studies regarding machine learning on diabetes are based on the Indian data collected from the UCI data set (13).This data set contains 9 variables, 8 of which were considered input variables and one served as the response variable.The data set also contains 768 records and two classes, with the first class including 500 healthy women and the second class including 268 women with DM aged 21-81 years (Table 1).

Data Preparation
When data are collected from the environment, they are examined for potential errors such as incompleteness, noise, etc.Therefore, we need a solution that performs preprocessing of these data.We are preparing to get into the methods used for data mining.This phase includes data cleaning, data integration, data transfer, data reduction.
One of the famous methods used for data extraction and data cleaning is the Scatter diagram (Figure 1), which is available in most data mining software and statistical analyses.
As observed, the structure of each of these points determines a row from one of the cells.The closer cells indicate that this cell is close to our data, and farther ones mean that these data do not adequately agree with the characteristics of the other data.If the number of the farther pillars of a sample is higher than those of other samples, its respective data should be cleaned.If our data differs from a few data sets, we need to integrate all data into one data set; finally, by selecting the feature in the software, we eliminate variables that have a less significant impact on the prediction of the disease.

Data Modeling
Data mining was used to predict the probability of diabetes from classification.This model is one of the most commonly used methods of machine learning for prediction of medical data (14).In data sorting algorithms, the division of data into two educational and experimental parts is 75 to 25 and using the technique of the method, the fold-10 division of the models are created, the fold-10 method randomly divides the whole data into 10 sections.Therefore, each time, one of the 10 parts is considered experimental data and the other 9 are considered educational data.With the increase in this section, the results may be more favorable yet time-consuming.Finally, we build our model data set and use the experimental data set to investigate the accuracy of the model.The data mining categorization algorithms include decision tree, SVM, Bayesian network and neural networks.

Decision Tree
The decision tree is one of the most powerful tools for classification and prediction which is capable of generating understandable manifestations of the relationships existing in a data set.The decision structure can also be introduced in the form of mathematical and computational techniques  Prediction and diagnose Diabetes that help describe, categorize and publicize a data set.the decision tree is a unique way of providing a system that facilitates future decisions and makes the system define an appropriate way.The most important feature of decision tree is the ability to break the complex decision-making process into a simpler set of decisions that can easily be interpreted (15).The decision tree is an explicit description using the decomposition of the algorithm.This tree structure is similar to the flowchart, including the highest node, represented by the root of the tree, branches, representing the outputs of the test, and the leaves, which represent the nodes or the distribution of the categories (16).The rules created by the decision tree are expressed as "then" and "if ".C5.0 is a well-known algorithm that is a decision tree.C5.0 is an algorithm for making decision trees.
ID 3 algorithm can be used to express the classification, as with a decision tree or set of rules (17).In many applications, the set of rules is preferred because their perception is simpler than that of a decision tree.

Support Vector Machine
The basis of the categorization of this algorithm is the linear classification of data, and in the division of the data line, the line that it chooses has the greatest interval of confidence.In a learning process involving two classes, the objective is to find a function for classification so that members of the two classes can be identified in the data set (18).

Bayesian Network
This algorithm assumes a categorization of objects according to the law of Bayes, and assumes input variables independent of each other.It has a very simple structure and, despite its simplicity, has a high predictive accuracy, such that Wu et al reported this algorithm as being the most effective algorithm to anticipate recurrence of breast cancer (19).
Neural Networks Artificial neural network is an algorithmic learning method that has been developed from the human brain and is used in statistical fields, artificial intelligence, and classification.The Neural Network consists of several layers called the input layer, the hidden layer(s), and the output layer.The neural networks are divided into 2 types in terms of the connection of the nodes.1. Antecessor nerve networks: The nodes of each layer are connected only to the next layer (Figure 2). 2. Retrogressive neural networks: The nodes of each layer are connected to the nodes of the next layer or to themselves.In this study, we used antecessor neural networks or multilayer perception (20).
The output of the artificial neural network algorithm is in the form of a black box.These networks can be used as appropriate methods for generating analytical and estimating models and using different data (20).
Data evaluation: In this study, for the evaluation and analysis of the accuracy, sensitivity and specificity criteria, which are briefly described below, confusion matrix was used: Specificity: If the answer is negative for a person, in a low percentage of cases, the model will also have a negative result.In other words, if the test is very specific and positive, we can be relatively sure that the person will develop diabetes.It is calculated by equation 1.

TN Specificity FP TN
Sensitivity: If the answer is positive for a person, he/ she will also have a positive result in a low percentage of the cases.In other words, if the test is very sensitive and the answer to it is negative, we can almost be sure that the person will not develop diabetes.It is calculated by equation 2.

TP Sensitivity TP FN = +
(2) Accuracy: This criterion is defined as the percentage of correct classes and is calculated by equation 3.

TP TN Accuracy TP FN FP TN
Where: TN: The real category is negative and the algorithm is correctly detected as negative.
TP: The real category is positive and the algorithm is recognized correctly.
FP: The real category is negative and the algorithm has been mistakenly detected as positive.
FN: The real category is positive and the algorithm is detected by a negative error.

Results
The artificial neural networks, Bayesian networks, SVM, and C5.0 algorithms were studied in the data set.The precision produced for training and testing data is according to Table 2.The highest accuracy was obtained using the decision tree using the C5.0 algorithm, so this algorithm is used to predict diabetes.
Table 3 shows the values of the indexes calculated for each of the studied algorithms.For the C5.0 algorithm, by means of confusion matrix, accuracy, sensitivity, and specificity were calculated at 80.2, 59.5, and 87.5, respectively.These values represent that the Tree can produce comprehensive rules to predict the diagnosis of diabetes.
Ranking the Importance of Variables In the C5.0 algorithm model, the order of importance of the variables used to predict the response variable is shown in Figure 3.
According to Figure 3, the variables plasma glucose concentration, age, parity, diabetes pedigree function, and body mass index are most important for predicting diabetes.

Discussion
In this study, using data mining algorithms, we sought to draw a model to predict the risk of diabetes by using C5.0 decision tree algorithms, neural networks, SVM, and Bayesian networks.Among the models produced, the C5.0 model has the highest accuracy to predict development of diabetes.
Gao et al (17) created a system of data processing for type 2 diabetes by combining C4.5 algorithms.Huang et al (21) conducted a study to identify the major factors influencing diabetes control, by using Feature Selection in the patient management system.Han et al (22) applied the Rapid Miner software using the ID3 Decision Tree algorithm to diabetic patients database.Anbananthen et al (23) used artificial neural network and decision tree developed by using the C4.5 algorithm for detecting individuals with diabetes based on age-related and blood pressure characteristics.Fang (24) clustered the data of patients with diabetes using different techniques.The features that are important in these models are age, family history, and weight.The accuracy of the model is based on 80% clustering.
The results show that blood glucose concentration and increased age are two major contributor to diabetes.Comparison of previous research findings on data mining and predicting diabetes clearly shows the model presented  in this study has a high accuracy.
Increasing the accuracy of identifying people with diabetes depends on the magnitude of the database, so in future studies, by using larger databases, the number of records and the accuracy of the algorithm can be increased.Data mining can also be used to reduce the processing time of feature selection algorithms in order to reduce the number of variables.It is possible to use this method to develop decision-making systems of medicine to help diagnose diabetes in healthcare centers.

Conclusion
In this study, a systematic effort was made to identify and review machine learning and data mining approaches for diabetes.Diabetes is rapidly emerging as one of the greatest global health issues of the 21st century.
Data mining is a valuable asset to deal with the abundant clinical data collected from patients and generated from the research and management of diabetes, so that researchers and clinicians can be assisted in providing better health care for the patients affected by this disease of the modern society.
The results showed that the C5.0 algorithm has the highest accuracy, specificity, and sensitivity compared to Prediction and diagnose Diabetes other methods studied.Therefore, the C5.0 algorithm has the best performance among other algorithms and is introduced as the most effective method to treat diabetes using available data types.
By comparison with other methods, it seems that: • Speed -C5.0 is significantly faster than the other networks (neural networks, SVM, and Bayesian); • Memory usage -C5.0 is more memory efficient.
• Smaller decision trees -C5.0 yields similar results to C4.5 with considerably smaller Decision Trees; • Support for boosting -Boosting improves the trees and makes them more accurate; and • Weighting -C5.0 allows to weight different cases and misclassification types.From biological perspective, blood glucose concentration could be an effective biomarker for diagnosis and recent research has been aimed to present genomic elements.

Figure 2 .
Figure 2. View of Neural Network.

Table 1 .
Used the Variable in Study

Table 2 .
Accuracy Values Calculated foe Models

Table 3 .
Index Size for Studied Algorithms