Comparison of machine-learning algorithms to build a predictive model for detecting undiagnosed diabetes – ELSA-Brasil

accuracy study

Authors

  • André Rodrigues Olivera Universidade Federal do Rio Grande do Sul
  • Cirano Iochpe Universidade Federal do Rio Grande do Sul
  • Maria Inês Schmidt Universidade Federal do Rio Grande do Sul
  • Álvaro Vigo Universidade Federal do Rio Grande do Sul
  • Sandhi Maria Barreto Universidade Federal do Rio Grande do Sul
  • Bruce Bartholow Duncan Universidade Federal do Rio Grande do Sul

Keywords:

Supervised machine learning, Decision support techniques, Data mining, Models, statistical, Diabetes mellitus, type 2

Abstract

CONTEXT AND OBJECTIVE: Type 2 diabetes is a chronic disease associated with a wide range of serious health complications that have a major impact on overall health. The aims here were to develop and validate predictive models for detecting undiagnosed diabetes using data from the Longitudinal Study of Adult Health (ELSA-Brasil) and to compare the performance of different machine-learning algorithms in this task. DESIGN AND SETTING: Comparison of machine-learning algorithms to develop predictive models using data from ELSA-Brasil. METHODS: After selecting a subset of 27 candidate variables from the literature, models were built and validated in four sequential steps: (i) parameter tuning with tenfold cross-validation, repeated three times; (ii) automatic variable selection using forward selection, a wrapper strategy with four different machine-learning algorithms and tenfold cross-validation (repeated three times), to evaluate each subset of variables; (iii) error estimation of model parameters with tenfold cross-validation, repeated ten times; and (iv) generalization testing on an independent dataset. The models were created with the following machine-learning algorithms: logistic regression, artificial neural network, naïve Bayes, K-nearest neighbor and random forest. RESULTS: The best models were created using artificial neural networks and logistic regression. ­These achieved mean areas under the curve of, respectively, 75.24% and 74.98% in the error estimation step and 74.17% and 74.41% in the generalization testing step. CONCLUSION: Most of the predictive models produced similar results, and demonstrated the feasibility of identifying individuals with highest probability of having undiagnosed diabetes, through easily-obtained clinical data.

Downloads

Download data is not yet available.

Author Biographies

André Rodrigues Olivera, Universidade Federal do Rio Grande do Sul

MSc. IT Analyst, Postgraduate Computing Program, Universidade Federal do Rio Grande do Sul (UFRGS), Porto Alegre (RS), Brazil.

Cirano Iochpe, Universidade Federal do Rio Grande do Sul

PhD. Professor, Postgraduate Computing Program, Universidade Federal do Rio Grande do Sul (UFRGS), Porto Alegre (RS), Brazil.

Maria Inês Schmidt, Universidade Federal do Rio Grande do Sul

PhD. Professor, Postgraduate Epidemiology Program and Hospital de Clínicas, Universidade Federal do Rio Grande do Sul (UFRGS), Porto Alegre (RS), Brazil.

Álvaro Vigo, Universidade Federal do Rio Grande do Sul

PhD. Professor, Postgraduate Epidemiology Program, Universidade Federal do Rio Grande do Sul (UFRGS), Porto Alegre (RS), Brazil.

Sandhi Maria Barreto, Universidade Federal do Rio Grande do Sul

PhD. Professor, Department of Social and Preventive Medicine & Postgraduate Program in Public Health, Universidade Federal de Minas Gerais (UFMG), Belo Horizonte (MG), Brazil.

Bruce Bartholow Duncan, Universidade Federal do Rio Grande do Sul

PhD. Professor, Postgraduate Epidemiology Program and Hospital de Clínicas, Universidade Federal do Rio Grande do Sul (UFRGS), Porto Alegre (RS), Brazil.

References

Glauber H, Karnieli E. Preventing type 2 diabetes mellitus: a call for personalized intervention. Perm J. 2013;17(3):74-9.

Beagley J, Guariguata L, Weil C, Motala AA. Global estimates of undiagnosed diabetes in adults. Diabetes Res Clin Pract. 2014;103(2):150-60.

Guariguata L, Whiting DR, Hambleton I, et al. Global estimates of diabetes prevalence for 2013 and projections for 2035. Diabetes Res Clin Pract. 2014;103(2):137-49.

International Diabetes Federation. IDF Diabetes Atlas. 7th ed. Brussels: International Diabetes Federation; 2015. Available from: http://www.diabetesatlas.org Accessed in 2017 (Feb 20).

Buijsse B, Simmons RK, Griffin SJ, Schulze MB. Risk assessment tools for identifying individuals at risk of developing type 2 diabetes. Epidemiol Rev. 2011;33:46-62.

Thoopputra T, Newby D, Schneider J, Li SC. Survey of diabetes risk assessment tools: concepts, structure and performance. Diabetes Metab Res Rev. 2012;28(6):485-98.

Abbasi A, Peelen LM, Corpeleijn E, et al. Prediction models for risk of developing type 2 diabetes: systematic literature search and independent external validation study. BMJ. 2012;345:e5900.

Collins GS, Mallett S, Omar O, Yu LM. Developing risk prediction models for type 2 diabetes: a systematic review of methodology and reporting. BMC Med. 2011;9(1):103.

Noble D, Mathur R, Dent T, Meads C, Greenhalgh T. Risk models and scores for type 2 diabetes: systematic review. BMJ. 2011;343:d7163.

Schmidt MI, Duncan BB, Mill JG, et al. Cohort Profile: Longitudinal Study of Adult Health (ELSA-Brasil). Int J Epidemiol. 2015;44(1):68-75.

Aquino EM, Barreto SM, Bensenor IM, et al. Brazilian Longitudinal Study of Adult Health (ELSA-Brasil): objectives and design. Am J Epidemiol. 2012;175(4):315-24.

Hosmer DW, Lemeshow S. Applied logistic regression. 2nd ed. Hoboken: Wiley; 2005.

Haykin SO. Neural networks and learning machines. 3rd ed. Upper Saddle River: Prentice Hall; 2008.

Friedman N, Geiger D, Goldszmidt M. Bayesian Network Classifiers. Machine Learning. 1997;29(2-3):131-63. Available from: http://www.cs.technion.ac.il/~dang/journal_papers/friedman1997Bayesian.pdf Accessed in 2017 (Feb 20).

Cover T, Hart P. Nearest neighbor pattern classification. IEEE Transactions on Information Theory. 1967;13(1):21-7. Available from: http://ieeexplore.ieee.org/document/1053964/ Accessed in 2017 (Feb 20).

Breiman L. Random forests. Machine Learning. 2001;45(1):5-32. Available from: http://download.springer.com/static/pdf/639/art%253A10.1023%252FA%253A1010933404324.pdf?originUrl=http%3A%2F%2Flink.springer.com%2Farticle%2F10.1023%2FA%3A1010933404324&token2=exp=1487599835~acl=%2Fstatic%2Fpdf%2F639%2Fart%25253A10.1023%25252FA%25253A1010933404324.pdf%3ForiginUrl%3Dhttp%253A%252F%252Flink.springer.com%252Farticle%252F10.1023%252FA%253A1010933404324*~hmac=ba7626571c8b7a2e4710c893c3bc243eb963021f7bbf0e70ef0fe0a27344e28d Accessed in 2017 (Feb 20).

Kotsiantis SB, Zaharakis ID, Pintelas PE. Machine learning: a review of classification and combining techniques. Artif Intell Rev. 2006;26(3):159-90. Available from: http://www.cs.bham.ac.uk/~pxt/IDA/class_rev.pdf Accessed in 2017 (Feb 20).

Gonzalez-Abril L, Cuberos FJ, Velasco F, Ortega JA. Ameva: An autonomous discretization algorithm. Expert Systems with Applications. 2009;36(3):5327-32. Available from: http://sci2s.ugr.es/keel/pdf/algorithm/articulo/2009-Gonzalez-Abril-ESWA.pdf Accessed in 2017 (Feb 20).

Guyon I, Elisseeff A. An introduction to variable and feature selection. Journal of Machine Learning Research. 2003;3:1157-82. Available from: http://www.jmlr.org/papers/volume3/guyon03a/guyon03a.pdf Accessed in 2017 (Feb 20).

Brown N, Critchley J, Bogowicz P, Mayige M, Unwin N. Risk scores based on self-reported or available clinical data to detect undiagnosed type 2 diabetes: a systematic review. Diabetes Res Clin Pract. 2012;98(3):369-85.

Bellazi R, Zupan B. Predictive data mining in clinical medicine: current issues and guidelines. Int J Med Inform. 2008;77(2):81-97.

Brown DE. Introduction to data mining for medical informatics. Clin Lab Med. 2008;28(1):9-35, v.

Harrison JH Jr. Introduction to the mining of clinical data. Clin Lab Med. 2008;28(1):1-7, v.

Koh HC, Tan G. Data mining applications in healthcare. J Healthc Inf Manag. 2005;19(2):64-72.

Lavrac N. Selected techniques for data mining in medicine. Artif Intell Med. 1999;16(1):3-23.

Obenshain MK. Application of data mining techniques to healthcare data. Infect Control Hosp Epidemiol. 2004;25(8):690-5.

Yoo I, Alafaireet P, Marinov M, et al. Data mining in healthcare and biomedicine: a survey of the literature. J Med Syst. 2012;36(4):2431-48.

Barber SR, Davies MJ, Khunti K, Gray LJ. Risk assessment tools for detecting those with pre-diabetes: a systematic review. Diabetes Res Clin Pract. 2014;105(1):1-13.

Shankaracharya, Odedra D, Samanta S, Vidyarthi AS. Computational intelligence in early diabetes diagnosis: a review. Rev Diabet Stud. 2010;7(4):252-62.

Choi SB, Kim WJ, Yoo TK, et al. Screening for prediabetes using machine learning models. Comput Math Methods Med. 2014;2014:618976.

Lee YH, Bang H, Kim HC, et al. A simple screening score for diabetes for the Korean population: development, validation, and comparison with other scores. Diabetes Care. 2012;35(8):1723-30.

Wang C, Li L, Wang L, et al. Evaluating the risk of type 2 diabetes mellitus using artificial neural network: an effective classification approach. Diabetes Res Clin Pract. 2013;100(1):111-8.

Mansour R, Eghbal Z, Amirhossein H. Comparison of artificial neural network, logistic regression and discriminant analysis efficiency in determining risk factors of type 2 diabetes. World Applied Sciences Journal. 2013;23(11):1522-9. Available from: https://www.idosi.org/wasj/wasj23(11)13/14.pdf Accessed in 2017 (Feb 20).

Lee BJ, Ku B, Nam J, Pham DD, Kim JY. Prediction of fasting plasma glucose status using anthropometric measures for diagnosing type 2 diabetes. IEEE J Biomed Heal Inform. 2014;18(2):555-61.

Ramezankhani A, Pournik O, Shahrabi J, et al. Applying decision tree for identification of a low risk population for type 2 diabetes. Tehran Lipid and Glucose Study. Diabetes Res Clin Pract. 2014;105(3):391-8.

Golino HF, Amaral LS, Duarte SF, et al. Predicting increased blood pressure using machine learning. J Obes. 2014;2014:637635.

Downloads

Published

2017-06-01

How to Cite

1.
Olivera AR, Iochpe C, Schmidt MI, Vigo Álvaro, Barreto SM, Duncan BB. Comparison of machine-learning algorithms to build a predictive model for detecting undiagnosed diabetes – ELSA-Brasil: accuracy study. Sao Paulo Med J [Internet]. 2017 Jun. 1 [cited 2025 Mar. 9];135(3):234-46. Available from: https://periodicosapm.emnuvens.com.br/spmj/article/view/785

Issue

Section

Original Article