Skip to main content

Predictive Modelling of Diabetes

Diabetes has become a leading cause for concern amongst Australia's population, with over 1.3 Million people confirmed cases and 500,000 more estimated undiagnosed type 2 cases.1

The scale of this issue is compounded by the fact that on average each of these people typically have a family member or carer who's life is directly effected in a support capacity and that diabetes is the 7th leading cause of death of Australians.2

It is from this perspective that we aim to develop a predictive model for diabetes risk to improve the ability for medical professionals to make timely detections and interventions as well as to give users agency in preventing this disease.

Data Analytics

To analyse this data, as well as to train machine learning classification models, a number of python libraries have been incorporated into the project:

Dataset 1

https://www.kaggle.com/datasets/iammustafatz/diabetes-prediction-dataset
The original poster is coy3 about the source of the data but it is still one of the most popular datasets across Kaggle.

This Dataset consists of 100001 records which after cleaning, and selction for those aged 55 and over, is reduced to 30995.

The columns are as follows

NameDescriptionType
genderGender of the patientNominal: [Male, Female, Other]
ageAge of the patientDiscrete: [Years, 55 ➡ 80]
hypertensionHistory of hypertensionNominal: [0: false, 1: true]
heart_diseasePatient history of heart diseaseNominal: [0: false, 1: true]
smoking_historyPatients Smoking HistoryNominal: ['never', 'current', 'No Info', 'former', 'not current', 'ever']
bmiBody-Mass Index: calculated by body-mass / height^2Continuous
HbA1c_levelAverage Blood Glucose over the last 3 monthsContinuous: [A1c%]
blood_glucose_levelBlood glucose at time of recordingDiscrete: [mg/dL]
diabetesWhether the patient has been diagnosed with diabetesNominal: [0: false, 1: true]

Dataset 2

https://archive.ics.uci.edu/dataset/529/early+stage+diabetes+risk+prediction+dataset

This dataset is provided in the UC Irvine ML repo and was collected using questionnaires from the patients of the Syhlet Diabetes Hospital in Bangladesh.4

The Dataset contains 17 columns and 520 records

NameDescriptionType
AgeAge of the patientDiscrete: [Years]
GenderGender of the patientNominal: [Male, Female]
PolyuriaExcessive UrineNominal: [No, Yes]
PolydipsiaExcessive ThirstNominal: [No, Yes]
Sudden Weight LossNominal: [No, Yes]
WeaknessNominal: [No, Yes]
PolyphagiaExtreme HungerNominal: [No, Yes]
Genital ThrushNominal: [No, Yes]
Visual BlurringNominal: [No, Yes]
ItchingNominal: [No, Yes]
IrritabilityNominal: [No, Yes]
Delayed HealingNominal: [No, Yes]
Partial ParesisMuscular weakness / impairmentNominal: [No, Yes]
Muscle StiffnessNominal: [No, Yes]
AlopeciaHair LossNominal: [No, Yes]
ObesityNominal: [No, Yes]
ClassPresense of DiabetesNominal: [Negative, Positive]

Approach

Please discuss your analytics workflow / methodology here

Footnotes

  1. https://www.diabetesaustralia.com.au/about-diabetes/diabetes-in-australia/

  2. https://www.abs.gov.au/statistics/health/causes-death/provisional-mortality-statistics/jan-may-2024

  3. https://www.kaggle.com/datasets/iammustafatz/diabetes-prediction-dataset/discussion/406676#2282358

  4. Islam, M. M. Faniqul et al. “Likelihood Prediction of Diabetes at Early Stage Using Data Mining Techniques.” Computer Vision and Machine Intelligence in Medical Image Analysis (2019): n. pag.