association rule mining
Association rule mining#
Last time, we talked about healthcare analytics using predictive or descriptive methods. This post will talk about association rule mining, which aims to find co-occurrences of disease carried by a patient using the health care repository.
The research has proved that getting one disease can lead to one or more several disease. For example, a patient with diabetes is more likely to have other conditions, such as hypertension, and increase the risk of heart disease. Association rule mining is one of the data mining techniques that predict correlation of disease carried by patients.
In Association Rules Mining Based Clinical Observations, the paper discussed a proposed architecture of Clinical State Correlation Prediction(CSCP) and Online Transaction Processing(OLTP). When a patient goes hospital, he must fill up forms and information. These data are stored in OLTP system. Later on, the CSCP will import those records to generate association among disease using data mining techniques.
Apriori Algorithm#
1. Support#
Support represents the occurrences of a disease. For example, amount 100 people, there are 80 diabetes patients. Support of diabetes can be calculated as
Support(diabetes)=diabetes/total: 80/100=0.8
2. Support ratio#
Support ratio represent how frequently two disease happen together. When calculating the support ratio, the small support will always be numerator. For example, the support of diabetes is 0.8, the support of heart disease is 0.7.
Support Ratio(diabetes and heart disease)= 0.7/0.8=0.875
3. Confidence:#
Confidence represents the occurrence of disease A lead to disease B. The numerator is the occurrence of A and B, the denominator is the support of antecedent disease(A). For example, among total 100 people, 80 people have diabetes, 70 people have heart disease. I would like to calculate the confidence of diabetes lead to heart disease. Confidence(A—>B):
Confidence(diabetes--->heart disease)
=support(diabetes and heart disease)/support(diabetes)=0.5/0.8=0.625
Note: when calculate confidence, there must be a antecedents and a consequents.
4. Lift:#
Lift represent how strong an association rule is. It is calculated as confidence divided by support of consequents. For example, I would like to know how strong the association rule of diabetes—>heart disease.
support(heart disease)=70/100=0.7
Lift=0.625/0.7=0.89
As we can see, the lift(0.89) is smaller than 1. This is due to the confidence is smaller than the support of heart disease. Therefore, it indicates: getting diabetes reduce the risk of getting heart disease.
Lift<1: negative correlation
Lift>1: positive correlation
Example: association rule between alzheimer and cancer#
Here is a example of association rule mining between Alzheimer and different types of cancer. For this example, I would like to find the lite association, which basically mean one disease lead to the other disease. I will be using the following dataset to conduct association rule mining. The two columns will be used are PID_Arry(patient id) and CAT2(disease code), where 331 represent Alzheimer and the other codes represent different types of cancer.
After data cleaning, I’ve calculated support, SR, confidence and Lift. Let’s sort by confidence. As shown in the picture below, the highest confidence is 1.62% between alzheimer and cancer 198. The lift score indicates that getting Alzheimer decrease the risk of getting cancer 198.
Conclusion:#
The recognized pattern by using association rule mining can definitely help health providers and researchers to better find disease association. By implementing the CSCP system proposed in Association Rules Mining Based Clinical Observations, health professional will be able to identify disease faster and make better decision.
Reference:
Rashid, M., Hoque, M., & Sattar, A. (2014, January 11). Association Rules Mining Based Clinical Observations. Retrieved November 20, 2020, from https://arxiv.org/abs/1401.2571