The goal of this project is to use machine learning to predict fraudulent health providers by analyzing patterns across inpatient and outpatient claims data.
There are 4 sets of data available. You can download them from kaggle
Inpatient Dataset - Inpatient claims Outpatient Dataset - Outpatient claims
In patient and outpatient claims data sets consist of:
Additionally Inpatient datset consists of Admission and Discharge dates
Beneficiary Dataset - This data contains beneficiary KYC details like health conditions,region they belong to etc
Provider labels - Provider ID and labels (fraudulent and non-fraudulent)
The challenge here is to analyze individual claims data and find patterns that might then help us predict fradulent providers. So, we are making an assumption that all claims filed by a fraudulent provider and fraud and vice versa.
Initial data cleaning:
Lots of missing values in both inpatient and outpatient datsets
Dropping the following columns because 100% of the data is missing:
Inpatient and Outpatient data sets- Missing values in Physician columns, Diagnosis codes, and procedure codes.
Join the beneficiary and Inpatient, Outpatient datsets for analysis
Ok, lets explore the data now.Now each row has a unique claim and various variables along with the labels column(Potential Fraud)
Age seems to have no impact on Fraud
There are more females in the dataset than males No influence on fraud
State | Total claims |
---|---|
California | 30335 |
Florida | 17512 |
New York | 17492 |
Pennsylvania | 11448 |
Texas | 10135 |
Number of claims from flagged providers per state correlates with total claims per state NO CLEAR STATE BIAS
Looks like flagged providers file claims in more states than non-flagged providers
Do fraudulent providers use some diagnostic codes more frequently ?
Insights:Looks like some codes are strongly associated with fraudulent claims
Flagged providers file more inpatient claims which explains the frequency bias of procedure codes
Is there a difference in distributions of hospital stay between fraudulent and non-fraudulent providers?
Do fraudulent providers file more claims per patient?
We will employ the following algorithms to generate models:
-Logistic Regression -Random Forest -Xtreme Gradient Boosting
For each model
-Train-Test split ratio - 80:20
Repested k-fold crodd validation (5 repeats of 10 fold cross validation)
Since this is an imblanced data set, we will use Recall and Specificity instead of overall accuracy to validate our predictions
##
## Call:
## glm(formula = PotentialFraud ~ ., family = binomial(link = "logit"),
## data = train_model)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.1527 -0.3056 -0.1542 -0.1321 3.1093
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.174758 0.715118 -7.236 4.61e-13 ***
## Total_patients -0.005360 0.001825 -2.937 0.003315 **
## Total_claims -0.117602 0.060713 -1.937 0.052744 .
## Code_2449 0.005336 0.030094 0.177 0.859254
## Code_25000 0.019368 0.021217 0.913 0.361304
## Code_2720 -0.001736 0.031201 -0.056 0.955636
## Code_2724 0.055277 0.021120 2.617 0.008863 **
## Code_4011 -0.025891 0.029459 -0.879 0.379468
## Code_4019 0.001694 0.015050 0.113 0.910401
## Code_42731 0.093493 0.027039 3.458 0.000545 ***
## Code_4280 0.156565 0.029497 5.308 1.11e-07 ***
## Code_V5861 -0.037449 0.029607 -1.265 0.205918
## Code_5869 -0.056165 0.027712 -2.027 0.042686 *
## Age_20_40 0.138741 0.063889 2.172 0.029886 *
## Age_40_60 0.112166 0.060203 1.863 0.062443 .
## Age_60_80 0.124113 0.061225 2.027 0.042646 *
## Age_80_100 0.113972 0.060542 1.883 0.059765 .
## Total_states 0.044576 0.026009 1.714 0.086551 .
## Totalstay_days 0.203812 0.018816 10.832 < 2e-16 ***
## Total_diagnosiscodes 0.142743 0.046735 3.054 0.002256 **
## Total_chronicconds 0.087163 0.111803 0.780 0.435617
## Total_physicians -0.198766 0.354965 -0.560 0.575507
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2689.9 on 4328 degrees of freedom
## Residual deviance: 1503.4 on 4307 degrees of freedom
## AIC: 1547.4
##
## Number of Fisher Scoring iterations: 7
Actual Positives | Actual Negatives | |
---|---|---|
Predicted Positives | 38(TP) | 6(FP) |
Predicted Negatives | 63(FN) | 974(FN) |
TP = True positives (providers correctly identified as Fraud)
TN = True negatives (providers correctly identitied as Not Fraud)
FP = False positives (providers incorrectly identified as Fraud)
FN = False negatives (providers incorreclt identified as not fraud)
Accuracy = TP+TN/Total = 974+38/1081 = 93%
Recall= TP/Total Actual positives = 38/101 = 37%
The overall accuracy is 93%. However, the goal of this project is to correctly identify the fraud providers accurately and the poor recall rate indicates that this model will perform poorly when it comes to predicting fraud providers
Lets see how Random Forest and XGBM perform
Logistic Regression | Random Forest | XGBoost | |
---|---|---|---|
True Positive | 38 | 78 | 62 |
False Negative | 63 | 23 | 39 |
Recall(%) | 37 | 77 | 61 |
Accuracy(%) | 93 | 97 | 95 |