Goal

The goal of this project is to use machine learning to predict fraudulent health providers by analyzing patterns across inpatient and outpatient claims data.

Background

Impact of health care fraud

  • U.S spends ~3.6 trillion on health insurance claims every year
  • ~$300 billion(3-10% of claims) are fraudulent health care claims
  • Examples of fraudulent claims:
  • Billing for service that were not provided
  • Misrepresenting the services provided (charging for a more complex procedure)
  • Submitting duplicate claims for same or different patients
  • Fraudulent claims cause:
  • Incresed cost of care
  • slow processing of valid claims
  • Higher premiums
  • Manual review of billions of claims is time consuming and expensive

Opportunity for machine learning

  • Machine learning is ideally suited to detect fraud claims
  • Models can be built on existing fraud patterns to automate assessment of claims
  • Benefits:
  • Faster processing of all claims
    • Identify genuine claims and streamline the approval and payment process
    • Flag fraudulent claims for further review before payment
  • Can provide clear reasons for flagging
  • Models can be improved to find new patterns and therefore identiy new fraud types

Data set

There are 4 sets of data available. You can download them from kaggle
Inpatient Dataset - Inpatient claims Outpatient Dataset - Outpatient claims

In patient and outpatient claims data sets consist of:

  • Provider ID
  • Beneficiary ID
  • Claim ID
  • Claim Start date
  • Claim End date
  • Physician information (3 columns)
  • Diagnosis codes (10 columns)
  • Procedure codes (6 columns)

Additionally Inpatient datset consists of Admission and Discharge dates

Beneficiary Dataset - This data contains beneficiary KYC details like health conditions,region they belong to etc

  • Beneficiary ID
  • Date of Birth
  • Gender
  • Race
  • State
  • Chronic condition information ( 1 column per condition - 11 columns)

Provider labels - Provider ID and labels (fraudulent and non-fraudulent)

The inpatient and outpatient claims data provided are per patient while the labels are provided per provider

The inpatient and outpatient claims data provided are per patient while the labels are provided per provider

Challenge

The challenge here is to analyze individual claims data and find patterns that might then help us predict fradulent providers. So, we are making an assumption that all claims filed by a fraudulent provider and fraud and vice versa.

Data cleaning

Initial data cleaning:

  1. Dates were read as factors. Lets change that.
  2. There are some missing values. Lets take a look at them.

Missing values

Lots of missing values in both inpatient and outpatient datsets

Dealing with missing values

Step 1:

Dropping the following columns because 100% of the data is missing:

  1. ClmProcedure code 4,5,6 from the inpatient data
  2. ClmProcedureCode 1,2,3,4,5,6 from the outpatient data - After researching a little I found that frequently these columns exist in the claims data but when they are not the basis of payment they are empty.
  3. DOD - Date of Death
Step 2:

Inpatient and Outpatient data sets- Missing values in Physician columns, Diagnosis codes, and procedure codes.

  1. For Physician columns - Replace NAs with “None”
  2. For Diagnosis codes and procedure codes - Replace NAs with 0

Feature Engineering

  1. Chronic conditions are listed as 1=Yes, 2=No. We will convert 2 to 0 just for ease of computation
  2. Add a new column age will be more useful than date of birth
  3. Adding a new column of state names
  4. Totalstay_days (Total days spent in the hospital) = DischargeDt-AdmissionDt
  5. Claimlength = ClaimEndDt-ClaimStartDt
  6. Column Type with “Inpatient” and “Outpatient”

Join the beneficiary and Inpatient, Outpatient datsets for analysis

Ok, lets explore the data now.Now each row has a unique claim and various variables along with the labels column(Potential Fraud)

Exploratory data analysis

Age and Fraud

Insight:

Age seems to have no impact on Fraud

Gender and Fraud

Insights:

There are more females in the dataset than males No influence on fraud

Race and Fraud

Insights:
  1. Race is mostly populated with a single value (race 1), which can cause model bias. Therefore, remove this variable before modeling.

State and Fraud

  1. First lets take a look at fraudulent claims per state. Looks like California, Florida, New York, Pennsylvania, and Texas have higher fraudulent claims than other states
State Total claims
California 30335
Florida 17512
New York 17492
Pennsylvania 11448
Texas 10135

  1. Lets take a look at total claims per state to investigate if these states also have high total claims

Number of claims from flagged providers per state correlates with total claims per state NO CLEAR STATE BIAS

  1. Lets investigate if fraudulent provuders file claims across multiple states

Looks like flagged providers file claims in more states than non-flagged providers

Chronic conditions and Fraud

Diagnosis codes and Fraud

Do fraudulent providers use some diagnostic codes more frequently ?

Insights:

Looks like some codes are strongly associated with fraudulent claims

Procedure codes vs Fraud

Flagged providers file more inpatient claims which explains the frequency bias of procedure codes

Hospital stay and fraud

Is there a difference in distributions of hospital stay between fraudulent and non-fraudulent providers?

Number of claims and Fraud

Do fraudulent providers file more claims per patient?

Getting the dataset ready

  • In our explortatory data analysis, we had 550k rows and 60 features.
  • In order to predict fradulent providers, we need to map the features onto the provider data which consists of 5410 rows.
  • For each provider, the features shown below in the feature engineering box were calculated
  • The final data set consists of 5410 rows and 21 features enumerated on per provider basis
Flow chart depicting how data was enumerated on a per provider basis

Flow chart depicting how data was enumerated on a per provider basis

Modeling

  • We will employ the following algorithms to generate models:
    -Logistic Regression -Random Forest -Xtreme Gradient Boosting

  • For each model
    -Train-Test split ratio - 80:20

  • For Random Forest and XGBM
  • Repested k-fold crodd validation (5 repeats of 10 fold cross validation)

  • Since this is an imblanced data set, we will use Recall and Specificity instead of overall accuracy to validate our predictions

Logistic Regression

## 
## Call:
## glm(formula = PotentialFraud ~ ., family = binomial(link = "logit"), 
##     data = train_model)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.1527  -0.3056  -0.1542  -0.1321   3.1093  
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)          -5.174758   0.715118  -7.236 4.61e-13 ***
## Total_patients       -0.005360   0.001825  -2.937 0.003315 ** 
## Total_claims         -0.117602   0.060713  -1.937 0.052744 .  
## Code_2449             0.005336   0.030094   0.177 0.859254    
## Code_25000            0.019368   0.021217   0.913 0.361304    
## Code_2720            -0.001736   0.031201  -0.056 0.955636    
## Code_2724             0.055277   0.021120   2.617 0.008863 ** 
## Code_4011            -0.025891   0.029459  -0.879 0.379468    
## Code_4019             0.001694   0.015050   0.113 0.910401    
## Code_42731            0.093493   0.027039   3.458 0.000545 ***
## Code_4280             0.156565   0.029497   5.308 1.11e-07 ***
## Code_V5861           -0.037449   0.029607  -1.265 0.205918    
## Code_5869            -0.056165   0.027712  -2.027 0.042686 *  
## Age_20_40             0.138741   0.063889   2.172 0.029886 *  
## Age_40_60             0.112166   0.060203   1.863 0.062443 .  
## Age_60_80             0.124113   0.061225   2.027 0.042646 *  
## Age_80_100            0.113972   0.060542   1.883 0.059765 .  
## Total_states          0.044576   0.026009   1.714 0.086551 .  
## Totalstay_days        0.203812   0.018816  10.832  < 2e-16 ***
## Total_diagnosiscodes  0.142743   0.046735   3.054 0.002256 ** 
## Total_chronicconds    0.087163   0.111803   0.780 0.435617    
## Total_physicians     -0.198766   0.354965  -0.560 0.575507    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2689.9  on 4328  degrees of freedom
## Residual deviance: 1503.4  on 4307  degrees of freedom
## AIC: 1547.4
## 
## Number of Fisher Scoring iterations: 7

Actual Positives Actual Negatives
Predicted Positives 38(TP) 6(FP)
Predicted Negatives 63(FN) 974(FN)

TP = True positives (providers correctly identified as Fraud)
TN = True negatives (providers correctly identitied as Not Fraud)
FP = False positives (providers incorrectly identified as Fraud)
FN = False negatives (providers incorreclt identified as not fraud)

Accuracy = TP+TN/Total = 974+38/1081 = 93%

Recall= TP/Total Actual positives = 38/101 = 37%

The overall accuracy is 93%. However, the goal of this project is to correctly identify the fraud providers accurately and the poor recall rate indicates that this model will perform poorly when it comes to predicting fraud providers

Lets see how Random Forest and XGBM perform

Random Forest and XGBM

Logistic Regression Random Forest XGBoost
True Positive 38 78 62
False Negative 63 23 39
Recall(%) 37 77 61
Accuracy(%) 93 97 95

Conclusions

  • Random Forest predicts with highest accuracy - 77% Recall and 97% Accuracy
  • Our model with 77% recall can accurately identify 161,000 fradulent claims. Each claim costs an average of $1500. So, the insurance company can potentially save ~$200 million
  • Significant saving in avoiding investigations on all claims and can help process valid claims faster