Goal

The goal of this project is to use machine learning to predict fraudulent health providers by analyzing patterns across inpatient and outpatient claims data.

Background

Impact of health care fraud

U.S spends ~3.6 trillion on health insurance claims every year
~$300 billion(3-10% of claims) are fraudulent health care claims
Examples of fraudulent claims:
Billing for service that were not provided
Misrepresenting the services provided (charging for a more complex procedure)
Submitting duplicate claims for same or different patients
Fraudulent claims cause:
Incresed cost of care
slow processing of valid claims
Higher premiums
Manual review of billions of claims is time consuming and expensive

Opportunity for machine learning

Machine learning is ideally suited to detect fraud claims
Models can be built on existing fraud patterns to automate assessment of claims
Benefits:
Faster processing of all claims
- Identify genuine claims and streamline the approval and payment process
- Flag fraudulent claims for further review before payment
Can provide clear reasons for flagging
Models can be improved to find new patterns and therefore identiy new fraud types

Data set

There are 4 sets of data available. You can download them from kaggle
Inpatient Dataset - Inpatient claims Outpatient Dataset - Outpatient claims

In patient and outpatient claims data sets consist of:

Provider ID
Beneficiary ID
Claim ID
Claim Start date
Claim End date
Physician information (3 columns)
Diagnosis codes (10 columns)
Procedure codes (6 columns)

Additionally Inpatient datset consists of Admission and Discharge dates

Beneficiary Dataset - This data contains beneficiary KYC details like health conditions,region they belong to etc

Beneficiary ID
Date of Birth
Gender
Race
State
Chronic condition information ( 1 column per condition - 11 columns)

Provider labels - Provider ID and labels (fraudulent and non-fraudulent)

The inpatient and outpatient claims data provided are per patient while the labels are provided per provider

Challenge

The challenge here is to analyze individual claims data and find patterns that might then help us predict fradulent providers. So, we are making an assumption that all claims filed by a fraudulent provider and fraud and vice versa.

Data cleaning

Initial data cleaning:

Dates were read as factors. Lets change that.
There are some missing values. Lets take a look at them.

Missing values

Lots of missing values in both inpatient and outpatient datsets

Dealing with missing values

Step 1:

Dropping the following columns because 100% of the data is missing:

ClmProcedure code 4,5,6 from the inpatient data
ClmProcedureCode 1,2,3,4,5,6 from the outpatient data - After researching a little I found that frequently these columns exist in the claims data but when they are not the basis of payment they are empty.
DOD - Date of Death

Step 2:

Inpatient and Outpatient data sets- Missing values in Physician columns, Diagnosis codes, and procedure codes.

For Physician columns - Replace NAs with “None”
For Diagnosis codes and procedure codes - Replace NAs with 0

Feature Engineering

Chronic conditions are listed as 1=Yes, 2=No. We will convert 2 to 0 just for ease of computation
Add a new column age will be more useful than date of birth
Adding a new column of state names
Totalstay_days (Total days spent in the hospital) = DischargeDt-AdmissionDt
Claimlength = ClaimEndDt-ClaimStartDt
Column Type with “Inpatient” and “Outpatient”

Join the beneficiary and Inpatient, Outpatient datsets for analysis

Ok, lets explore the data now.Now each row has a unique claim and various variables along with the labels column(Potential Fraud)

Exploratory data analysis

Age and Fraud

Insight:

Age seems to have no impact on Fraud

Gender and Fraud

Insights:

There are more females in the dataset than males No influence on fraud

Race and Fraud

Insights:

Race is mostly populated with a single value (race 1), which can cause model bias. Therefore, remove this variable before modeling.

State and Fraud

First lets take a look at fraudulent claims per state. Looks like California, Florida, New York, Pennsylvania, and Texas have higher fraudulent claims than other states

State	Total claims
California	30335
Florida	17512
New York	17492
Pennsylvania	11448
Texas	10135

Lets take a look at total claims per state to investigate if these states also have high total claims

Number of claims from flagged providers per state correlates with total claims per state NO CLEAR STATE BIAS

Lets investigate if fraudulent provuders file claims across multiple states

Looks like flagged providers file claims in more states than non-flagged providers

Chronic conditions and Fraud

Diagnosis codes and Fraud

Do fraudulent providers use some diagnostic codes more frequently ?

Insights:

Looks like some codes are strongly associated with fraudulent claims

Procedure codes vs Fraud

Flagged providers file more inpatient claims which explains the frequency bias of procedure codes

Hospital stay and fraud

Is there a difference in distributions of hospital stay between fraudulent and non-fraudulent providers?

Number of claims and Fraud

Do fraudulent providers file more claims per patient?

Getting the dataset ready

In our explortatory data analysis, we had 550k rows and 60 features.
In order to predict fradulent providers, we need to map the features onto the provider data which consists of 5410 rows.
For each provider, the features shown below in the feature engineering box were calculated
The final data set consists of 5410 rows and 21 features enumerated on per provider basis

Flow chart depicting how data was enumerated on a per provider basis

Modeling

We will employ the following algorithms to generate models:
-Logistic Regression -Random Forest -Xtreme Gradient Boosting
For each model
-Train-Test split ratio - 80:20
For Random Forest and XGBM
Repested k-fold crodd validation (5 repeats of 10 fold cross validation)
Since this is an imblanced data set, we will use Recall and Specificity instead of overall accuracy to validate our predictions

Logistic Regression

## 
## Call:
## glm(formula = PotentialFraud ~ ., family = binomial(link = "logit"), 
##     data = train_model)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.1527  -0.3056  -0.1542  -0.1321   3.1093  
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)          -5.174758   0.715118  -7.236 4.61e-13 ***
## Total_patients       -0.005360   0.001825  -2.937 0.003315 ** 
## Total_claims         -0.117602   0.060713  -1.937 0.052744 .  
## Code_2449             0.005336   0.030094   0.177 0.859254    
## Code_25000            0.019368   0.021217   0.913 0.361304    
## Code_2720            -0.001736   0.031201  -0.056 0.955636    
## Code_2724             0.055277   0.021120   2.617 0.008863 ** 
## Code_4011            -0.025891   0.029459  -0.879 0.379468    
## Code_4019             0.001694   0.015050   0.113 0.910401    
## Code_42731            0.093493   0.027039   3.458 0.000545 ***
## Code_4280             0.156565   0.029497   5.308 1.11e-07 ***
## Code_V5861           -0.037449   0.029607  -1.265 0.205918    
## Code_5869            -0.056165   0.027712  -2.027 0.042686 *  
## Age_20_40             0.138741   0.063889   2.172 0.029886 *  
## Age_40_60             0.112166   0.060203   1.863 0.062443 .  
## Age_60_80             0.124113   0.061225   2.027 0.042646 *  
## Age_80_100            0.113972   0.060542   1.883 0.059765 .  
## Total_states          0.044576   0.026009   1.714 0.086551 .  
## Totalstay_days        0.203812   0.018816  10.832  < 2e-16 ***
## Total_diagnosiscodes  0.142743   0.046735   3.054 0.002256 ** 
## Total_chronicconds    0.087163   0.111803   0.780 0.435617    
## Total_physicians     -0.198766   0.354965  -0.560 0.575507    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2689.9  on 4328  degrees of freedom
## Residual deviance: 1503.4  on 4307  degrees of freedom
## AIC: 1547.4
## 
## Number of Fisher Scoring iterations: 7

	Actual Positives	Actual Negatives
Predicted Positives	38(TP)	6(FP)
Predicted Negatives	63(FN)	974(FN)

TP = True positives (providers correctly identified as Fraud)
TN = True negatives (providers correctly identitied as Not Fraud)
FP = False positives (providers incorrectly identified as Fraud)
FN = False negatives (providers incorreclt identified as not fraud)

Accuracy = TP+TN/Total = 974+38/1081 = 93%

Recall= TP/Total Actual positives = 38/101 = 37%

The overall accuracy is 93%. However, the goal of this project is to correctly identify the fraud providers accurately and the poor recall rate indicates that this model will perform poorly when it comes to predicting fraud providers

Lets see how Random Forest and XGBM perform

Random Forest and XGBM

	Logistic Regression	Random Forest	XGBoost
True Positive	38	78	62
False Negative	63	23	39
Recall(%)	37	77	61
Accuracy(%)	93	97	95

Conclusions

Random Forest predicts with highest accuracy - 77% Recall and 97% Accuracy
Our model with 77% recall can accurately identify 161,000 fradulent claims. Each claim costs an average of $1500. So, the insurance company can potentially save ~$200 million
Significant saving in avoiding investigations on all claims and can help process valid claims faster

Health care fraud detection

Spandana Makeneni

Goal

Background

Impact of health care fraud

Opportunity for machine learning

Data set

Challenge

Data cleaning

Missing values

Dealing with missing values

Feature Engineering

Exploratory data analysis

Age and Fraud

Gender and Fraud

Race and Fraud

State and Fraud

Chronic conditions and Fraud

Diagnosis codes and Fraud

Procedure codes vs Fraud

Hospital stay and fraud

Number of claims and Fraud

Getting the dataset ready

Modeling

Logistic Regression

Random Forest and XGBM

Conclusions