SpringboardCapstone/monisha_bnp_proposed_solution.Rmd at master · monisha-g/SpringboardCapstone · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
---
title: "Solution Proposal for BNP Paribas Cardif Claims Management"
author: "Monisha Gopal"
output: md_document
---


## Problem

BNP Paribas Cardif is a global personal insurance provider. They want to automate the process of determining whether a claim can automatically be accepted or needs to be checked further. The problem is then to take the features available to BNP Paribas Cardif and output the probability that a claim can be accepted.


## Data

This problem comes with a training data set and a test data set (https://www.kaggle.com/c/bnp-paribas-cardif-claims-management/data). The training data set contains 133 variables, 21 of which are categorical variables and 112 of which are numerical variables. 2 of the categorical variables include the ID and the target (whether or not the claim is acceptable).

``` {r train}
load(file="train.RData")
str(train, list.len=10)
```

## Solution Outline
1. Data preprocessing

    a. Go through each feature from v1 to v131,

        i. If more than 25% of the values in the feature are missing, remove the feature

    b. Treat outliers and missing values

        i. Check whether non-missing values are too constant, if so, remove the feature
        ii. For numerical features, replace outliers and missing values with the average of the remaining values
        iii. For categorical features, there are two options,

            A. If between 10-25% of the values are missing, create an entirely new categorical feature called "Missing"
            B. Or replace the missing values with the most frequently occurring features

    c. Check correlation between features

        i. If two features are highly correlated to each other, then remove one of them

2. Feature enrichment (possible methods)

    a. Weight of evidence
    b. Informational value

3. Data split

4. Model

    a. Logistic Regression

5. Cross-validation

    a. 50/50 training data split

## Deliverables
1. Slide deck explaining process
2. Code