Customer Segmentation for Arvato Financial Services

13 min readMar 28, 2021

Report for Udacity Data Science Nanodegree capstone project: Customer Segmentation Report for Arvato Financial Services

Introduction

It is really important to know who are your customers. First of all it helps you to understand their needs and their special characteritics. Furthermore it is easier for you to attract new customers with similar characteristics as the ones you already have.
In this project, we will analyze demographics data for customers of a mail-order sales company Arvato in Germany, comparing it against demographics information for the general population.

Problem Statement

“Given the demographic data of a person how can a mail order company acquire new customers”

We approach this project in 2 phases :

In the first part of the project, we will use unsupervised learning techniques to perform customer segmentation, identifying the parts of the population that best describe the core customer base of the company. In order to accomplish this, we are going to use a dimensionality reduction technique like PCA to reduce the amount of the data and then an unsupervised clustering algorithm in order to find the different types of customers.
Then, we’ll and use a model to predict according to the demographic pieces of information which individuals are most likely to convert into becoming customers for the company. For the part of the prediction, we are going to test different machine learning models like Random Forest and Xgboost. After brief testing of the different models, we will try to focus on the ones with the best score and optimize the hyperparameters with the use of Bayesian optimization.

Evaluation metric

The prediction will be evaluated in a Kaggle Competition. Due to the fact that the labels are highly imbalanced, the evaluation metric that has been used is the AUC for the ROC curve. The AUC considers both the true positive rate and the false positive rate. A ROC, or receiver operating characteristic, is a graphic used to plot the true positive rate (TPR, proportion of actual customers that are labeled as so) against the false positive rate (FPR, proportion of non-customers labeled as customers). We can check in the plot below how imbalanced are the 2 classes.

Moreover, the Kaggle competition also uses the AUCROC as the evaluation metric.

Exploratory Data analysis

This is the most important step in the project so it worth spending the most time here.

First, let’s take a look at the datasets:

AZDIAS: Demographics data for the general population of Germany.

**Columns: 366 entries, LNR to ALTERSKATEGORIE_GROB, dtypes: float64(267), int64(93), object(6)**

Customers: Demographics data for customers of a mail-order company.

**Columns: 369 entries, LNR to ALTERSKATEGORIE_GROB, dtypes: float64(267), int64(94), object(8)**

MAILOUT Train/Test: Demographics data for individuals who were targets of a marketing campaign.

**Columns: 367 entries, LNR to ALTERSKATEGORIE_GROB, dtypes: float64(267), int64(94), object(6)**

We can see from a first glance that our datasets have a lot of Nan values and most of the features are ordinal or categorical (360 out of 366 in the AZDIAS dataset).

Additionally, two metadata files have been provided to give some information about the attributes:

Dias Information Levels — Attribute 2017.xlsx: top-level list of attributes and descriptions organized by information category
Dias Attribute-Values 2017.xlsx: detailed description of data values for some of the features

Addressing mixed type columns

While I was reading the AZDIAS file I got a warning: ” DtypeWarning: Columns (18,19) have mixed types”. Those columns are CAMEO_INTL_2015 and CAMEO_DEUG_2015. The value “XX” and “X” respectively so I decided to replace them with Nan since I do not know what they represent.

Addressing unknown values

For most of the features, we have a description of the meaning of their values from the `Dias Attribute-Values 2017.xlsx` file. In many of the features, I noticed that there is a value with the description “Unknown” but there are also nan values. So I decided to replace all the values that their meaning is “unknown” with nan.

Missing values per Feature

With the above plots, we can clearly see that most of the features have less than 20% missing values and we have some features with more than 80% so I will remove the features that have more than 35% missing values.

Missing values per User

We can clearly see from the plot below that most of the users in the Azdias and Customer dataset do not miss more than 50% of their features.

Feature Engineering

The unsupervised learning algorithm that will be used to build the customer segmentation, requires numerical values. So we have to convert the object type features to numerical or drop them if we do not need them. The first was to check all the possible values of every feature and the description of each feature if it is available to decide if some of them are irrelevant.

After checking all the features I decided to work as follows for the non-numeric features:

OST_WEST_KZ: a binary feature that will be re-encoded to 0 and 1
LNR: Redundant feature has a different value for every user probably it is the id of the user
D19_LETZTER_KAUF_BRANCHE: use of the pd.get_dummies. Convert categorical variable into dummy/indicator variables
EINGEFUEGT_AM: it is a date format feature where I keep only the year for simplicity

Missing values

After the removal of rows and columns with a high number of missing values based on a threshold, there are still a lot of missing values. These missing values have been replaced with -1 in order to make clear that we have no information about those values. Also, replacement with the most frequent value at feature would be a good solution for the missing values

Customer Segmentation

The goal of this section is to find future customers. This is able by dividing the general population and the customers into different segments. The data about the existing customers are available so we can compare them with general population.

Dimensionality reduction

Principal Component Analysis (PCA)

Due to the big size of the dataset, it will be useful to use a dimensionality reduction algorithm and PCA is one of the most used dimensionality reduction algorithms.

Feature scaling

For the use of the principal component analysis, we have to perform a feature scaling so that the different scales at each feature will not have any effect. StandardScale has been used which scale each of the features to 0 mean and std equals 1.

In the next figure, we can see the cumulative variance explained and make the conclusion that ~90% of the variance is explained by the first 200 components of the PCA. So we can reduce the number of the features from 389 to 250.

Components analysis

Component 4 corresponds to the people that have more expensive or high lux car like BMW or Mercedes. Also, those people with more expensive cars are more possible to have a sportscar and fewer cars with five seats.

Clustering

The next step after the dimensionality reduction is to divide the customers and general population into different groups. Due to the simplicity and appropriateness(a measure of distance between 2 observations), Kmeans has been chosen for this task.

Kmeans

The number of clusters is an unknown parameter but a fundamental step for any unsupervised algorithm is to determine the optimal number of clusters into which the data may be clustered. The Elbow Method is one of the most popular methods to determine this optimal value of k.

Distortion: It is calculated as the average of the squared distances from the cluster centers of the respective clusters. Typically, the Euclidean distance metric is used.
Inertia: It is the sum of squared distances of samples to their closest cluster center.

We iterate the values of k from 1 to 20 and calculate the values of distortions for each value of k and calculate the distortion and inertia for each value of k in the given range.

To determine the optimal number of clusters, we have to select the value of k at the “elbow” ie the point after which the distortion/inertia starts decreasing in a linear fashion. Thus for the given data, we conclude that the optimal number of clusters for the data is 9. [3]

Compare Customer Data to Demographics Data

The figure below shows the distribution assigned to each cluster for the 2 datasets. We can observe that the cluster have similar distribution between the datasets except from the 0 and 8 clusters

The mismatch of clusters between AZDIAS and CUSTOMERS dataset indicates that there are only particular segments of the population that are interested in the company’s products.

A first notice is that the customer population is overrepresented in some clusters. The three clusters with the largest amount of over-representation are cluster 1, cluster 6, and cluster 8. So this suggests that those clusters are more likely to be customers. With the same logic the cluster 9, cluster 10, and cluster 2 are more unlikely to be potential customers.

If we explore the feature of “HH_EINKOMMEN_SCORE” in customers dataset in clusters 8 and 10 we can clearly see a difference. In the figure, below the target consider the cluster where the customers have a higher percentage in this case cluster 8, and non-target the opposite cluster 10.

Here is the description of the data in provided excel file sheet:

1, 0: unknown
1: highest income
2: very high income
3: high income
4: average income
5: lower income
6: very low income

It is obvious that the company should target customers with higher income and the PCA/Kmeans finds it correct.

LP_STATUS_FEIN feature shows the social status the higher the better.

Customer Acquisition

The last part of the project is to apply supervised learning to investigate MAILOUT_TRAIN and MAILOUT_TEST dataset to predict whether or not a person became a customer of the company following the campaign.

There are many machine learning models to choose from. I start by selecting Random Forest, AdaBoost, SVC, and XG boost. The performance of those models based on 5-fold cross-validation where 5 refers to the number of groups that a given data sample is to be split into are in the below figure. The parameters for each model were the default of the sklearn library.

Due to the above scores, I decided to focus on boosting algorithms and optimized their hyperparameters. After testing I focused on three different models “LGBMClassifier”, “XGBoostClassifier” and “CatBoostClassifier”.

Bayesian optimization

Bayesian optimization trains a machine learning model to predict the best hyperparameters. For each set of hyperparameters, you get a different model performance and thus a different result under your performance metric. Grid search and random search would solve the problem by just blindly searching the whole parameters space (either systematically in a grid or just randomly). But for models with a large parameter space, like XGBoost, are slow and painfully inefficient. Bayesian optimization on the other side builds a model for the optimization function and explores the parameter space systematically, which is a smart and much faster way to find your parameters

Model Evaluation and Validation

5-fold cross-validation for all the three algorithms that have been tested. Cross-validationwas used for the estimatinon of the skill the model on unseen data. That is, to use a limited sample in order to estimate how the model is expected to perform in general.

The best score among the three algorithms was with the Cat boost. Unlike XGBoost or other machine learning models, CatBoost deals better with categorical variables in their native form that’s why it probably has better results. CatBoost distinguishes itself from LightGBM and XGBoost by focusing on optimizing decision trees for categorical variables, or variables whose different values may have no relation with each other (eg. apples and oranges). CatBoost determines different categories automatically with no need for preprocessing. The big advantage of doing this in the algorithm over a preprocessing phase is because you can adjust the encoding when doing bootstrap sampling of rows.

Maybe a better encoding of the categorical features and a more exhaustive hyperparameter search will result in a better score with the XGboost.

Finally, I submit the result to the Kaggle competition and achieved a 0.79813 roc_auc_score(Receiver Operating Characteristic Curve).

The BayesianOptimization object will work out of the box without much tuning needed. The package that I used is the bayesian-optimization and the main method you should be aware of is maximize, which does exactly what you think it does (For more details check here).

There are many parameters you can pass to maximize, nonetheless, the most important ones are:

n_iter: How many steps of Bayesian optimization you want to perform. The more steps the more likely to find a good maximum you are.
init_points: How many steps of random exploration you want to perform. Random exploration can help by diversifying the exploration space.

optimizer.maximize(
    init_points=2,
    n_iter=3,
)

The hyperparameters that I tried to optimize for the cat boost are:

Depth of the tree:
'depth': (1, 4)Coefficient at the L2 regularization term of the cost function.
'l2_leaf_reg': (2, 30)The maximum number of trees that can be built when solving machine learning problems.
'num_boost_round': (100, 1000)

So the best parameters are depth=3, l2_leaf_reg=27.7 and num_boost_round=948. The other parameters that were fixed are

params = {"loss_function": "Logloss","eval_metric" : "AUC","learning_rate" : 0.01,"random_state" : 42,"logging_level" : "Silent","thread_count": 24,}

For sure there more parameters that can be tuned like the learning_rate or the iterations, but due to computational and time constraints, I wasn’t able to try more variables.

For the xgboost and light boost hyperparameters tuning you can check the following notebooks: LightGBM.ipynb and XGBoost.ipynb

The feature importance according to the Catboost model:

We can see that the most important variable is the “D19_SOZIALES”.

Complications

The most time-consuming problem was at the preprocessing step where I considered that the dataset for the prediction model had similar feature distribution on the nan values and I did the mistake to remove some useful features.

Conclusion

In this post, we dive into a real problem of Arvato Financial Solutions where we tried to solve it with the use of machine learning. So it was really important to have at least some business understanding especially in the part of the preprocessing which is the most important one. Here are the main results :

Eploratory analysis on the Demographics data of general population of Germany and data for customers of a mail-order company in order to understand similarities and differences.
Use column/feature properties in order to preprocess the data
Part 1: In order to find the group of individuals that descripe better the customers of the company I used Unsupervised Learning Algorithms, more specificaly PCA and KMeans to segment the general population and then find in which cluster fit the customers.
Part 2: Use of Supervised Learning to poredict potential new customers. I compared bifferent boosting techniques and with the help of the bayesian optimaization I managed to tune som of their hyperparameters in order to accomplish a better score. Try t uderstand how the model works by finding the most important features the model uses to make the prediction.

One of the most interesting parts of the project was the highly imbalanced classes which play a crucial role at all steps. Also, it makes it really difficult to accomplish a good prediction score so a good idea is to try an Over/Under-sampling technique in order to improve the model.

Next steps

There are a lot of different approaches that can be applied from data cleaning to model training. The first thing I will try is to do more feature engineering and get into more details about each feature. Try to excluded high correlated features and combine information of the unsupervised algorithm for the prediction.

For the segmentation, I will try other dimensionality reduction techniques like Umap, TSNE, and Independent component analysis. Also for the prediction part of the project may be a combination of different models will result in a better score.

The source code for the project can be found in this Github repository.

Last but not least I would like to thank Udacity and Arvato Bertelsmann for the dataset and the opportunity to work with this educative project. Thank you for your time and if you want more information about the Udacity Data Scientist Nano-degree refer here.