Customer Segmentation for Arvato Financial Services

Report for Udacity Data Science Nanodegree capstone project: Customer Segmentation Report for Arvato Financial Services


  1. Then, we’ll and use a model to predict according to the demographic pieces of information which individuals are most likely to convert into becoming customers for the company. For the part of the prediction, we are going to test different machine learning models like Random Forest and Xgboost. After brief testing of the different models, we will try to focus on the ones with the best score and optimize the hyperparameters with the use of Bayesian optimization.
Class imbalance

Exploratory Data analysis

This is the most important step in the project so it worth spending the most time here.

Columns: 366 entries, LNR to ALTERSKATEGORIE_GROB, dtypes: float64(267), int64(93), object(6)
Columns: 369 entries, LNR to ALTERSKATEGORIE_GROB, dtypes: float64(267), int64(94), object(8)
Columns: 367 entries, LNR to ALTERSKATEGORIE_GROB, dtypes: float64(267), int64(94), object(6)
  • Dias Attribute-Values 2017.xlsx: detailed description of data values for some of the features

Missing values per Feature

AZDIAS dataset
AZDIAS dataset

Missing values per User

We can clearly see from the plot below that most of the users in the Azdias and Customer dataset do not miss more than 50% of their features.

Feature Engineering

The unsupervised learning algorithm that will be used to build the customer segmentation, requires numerical values. So we have to convert the object type features to numerical or drop them if we do not need them. The first was to check all the possible values of every feature and the description of each feature if it is available to decide if some of them are irrelevant.

Some feature and their description
  • LNR: Redundant feature has a different value for every user probably it is the id of the user
  • D19_LETZTER_KAUF_BRANCHE: use of the pd.get_dummies. Convert categorical variable into dummy/indicator variables
  • EINGEFUEGT_AM: it is a date format feature where I keep only the year for simplicity

Customer Segmentation

The goal of this section is to find future customers. This is able by dividing the general population and the customers into different segments. The data about the existing customers are available so we can compare them with general population.

Dimensionality reduction

Principal Component Analysis (PCA)


The next step after the dimensionality reduction is to divide the customers and general population into different groups. Due to the simplicity and appropriateness(a measure of distance between 2 observations), Kmeans has been chosen for this task.

  1. Inertia: It is the sum of squared distances of samples to their closest cluster center.

Compare Customer Data to Demographics Data

The figure below shows the distribution assigned to each cluster for the 2 datasets. We can observe that the cluster have similar distribution between the datasets except from the 0 and 8 clusters

Customer Acquisition

The last part of the project is to apply supervised learning to investigate MAILOUT_TRAIN and MAILOUT_TEST dataset to predict whether or not a person became a customer of the company following the campaign.

  • init_points: How many steps of random exploration you want to perform. Random exploration can help by diversifying the exploration space.
Depth of the tree:
'depth': (1, 4)
Coefficient at the L2 regularization term of the cost function.
'l2_leaf_reg': (2, 30)
The maximum number of trees that can be built when solving machine learning problems.
'num_boost_round': (100, 1000)
params = {"loss_function": "Logloss","eval_metric" : "AUC","learning_rate" : 0.01,"random_state" : 42,"logging_level" : "Silent","thread_count": 24,}


In this post, we dive into a real problem of Arvato Financial Solutions where we tried to solve it with the use of machine learning. So it was really important to have at least some business understanding especially in the part of the preprocessing which is the most important one. Here are the main results :