The Mercedes Dataset

Omar

8 years ago

Categories: Contests

The Mercedes Dataset

The Kaggle competition on Mercedes Benz greener manufacturing has concluded and the private leaderboard has had a huge shake up, people who were in the top #10 for the most part of the competition failed to remain in the top #10 on the private leaderboard, why is that? let’s explore the dataset and discover why.

First let’s start with loading our data

train_df = pd.read_csv("../../DataSets/mercedes/train.csv")
test_df = pd.read_csv("../../DataSets/mercedes/test.csv")

Next let’s take a look into the data

In [7]: train_df.shape
Out[7]: (4209, 378)

In [8]: train_df.head(1)
Out[8]: 
   ID       y X0 X1  X2 X3 X4 X5 X6 X8  ...   X375  X376  X377  X378  X379  \
0   0  130.81  k  v  at  a  d  u  j  o  ...      0     0     1     0     0   

   X380  X382  X383  X384  X385  
0     0     0     0     0     0  

[1 rows x 378 columns]

In [9]: train_df.dtypes
Out[9]: 
ID        int64
y       float64
X0       object
X1       object
X2       object
X3       object
X4       object
X5       object
X6       object
X8       object
X10       int64
X11       int64
X12       int64
X13       int64
X14       int64
X15       int64
X16       int64
X17       int64
X18       int64
X19       int64
X20       int64
X21       int64
X22       int64
X23       int64
X24       int64
X26       int64
X27       int64
X28       int64
X29       int64
X30       int64
         ...   
X355      int64
X356      int64
X357      int64
X358      int64
X359      int64
X360      int64
X361      int64
X362      int64
X363      int64
X364      int64
X365      int64
X366      int64
X367      int64
X368      int64
X369      int64
X370      int64
X371      int64
X372      int64
X373      int64
X374      int64
X375      int64
X376      int64
X377      int64
X378      int64
X379      int64
X380      int64
X382      int64
X383      int64
X384      int64
X385      int64
dtype: object

So apparently we have a few categorical variables (8), and for the most part we have integer variables and our target variable is a float.

Taking a look at the unique values of the remaining integers we find out that they are binary variables

In [38]: np.unique(train_df[train_df.columns[10:]])
Out[38]: array([0, 1])

Now let’s take a look at the target variable

train_df['y'].hist()

A Histogram of the target variable

It looks like there might be some outliers because of the tiny number of values that are above 150 so let’s further investigate those

sns.violinplot(train_df['y'].values)

Violin plot of the target variable

We can see from the violin plot that there are indeed outliers, values above ~ 135 can be considered outliers, we can also see from the distribution that there are two peaks at around 98 and 108

In [56]: len(train_df[train_df['y'] > 140])
Out[56]: 35

We have 35 training instances with their target value above 140, which is approximately 0.8% of our data so we can handle those with deletion or truncation.

Another very tricky and problematic issue of data like this (Binary and categorical variables) is duplicates because they confuse classifiers, so let’s check if we have rows that have the same features but different y value

In [72]: len(train_df[train_df.drop(["ID", "y"], axis=1).duplicated()])
Out[72]: 298

Wow, almost 300 duplicate rows, that is almost 7% of our dataset!

Handling duplicates

In my opinion the best way to handle these duplicates is to drop them, but since the dataset is small then let’s find a moderate solution that will conserve data.

We can perform an aggregate function on the dataset like taking the mean value for their targets or take the median, I will go with the median

def average_dups(x):
    # Average value of duplicates
    Y.loc[list(x.index)] = Y.loc[list(x.index)].mean()


X = train_df.drop(["y"], axis=1)
Y = train_df["y"]
dups = X[X.duplicated(keep=False)]
dups.groupby(dups.columns.tolist()).apply(average_dups)
train_df.drop(X[X.duplicated()].index.values, axis=0, inplace=True)
X = train_df.drop(["y"], axis=1)
Y = train_df["y"]
X.reset_index(inplace=True, drop=True)
Y.reset_index(inplace=True, drop=True)

Building a model

Now that we have a sense of the data let’s go ahead and start building a model and see how will it perform

First of all we need to get rid of those categorical variables in order to do this we will use One Hot encoding since we can’t see any reason to use an integer encoding that would imply ordinal relations in the data, to do this we can use pandas directly

data = train_df.append(test_df)
data = pd.get_dummies(data)

train, test = data[0:len(train_df)], data[len(train_df):]

We had to concatenate both sets to ensure the mapping is consistent

Next we need to drop the ID of each example because in theory an ID is unique for each example and that implies that it should add no information to our models

train = train.drop(["ID"], axis=1)
test = test.drop(["ID"], axis=1)

We have both of our train/test datasets ready so let’s go ahead and start building an XGBoost model

Why use XGBoost? Because it’s fast to train, very accurate and can provide us with intuition on its decisions in addition to being favored so much in Kaggle competitions overall

xgb_params = {
    'n_trees': 400,
    'eta': 0.008,
    'max_depth': 2,
    'subsample': 0.93,
    'objective': 'reg:linear',
    'base_score': np.mean(Y),
    'min_child_weight': 4,
}

These parameters were computed earlier by using GridSearch and BayesianOptimization however you can start with any parameters and move into parameter tuning later.

import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score

def xgb_r2_score(preds, dtrain):
    # Courtesy of Tilii
    labels = dtrain.get_label()
    return 'r2', r2_score(labels, preds)


X, XVal, Y, YVal = train_test_split(X, Y)

dtrain = xgb.DMatrix(X, Y)
dval = xgb.DMatrix(XVal, YVal)
dtest = xgb.DMatrix(X_Test)

scores = xgb.cv(xgb_params, dtrain, num_boost_round=1500, early_stopping_rounds=50, verbose_eval=True, feval=xgb_r2_score, maximize=True, nfold=10)

print("Best CV Score is: {}".format(scores['test-r2-mean'].iloc[-1]))

model = xgb.train(xgb_params, dtrain, num_boost_round=1500, feval=xgb_r2_score, maximize=True, early_stopping_rounds=50, verbose_eval=True, evals=[(dval, 'val')])

Submitting this model’s prediction would score 0.54193 on the private board which is around position #2687

This is very low, now let’s see some tricks that did work to improve it

#1 adding more dense features

Apparently the data is high dimensional and very sparse, this is not fun for a Tree based model so we resort to improving the data for our model, we do this by compressing information by using dimensionality reduction algorithms, out of these algorithms we’ll be using: PCA, SVD, ICA, Gaussian Random Projection and SparseRandomProjection.

Determining which algorithms to use is a matter of trial and error first you’ll start with PCA which will improve your scores then you’ll want to add more projections until the score settles.

from sklearn.decomposition import PCA, FastICA, TruncatedSVD
from sklearn.random_projection import GaussianRandomProjection
from sklearn.random_projection import SparseRandomProjection

pca = PCA(n_components=12)
ica = FastICA(n_components=12, max_iter=1000)
tsvd = TruncatedSVD(n_components=12)
gp = GaussianRandomProjection(n_components=12)
sp = SparseRandomProjection(n_components=12, dense_output=True)

x_pca = pd.DataFrame(pca.fit_transform(X))
x_ica = pd.DataFrame(ica.fit_transform(X))
x_tsvd = pd.DataFrame(tsvd.fit_transform(X))
x_gp = pd.DataFrame(gp.fit_transform(X))
x_sp = pd.DataFrame(sp.fit_transform(X))

x_pca.columns = ["pca_{}".format(i) for i in x_pca.columns]
x_ica.columns = ["ica_{}".format(i) for i in x_ica.columns]
x_tsvd.columns = ["tsvd_{}".format(i) for i in x_tsvd.columns]
x_gp.columns = ["gp_{}".format(i) for i in x_gp.columns]
x_sp.columns = ["sp_{}".format(i) for i in x_sp.columns]

X = pd.concat((X, x_pca), axis=1)
X = pd.concat((X, x_ica), axis=1)
X = pd.concat((X, x_tsvd), axis=1)
X = pd.concat((X, x_gp), axis=1)
X = pd.concat((X, x_sp), axis=1)

x_test_pca = pd.DataFrame(pca.transform(X_Test))
x_test_ica = pd.DataFrame(ica.transform(X_Test))
x_test_tsvd = pd.DataFrame(tsvd.transform(X_Test))
x_test_gp = pd.DataFrame(gp.transform(X_Test))
x_test_sp = pd.DataFrame(sp.transform(X_Test))

x_test_pca.columns = ["pca_{}".format(i) for i in x_test_pca.columns]
x_test_ica.columns = ["ica_{}".format(i) for i in x_test_ica.columns]
x_test_tsvd.columns = ["tsvd_{}".format(i) for i in x_test_tsvd.columns]
x_test_gp.columns = ["gp_{}".format(i) for i in x_test_gp.columns]
x_test_sp.columns = ["sp_{}".format(i) for i in x_test_sp.columns]


X_Test = pd.concat((X_Test, x_test_pca), axis=1)
X_Test = pd.concat((X_Test, x_test_ica), axis=1)
X_Test = pd.concat((X_Test, x_test_tsvd), axis=1)
X_Test = pd.concat((X_Test, x_test_gp), axis=1)
X_Test = pd.concat((X_Test, x_test_sp), axis=1)

I didn’t not get an improvement on the score when I submitted this trial, probably the number of components needs to be tuned, but according to many kernels on Kaggle, it improved their scores.

#2 Adding the ID

Yeah, I know I’ve told you before that the ID is useless but apparently it’s quite the opposite, it turns out that using the ID as a feature improves the private leaderboard score which indicates that there might be some information in the ID or a hidden temporal feature

Adding the ID to the features scored: 0.54779 which would be in position #1674

#3 Using label encoder instead of One hot encoding

By using a Label Encoder instead of OHE + the ID we seem to get a rise on the score to be 0.54854 which would rank at position #1480 which is an even better improvement

from sklearn.preprocessing import LabelEncoder

for c in train_df.columns:
    if train_df[c].dtype == 'object':

        lbl = LabelEncoder() 
        lbl.fit(list(train_df[c].values) + list(test_df[c].values)) 
        train_df[c] = lbl.transform(list(train_df[c].values))
        test_df[c] = lbl.transform(list(test_df[c].values))

And by using the 3 approaches this would give an even better score.

These were the most common tricks for the competition in conclusion adding more features and the magic features seemed to improved scores overall however this made the models more prone to overfitting which is why the huge shake up happened when the private lb was released.

Moral of the competition: Trust your local CV

Stacking was there in the top #5 spots with this kernel which is a variation from a kernel that caused a lot of overfitting on the public lb.

You can take a look at my kernels for Hyper parameter tuning using both GridSearch and Bayesian Optimization:

Word2Vec for Product Recommendations »

« The Data Science Learning Path

Tags: contestData sciencekaggleregression

Omar: