The Kaggle competition on Mercedes Benz greener manufacturing has concluded and the private leaderboard has had a huge shake up, people who were in the top #10 for the most part of the competition failed to remain in the top #10 on the private leaderboard, why is that? let’s explore the dataset and discover why.
First let’s start with loading our data
train_df = pd.read_csv("../../DataSets/mercedes/train.csv")
test_df = pd.read_csv("../../DataSets/mercedes/test.csv")
Next let’s take a look into the data
In [7]: train_df.shape Out[7]: (4209, 378)
In [8]: train_df.head(1) Out[8]: ID y X0 X1 X2 X3 X4 X5 X6 X8 ... X375 X376 X377 X378 X379 \ 0 0 130.81 k v at a d u j o ... 0 0 1 0 0 X380 X382 X383 X384 X385 0 0 0 0 0 0 [1 rows x 378 columns]
In [9]: train_df.dtypes
Out[9]:
ID int64
y float64
X0 object
X1 object
X2 object
X3 object
X4 object
X5 object
X6 object
X8 object
X10 int64
X11 int64
X12 int64
X13 int64
X14 int64
X15 int64
X16 int64
X17 int64
X18 int64
X19 int64
X20 int64
X21 int64
X22 int64
X23 int64
X24 int64
X26 int64
X27 int64
X28 int64
X29 int64
X30 int64
...
X355 int64
X356 int64
X357 int64
X358 int64
X359 int64
X360 int64
X361 int64
X362 int64
X363 int64
X364 int64
X365 int64
X366 int64
X367 int64
X368 int64
X369 int64
X370 int64
X371 int64
X372 int64
X373 int64
X374 int64
X375 int64
X376 int64
X377 int64
X378 int64
X379 int64
X380 int64
X382 int64
X383 int64
X384 int64
X385 int64
dtype: object
So apparently we have a few categorical variables (8), and for the most part we have integer variables and our target variable is a float.
Taking a look at the unique values of the remaining integers we find out that they are binary variables
In [38]: np.unique(train_df[train_df.columns[10:]]) Out[38]: array([0, 1])
Now let’s take a look at the target variable
train_df['y'].hist()

It looks like there might be some outliers because of the tiny number of values that are above 150 so let’s further investigate those
sns.violinplot(train_df['y'].values)

We can see from the violin plot that there are indeed outliers, values above ~ 135 can be considered outliers, we can also see from the distribution that there are two peaks at around 98 and 108
In [56]: len(train_df[train_df['y'] > 140]) Out[56]: 35
We have 35 training instances with their target value above 140, which is approximately 0.8% of our data so we can handle those with deletion or truncation.
Another very tricky and problematic issue of data like this (Binary and categorical variables) is duplicates because they confuse classifiers, so let’s check if we have rows that have the same features but different y value
In [72]: len(train_df[train_df.drop(["ID", "y"], axis=1).duplicated()]) Out[72]: 298
Wow, almost 300 duplicate rows, that is almost 7% of our dataset!
Handling duplicates
In my opinion the best way to handle these duplicates is to drop them, but since the dataset is small then let’s find a moderate solution that will conserve data.
We can perform an aggregate function on the dataset like taking the mean value for their targets or take the median, I will go with the median
def average_dups(x):
# Average value of duplicates
Y.loc[list(x.index)] = Y.loc[list(x.index)].mean()
X = train_df.drop(["y"], axis=1)
Y = train_df["y"]
dups = X[X.duplicated(keep=False)]
dups.groupby(dups.columns.tolist()).apply(average_dups)
train_df.drop(X[X.duplicated()].index.values, axis=0, inplace=True)
X = train_df.drop(["y"], axis=1)
Y = train_df["y"]
X.reset_index(inplace=True, drop=True)
Y.reset_index(inplace=True, drop=True)
Building a model
Now that we have a sense of the data let’s go ahead and start building a model and see how will it perform
First of all we need to get rid of those categorical variables in order to do this we will use One Hot encoding since we can’t see any reason to use an integer encoding that would imply ordinal relations in the data, to do this we can use pandas directly
data = train_df.append(test_df) data = pd.get_dummies(data) train, test = data[0:len(train_df)], data[len(train_df):]
We had to concatenate both sets to ensure the mapping is consistent
Next we need to drop the ID of each example because in theory an ID is unique for each example and that implies that it should add no information to our models
train = train.drop(["ID"], axis=1) test = test.drop(["ID"], axis=1)
We have both of our train/test datasets ready so let’s go ahead and start building an XGBoost model
Why use XGBoost? Because it’s fast to train, very accurate and can provide us with intuition on its decisions in addition to being favored so much in Kaggle competitions overall
xgb_params = {
'n_trees': 400,
'eta': 0.008,
'max_depth': 2,
'subsample': 0.93,
'objective': 'reg:linear',
'base_score': np.mean(Y),
'min_child_weight': 4,
}
These parameters were computed earlier by using GridSearch and BayesianOptimization however you can start with any parameters and move into parameter tuning later.
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
def xgb_r2_score(preds, dtrain):
# Courtesy of Tilii
labels = dtrain.get_label()
return 'r2', r2_score(labels, preds)
X, XVal, Y, YVal = train_test_split(X, Y)
dtrain = xgb.DMatrix(X, Y)
dval = xgb.DMatrix(XVal, YVal)
dtest = xgb.DMatrix(X_Test)
scores = xgb.cv(xgb_params, dtrain, num_boost_round=1500, early_stopping_rounds=50, verbose_eval=True, feval=xgb_r2_score, maximize=True, nfold=10)
print("Best CV Score is: {}".format(scores['test-r2-mean'].iloc[-1]))
model = xgb.train(xgb_params, dtrain, num_boost_round=1500, feval=xgb_r2_score, maximize=True, early_stopping_rounds=50, verbose_eval=True, evals=[(dval, 'val')])
Submitting this model’s prediction would score 0.54193 on the private board which is around position #2687
This is very low, now let’s see some tricks that did work to improve it
#1 adding more dense features
Apparently the data is high dimensional and very sparse, this is not fun for a Tree based model so we resort to improving the data for our model, we do this by compressing information by using dimensionality reduction algorithms, out of these algorithms we’ll be using: PCA, SVD, ICA, Gaussian Random Projection and SparseRandomProjection.
Determining which algorithms to use is a matter of trial and error first you’ll start with PCA which will improve your scores then you’ll want to add more projections until the score settles.
from sklearn.decomposition import PCA, FastICA, TruncatedSVD
from sklearn.random_projection import GaussianRandomProjection
from sklearn.random_projection import SparseRandomProjection
pca = PCA(n_components=12)
ica = FastICA(n_components=12, max_iter=1000)
tsvd = TruncatedSVD(n_components=12)
gp = GaussianRandomProjection(n_components=12)
sp = SparseRandomProjection(n_components=12, dense_output=True)
x_pca = pd.DataFrame(pca.fit_transform(X))
x_ica = pd.DataFrame(ica.fit_transform(X))
x_tsvd = pd.DataFrame(tsvd.fit_transform(X))
x_gp = pd.DataFrame(gp.fit_transform(X))
x_sp = pd.DataFrame(sp.fit_transform(X))
x_pca.columns = ["pca_{}".format(i) for i in x_pca.columns]
x_ica.columns = ["ica_{}".format(i) for i in x_ica.columns]
x_tsvd.columns = ["tsvd_{}".format(i) for i in x_tsvd.columns]
x_gp.columns = ["gp_{}".format(i) for i in x_gp.columns]
x_sp.columns = ["sp_{}".format(i) for i in x_sp.columns]
X = pd.concat((X, x_pca), axis=1)
X = pd.concat((X, x_ica), axis=1)
X = pd.concat((X, x_tsvd), axis=1)
X = pd.concat((X, x_gp), axis=1)
X = pd.concat((X, x_sp), axis=1)
x_test_pca = pd.DataFrame(pca.transform(X_Test))
x_test_ica = pd.DataFrame(ica.transform(X_Test))
x_test_tsvd = pd.DataFrame(tsvd.transform(X_Test))
x_test_gp = pd.DataFrame(gp.transform(X_Test))
x_test_sp = pd.DataFrame(sp.transform(X_Test))
x_test_pca.columns = ["pca_{}".format(i) for i in x_test_pca.columns]
x_test_ica.columns = ["ica_{}".format(i) for i in x_test_ica.columns]
x_test_tsvd.columns = ["tsvd_{}".format(i) for i in x_test_tsvd.columns]
x_test_gp.columns = ["gp_{}".format(i) for i in x_test_gp.columns]
x_test_sp.columns = ["sp_{}".format(i) for i in x_test_sp.columns]
X_Test = pd.concat((X_Test, x_test_pca), axis=1)
X_Test = pd.concat((X_Test, x_test_ica), axis=1)
X_Test = pd.concat((X_Test, x_test_tsvd), axis=1)
X_Test = pd.concat((X_Test, x_test_gp), axis=1)
X_Test = pd.concat((X_Test, x_test_sp), axis=1)
I didn’t not get an improvement on the score when I submitted this trial, probably the number of components needs to be tuned, but according to many kernels on Kaggle, it improved their scores.
#2 Adding the ID
Yeah, I know I’ve told you before that the ID is useless but apparently it’s quite the opposite, it turns out that using the ID as a feature improves the private leaderboard score which indicates that there might be some information in the ID or a hidden temporal feature
Adding the ID to the features scored: 0.54779 which would be in position #1674
#3 Using label encoder instead of One hot encoding
By using a Label Encoder instead of OHE + the ID we seem to get a rise on the score to be 0.54854 which would rank at position #1480 which is an even better improvement
from sklearn.preprocessing import LabelEncoder
for c in train_df.columns:
if train_df[c].dtype == 'object':
lbl = LabelEncoder()
lbl.fit(list(train_df[c].values) + list(test_df[c].values))
train_df[c] = lbl.transform(list(train_df[c].values))
test_df[c] = lbl.transform(list(test_df[c].values))
And by using the 3 approaches this would give an even better score.
These were the most common tricks for the competition in conclusion adding more features and the magic features seemed to improved scores overall however this made the models more prone to overfitting which is why the huge shake up happened when the private lb was released.
Moral of the competition: Trust your local CV
Stacking was there in the top #5 spots with this kernel which is a variation from a kernel that caused a lot of overfitting on the public lb.
You can take a look at my kernels for Hyper parameter tuning using both GridSearch and Bayesian Optimization:

Be First to Comment