The Kaggle competition on Mercedes Benz greener manufacturing has concluded and the private leaderboard has had a huge shake up, people who were in the top #10 for the most part of the competition failed to remain in the top #10 on the private leaderboard, why is that? let’s explore the dataset and discover why.
First let’s start with loading our data
train_df = pd.read_csv("../../DataSets/mercedes/train.csv") test_df = pd.read_csv("../../DataSets/mercedes/test.csv")
Next let’s take a look into the data
In [7]: train_df.shape Out[7]: (4209, 378)
In [8]: train_df.head(1) Out[8]: ID y X0 X1 X2 X3 X4 X5 X6 X8 ... X375 X376 X377 X378 X379 \ 0 0 130.81 k v at a d u j o ... 0 0 1 0 0 X380 X382 X383 X384 X385 0 0 0 0 0 0 [1 rows x 378 columns]
In [9]: train_df.dtypes Out[9]: ID int64 y float64 X0 object X1 object X2 object X3 object X4 object X5 object X6 object X8 object X10 int64 X11 int64 X12 int64 X13 int64 X14 int64 X15 int64 X16 int64 X17 int64 X18 int64 X19 int64 X20 int64 X21 int64 X22 int64 X23 int64 X24 int64 X26 int64 X27 int64 X28 int64 X29 int64 X30 int64 ... X355 int64 X356 int64 X357 int64 X358 int64 X359 int64 X360 int64 X361 int64 X362 int64 X363 int64 X364 int64 X365 int64 X366 int64 X367 int64 X368 int64 X369 int64 X370 int64 X371 int64 X372 int64 X373 int64 X374 int64 X375 int64 X376 int64 X377 int64 X378 int64 X379 int64 X380 int64 X382 int64 X383 int64 X384 int64 X385 int64 dtype: object
So apparently we have a few categorical variables (8), and for the most part we have integer variables and our target variable is a float.
Taking a look at the unique values of the remaining integers we find out that they are binary variables
In [38]: np.unique(train_df[train_df.columns[10:]]) Out[38]: array([0, 1])
Now let’s take a look at the target variable
train_df['y'].hist()
It looks like there might be some outliers because of the tiny number of values that are above 150 so let’s further investigate those
sns.violinplot(train_df['y'].values)
We can see from the violin plot that there are indeed outliers, values above ~ 135 can be considered outliers, we can also see from the distribution that there are two peaks at around 98 and 108
In [56]: len(train_df[train_df['y'] > 140]) Out[56]: 35
We have 35 training instances with their target value above 140, which is approximately 0.8% of our data so we can handle those with deletion or truncation.
Another very tricky and problematic issue of data like this (Binary and categorical variables) is duplicates because they confuse classifiers, so let’s check if we have rows that have the same features but different y value
In [72]: len(train_df[train_df.drop(["ID", "y"], axis=1).duplicated()]) Out[72]: 298
Wow, almost 300 duplicate rows, that is almost 7% of our dataset!
Handling duplicates
In my opinion the best way to handle these duplicates is to drop them, but since the dataset is small then let’s find a moderate solution that will conserve data.
We can perform an aggregate function on the dataset like taking the mean value for their targets or take the median, I will go with the median
def average_dups(x): # Average value of duplicates Y.loc[list(x.index)] = Y.loc[list(x.index)].mean() X = train_df.drop(["y"], axis=1) Y = train_df["y"] dups = X[X.duplicated(keep=False)] dups.groupby(dups.columns.tolist()).apply(average_dups) train_df.drop(X[X.duplicated()].index.values, axis=0, inplace=True) X = train_df.drop(["y"], axis=1) Y = train_df["y"] X.reset_index(inplace=True, drop=True) Y.reset_index(inplace=True, drop=True)
Building a model
Now that we have a sense of the data let’s go ahead and start building a model and see how will it perform
First of all we need to get rid of those categorical variables in order to do this we will use One Hot encoding since we can’t see any reason to use an integer encoding that would imply ordinal relations in the data, to do this we can use pandas directly
data = train_df.append(test_df) data = pd.get_dummies(data) train, test = data[0:len(train_df)], data[len(train_df):]
We had to concatenate both sets to ensure the mapping is consistent
Next we need to drop the ID of each example because in theory an ID is unique for each example and that implies that it should add no information to our models
train = train.drop(["ID"], axis=1) test = test.drop(["ID"], axis=1)
We have both of our train/test datasets ready so let’s go ahead and start building an XGBoost model
Why use XGBoost? Because it’s fast to train, very accurate and can provide us with intuition on its decisions in addition to being favored so much in Kaggle competitions overall
xgb_params = { 'n_trees': 400, 'eta': 0.008, 'max_depth': 2, 'subsample': 0.93, 'objective': 'reg:linear', 'base_score': np.mean(Y), 'min_child_weight': 4, }
These parameters were computed earlier by using GridSearch and BayesianOptimization however you can start with any parameters and move into parameter tuning later.
import xgboost as xgb from sklearn.model_selection import train_test_split from sklearn.metrics import r2_score def xgb_r2_score(preds, dtrain): # Courtesy of Tilii labels = dtrain.get_label() return 'r2', r2_score(labels, preds) X, XVal, Y, YVal = train_test_split(X, Y) dtrain = xgb.DMatrix(X, Y) dval = xgb.DMatrix(XVal, YVal) dtest = xgb.DMatrix(X_Test) scores = xgb.cv(xgb_params, dtrain, num_boost_round=1500, early_stopping_rounds=50, verbose_eval=True, feval=xgb_r2_score, maximize=True, nfold=10) print("Best CV Score is: {}".format(scores['test-r2-mean'].iloc[-1])) model = xgb.train(xgb_params, dtrain, num_boost_round=1500, feval=xgb_r2_score, maximize=True, early_stopping_rounds=50, verbose_eval=True, evals=[(dval, 'val')])
Submitting this model’s prediction would score 0.54193 on the private board which is around position #2687
This is very low, now let’s see some tricks that did work to improve it
#1 adding more dense features
Apparently the data is high dimensional and very sparse, this is not fun for a Tree based model so we resort to improving the data for our model, we do this by compressing information by using dimensionality reduction algorithms, out of these algorithms we’ll be using: PCA, SVD, ICA, Gaussian Random Projection and SparseRandomProjection.
Determining which algorithms to use is a matter of trial and error first you’ll start with PCA which will improve your scores then you’ll want to add more projections until the score settles.
from sklearn.decomposition import PCA, FastICA, TruncatedSVD from sklearn.random_projection import GaussianRandomProjection from sklearn.random_projection import SparseRandomProjection pca = PCA(n_components=12) ica = FastICA(n_components=12, max_iter=1000) tsvd = TruncatedSVD(n_components=12) gp = GaussianRandomProjection(n_components=12) sp = SparseRandomProjection(n_components=12, dense_output=True) x_pca = pd.DataFrame(pca.fit_transform(X)) x_ica = pd.DataFrame(ica.fit_transform(X)) x_tsvd = pd.DataFrame(tsvd.fit_transform(X)) x_gp = pd.DataFrame(gp.fit_transform(X)) x_sp = pd.DataFrame(sp.fit_transform(X)) x_pca.columns = ["pca_{}".format(i) for i in x_pca.columns] x_ica.columns = ["ica_{}".format(i) for i in x_ica.columns] x_tsvd.columns = ["tsvd_{}".format(i) for i in x_tsvd.columns] x_gp.columns = ["gp_{}".format(i) for i in x_gp.columns] x_sp.columns = ["sp_{}".format(i) for i in x_sp.columns] X = pd.concat((X, x_pca), axis=1) X = pd.concat((X, x_ica), axis=1) X = pd.concat((X, x_tsvd), axis=1) X = pd.concat((X, x_gp), axis=1) X = pd.concat((X, x_sp), axis=1) x_test_pca = pd.DataFrame(pca.transform(X_Test)) x_test_ica = pd.DataFrame(ica.transform(X_Test)) x_test_tsvd = pd.DataFrame(tsvd.transform(X_Test)) x_test_gp = pd.DataFrame(gp.transform(X_Test)) x_test_sp = pd.DataFrame(sp.transform(X_Test)) x_test_pca.columns = ["pca_{}".format(i) for i in x_test_pca.columns] x_test_ica.columns = ["ica_{}".format(i) for i in x_test_ica.columns] x_test_tsvd.columns = ["tsvd_{}".format(i) for i in x_test_tsvd.columns] x_test_gp.columns = ["gp_{}".format(i) for i in x_test_gp.columns] x_test_sp.columns = ["sp_{}".format(i) for i in x_test_sp.columns] X_Test = pd.concat((X_Test, x_test_pca), axis=1) X_Test = pd.concat((X_Test, x_test_ica), axis=1) X_Test = pd.concat((X_Test, x_test_tsvd), axis=1) X_Test = pd.concat((X_Test, x_test_gp), axis=1) X_Test = pd.concat((X_Test, x_test_sp), axis=1)
I didn’t not get an improvement on the score when I submitted this trial, probably the number of components needs to be tuned, but according to many kernels on Kaggle, it improved their scores.
#2 Adding the ID
Yeah, I know I’ve told you before that the ID is useless but apparently it’s quite the opposite, it turns out that using the ID as a feature improves the private leaderboard score which indicates that there might be some information in the ID or a hidden temporal feature
Adding the ID to the features scored: 0.54779 which would be in position #1674
#3 Using label encoder instead of One hot encoding
By using a Label Encoder instead of OHE + the ID we seem to get a rise on the score to be 0.54854 which would rank at position #1480 which is an even better improvement
from sklearn.preprocessing import LabelEncoder for c in train_df.columns: if train_df[c].dtype == 'object': lbl = LabelEncoder() lbl.fit(list(train_df[c].values) + list(test_df[c].values)) train_df[c] = lbl.transform(list(train_df[c].values)) test_df[c] = lbl.transform(list(test_df[c].values))
And by using the 3 approaches this would give an even better score.
These were the most common tricks for the competition in conclusion adding more features and the magic features seemed to improved scores overall however this made the models more prone to overfitting which is why the huge shake up happened when the private lb was released.
Moral of the competition: Trust your local CV
Stacking was there in the top #5 spots with this kernel which is a variation from a kernel that caused a lot of overfitting on the public lb.
You can take a look at my kernels for Hyper parameter tuning using both GridSearch and Bayesian Optimization: