使用模型堆叠优化的sklearn模型
Published on Aug. 22, 2023, 12:10 p.m.
模型堆叠是一种通过组合多个模型的输出并通过另一个称为元学习器的机器学习模型运行它们来改进模型预测的方法。这是一种用于赢得 Kaggle 比赛的流行策略,但尽管它们很有用,但在数据科学文章中却很少被提及——我希望能改变这种情况。
本质上,堆叠模型的工作原理是通过元学习器(通常是线性回归器/分类器,但也可以是其他模型,如决策树)运行多个模型的输出。元学习器试图最小化每个单独模型的弱点并最大化其优势。结果通常是一个非常健壮的模型,可以很好地概括未知数据。
How to get to TOP 25% with Simple Model using sklearn only
by Sergei Neviadomski
Importing libraries and data
That’s my simple ensemble model that helped me to get to top 40%. I’ll try to briefly show you all steps that I made during my analysis and model building.
# Adding needed libraries and reading data
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import ensemble, tree, linear_model
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import r2_score, mean_squared_error
from sklearn.utils import shuffle
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
train = pd.read_csv('../input/train.csv')
test = pd.read_csv('../input/test.csv')
train.head()
.dataframe thead tr:only-child th {
text-align: right;
}
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
Id | MSSubClass | MSZoning | LotFrontage | LotArea | Street | Alley | LotShape | LandContour | Utilities | … | PoolArea | PoolQC | Fence | MiscFeature | MiscVal | MoSold | YrSold | SaleType | SaleCondition | SalePrice | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 60 | RL | 65.0 | 8450 | Pave | NaN | Reg | Lvl | AllPub | … | 0 | NaN | NaN | NaN | 0 | 2 | 2008 | WD | Normal | 208500 |
1 | 2 | 20 | RL | 80.0 | 9600 | Pave | NaN | Reg | Lvl | AllPub | … | 0 | NaN | NaN | NaN | 0 | 5 | 2007 | WD | Normal | 181500 |
2 | 3 | 60 | RL | 68.0 | 11250 | Pave | NaN | IR1 | Lvl | AllPub | … | 0 | NaN | NaN | NaN | 0 | 9 | 2008 | WD | Normal | 223500 |
3 | 4 | 70 | RL | 60.0 | 9550 | Pave | NaN | IR1 | Lvl | AllPub | … | 0 | NaN | NaN | NaN | 0 | 2 | 2006 | WD | Abnorml | 140000 |
4 | 5 | 60 | RL | 84.0 | 14260 | Pave | NaN | IR1 | Lvl | AllPub | … | 0 | NaN | NaN | NaN | 0 | 12 | 2008 | WD | Normal | 250000 |
5 rows × 81 columns
Checking for NAs
#Checking for missing data
NAs = pd.concat([train.isnull().sum(), test.isnull().sum()], axis=1, keys=['Train', 'Test'])
NAs[NAs.sum(axis=1) > 0]
.dataframe thead tr:only-child th {
text-align: right;
}
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
Train | Test | |
---|---|---|
Alley | 1369 | 1352.0 |
BsmtCond | 37 | 45.0 |
BsmtExposure | 38 | 44.0 |
BsmtFinSF1 | 0 | 1.0 |
BsmtFinSF2 | 0 | 1.0 |
BsmtFinType1 | 37 | 42.0 |
BsmtFinType2 | 38 | 42.0 |
BsmtFullBath | 0 | 2.0 |
BsmtHalfBath | 0 | 2.0 |
BsmtQual | 37 | 44.0 |
BsmtUnfSF | 0 | 1.0 |
Electrical | 1 | 0.0 |
Exterior1st | 0 | 1.0 |
Exterior2nd | 0 | 1.0 |
Fence | 1179 | 1169.0 |
FireplaceQu | 690 | 730.0 |
Functional | 0 | 2.0 |
GarageArea | 0 | 1.0 |
GarageCars | 0 | 1.0 |
GarageCond | 81 | 78.0 |
GarageFinish | 81 | 78.0 |
GarageQual | 81 | 78.0 |
GarageType | 81 | 76.0 |
GarageYrBlt | 81 | 78.0 |
KitchenQual | 0 | 1.0 |
LotFrontage | 259 | 227.0 |
MSZoning | 0 | 4.0 |
MasVnrArea | 8 | 15.0 |
MasVnrType | 8 | 16.0 |
MiscFeature | 1406 | 1408.0 |
PoolQC | 1453 | 1456.0 |
SaleType | 0 | 1.0 |
TotalBsmtSF | 0 | 1.0 |
Utilities | 0 | 2.0 |
Importing my functions
# Prints R2 and RMSE scores
def get_score(prediction, lables):
print('R2: {}'.format(r2_score(prediction, lables)))
print('RMSE: {}'.format(np.sqrt(mean_squared_error(prediction, lables))))
# Shows scores for train and validation sets
def train_test(estimator, x_trn, x_tst, y_trn, y_tst):
prediction_train = estimator.predict(x_trn)
# Printing estimator
print(estimator)
# Printing train scores
get_score(prediction_train, y_trn)
prediction_test = estimator.predict(x_tst)
# Printing test scores
print("Test")
get_score(prediction_test, y_tst)
Splitting to features and labels and deleting variables I don’t need
# Spliting to features and lables and deleting variable I don't need
train_labels = train.pop('SalePrice')
features = pd.concat([train, test], keys=['train', 'test'])
# I decided to get rid of features that have more than half of missing information or do not correlate to SalePrice
features.drop(['Utilities', 'RoofMatl', 'MasVnrArea', 'BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF', 'Heating', 'LowQualFinSF',
'BsmtFullBath', 'BsmtHalfBath', 'Functional', 'GarageYrBlt', 'GarageArea', 'GarageCond', 'WoodDeckSF',
'OpenPorchSF', 'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'PoolQC', 'Fence', 'MiscFeature', 'MiscVal'],
axis=1, inplace=True)
Filling NAs and converting features
# MSSubClass as str
features['MSSubClass'] = features['MSSubClass'].astype(str)
# MSZoning NA in pred. filling with most popular values
features['MSZoning'] = features['MSZoning'].fillna(features['MSZoning'].mode()[0])
# LotFrontage NA in all. I suppose NA means 0
features['LotFrontage'] = features['LotFrontage'].fillna(features['LotFrontage'].mean())
# Alley NA in all. NA means no access
features['Alley'] = features['Alley'].fillna('NOACCESS')
# Converting OverallCond to str
features.OverallCond = features.OverallCond.astype(str)
# MasVnrType NA in all. filling with most popular values
features['MasVnrType'] = features['MasVnrType'].fillna(features['MasVnrType'].mode()[0])
# BsmtQual, BsmtCond, BsmtExposure, BsmtFinType1, BsmtFinType2
# NA in all. NA means No basement
for col in ('BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2'):
features[col] = features[col].fillna('NoBSMT')
# TotalBsmtSF NA in pred. I suppose NA means 0
features['TotalBsmtSF'] = features['TotalBsmtSF'].fillna(0)
# Electrical NA in pred. filling with most popular values
features['Electrical'] = features['Electrical'].fillna(features['Electrical'].mode()[0])
# KitchenAbvGr to categorical
features['KitchenAbvGr'] = features['KitchenAbvGr'].astype(str)
# KitchenQual NA in pred. filling with most popular values
features['KitchenQual'] = features['KitchenQual'].fillna(features['KitchenQual'].mode()[0])
# FireplaceQu NA in all. NA means No Fireplace
features['FireplaceQu'] = features['FireplaceQu'].fillna('NoFP')
# GarageType, GarageFinish, GarageQual NA in all. NA means No Garage
for col in ('GarageType', 'GarageFinish', 'GarageQual'):
features[col] = features[col].fillna('NoGRG')
# GarageCars NA in pred. I suppose NA means 0
features['GarageCars'] = features['GarageCars'].fillna(0.0)
# SaleType NA in pred. filling with most popular values
features['SaleType'] = features['SaleType'].fillna(features['SaleType'].mode()[0])
# Year and Month to categorical
features['YrSold'] = features['YrSold'].astype(str)
features['MoSold'] = features['MoSold'].astype(str)
# Adding total sqfootage feature and removing Basement, 1st and 2nd floor features
features['TotalSF'] = features['TotalBsmtSF'] + features['1stFlrSF'] + features['2ndFlrSF']
features.drop(['TotalBsmtSF', '1stFlrSF', '2ndFlrSF'], axis=1, inplace=True)
Log transformation
# Our SalesPrice is skewed right (check plot below). I'm logtransforming it.
ax = sns.distplot(train_labels)
## Log transformation of labels
train_labels = np.log(train_labels)
## Now it looks much better
ax = sns.distplot(train_labels)
Standardizing numeric data
## Standardizing numeric features
numeric_features = features.loc[:,['LotFrontage', 'LotArea', 'GrLivArea', 'TotalSF']]
numeric_features_standardized = (numeric_features - numeric_features.mean())/numeric_features.std()
ax = sns.pairplot(numeric_features_standardized)
Converting categorical data to dummies
# Getting Dummies from Condition1 and Condition2
conditions = set([x for x in features['Condition1']] + [x for x in features['Condition2']])
dummies = pd.DataFrame(data=np.zeros((len(features.index), len(conditions))),
index=features.index, columns=conditions)
for i, cond in enumerate(zip(features['Condition1'], features['Condition2'])):
dummies.ix[i, cond] = 1
features = pd.concat([features, dummies.add_prefix('Condition_')], axis=1)
features.drop(['Condition1', 'Condition2'], axis=1, inplace=True)
# Getting Dummies from Exterior1st and Exterior2nd
exteriors = set([x for x in features['Exterior1st']] + [x for x in features['Exterior2nd']])
dummies = pd.DataFrame(data=np.zeros((len(features.index), len(exteriors))),
index=features.index, columns=exteriors)
for i, ext in enumerate(zip(features['Exterior1st'], features['Exterior2nd'])):
dummies.ix[i, ext] = 1
features = pd.concat([features, dummies.add_prefix('Exterior_')], axis=1)
features.drop(['Exterior1st', 'Exterior2nd', 'Exterior_nan'], axis=1, inplace=True)
# Getting Dummies from all other categorical vars
for col in features.dtypes[features.dtypes == 'object'].index:
for_dummy = features.pop(col)
features = pd.concat([features, pd.get_dummies(for_dummy, prefix=col)], axis=1)
Obtaining standardized dataset
### Copying features
features_standardized = features.copy()
### Replacing numeric features by standardized values
features_standardized.update(numeric_features_standardized)
Splitting train and test features
### Splitting features
train_features = features.loc['train'].drop('Id', axis=1).select_dtypes(include=[np.number]).values
test_features = features.loc['test'].drop('Id', axis=1).select_dtypes(include=[np.number]).values
### Splitting standardized features
train_features_st = features_standardized.loc['train'].drop('Id', axis=1).select_dtypes(include=[np.number]).values
test_features_st = features_standardized.loc['test'].drop('Id', axis=1).select_dtypes(include=[np.number]).values
Splitting to train and validation sets
### Shuffling train sets
train_features_st, train_features, train_labels = shuffle(train_features_st, train_features, train_labels, random_state = 5)
### Splitting
x_train, x_test, y_train, y_test = train_test_split(train_features, train_labels, test_size=0.1, random_state=200)
x_train_st, x_test_st, y_train_st, y_test_st = train_test_split(train_features_st, train_labels, test_size=0.1, random_state=200)
First level models
My analysis revealed that Gradient Boosting and Elastic Net (using Standardized Features) show best results.
Elastic Net
I’m using ElasticNetCV estimator to choose best alpha and l1_ratio for my Elastic Net model.
ENSTest = linear_model.ElasticNetCV(alphas=[0.0001, 0.0005, 0.001, 0.01, 0.1, 1, 10], l1_ratio=[.01, .1, .5, .9, .99], max_iter=5000).fit(x_train_st, y_train_st)
train_test(ENSTest, x_train_st, x_test_st, y_train_st, y_test_st)
ElasticNetCV(alphas=[0.0001, 0.0005, 0.001, 0.01, 0.1, 1, 10], copy_X=True,
cv=None, eps=0.001, fit_intercept=True,
l1_ratio=[0.01, 0.1, 0.5, 0.9, 0.99], max_iter=5000, n_alphas=100,
n_jobs=1, normalize=False, positive=False, precompute='auto',
random_state=None, selection='cyclic', tol=0.0001, verbose=0)
R2: 0.9009283351517985
RMSE: 0.11921418263870308
Test
R2: 0.8967298861915098
RMSE: 0.11097045657712076
# Average R2 score and standart deviation of 5-fold cross-validation
scores = cross_val_score(ENSTest, train_features_st, train_labels, cv=5)
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
Accuracy: 0.88 (+/- 0.10)
Gradient Boosting
We use a lot of features and have many outliers. So I’m using max_features=’sqrt’ to reduce overfitting of my model. I also use loss=’huber’ because it more tolerant to outliers. All other hyper-parameters was chosen using GridSearchCV.
GBest = ensemble.GradientBoostingRegressor(n_estimators=3000, learning_rate=0.05, max_depth=3, max_features='sqrt',
min_samples_leaf=15, min_samples_split=10, loss='huber').fit(x_train, y_train)
train_test(GBest, x_train, x_test, y_train, y_test)
GradientBoostingRegressor(alpha=0.9, criterion='friedman_mse', init=None,
learning_rate=0.05, loss='huber', max_depth=3,
max_features='sqrt', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=15, min_samples_split=10,
min_weight_fraction_leaf=0.0, n_estimators=3000,
n_iter_no_change=None, presort='auto', random_state=None,
subsample=1.0, tol=0.0001, validation_fraction=0.1, verbose=0,
warm_start=False)
R2: 0.9605747125044706
RMSE: 0.0772628427470985
Test
R2: 0.9109468670183566
RMSE: 0.10411858605847635
# Average R2 score and standart deviation of 5-fold cross-validation
scores = cross_val_score(GBest, train_features_st, train_labels, cv=5)
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
Accuracy: 0.89 (+/- 0.04)
Ensembling final model
My final ensemble model is an average of Gradient Boosting and Elastic Net predictions. But before that I retrained my models on all train data.
# Retraining models
GB_model = GBest.fit(train_features, train_labels)
ENST_model = ENSTest.fit(train_features_st, train_labels)
## Getting our SalePrice estimation
Final_labels = (np.exp(GB_model.predict(test_features)) + np.exp(ENST_model.predict(test_features_st))) / 2
## Saving to CSV
pd.DataFrame({'Id': test.Id, 'SalePrice': Final_labels}).to_csv('2017-02-28.csv', index =False)
I’ll be glad to hear suggestions on improving my models.
Please upvote if you like my notebook :)
原始出处
https://www.kaggle.com/zusmani/how-to-get-to-top-25-with-simple-model-sklearn/notebook