【Kaggle】 Learn Feature Engineering④ - 工学と医学のあいだ

kaggle learn で feature engineeringを勉強するシリーズ第四弾(最終回)。

www.kaggle.com

概要) 特徴量が多すぎるとoverfitします。また、処理も重くなります。なので特徴量を選択する必要があります。

1) Which data to use for feature selection?

Including validation and test data within the feature selection is a source of leakage. You'll want to perform feature selection on the train set only, then use the results there to remove features from the validation and test sets.

2) Univariate Feature Selection

Create the selector with SelectKBest using f_classif as the scoring function. The .fit_transform method will return an array with the best features retained. However, it doesn't retain the column names so you'll need to get them back. The easiest way is to use selector.inverse_transform(X_new) to get back an array with the same shape as the original features but with the dropped columns zeroed out. From this you can build a DataFrame with the same index and columns as the original features. From here, you can find the names of the dropped columns by finding all the columns with a variance of zero.

    # Do feature extraction on the training data only!
    selector = SelectKBest(f_classif, k=40)
    X_new = selector.fit_transform(train[feature_cols], train['is_attributed'])

    # Get back the features we've kept, zero out all other features
    selected_features = pd.DataFrame(selector.inverse_transform(X_new),
                                    index=train.index,
                                    columns=feature_cols)

    # Dropped columns have values of all 0s, so var is 0, drop them
    dropped_columns = selected_features.columns[selected_features.var() == 0]

selectKbestは統計検定を行います。defaultのf_classifを使うと、ANOVA F-valueの上位K個の特徴量をとってきます。また、chi2を使うとChi-squared検定をしてくれます。参考：sklearn.feature_selection.SelectKBest — scikit-learn 0.24.1 documentation

3) The best value of K

To find the best value of K, you can fit multiple models with increasing values of K, then choose the smallest K with validation score above some threshold or some other criteria. A good way to do this is loop over values of K and record the validation scores for each iteration.

上でK=40にしましたが、Kの値を決める方法を考えます。

Kでloopしてvalidation scoreがどう変化するかをよく見ながら、なるべく小さなKの値を決めます。

4) Use L1 regularization for feature selection

L1正規化も特徴量選択に効果的です。

sklearn.feature_selectionのSelectFromModelを使います。

X_newを新しいfeatureとして使用します。

使わない列の値は0になっているので、selected_features.var()!=0でdropします。

First fit the logistic regression model, then pass it to SelectFromModel. That should give you a model with the selected features, you can get the selected features with X_new = model.transform(X). However, this leaves off the column labels so you'll need to get them back. The easiest way to do this is to use model.inverse_transform(X_new) to get back the original X array with the dropped columns as all zeros. Then you can create a new DataFrame with the index and columns of X. From there, keep the columns that aren't all zeros.

    def select_features_l1(X, y):
        logistic = LogisticRegression(C=0.1, penalty="l1", random_state=7, solver='liblinear').fit(X, y)
        model = SelectFromModel(logistic, prefit=True)

        X_new = model.transform(X)

        # Get back the kept features as a DataFrame with dropped columns as all 0s
        selected_features = pd.DataFrame(model.inverse_transform(X_new),
                                        index=X.index,
                                        columns=X.columns)

        # Dropped columns have values of all 0s, keep other columns
        cols_to_keep = selected_features.columns[selected_features.var() != 0]

        return cols_to_keep

5) Feature Selection with Trees

TreeModelだと、feature importancesで特徴量選択を行うことができます。

You could use something like RandomForestClassifier or ExtraTreesClassifier to find feature importances. SelectFromModel can use the feature importances to find the best features.

6) Top K features with L1 regularization

L1のCの値も3)と同様にloopして決めます。

To select a certain number of features with L1 regularization, you need to find the regularization parameter that leaves the desired number of features. To do this you can iterate over models with different regularization parameters from low to high and choose the one that leaves K features. Note that for the scikit-learn models C is the inverse of the regularization strength.

scikit-learnのCでは正規化の逆数であることに注意しましょう。

まとめ)

特徴量選択を行いました。

選択に統計値を利用する場合はselectKbestが便利です。

また、L1 regressionでも特徴量選択を行うことができます。その場合はSelectFromModelを使います。

Tree Modelではfeature importancesで特徴量選択を行うことができます。