【Kaggle】 Learn Feature Engineering③ - 工学と医学のあいだ

kaggle learn で feature engineeringを勉強するシリーズ第三弾。

www.kaggle.com

概要) 今回は特徴量の作り方について勉強します。

1) Add interaction features

categoryとcountryの特徴量を組み合わせた特徴量を作ります

The easiest way to loop through the pairs is with itertools.combinations. Once you have that working, for each pair of columns convert them to strings then you can join them with the + operator. It's usually good to join with a symbol like _ inbetween to ensure unique values. Now you should have a column of new categorical values, you can label encoder those and add them to the DataFrame


    cat_features = ['ip', 'app', 'device', 'os', 'channel']
    interactions = pd.DataFrame(index=clicks.index)
    for col1, col2 in itertools.combinations(cat_features, 2):
        new_col_name = '_'.join([col1, col2])

        # Convert to strings and combine
        new_values = clicks[col1].map(str) + "_" + clicks[col2].map(str)

        encoder = preprocessing.LabelEncoder()
        interactions[new_col_name] = encoder.fit_transform(new_values)

特徴量の組み合わせは、for loopとitertools.combinations()を使います。

特徴量の列名は、.joinを使います。"_"でつなぎます。'os_channel'のような列名が作られます。

特徴量の値は、"+"を使います。文字列にするために.map(str)を使います。

0 13_120
1 13_10
2 13_157
3 13_120
4 13_120

のような特徴量が作られます。

Generating numerical features

カテゴリカル特徴量だけでなく、数値特徴量も作ってみましょう。

Pandas Seriesを使います。

2) Number of events in the past six hours

時系列データについては、.rollingという手法を使います

launched.rolling('7d')

を使うと、前の7日間をrollingします。rollingで指定したデータについて、.count()をすることで7日間のデータの中身を数えてくれます。

You can get a rolling time window using .rolling(), but first you need to convert the index to a time series. The current row is included in the window, but we want to count all the events before the current row, so be sure to adjust the count.

    def count_past_events(series):
        series = pd.Series(series.index, index=series)
        # Subtract 1 so the current event isn't counted
        past_events = series.rolling('6h').count() - 1
        return past_events

-1することで現在データを除きます。

3) Features from future information

未来のデータは使わないようにしましょう。

4) Time since last event

.diff()で時間の差分を作ります。

    def time_diff(series):
        return series.diff().dt.total_seconds()

5) Number of previous app downloads

It's likely that if a visitor downloaded an app previously, it'll affect the likelihood they'll download one again. Implement a function previous_attributions that returns a Series with the number of times an app has been downloaded ('is_attributed' == 1) before the current event.

    def previous_attributions(series):
        # Subtracting raw values so I don't count the current event
        sums = series.expanding(min_periods=2).sum() - series
        return sums

series.expandingを使って過去の累積和を計算します。

累積和から現在のデータを引き算するとこれまでの累積和が計算できます。

まとめ)

interaction featuresを作成しました。

カテゴリカル特徴量と数値特徴量それぞれで作成しました。

カテゴリカル特徴量ではitertools.combinationsを使いました。

数値特徴量ではpandasのrollingやdiffを使いました。

leakしないように未来のデータを入れないように注意しましょう。