2016-04-19 11 views
6

TLDR: Come ottenere le intestazioni per l'array numpy di output dalla funzione sklearn.preprocessing.PolynomialFeatures()?Pre-elaborazione di Sklearn - PolynomialFeatures - Come mantenere i nomi delle colonne/matrice di output/dataframe


Diciamo che ho il seguente codice ...

import pandas as pd 
import numpy as np 
from sklearn import preprocessing as pp 

a = np.ones(3) 
b = np.ones(3) * 2 
c = np.ones(3) * 3 

input_df = pd.DataFrame([a,b,c]) 
input_df = input_df.T 
input_df.columns=['a', 'b', 'c'] 

input_df 

    a b c 
0 1 2 3 
1 1 2 3 
2 1 2 3 

poly = pp.PolynomialFeatures(2) 
output_nparray = poly.fit_transform(input_df) 
print output_nparray 

[[ 1. 1. 2. 3. 1. 2. 3. 4. 6. 9.] 
[ 1. 1. 2. 3. 1. 2. 3. 4. 6. 9.] 
[ 1. 1. 2. 3. 1. 2. 3. 4. 6. 9.]] 

Come posso ottenere che 3x10 matrice/output_nparray di riportare l'a, b, c etichette come si riferiscono ai dati sopra?

risposta

10

Esempio di applicazione, il tutto in una sola riga (presumo "leggibilità" non è l'obiettivo qui):

target_feature_names = ['x'.join(['{}^{}'.format(pair[0],pair[1]) for pair in tuple if pair[1]!=0]) for tuple in [zip(input_df.columns,p) for p in poly.powers_]] 
output_df = pd.DataFrame(output_nparray, columns = target_feature_names) 
2

Questo funziona:

def PolynomialFeatures_labeled(input_df,power): 
    '''Basically this is a cover for the sklearn preprocessing function. 
    The problem with that function is if you give it a labeled dataframe, it ouputs an unlabeled dataframe with potentially 
    a whole bunch of unlabeled columns. 

    Inputs: 
    input_df = Your labeled pandas dataframe (list of x's not raised to any power) 
    power = what order polynomial you want variables up to. (use the same power as you want entered into pp.PolynomialFeatures(power) directly) 

    Ouput: 
    Output: This function relies on the powers_ matrix which is one of the preprocessing function's outputs to create logical labels and 
    outputs a labeled pandas dataframe 
    ''' 
    poly = pp.PolynomialFeatures(power) 
    output_nparray = poly.fit_transform(input_df) 
    powers_nparray = poly.powers_ 

    input_feature_names = list(input_df.columns) 
    target_feature_names = ["Constant Term"] 
    for feature_distillation in powers_nparray[1:]: 
     intermediary_label = "" 
     final_label = "" 
     for i in range(len(input_feature_names)): 
      if feature_distillation[i] == 0: 
       continue 
      else: 
       variable = input_feature_names[i] 
       power = feature_distillation[i] 
       intermediary_label = "%s^%d" % (variable,power) 
       if final_label == "":   #If the final label isn't yet specified 
        final_label = intermediary_label 
       else: 
        final_label = final_label + " x " + intermediary_label 
     target_feature_names.append(final_label) 
    output_df = pd.DataFrame(output_nparray, columns = target_feature_names) 
    return output_df 

output_df = PolynomialFeatures_labeled(input_df,2) 
output_df 

    Constant Term a^1 b^1 c^1 a^2 a^1 x b^1 a^1 x c^1 b^2 b^1 x c^1 c^2 
0    1 1 2 3 1   2   3 4   6 9 
1    1 1 2 3 1   2   3 4   6 9 
2    1 1 2 3 1   2   3 4   6 9 
4

scikit-learn 0,18 aggiunto un metodo ingegnoso get_feature_names() !

>> input_df.columns 
Index(['a', 'b', 'c'], dtype='object') 

>> poly.fit_transform(input_df) 
array([[ 1., 1., 2., 3., 1., 2., 3., 4., 6., 9.], 
     [ 1., 1., 2., 3., 1., 2., 3., 4., 6., 9.], 
     [ 1., 1., 2., 3., 1., 2., 3., 4., 6., 9.]]) 

>> poly.get_feature_names(input_df.columns) 
['1', 'a', 'b', 'c', 'a^2', 'a b', 'a c', 'b^2', 'b c', 'c^2'] 

Nota è necessario fornire con i nomi delle colonne, in quanto sklearn non legge fuori dalla dataframe di per sé.

Problemi correlati