JavaEar 专注于收集分享传播有价值的技术资料

Vectorizing a Pandas dataframe for Scikit-Learn

Say I have a dataframe in Pandas like the following:

> my_dataframe

col1   col2
A      foo
B      bar
C      something
A      foo
A      bar
B      foo

where rows represent instances, and columns input features (not showing the target label, but this would be for a classification task), i.e. I trying to build X out of my_dataframe.

How can I vectorize this efficiently using e.g. DictVectorizer ?

Do I need to convert each and every entry in my DataFrame to a dictionary first? (that's the way it is done in the example in the link above). Is there a more efficient way to do this?


  1. Take a look at sklearn-pandas which provides exactly what you're looking for. The corresponding Github repo is here.

  2. 参考答案2
  3. You can definitely use DictVectorizer. Because DictVectorizer expects an iterable of dict-like objects, you could do the following:

    from sklearn.base import TransformerMixin
    from sklearn.pipeline import make_pipeline
    from sklearn.feature_extraction import DictVectorizer
    class RowIterator(TransformerMixin):
        """ Prepare dataframe for DictVectorizer """
        def fit(self, X, y=None):
            return self
        def transform(self, X):
            return (row for _, row in X.iterrows())
    vectorizer = make_pipeline(RowIterator(), DictVectorizer())
    # now you can use vectorizer as you might expect, e.g.
  4. 参考答案3
  5. You want to build a design matrix from a pandas DataFrame containing categoricals (or simply strings) and the easiest way to do it is using patsy, a library that replicates and expands R formulas functionality.

    Using your example, the conversion would be:

    import pandas as pd
    import patsy
    my_df = pd.DataFrame({'col1':['A', 'B', 'C', 'A', 'A', 'B'], 
                          'col2':['foo', 'bar', 'something', 'foo', 'bar', 'foo']})
    patsy.dmatrix('col1 + col2', data=my_df) # With added intercept
    patsy.dmatrix('0 + col1 + col2', data=my_df) # Without added intercept

    The resulting design matrices are just NumPy arrays with some extra information and can be directly used in scikit-learn.

    Example result with intercept added:

    DesignMatrix with shape (6, 5)
      Intercept  col1[T.B]  col1[T.C]  col2[]  col2[T.something]
              1          0          0            1                  0
              1          1          0            0                  0
              1          0          1            0                  1
              1          0          0            1                  0
              1          0          0            0                  0
              1          1          0            1                  0
        'Intercept' (column 0)
        'col1' (columns 1:3)
        'col2' (columns 3:5)

    Note that patsy tried to avoid multicolinearity by incorporating the effects of A and bar into the intercept. That way, for example, the col1[T.B] predictor should be interpreted as the additional effect of B in relation to observations that are classified as A.

  6. 参考答案4
  7. First, I don't get where in your sample array are features, and where observations.

    Second, DictVectorizer holds no data, and is only about transformation utility and metadata storage. After transformation it stores features names and mapping. It returns a numpy array, used for further computations. Numpy array (features matrix) size equals to features count x number of observations, with values equal to feature value for an observation. So if you know your observations and features, you can create this array any other way you like.

    In case you expect sklearn do it for you, you don't have to reconstruct dict manually, as it can be done with to_dict applied to transposed dataframe:

    >>> df
      col1 col2
    0    A  foo
    1    B  bar
    2    C  foo
    3    A  bar
    4    A  foo
    5    B  bar
    >>> df.T.to_dict().values()
    [{'col2': 'foo', 'col1': 'A'}, {'col2': 'bar', 'col1': 'B'}, {'col2': 'foo', 'col1': 'C'}, {'col2': 'bar', 'col1': 'A'}, {'col2': 'foo', 'col1': 'A'}, {'col2': 'bar', 'col1': 'B'}]

    Since scikit-learn 0.13.0 (Jan 3, 2014) there is a new parameter 'records' for the to_dict() method available, so now you can simple use this method without additional manipulations:

    >>> df = pandas.DataFrame({'col1': ['A', 'B', 'C', 'A', 'A', 'B'], 'col2': ['foo', 'bar', 'foo', 'bar', 'foo', 'bar']})
    >>> df
      col1 col2
    0    A  foo
    1    B  bar
    2    C  foo
    3    A  bar
    4    A  foo
    5    B  bar
    >>> df.to_dict('records')
    [{'col2': 'foo', 'col1': 'A'}, {'col2': 'bar', 'col1': 'B'}, {'col2': 'foo', 'col1': 'C'}, {'col2': 'bar', 'col1': 'A'}, {'col2': 'foo', 'col1': 'A'}, {'col2': 'bar', 'col1': 'B'}]