python - Pandas DataFrame column to list


Translate

This question already has an answer here:


All Answers
  • Translate

    Use .values to get a numpy.array and then .tolist() to get a list.

    For example:

    import pandas as pd
    df = pd.DataFrame({'a':[1,3,5,7,4,5,6,4,7,8,9],
                       'b':[3,5,6,2,4,6,7,8,7,8,9]})
    

    Result:

    >>> df['a'].values.tolist()
    [1, 3, 5, 7, 4, 5, 6, 4, 7, 8, 9]
    

    or you can just use

    >>> df['a'].tolist()
    [1, 3, 5, 7, 4, 5, 6, 4, 7, 8, 9]
    

    To drop duplicates you can do one of the following:

    >>> df['a'].drop_duplicates().values.tolist()
    [1, 3, 5, 7, 4, 6, 8, 9]
    >>> list(set(df['a'])) # as pointed out by EdChum
    [1, 3, 4, 5, 6, 7, 8, 9]
    

  • Translate

    I'd like to clarify a few things:

    1. As other answers have pointed out, the simplest thing to do is use pandas.Series.tolist(). I'm not sure why the top voted answer leads off with using pandas.Series.values.tolist() since as far as I can tell, it adds syntax/confusion with no added benefit.
    2. tst[lookupValue][['SomeCol']] is a dataframe (as stated in the question), not a series (as stated in a comment to the question). This is because tst[lookupValue] is a dataframe, and slicing it with [['SomeCol']] asks for a list of columns (that list that happens to have a length of 1), resulting in a dataframe being returned. If you remove the extra set of brackets, as in tst[lookupValue]['SomeCol'], then you are asking for just that one column rather than a list of columns, and thus you get a series back.
    3. You need a series to use pandas.Series.tolist(), so you should definitely skip the second set of brackets in this case. FYI, if you ever end up with a one-column dataframe that isn't easily avoidable like this, you can use pandas.DataFrame.squeeze() to convert it to a series.
    4. tst[lookupValue]['SomeCol'] is getting a subset of a particular column via chained slicing. It slices once to get a dataframe with only certain rows left, and then it slices again to get a certain column. You can get away with it here since you are just reading, not writing, but the proper way to do it is tst.loc[lookupValue, 'SomeCol'] (which returns a series).
    5. Using the syntax from #4, you could reasonably do everything in one line: ID = tst.loc[tst['SomeCol'] == 'SomeValue', 'SomeCol'].tolist()

    Demo Code:

    import pandas as pd
    df = pd.DataFrame({'colA':[1,2,1],
                       'colB':[4,5,6]})
    filter_value = 1
    
    print "df"
    print df
    print type(df)
    
    rows_to_keep = df['colA'] == filter_value
    print "\ndf['colA'] == filter_value"
    print rows_to_keep
    print type(rows_to_keep)
    
    result = df[rows_to_keep]['colB']
    print "\ndf[rows_to_keep]['colB']"
    print result
    print type(result)
    
    result = df[rows_to_keep][['colB']]
    print "\ndf[rows_to_keep][['colB']]"
    print result
    print type(result)
    
    result = df[rows_to_keep][['colB']].squeeze()
    print "\ndf[rows_to_keep][['colB']].squeeze()"
    print result
    print type(result)
    
    result = df.loc[rows_to_keep, 'colB']
    print "\ndf.loc[rows_to_keep, 'colB']"
    print result
    print type(result)
    
    result = df.loc[df['colA'] == filter_value, 'colB']
    print "\ndf.loc[df['colA'] == filter_value, 'colB']"
    print result
    print type(result)
    
    ID = df.loc[rows_to_keep, 'colB'].tolist()
    print "\ndf.loc[rows_to_keep, 'colB'].tolist()"
    print ID
    print type(ID)
    
    ID = df.loc[df['colA'] == filter_value, 'colB'].tolist()
    print "\ndf.loc[df['colA'] == filter_value, 'colB'].tolist()"
    print ID
    print type(ID)
    

    Result:

    df
       colA  colB
    0     1     4
    1     2     5
    2     1     6
    <class 'pandas.core.frame.DataFrame'>
    
    df['colA'] == filter_value
    0     True
    1    False
    2     True
    Name: colA, dtype: bool
    <class 'pandas.core.series.Series'>
    
    df[rows_to_keep]['colB']
    0    4
    2    6
    Name: colB, dtype: int64
    <class 'pandas.core.series.Series'>
    
    df[rows_to_keep][['colB']]
       colB
    0     4
    2     6
    <class 'pandas.core.frame.DataFrame'>
    
    df[rows_to_keep][['colB']].squeeze()
    0    4
    2    6
    Name: colB, dtype: int64
    <class 'pandas.core.series.Series'>
    
    df.loc[rows_to_keep, 'colB']
    0    4
    2    6
    Name: colB, dtype: int64
    <class 'pandas.core.series.Series'>
    
    df.loc[df['colA'] == filter_value, 'colB']
    0    4
    2    6
    Name: colB, dtype: int64
    <class 'pandas.core.series.Series'>
    
    df.loc[rows_to_keep, 'colB'].tolist()
    [4, 6]
    <type 'list'>
    
    df.loc[df['colA'] == filter_value, 'colB'].tolist()
    [4, 6]
    <type 'list'>
    

  • Translate

    You can use pandas.Series.tolist

    e.g.:

    import pandas as pd
    df = pd.DataFrame({'a':[1,2,3], 'b':[4,5,6]})
    

    Run:

    >>> df['a'].tolist()
    

    You will get

    >>> [1, 2, 3]
    

  • Translate

    The above solution is good if all the data is of same dtype. Numpy arrays are homogeneous containers. When you do df.values the output is an numpy array. So if the data has int and float in it then output will either have int or float and the columns will loose their original dtype. Consider df

    a  b 
    0  1  4
    1  2  5 
    2  3  6 
    
    a    float64
    b    int64 
    

    So if you want to keep original dtype, you can do something like

    row_list = df.to_csv(None, header=False, index=False).split('\n')
    

    this will return each row as a string.

    ['1.0,4', '2.0,5', '3.0,6', '']
    

    Then split each row to get list of list. Each element after splitting is a unicode. We need to convert it required datatype.

    def f(row_str): 
      row_list = row_str.split(',')
      return [float(row_list[0]), int(row_list[1])]
    
    df_list_of_list = map(f, row_list[:-1])
    
    [[1.0, 4], [2.0, 5], [3.0, 6]]