df = pd.DataFrame({'Col1': ['Bob', 'Joe', 'Bill', 'Mary', 'Joe'],
'Col2': ['Joe', 'Steve', 'Bob', 'Bob', 'Steve'],
'Col3': np.random.random(5)})
What is the best way to return the unique values of 'Col1' and 'Col2'?
The desired output is
'Bob', 'Joe', 'Bill', 'Mary', 'Steve'
pd.unique
returns the unique values from an input array, or DataFrame column or index.The input to this function needs to be one-dimensional, so multiple columns will need to be combined. The simplest way is to select the columns you want and then view the values in a flattened NumPy array. The whole operation looks like this:
Note that
ravel()
is an array method than returns a view (if possible) of a multidimensional array. The argument'K'
tells the method to flatten the array in the order the elements are stored in memory (pandas typically stores underlying arrays in Fortran-contiguous order; columns before rows). This can be significantly faster than using the method's default 'C' order.An alternative way is to select the columns and pass them to
np.unique
:There is no need to use
ravel()
here as the method handles multidimensional arrays. Even so, this is likely to be slower thanpd.unique
as it uses a sort-based algorithm rather than a hashtable to identify unique values.The difference in speed is significant for larger DataFrames (especially if there are only a handful of unique values):
I have setup a
DataFrame
with a few simple strings in it's columns:You can concatenate the columns you are interested in and call
unique
function:Or:
Non-
pandas
solution: using set().Output:
An updated solution using numpy v1.13+ requires specifying the axis in np.unique if using multiple columns, otherwise the array is implicitly flattened.
This change was introduced Nov 2016: https://github.com/numpy/numpy/commit/1f764dbff7c496d6636dc0430f083ada9ff4e4be
for those of us that love all things pandas, apply, and of course lambda functions:
The output will be ['Mary', 'Joe', 'Steve', 'Bob', 'Bill']
here's another way