python - Policz unikalne wartości z pandami na grupy

original title: "python - Count unique values with pandas per groups"


Translate

This question already has an answer here:



To pytanie ma już odpowiedź tutaj: Pandy liczą (odrębne) równoważne 5 odpowiedzi ...

To jest podsumowanie po przetłumaczeniu, jeśli chcesz zobaczyć całe tłumaczenie, kliknij ikonę „przetłumacz”


Wszystkie odpowiedzi
  • Translate

    You need nunique:

    df = df.groupby('domain')['ID'].nunique()
    
    print (df)
    domain
    'facebook.com'    1
    'google.com'      1
    'twitter.com'     2
    'vk.com'          3
    Name: ID, dtype: int64
    

    If you need to strip ' characters:

    df = df.ID.groupby([df.domain.str.strip("'")]).nunique()
    print (df)
    domain
    facebook.com    1
    google.com      1
    twitter.com     2
    vk.com          3
    Name: ID, dtype: int64
    

    Or as Jon Clements commented:

    df.groupby(df.domain.str.strip("'"))['ID'].nunique()
    

    You can retain the column name like this:

    df = df.groupby(by='domain', as_index=False).agg({'ID': pd.Series.nunique})
    print(df)
        domain  ID
    0       fb   1
    1      ggl   1
    2  twitter   2
    3       vk   3
    

    The difference is that nunique() returns a Series and agg() returns a DataFrame.


  • Translate

    Generally to count distinct values in single column, you can use Series.value_counts:

    df.domain.value_counts()
    
    #'vk.com'          5
    #'twitter.com'     2
    #'facebook.com'    1
    #'google.com'      1
    #Name: domain, dtype: int64
    

    To see how many unique values in a column, use Series.nunique:

    df.domain.nunique()
    # 4
    

    To get all these distinct values, you can use unique or drop_duplicates, the slight difference between the two functions is that unique return a numpy.array while drop_duplicates returns a pandas.Series:

    df.domain.unique()
    # array(["'vk.com'", "'twitter.com'", "'facebook.com'", "'google.com'"], dtype=object)
    
    df.domain.drop_duplicates()
    #0          'vk.com'
    #2     'twitter.com'
    #4    'facebook.com'
    #6      'google.com'
    #Name: domain, dtype: object
    

    As for this specific problem, since you'd like to count distinct value with respect to another variable, besides groupby method provided by other answers here, you can also simply drop duplicates firstly and then do value_counts():

    import pandas as pd
    df.drop_duplicates().domain.value_counts()
    
    # 'vk.com'          3
    # 'twitter.com'     2
    # 'facebook.com'    1
    # 'google.com'      1
    # Name: domain, dtype: int64
    

  • Translate

    df.domain.value_counts()

    >>> df.domain.value_counts()
    
    vk.com          5
    
    twitter.com     2
    
    google.com      1
    
    facebook.com    1
    
    Name: domain, dtype: int64
    

  • Translate

    IIUC you want the number of different ID for every domain, then you can try this:

    output = df.drop_duplicates()
    output.groupby('domain').size()
    

    output:

        domain
    facebook.com    1
    google.com      1
    twitter.com     2
    vk.com          3
    dtype: int64
    

    You could also use value_counts, which is slightly less efficient.But the best is Jezrael's answer using nunique:

    %timeit df.drop_duplicates().groupby('domain').size()
    1000 loops, best of 3: 939 µs per loop
    %timeit df.drop_duplicates().domain.value_counts()
    1000 loops, best of 3: 1.1 ms per loop
    %timeit df.groupby('domain')['ID'].nunique()
    1000 loops, best of 3: 440 µs per loop