python - Flatten nested dictionaries, compressing keys


Translate

Suppose you have a dictionary like:

{'a': 1,
 'c': {'a': 2,
       'b': {'x': 5,
             'y' : 10}},
 'd': [1, 2, 3]}

How would you go about flattening that into something like:

{'a': 1,
 'c_a': 2,
 'c_b_x': 5,
 'c_b_y': 10,
 'd': [1, 2, 3]}


All Answers
  • Translate

    Basically the same way you would flatten a nested list, you just have to do the extra work for iterating the dict by key/value, creating new keys for your new dictionary and creating the dictionary at final step.

    import collections
    
    def flatten(d, parent_key='', sep='_'):
        items = []
        for k, v in d.items():
            new_key = parent_key + sep + k if parent_key else k
            if isinstance(v, collections.MutableMapping):
                items.extend(flatten(v, new_key, sep=sep).items())
            else:
                items.append((new_key, v))
        return dict(items)
    
    >>> flatten({'a': 1, 'c': {'a': 2, 'b': {'x': 5, 'y' : 10}}, 'd': [1, 2, 3]})
    {'a': 1, 'c_a': 2, 'c_b_x': 5, 'd': [1, 2, 3], 'c_b_y': 10}
    

  • Translate

    There are two big considerations that the original poster needs to consider:

    1. Are there keyspace clobbering issues? For example, {'a_b':{'c':1}, 'a':{'b_c':2}} would result in {'a_b_c':???}. The below solution evades the problem by returning an iterable of pairs.
    2. If performance is an issue, does the key-reducer function (which I hereby refer to as 'join') require access to the entire key-path, or can it just do O(1) work at every node in the tree? If you want to be able to say joinedKey = '_'.join(*keys), that will cost you O(N^2) running time. However if you're willing to say nextKey = previousKey+'_'+thisKey, that gets you O(N) time. The solution below lets you do both (since you could merely concatenate all the keys, then postprocess them).

    (Performance is not likely an issue, but I'll elaborate on the second point in case anyone else cares: In implementing this, there are numerous dangerous choices. If you do this recursively and yield and re-yield, or anything equivalent which touches nodes more than once (which is quite easy to accidentally do), you are doing potentially O(N^2) work rather than O(N). This is because maybe you are calculating a key a then a_1 then a_1_i..., and then calculating a then a_1 then a_1_ii..., but really you shouldn't have to calculate a_1 again. Even if you aren't recalculating it, re-yielding it (a 'level-by-level' approach) is just as bad. A good example is to think about the performance on {1:{1:{1:{1:...(N times)...{1:SOME_LARGE_DICTIONARY_OF_SIZE_N}...}}}})

    Below is a function I wrote flattenDict(d, join=..., lift=...) which can be adapted to many purposes and can do what you want. Sadly it is fairly hard to make a lazy version of this function without incurring the above performance penalties (many python builtins like chain.from_iterable aren't actually efficient, which I only realized after extensive testing of three different versions of this code before settling on this one).

    from collections import Mapping
    from itertools import chain
    from operator import add
    
    _FLAG_FIRST = object()
    
    def flattenDict(d, join=add, lift=lambda x:x):
        results = []
        def visit(subdict, results, partialKey):
            for k,v in subdict.items():
                newKey = lift(k) if partialKey==_FLAG_FIRST else join(partialKey,lift(k))
                if isinstance(v,Mapping):
                    visit(v, results, newKey)
                else:
                    results.append((newKey,v))
        visit(d, results, _FLAG_FIRST)
        return results
    

    To better understand what's going on, below is a diagram for those unfamiliar with reduce(left), otherwise known as "fold left". Sometimes it is drawn with an initial value in place of k0 (not part of the list, passed into the function). Here, J is our join function. We preprocess each kn with lift(k).

                   [k0,k1,...,kN].foldleft(J)
                               /    \
                             ...    kN
                             /
           J(k0,J(k1,J(k2,k3)))
                           /  \
                          /    \
               J(J(k0,k1),k2)   k3
                        /   \
                       /     \
                 J(k0,k1)    k2
                     /  \
                    /    \
                   k0     k1
    

    This is in fact the same as functools.reduce, but where our function does this to all key-paths of the tree.

    >>> reduce(lambda a,b:(a,b), range(5))
    ((((0, 1), 2), 3), 4)
    

    Demonstration (which I'd otherwise put in docstring):

    >>> testData = {
            'a':1,
            'b':2,
            'c':{
                'aa':11,
                'bb':22,
                'cc':{
                    'aaa':111
                }
            }
        }
    from pprint import pprint as pp
    
    >>> pp(dict( flattenDict(testData, lift=lambda x:(x,)) ))
    {('a',): 1,
     ('b',): 2,
     ('c', 'aa'): 11,
     ('c', 'bb'): 22,
     ('c', 'cc', 'aaa'): 111}
    
    >>> pp(dict( flattenDict(testData, join=lambda a,b:a+'_'+b) ))
    {'a': 1, 'b': 2, 'c_aa': 11, 'c_bb': 22, 'c_cc_aaa': 111}    
    
    >>> pp(dict( (v,k) for k,v in flattenDict(testData, lift=hash, join=lambda a,b:hash((a,b))) ))
    {1: 12416037344,
     2: 12544037731,
     11: 5470935132935744593,
     22: 4885734186131977315,
     111: 3461911260025554326}
    

    Performance:

    from functools import reduce
    def makeEvilDict(n):
        return reduce(lambda acc,x:{x:acc}, [{i:0 for i in range(n)}]+range(n))
    
    import timeit
    def time(runnable):
        t0 = timeit.default_timer()
        _ = runnable()
        t1 = timeit.default_timer()
        print('took {:.2f} seconds'.format(t1-t0))
    
    >>> pp(makeEvilDict(8))
    {7: {6: {5: {4: {3: {2: {1: {0: {0: 0,
                                     1: 0,
                                     2: 0,
                                     3: 0,
                                     4: 0,
                                     5: 0,
                                     6: 0,
                                     7: 0}}}}}}}}}
    
    import sys
    sys.setrecursionlimit(1000000)
    
    forget = lambda a,b:''
    
    >>> time(lambda: dict(flattenDict(makeEvilDict(10000), join=forget)) )
    took 0.10 seconds
    >>> time(lambda: dict(flattenDict(makeEvilDict(100000), join=forget)) )
    [1]    12569 segmentation fault  python
    

    ... sigh, don't think that one is my fault...


    [unimportant historical note due to moderation issues]

    Regarding the alleged duplicate of Flatten a dictionary of dictionaries (2 levels deep) of lists in Python:

    That question's solution can be implemented in terms of this one by doing sorted( sum(flatten(...),[]) ). The reverse is not possible: while it is true that the values of flatten(...) can be recovered from the alleged duplicate by mapping a higher-order accumulator, one cannot recover the keys. (edit: Also it turns out that the alleged duplicate owner's question is completely different, in that it only deals with dictionaries exactly 2-level deep, though one of the answers on that page gives a general solution.)


  • Translate

    Or if you are already using pandas, You can do it with json_normalize() like so:

    import pandas as pd
    
    d = {'a': 1,
         'c': {'a': 2, 'b': {'x': 5, 'y' : 10}},
         'd': [1, 2, 3]}
    
    df = pd.io.json.json_normalize(d, sep='_')
    
    print(df.to_dict(orient='records')[0])
    

    Output:

    {'a': 1, 'c_a': 2, 'c_b_x': 5, 'c_b_y': 10, 'd': [1, 2, 3]}
    

  • Translate

    Here is a kind of a "functional", "one-liner" implementation. It is recursive, and based on a conditional expression and a dict comprehension.

    def flatten_dict(dd, separator='_', prefix=''):
        return { prefix + separator + k if prefix else k : v
                 for kk, vv in dd.items()
                 for k, v in flatten_dict(vv, separator, kk).items()
                 } if isinstance(dd, dict) else { prefix : dd }
    

    Test:

    In [2]: flatten_dict({'abc':123, 'hgf':{'gh':432, 'yu':433}, 'gfd':902, 'xzxzxz':{"432":{'0b0b0b':231}, "43234":1321}}, '.')
    Out[2]: 
    {'abc': 123,
     'gfd': 902,
     'hgf.gh': 432,
     'hgf.yu': 433,
     'xzxzxz.432.0b0b0b': 231,
     'xzxzxz.43234': 1321}
    

  • Translate

    If you're using pandas there is a function hidden in pandas.io.json.normalize called nested_to_record which does this exactly.

    from pandas.io.json.normalize import nested_to_record    
    
    flat = nested_to_record(my_dict, sep='_')
    

  • Translate

    Code:

    test = {'a': 1, 'c': {'a': 2, 'b': {'x': 5, 'y' : 10}}, 'd': [1, 2, 3]}
    
    def parse_dict(init, lkey=''):
        ret = {}
        for rkey,val in init.items():
            key = lkey+rkey
            if isinstance(val, dict):
                ret.update(parse_dict(val, key+'_'))
            else:
                ret[key] = val
        return ret
    
    print(parse_dict(test,''))
    

    Results:

    $ python test.py
    {'a': 1, 'c_a': 2, 'c_b_x': 5, 'd': [1, 2, 3], 'c_b_y': 10}
    

    I am using python3.2, update for your version of python.


  • Translate

    This is not restricted to dictionaries, but every mapping type that implements .items(). Further ist faster as it avoides an if condition. Nevertheless credits go to Imran:

    def flatten(d, parent_key=''):
        items = []
        for k, v in d.items():
            try:
                items.extend(flatten(v, '%s%s_' % (parent_key, k)).items())
            except AttributeError:
                items.append(('%s%s' % (parent_key, k), v))
        return dict(items)
    

  • Translate

    How about a functional and performant solution in Python3.5?

    from functools import reduce
    
    
    def _reducer(items, key, val, pref):
        if isinstance(val, dict):
            return {**items, **flatten(val, pref + key)}
        else:
            return {**items, pref + key: val}
    
    def flatten(d, pref=''):
        return(reduce(
            lambda new_d, kv: _reducer(new_d, *kv, pref), 
            d.items(), 
            {}
        ))
    

    This is even more performant:

    def flatten(d, pref=''):
        return(reduce(
            lambda new_d, kv: \
                isinstance(kv[1], dict) and \
                {**new_d, **flatten(kv[1], pref + kv[0])} or \
                {**new_d, pref + kv[0]: kv[1]}, 
            d.items(), 
            {}
        ))
    

    In use:

    my_obj = {'a': 1, 'c': {'a': 2, 'b': {'x': 5, 'y': 10}}, 'd': [1, 2, 3]}
    
    print(flatten(my_obj)) 
    # {'d': [1, 2, 3], 'cby': 10, 'cbx': 5, 'ca': 2, 'a': 1}
    

  • Translate

    My Python 3.3 Solution using generators:

    def flattenit(pyobj, keystring=''):
       if type(pyobj) is dict:
         if (type(pyobj) is dict):
             keystring = keystring + "_" if keystring else keystring
             for k in pyobj:
                 yield from flattenit(pyobj[k], keystring + k)
         elif (type(pyobj) is list):
             for lelm in pyobj:
                 yield from flatten(lelm, keystring)
       else:
          yield keystring, pyobj
    
    my_obj = {'a': 1, 'c': {'a': 2, 'b': {'x': 5, 'y': 10}}, 'd': [1, 2, 3]}
    
    #your flattened dictionary object
    flattened={k:v for k,v in flattenit(my_obj)}
    print(flattened)
    
    # result: {'c_b_y': 10, 'd': [1, 2, 3], 'c_a': 2, 'a': 1, 'c_b_x': 5}
    

  • Translate

    Simple function to flatten nested dictionaries. For Python 3, replace .iteritems() with .items()

    def flatten_dict(init_dict):
        res_dict = {}
        if type(init_dict) is not dict:
            return res_dict
    
        for k, v in init_dict.iteritems():
            if type(v) == dict:
                res_dict.update(flatten_dict(v))
            else:
                res_dict[k] = v
    
        return res_dict
    

    The idea/requirement was: Get flat dictionaries with no keeping parent keys.

    Example of usage:

    dd = {'a': 3, 
          'b': {'c': 4, 'd': 5}, 
          'e': {'f': 
                     {'g': 1, 'h': 2}
               }, 
          'i': 9,
         }
    
    flatten_dict(dd)
    
    >> {'a': 3, 'c': 4, 'd': 5, 'g': 1, 'h': 2, 'i': 9}
    

    Keeping parent keys is simple as well.


  • Translate

    This is similar to both imran's and ralu's answer. It does not use a generator, but instead employs recursion with a closure:

    def flatten_dict(d, separator='_'):
      final = {}
      def _flatten_dict(obj, parent_keys=[]):
        for k, v in obj.iteritems():
          if isinstance(v, dict):
            _flatten_dict(v, parent_keys + [k])
          else:
            key = separator.join(parent_keys + [k])
            final[key] = v
      _flatten_dict(d)
      return final
    
    >>> print flatten_dict({'a': 1, 'c': {'a': 2, 'b': {'x': 5, 'y' : 10}}, 'd': [1, 2, 3]})
    {'a': 1, 'c_a': 2, 'c_b_x': 5, 'd': [1, 2, 3], 'c_b_y': 10}
    

  • Translate

    Davoud's solution is very nice but doesn't give satisfactory results when the nested dict also contains lists of dicts, but his code be adapted for that case:

    def flatten_dict(d):
        items = []
        for k, v in d.items():
            try:
                if (type(v)==type([])): 
                    for l in v: items.extend(flatten_dict(l).items())
                else: 
                    items.extend(flatten_dict(v).items())
            except AttributeError:
                items.append((k, v))
        return dict(items)
    

  • Translate

    The answers above work really well. Just thought I'd add the unflatten function that I wrote:

    def unflatten(d):
        ud = {}
        for k, v in d.items():
            context = ud
            for sub_key in k.split('_')[:-1]:
                if sub_key not in context:
                    context[sub_key] = {}
                context = context[sub_key]
            context[k.split('_')[-1]] = v
        return ud
    

    Note: This doesn't account for '_' already present in keys, much like the flatten counterparts.


  • Translate

    Here's an algorithm for elegant, in-place replacement. Tested with Python 2.7 and Python 3.5. Using the dot character as a separator.

    def flatten_json(json):
        if type(json) == dict:
            for k, v in list(json.items()):
                if type(v) == dict:
                    flatten_json(v)
                    json.pop(k)
                    for k2, v2 in v.items():
                        json[k+"."+k2] = v2
    

    Example:

    d = {'a': {'b': 'c'}}                   
    flatten_json(d)
    print(d)
    unflatten_json(d)
    print(d)
    

    Output:

    {'a.b': 'c'}
    {'a': {'b': 'c'}}
    

    I published this code here along with the matching unflatten_json function.


  • Translate

    If you want to flat nested dictionary and want all unique keys list then here is the solution:

    def flat_dict_return_unique_key(data, unique_keys=set()):
        if isinstance(data, dict):
            [unique_keys.add(i) for i in data.keys()]
            for each_v in data.values():
                if isinstance(each_v, dict):
                    flat_dict_return_unique_key(each_v, unique_keys)
        return list(set(unique_keys))
    

  • Translate
    def flatten(unflattened_dict, separator='_'):
        flattened_dict = {}
    
        for k, v in unflattened_dict.items():
            if isinstance(v, dict):
                sub_flattened_dict = flatten(v, separator)
                for k2, v2 in sub_flattened_dict.items():
                    flattened_dict[k + separator + k2] = v2
            else:
                flattened_dict[k] = v
    
        return flattened_dict
    

  • Translate

    Using generators:

    def flat_dic_helper(prepand,d):
        if len(prepand) > 0:
            prepand = prepand + "_"
        for k in d:
            i=d[k]
            if type(i).__name__=='dict':
                r = flat_dic_helper(prepand+k,i)
                for j in r:
                    yield j
            else:
                yield (prepand+k,i)
    
    def flat_dic(d): return dict(flat_dic_helper("",d))
    
    d={'a': 1, 'c': {'a': 2, 'b': {'x': 5, 'y' : 10}}, 'd': [1, 2, 3]}
    print(flat_dic(d))
    
    
    >> {'a': 1, 'c_a': 2, 'c_b_x': 5, 'd': [1, 2, 3], 'c_b_y': 10}
    

  • Translate

    Using dict.popitem() in straightforward nested-list-like recursion:

    def flatten(d):
        if d == {}:
            return d
        else:
            k,v = d.popitem()
            if (dict != type(v)):
                return {k:v, **flatten(d)}
            else:
                flat_kv = flatten(v)
                for k1 in list(flat_kv.keys()):
                    flat_kv[k + '_' + k1] = flat_kv[k1]
                    del flat_kv[k1]
                return {**flat_kv, **flatten(d)}
    

  • Translate

    I always prefer access dict objects via .items(), so for flattening dicts I use the following recursive generator flat_items(d). If you like to have dict again, simply wrap it like this: flat = dict(flat_items(d))

    def flat_items(d, key_separator='.'):
        """
        Flattens the dictionary containing other dictionaries like here: https://stackoverflow.com/questions/6027558/flatten-nested-python-dictionaries-compressing-keys
    
        >>> example = {'a': 1, 'c': {'a': 2, 'b': {'x': 5, 'y' : 10}}, 'd': [1, 2, 3]}
        >>> flat = dict(flat_items(example, key_separator='_'))
        >>> assert flat['c_b_y'] == 10
        """
        for k, v in d.items():
            if type(v) is dict:
                for k1, v1 in flat_items(v, key_separator=key_separator):
                    yield key_separator.join((k, k1)), v1
            else:
                yield k, v
    

  • Translate

    I actually wrote a package called cherrypicker recently to deal with this exact sort of thing since I had to do it so often!

    I think the following code would give you exactly what you're after:

    from cherrypicker import CherryPicker
    
    dct = {
        'a': 1,
        'c': {
            'a': 2,
            'b': {
                'x': 5,
                'y' : 10
            }
        },
        'd': [1, 2, 3]
    }
    
    picker = CherryPicker(dct)
    picker.flatten().get()
    

    You can install the package with:

    pip install cherrypicker
    

    ...and there's more docs and guidance at https://cherrypicker.readthedocs.io.

    Other methods may be faster, but the priority of this package is to make such tasks quick and easy. If you do have a large list of objects to flatten though, you can also tell CherryPicker to use parallel processing to speed things up.


  • Translate

    Just use python-benedict, it is a dict subclass that offers many features, included a flatten method. It's possible to install it using pip: pip install python-benedict

    https://github.com/fabiocaccamo/python-benedict#flatten

    from benedict import benedict 
    
    d = benedict(data)
    f = d.flatten(separator='_')