Python Dictionary Implementation
Table of Contents
Overview #
- CPython allocation memory to save dictionary, the initial table size is 8, entries are saved as
<hash,key,value>
in each slot(The slot content changed after Python 3.6). - When a new key is added, python use
i = hash(key) & mask
wheremask=table_size-1
to calculate which slot it should be placed. If the slot is occupied, CPython using a probing algorithm to find the empty slot to store new item. - When 2/3 of the table is full, the table will be resized.
- When getting item from dictionary, both
hash
andkey
must be equal.
Resizing #
When elements size is below 50000, the table size will increase by a factor of 4 based on used slots. Otherwise, it will increase by a factor of 2. The dictionary size is always \(2^{n}\).
dict size | resize when elements in dict | new table size |
---|---|---|
8 | 6 | 32 |
32 | 22 | 128 |
128 | 86 | 512 |
Removing item from dictionary doesn’t lead to shrink table. The value of the item will marks as null but not empty. When looking up element in dictionary, it will keep probing once find this special mark. So deleting element from Python will not decrease the memory using. If you really want to do so, you can the items in the old dictionary to create a new one.
Probing #
CPython used a modified random probing algorithm to choose the empty slot. This algorithm can traval all of the slots in a pseudo random order.
The travel order can be calculated by this formula: j = ((5*j) + 1) mod 2**i
, where j
is slot index.
For example, if table size is 8, and the calculate slot index is 2, then the traversal order should be:
2 -> (5*2+1) mod 8 = 3 -> (5*3+1) mod 8 = 0 -> (5*0+1) mod 8 = 1 -> 6 -> 7 -> 4 -> 5 -> 2
CPython changed this formula by adding perturb
and PERTURB_SHIFT
variables, where perturb
is hash value and PERTURB_SHIFT
is 5. By adding PERTURB_SHIFT
, the probe sequence depends on every bit in the hash code, and the collision probability is decreased. And perturb
will eventually becomes to 0, this ensures that all of the slots will be checked.
j = (5*j) + 1 + perturb;
perturb >>= PERTURB_SHIFT;
j = j % 2**i
Dictionary improvement after 3.6 #
CPython 3.6 use a compact representation to save entries, and “The memory usage of the new dict() is between 20% and 25% smaller compared to Python 3.5”.
Compact Hash Table #
As mentioned before, entries saved in the form of <hash,key,value>
. This will takes 3B on 64 bit machine. And no matter how much item is added into the dictionary, the memory usage is the same(3B*table_size).
After 3.6, CPython use two structure to save data. One is index, another is the real data.
For example, if the table size is 8, and there is an item in slot 1, the index looks like this:
[null, 0, null, null, null, null, null, null]
And the real data is:
| hash | key | value |
| xxx1 | yyy1 | zzz1 |
0 represents the items index on real data. If another item is added in slot 3, the new index become this:
[null, 0, null, 1, null, null, null, null]
The real data become this:
| hash | key | value |
| xxx1 | yyy1 | zzz1 |
| xxx2 | yyy2 | zzz2 |
This saves memory, especially when table load factor is low.