# Python Dictionary Implementation

## Table of Contents

## Overview #

- CPython allocation memory to save dictionary, the initial table size is 8, entries are saved as
`<hash,key,value>`

in each slot(The slot content changed after Python 3.6). - When a new key is added, python use
`i = hash(key) & mask`

where`mask=table_size-1`

to calculate which slot it should be placed. If the slot is occupied, CPython using a probing algorithm to find the empty slot to store new item. - When 2/3 of the table is full, the table will be resized.
- When getting item from dictionary, both
`hash`

and`key`

must be equal.

## Resizing #

When elements size is below 50000, the table size will increase by a factor of 4 based on used slots. Otherwise, it will increase by a factor of 2. The dictionary size is always \(2^{n}\).

dict size | resize when elements in dict | new table size |
---|---|---|

8 | 6 | 32 |

32 | 22 | 128 |

128 | 86 | 512 |

Removing item from dictionary doesn’t lead to shrink table. The value of the item will marks as null but not empty. When looking up element in dictionary, it will keep probing once find this special mark. So deleting element from Python will not decrease the memory using. If you really want to do so, you can the items in the old dictionary to create a new one.

## Probing #

CPython used a modified **random probing** algorithm to choose the empty slot. This algorithm can traval all of the slots in a pseudo random order.

The travel order can be calculated by this formula: `j = ((5*j) + 1) mod 2**i`

, where `j`

is slot index.

For example, if table size is 8, and the calculate slot index is 2, then the traversal order should be:

`2 -> (5*2+1) mod 8 = 3 -> (5*3+1) mod 8 = 0 -> (5*0+1) mod 8 = 1 -> 6 -> 7 -> 4 -> 5 -> 2`

CPython changed this formula by adding `perturb`

and `PERTURB_SHIFT`

variables, where `perturb`

is hash value and `PERTURB_SHIFT`

is 5. By adding `PERTURB_SHIFT`

, the probe sequence depends on every bit in the hash code, and the collision probability is decreased. And `perturb`

will eventually becomes to 0, this ensures that all of the slots will be checked.

```
j = (5*j) + 1 + perturb;
perturb >>= PERTURB_SHIFT;
j = j % 2**i
```

## Dictionary improvement after 3.6 #

CPython 3.6 use a compact representation to save entries, and “The memory usage of the new dict() is between 20% and 25% smaller compared to Python 3.5”.

### Compact Hash Table #

As mentioned before, entries saved in the form of `<hash,key,value>`

. This will takes 3B on 64 bit machine. And no matter how much item is added into the dictionary, the memory usage is the same(3B*table_size).

After 3.6, CPython use two structure to save data. One is **index**, another is the **real data**.

For example, if the table size is 8, and there is an item in slot 1, the **index** looks like this:

`[null, 0, null, null, null, null, null, null]`

And the **real data** is:

```
| hash | key | value |
| xxx1 | yyy1 | zzz1 |
```

0 represents the items index on **real data**. If another item is added in slot 3, the new **index** become this:

`[null, 0, null, 1, null, null, null, null]`

The **real data** become this:

```
| hash | key | value |
| xxx1 | yyy1 | zzz1 |
| xxx2 | yyy2 | zzz2 |
```

This saves memory, especially when table load factor is low.