# Python Dictionary Implementation

## Overview

- CPython allocation memory to save dictionary, the initial table size is 8, entries are saved as
`<hash,key,value>`

in each slot(The slot content changed after Python 3.6). - When a new key is added, python use
`i = hash(key) & mask`

where`mask=table_size-1`

to calculate which slot it should be placed. If the slot is occupied, CPython using a probing algorithm to find the empty slot to store new item. - When 2/3 of the table is full, the table will be resized.
- When getting item from dictionary, both
`hash`

and`key`

must be equal.

## Resizing

When elements size is below 50000, the table size will increase by a factor of 4 based on used slots. Otherwise, it will increase by a factor of 2.

So the table size increase like this: 8->8*2/3*4=24->24*2/3*4=64->…

Removing item from dictionary doesn’t lead to shrink table. The value of the item will marks as null but not empty. To prevent early stopping when looking for other element.

## Probing

CPython used a modified **random probing** algorithm to choose the empty slot. This algorithm can traval all of the slots in a pseudo random order.

The travel order can be calculated by this formula: `j = ((5*j) + 1) mod 2**i`

, where `j`

is slot index.

For example, if table size is 8, and the calculate slot index is 2, then the traversal order should be:

`2 -> (5*2+1) mod 8 = 3 -> (5*3+1) mod 8 = 0 -> (5*0+1) mod 8 = 1 -> 6 -> 7 -> 4 -> 5 -> 2`

CPython changed this formula by adding `perturb`

and `PERTURB_SHIFT`

variables, where `perturb`

is hash value and `PERTURB_SHIFT`

is 5. By adding `PERTURB_SHIFT`

, the probe sequence depends on every bit in the hash code, and the collision probability is decreased. And `perturb`

will eventually becomes to 0, this ensures that all of the slots will be checked.

```
j = (5*j) + 1 + perturb;
perturb >>= PERTURB_SHIFT;
j = j % 2**i
```

## Dictionary improvement after 3.6

CPython 3.6 use a compact representation to save entries, and “The memory usage of the new dict() is between 20% and 25% smaller compared to Python 3.5”.

### Compact Hash Table

As mentioned before, entries saved in the form of `<hash,key,value>`

. This will takes 3B on 64 bit machine. And no matter how much item is added into the dictionary, the memory usage is the same(3B*table_size).

After 3.6, CPython use two structure to save data. One is **index**, another is the **real data**.

For example, if the table size is 8, and there is an item in slot 1, the **index** looks like this:

`[null, 0, null, null, null, null, null, null]`

And the **real data** is:

```
| hash | key | value |
| xxx1 | yyy1 | zzz1 |
```

0 represents the items index on **real data**. If another item is added in slot 3, the new **index** become this:

`[null, 0, null, 1, null, null, null, null]`

The **real data** become this:

```
| hash | key | value |
| xxx1 | yyy1 | zzz1 |
| xxx2 | yyy2 | zzz2 |
```

This saves memory, especially when table is not full.