Home

This is a series of posts where I jot down my learnings from computer science papers I’ve read. Previously, A Pretty But Not Greedy Printer

Paper (pdf)

Background

In a dynamically typed language, a variable can reference any type of objects. One way to implement this is to box objects, which is to allocate objects on the heap with a common header. Now, every object has a simple and common representation, simplifying VM implementation. However, the trade-off is that this is inefficient.

We can compare a dynamic language, like PyPy, to a static language, like C.

In C, an integer typically occupies one word in memory. In PyPy, it takes up three words: - one for garbage collector, - one to specify the type of object, - one for the actual integer value. That’s two words for overhead, compared to C, for each integer.

Another example is a list of n integers. In C, it is usually a contiguous area of memory which takes up n words. In PyPy, the list object will have a pointer to a storage area, and in the storage area will be n pointers to the three-word integer:

+------------------+
||GC |List |Storage|
+------------------+
              |
            +-v------------------+
            |GC |Length |  |  |  |
            +--------------------+
                          |  |  |
         +----------------+  |  +---------+
         |               +---+            |
         |               |                |
      +--v--------+   +--v--------+   +---v-------+
      |GC |Int |1 |   |GC |Int |2 |   |GC |Int |3 |
      +-----------+   +-----------+   +-----------+

This will probably have poor cache locality when traversing the list since each integer can potentially be spread around the heap.

We can save some of this space using a technique called pointer tagging. Pointers to objects in heaps are typically aligned to multiples of 4 or 8, i.e. you cannot address objects in heap by each byte. So for each pointer, 2 or 4 (least significant) bits are unused. You can store information in these unused bits, such as the type of the object. For example, if the least significant bit is set, the object is an integer. Then when dereferencing the pointer you can and away the lower bits. Thus, objects that can fit into a word no longer needs to be allocated on the heap.

There are trade-offs to this technique: - every access of a tagged pointer requires a check of tag type, has implications on branch prediction - large code changes are required adding pointer tagging, because pointer access happen frequently in many parts of the VM code base - finite amount of unused bits, thus there is a small limit to the possible number of tagged types

Storage strategies

The focus of this paper is to optimize performance and memory usage of collections of a single primitive type. The optimizations are based on the following hypotheses about bottlenecks in real-world programs:

  1. A collection with items of one primitive type (homogeneously-typed collection) will usually stay that way (will not dehomogenise)
  2. If a homogenously-typed collection dehomogenises, this transition happens after only a small number of elements have been added

The design is as follows: Each collection references a storage strategy and a storage area. The strategy controls how data is laid out in the area. Operations on the collection are handed over strategy to perform. A collection can only have one strategy at any point, but the strategy can change over time.

Strategies can also be implemented as singleton or static classes to allow sharing.

An example of a strategy is the EmptyStrategy, which is a placeholder for when the collection is empty. The ObjectStrategy is the usual boxed layout that a VM without strategy will employ. An IntegerStrategy is an optimized strategy that will unbox the integers in a list and keep them in the storage area.

The overhead of this technique is a single extra pointer to a strategy object. Operations on collections need one extra method call on the strategy, but this cost is offset by the lesser number of boxing required.

Storage strategies in PyPy

The authors implemented storage strategies in PyPy for lists, sets, and dictionaries, for integers, floats, and strings. For lists and sets, the implementation is as describe before. For dictionaries, the strategies only affect the keys, the values are left boxed. For each collection X, the strategies EmptyXStrategy, ObjectXStrategy, IntegerXStrategy, FloatXStrategy, and StringXStrategy are implemented.

Compared to the pre-storage strategies of the VM, storing a list of integers saves three words per element because the integers can be unboxed and kept inline.

       +-----------------------------+       +-----------------------------+
       |GC |List|Storage  |Strategy  |       |GC |List|Storage  |Strategy  |
       +---------+-----------+-------+       +----------+---------+--------+
                 |           |                          |         |
 +---------------+       +---+                          |         |
 |                       | +--------------------------------------+
 |                       | |                            |
+v-------------------+  +v-v--------------------+    +--v-----------------+
|GC |length |1 |2 |3 |  |GC |IntegerListStrategy|    |GC |length |4 |2 |3 |
+--------------------+  +-----------------------+    +--------------------+

The user of a list of integers can, at any time, add a non-integer to the list. When this happens, the integers in the list are boxed, and the strategy transitions to an ObjectStrategy, and the storage section needs to be rewritten appropriately, essentially looking like what the original boxed implementation looks like.

Besides the describe Strategy, there are some other useful strategies. One interesting storage strategy is used by the range function. range(i, j, s) generates a list of integers from i to j in steps of s lazily. It does not build a full list, that would be wasteful, because most uses of range do not make use of the full list. A special RangeListStrategy is used by range – it only stores three words of information: start, stop, and step, and produces the desired integer by calculating.

A special fast path for choosing and initializing a strategy is implemented for functions which the return type is fixed and known in advance. For example, split(d) on a string will return a list of strings, so a StringListStrategy is immediately selected.

In cases where there are optimized type-based operations, a strategy can have and optimized implementation of a method. For example, the contains(o) method can be optimized to perform a fast machine word comparison when used in an IntegerListStrategy.

Evaluation of performance

The evaluation of storage strategies focus on execution speed and memory usage using a variety of benchmarks. These benchmarks are run on various variants of PyPy with different groups of strategies enabled, ranging from pypy-all, which has all strategies enabled, and pypy-none which has no strategies enabled.

Each benchmark is executed 35 times, and the first 5 is discarded, to try and get close to steady-state of the VM.

Overall, pypy-all improves performance by 18%. It also leads to useful savings in peak memory usage, because, in essence, the strategies cause objects to be shorter-lived on average.

The results measure steady-state, which holds true when using JIT VM in a server-like environment. For short-lived programs, these numbers might not hold up, and measuring that is hard because the impact of the JIT itself can be significant.