A Pretty But Not Greedy Printer by Jean-Philippe Bernardy
2017-09-09 21:40
This is a series of posts where I jot down my learnings from computer science papers I’ve read. Previously, The Next 700 Programming Languages
A pretty printer outputs a data structure as human-readable text.
A pretty printer should exhibit these behaviors, in order of importance:
- Visibility - layout output within the width
- Legibility - make appropriate use of layout so that it is easy for a human to read
- Frugality - use the least number of lines
In other words, a pretty printer is an optimization problem that maximizes all three parameters.
Pretty printers of the past (Hughes and Wadler) adopt a functional design, composing algebraic structures. To achieve all three, the resulting pretty printer will be unusably slow, O(2^n). Most pretty printers sacrifice one of these three goals. For example, Hughes library chooses a greedy algorithm, so most of the time it gives up 3, and Wadler’s library is not expressive enough.
Bernardy aims to, with this pretty printer, achieve all three goals, while doing it in a reasonable time.
The basic API of the pretty printer is:
text
, turns string to a pretty printable document (Doc
)<>
, concat twoDoc
s horizontally$$
, concat twoDoc
s vertically<|>
, choose to concat twoDoc
s either horizontally or vertically
More combinators can be defined in terms of those basic functions:
empty
is the emptyDoc
<+>
, horizontal concatenation of twoDoc
with a spacehsep
, horizontal concatenation of manyDoc
s with separatorvcat
, vertical concatenation of manyDoc
s
We can define a layout as a non-empty list of strings to print, then rendering the layout is as simple as joining the list of strings with newlines.
$$
is then adding a new string to the layout.
<>
has to deal with the last line of the first string, and the first line of the second string. For example
xxxxx <> yyyyy
xx yyyyyyy
= xxxxx
xxyyyyy
yyyyyyy
The core algorithm is to
- given a set of layouts,
- filter those that are visible,
- select the most frugal to render.
However this algorithm is O(2^n)
in terms of a n
input with n
choices (from the <|>
operator). Every possible combination of the choice is first constructed, and then the shortest output is picked.
The key techniques to achieving an efficient implementation are to:
- Recognize that a full layout does not need to be constructed to calculate its size
- Invalid results can be filtered out early
- Some results are dominated and thus can be trimmed away early.
For 1, the important attributes are the maximum width of the layout, width of the last line, and the height. It can also be shown that this measurement (which ignores the content of the text) is a homomorphism to the layout structure described for actually laying out text (sans the rendering).
The second relies on the fact that the validity of a document implies the validity of its part.
The third relies on the fact that layout operators are monotonic with respect to the defined domination relation. If a
dominates b
, op a
still dominates op b
. Define a partial order as the dominance relation, by looking at all three attributes used for measuring in 1. The subset of non-dominated layouts is called the Pareto frontier.
Can think of this as a graph where
a -> b
ifa
dominatesb
, and the frontier is the collection of nodes without in degree 0
Benchmarks done on laying out full binary trees and random trees represented as S-Expressions.
On a benchmark that uses randomly generated JSON, mimicking real life JSON files, this library performs ten times as slow as that of Wadler-Leijen, and about 4 tims as slow as the Hughes-PJ one.