03_compiler.scala // Jump To …

This part discusses compiler internals. How do embedded compilers compile their programs?

Outline:

The purely string based representation of staged programs from Part 2 does not allow analysis or transformation of embedded programs. Since LMS is not inherently tied to a particular program representation it is very easy to pick one that is better suited for optimization. As a first cut, we switch to an intermediate representation (IR) based on expression trees, adding a level of indirection between construction of object programs and code generation (see below). On this tree IR we can define traversal and transformation passes and build a straightforward embedded compiler. We can add new IR node types and new transformation passes that implement domain specific optimizations. In particular we can use multiple passes of staging: While traversing (effectively, interpreting) one IR we can execute staging commands to build another staged program, possibly in a different, lower-level object language.

However the extremely high degree of extensibility poses serious challenges. In particular, the interplay of optimizations implemented as many separate phases does not yield good results due to the phase ordering problem: It is unclear in which order and how often to execute these phases, and since each optimization pass has to make pessimistic assumptions about the outcome of all other passes the global result is often suboptimal compared to a dedicated, combined optimization phase (*). There are also implementation challenges as each optimization needs to be designed to treat unknown IR nodes in a sensible way.

Other challenges are due to the fact that embedded compilers are supposed to be used like libraries. Extending an embedded compiler should be easy, and as much of the work as possible should be delegated to a library of compiler components. Newly defined high-level IR nodes should profit from generic optimizations automatically.

To remedy this situation, we switch to a graph-based “sea of nodes” representation (see below). This representation links definitions and uses, and it also reflects the program block structure via nesting edges. We consider purely functional programs first. A number of nontrivial optimizations become considerably simpler. Common subexpression elimination (CSE) and dead code elimination (DCE) are particularly easy. Both are completely generic and support an open domain of IR node types. Optimizations that can be expressed as context-free rewrites are also easy to add in a modular fashion. A scheduling and code motion algorithm transforms graphs back into trees, moving computations to places where they are less often executed, e.g. out of loops or functions. Both graph-based and tree- based transformations are useful: graph-based transformations are usually simpler and more efficient whereas tree-based transformations, implemented as multiple staging passes, can be more powerful and employ arbitrary context- sensitive information.

To support effectful programs, we make effects explicit in the dependency graph (similar to SSA form). We can support simple effect domains (pure vs effectful) and more fine grained ones, such as tracking modifications per allocation site. The latter one relies on alias and points-to analysis.

We then turn to advanced optimizations (see below). For combining analyses and optimizations, it is crucial to maintain optimistic assumptions for all analyses. The key challenge is that one analysis has to anticipate the effects of the other transformations. The solution is speculative rewriting (*): transform a program fragment in the presence of partial and possibly unsound analysis results and re-run the analyses on the transformed code until a fixpoint is reached. This way, different analyses can communicate through the transformed code and need not anticipate their results in other ways. Using speculative rewriting, we compose many optimizations into more powerful combined passes. Often, a single forward simplification pass that can be used to clean up after non- optimizing transformations is sufficient.

However not all rewrites can fruitfully be combined into a single phase. For example, high-level representations of linear algebra operations may give rise to rewrite rules like $I M \rightarrow M$ where $I$ is the identity matrix. At the same time, there may be rules that define how a matrix multiplication can be implemented in terms of arrays and while loops, or a call to an external library (BLAS). To be effective, all the high-level simplifications need to be applied exhaustively before any of the lowering transformations are applied. But lowering transformations may create new opportunities for high- level rules, too. Our solution here is delayed rewriting: programmers can specify that a certain rewrite should not be applied right now, but have it registered to be executed at the next iteration of a particular phase. Delayed rewriting thus provides a way of grouping and prioritizing modularly defined transformations.

On top of this infrastructure, we build a number of advanced optimizations. A general pattern is split and merge: We split operations and data structures in order to expose their components to rewrites and dead-code elimination and then merge the remaining parts back together. This struct transformation also allows for more general data structure conversions, including array-of-struct to struct-of-array representation conversion. Furthermore we present a novel loop fusion algorithm, a powerful transformation that removes intermediate data structures.

With the aim of generating code, we could represent staged expressions directly as strings, as done in Part 2. But for optimization purposes we would rather have a structured intermediate representation that we can analyze in various ways. Fortunately, LMS makes it very easy to use a different internal program representation.

Our starting point is an object language *interface* derived from
Part 2:

```
trait Base {
type Rep[T]
}
trait Arith extends Base {
def infix_+(x: Rep[Double], y: Rep[Double]): Rep[Double]
def infix_*(x: Rep[Double], y: Rep[Double]): Rep[Double]
...
}
trait IfThenElse extends Base {
def __ifThenElse[T](c: Rep[Boolean], a: =>Rep[T], b: =>Rep[T]): Rep[T]
}
```

The goal will be to build a corresponding *implementation* hierarchy that
supports optimizing compilation.

Splitting interface and implementation has many advantages, most importantly a clear separation between the user program world and the compiler implementation world. For the sake of completeness, let us briefly recast the string representation in this model:

```
trait BaseStr extends Base {
type Rep[T] = String
}
trait ArithStr extends BaseStr with Arith {
def infix_+(x: Rep[Double], y: Rep[Double]) = perform(x + " + " + y)
def infix_*(x: Rep[Double], y: Rep[Double]) = perform(x + " * " + y)
...
}
trait IfThenElseStr extends BaseStr with IfThenElse {
def __ifThenElse[T](c: Rep[Boolean], a: =>Rep[T], b: =>Rep[T]) =
perform("if (" + c + ") " + accumulate(a) + " else " + accumulate(b))
}
```

In this chapter, we will use an IR that is based on expression trees, closely resembling the abstract syntax tree (AST) of a staged program. This representation enables separate analysis, optimization and code generation passes. We will use the following types:

```
type Exp[T] // atomic: Sym, Const
type Def[T] // composite: Exp + Exp, Exp * Exp, ...
type Stm[T] // statement: val x = Def
type Block[T] // blocks: { Stm; ...; Stm; Exp }
```

They are defined as follows in a separate trait:

```
trait Expressions {
// expressions (atomic)
abstract class Exp[T]
case class Const[T](x: T) extends Exp[T]
case class Sym[T](n: Int) extends Exp[T]
def fresh[T]: Sym[T]
// definitions (composite, subclasses provided by other traits)
abstract class Def[T]
// statements
case class Stm[T](sym: Sym[T], rhs: Def[T])
// blocks
case class Block[T](stms: Stm[_], res: Exp[T])
// perform and accumulate
def reflectStm[T](d: Stm[T]): Exp[T]
def reifyBlock[T](b: =>Exp[T]): Block[T]
// bind definitions to symbols automatically
// by creating a statement
implicit def toAtom[T](d: Def[T]): Exp[T] =
reflectStm(Stm(fresh[T], d))
}
```

This trait `Expressions`

will be mixed in at the root of the object language
implementation hierarchy. The guiding principle is that each definition has
an associated symbol and refers to other definitions only via their symbols.
This means that every composite value will be named, similar to
administrative normal form (ANF). Methods `reflectStm`

and `reifyBlock`

take
over the responsibility of `perform`

and `accumulate`

.

We observe that there are no concrete definition classes provided by trait
`Expressions`

. Providing meaningful data types is the responsibility of other
traits that implement the interfaces defined previously (`Base`

and its
descendents).

Trait `BaseExp`

forms the root of the implementation hierarchy and installs
atomic expressions as the representation of staged values by defining
`Rep[T] = Exp[T]`

:

```
trait BaseExp extends Base with Expressions {
type Rep[T] = Exp[T]
}
```

For each interface trait, there is one corresponding core implementation
trait. Shown below, we have traits `ArithExp`

and `IfThenElseExp`

as the
running example. Both traits define one definition class for each operation
defined by `Arith`

and `IfThenElse`

, respectively, and implement the
corresponding interface methods to create instances of those classes.

```
trait ArithExp extends BaseExp with Arith {
case class Plus(x: Exp[Double], y: Exp[Double]) extends Def[Double]
case class Times(x: Exp[Double], y: Exp[Double]) extends Def[Double]
def infix_+(x: Rep[Double], y: Rep[Double]) = Plus(x, y)
def infix_*(x: Rep[Double], y: Rep[Double]) = Times(x, y)
...
}
trait IfThenElseExp extends BaseExp with IfThenElse {
case class IfThenElse(c: Exp[Boolean], a: Block[T], b: Block[T]) extends Def[T]
def __ifThenElse[T](c: Rep[Boolean], a: =>Rep[T], b: =>Rep[T]): Rep[T] =
IfThenElse(c, reifyBlock(a), reifyBlock(b))
}
```

The framework ensures that code that contains staging operations will always
be executed within the dynamic scope of at least one invocation of
`reifyBlock`

, which returns a block object and takes as call-by-name argument
the present-stage expression that will compute the staged block result. Block
objects can be part of definitions, e.g. for loops or conditionals.

Since all operations in interface traits such as `Arith`

return `Rep`

types,
equating `Rep[T]`

and `Exp[T]`

in trait `BaseExp`

means that conversion to
symbols will take place already within those methods. This fact is important
because it establishes our correspondence between the evaluation order of the
program generator and the evaluation order of the generated program: at the
point where the generator calls `toAtom`

, the composite definition is turned
into an atomic value via `reflectStm`

, i.e. its evaluation will be recorded
now and played back later in the same relative order with respect to others
within the closest `reifyBlock`

invocation.

Given our IR representation it is easy to add traversals and transformations.

All that is needed to define a generic in-order traversal is a way to access all blocks immediately contained in a definition:

```
def blocks(x: Any): List[Block[Any]]
```

For example, applying `blocks`

to an `IfThenElse`

node will return the then
and else blocks. Since definitions are case classes, this method is easy to
implement by using the `Product`

interface that all case classes implement.

The basic structural in-order traversal is then defined like this:

```
trait ForwardTraversal {
val IR: Expressions
import IR._
def traverseBlock[T](b: Block[T]): Unit = b.stms.foreach(traverseStm)
def traverseStm[T](s: Stm[T]): Unit = blocks(s).foreach(traverseBlock)
}
```

Custom traversals can be implemented in a modular way by extending the
`ForwardTraversal`

trait:

```
trait MyTraversalBase extends ForwardTraversal {
val IR: BaseExp
import IR._
override def traverseStm[T](s: Stm[T]) = s match {
// custom base case or delegate to super
case _ => super.traverseStm(s)
}
}
trait MyTraversalArith extends MyTraversalBase {
val IR: ArithExp
import IR._
override def traverseStm[T](s: Stm[T]) = s match {
case Plus(x,y) => ... // handle specific nodes
case _ => super.traverseStm(s)
}
}
```

For each unit of functionality such as `Arith`

or `IfThenElse`

the traversal
actions can be defined separately as `MyTraversalArith`

and
`MyTraversalIfThenElse`

.

Finally, we can use our traversal as follows:

```
trait Prog extends Arith {
def main = ... // program code here
}
val impl = new Prog with ArithExp
val res = impl.reifyBlock(impl.main)
val inspect = MyTraversalArith { val IR: impl.type = impl }
inspect.traverseBlock(res)
```

In essence, traversals confront us with the classic “expression problem” of
independently extending a data model with new data variants and new
operations (*). There are many solutions to this problem
but most of them are rather heavyweight. More lightweight implementations are
possible in languages that support multi-methods, i.e. dispatch method calls
dynamically based on the actual types of all the arguments. We can achieve
essentially the same using pattern matching and mixin composition, making use
of the fact that composing traits is subject to linearization
(*). We package each set of specific traversal
rules into its own trait, e.g. `MyTraversalArith`

that inherits from
`MyTraversalBase`

and overrides `traverseStm`

. When the arguments do not match
the rewriting pattern, the overridden method will invoke the “parent”
implementation using `super`

. When several such traits are combined, the super
calls will traverse the overridden method implementations according to the
linearization order of their containing traits. The use of pattern matching
and super calls is similar to earlier work on extensible algebraic data types
with defaults (*), which supported linear
extensions but not composition of independent extensions.

Implementing multi-methods in a statically typed setting usually poses three problems: separate type checking/compilation, ensuring non-ambiguity and ensuring exhaustiveness. The described encoding supports separate type-checking and compilation in as far as traits do. Ambiguity is ruled out by always following the linearization order and the first-match semantics of pattern matching. Exhaustiveness is ensured at the type level by requiring a default implementation, although no guarantees can be made that the default will not choose to throw an exception at runtime. In the particular case of traversals, the default is always safe and will just continue the structural traversal.

Code generation is just a traversal pass that prints code. Compiling and executing code can use the same mechanism as described above.

Transformations work very similar to traversals. One option is to traverse and transform an existing program more or less in place, not actually modifying data but attaching new Defs to existing Syms:

```
trait SimpleTransformer {
val IR: Expressions
import IR._
def transformBlock[T](b: Block[T]): Block[T] =
Block(b.stms.flatMap(transformStm), transformExp(b.res))
def transformStm[T](s: Stm[T]): List[Stm] =
List(Stm(s.lhs, transformDef(s.rhs))) // preserve existing symbol s
def transformDef[T](d: Def[T]): Def[T] // default: use reflection
// to map over case classes
}
```

An implementation is straightforward:

```
trait MySimpleTransformer extends SimpleTransformer {
val IR: IfThenElseExp
import IR._
// override transformDef for each Def subclass
def transformDef[T](d: Def[T]): Def[T] = d match {
case IfThenElse(c,a,b) =>
IfThenElse(transformExp(c), transformBlock(a), transformBlock(b))
case _ => super.transformDef(d)
}
}
```

Another option that is more principled and in line with the idea of making compiler transforms programmable through the use of staging is to traverse the old program and create a new program. Effectively we are implementing an IR interpreter that executes staging commands, which greatly simplifies the implementation of the transform and removes the need for low-level IR manipulation.

In the implementation, we will create new symbols instead of reusing existing ones so we need to maintain a substitution that maps old to new Syms. The core implementation is given below:

```
trait ForwardTransformer extends ForwardTraversal {
val IR: Expressions
import IR._
var subst: Map[Exp[_],Exp[_]]
def transformExp[T](s: Exp[T]): Exp[T] = ... // lookup s in subst
def transformDef[T](d: Def[T]): Exp[T] // default
def transformStm[T](s: Stm[T]): Exp[T] = {
val e = transformDef(s.rhs); subst += (s.sym -> e); e
}
override def traverseStm[T](s: Stm[T]): Unit = {
transformStm(s)
}
def reflectBlock[T](b: Block[T]): Exp[T] = withSubstScope {
traverseBlock(b); transformExp(b.res)
}
def transformBlock[T](b: Block[T]): Block[T] = {
reifyBlock(reflectBlock(b))
}
}
```

Here is a simple identity transformer implementation for conditionals and array construction:

```
trait MyTransformer extends ForwardTransformer {
val IR: IfThenElseExp with ArraysExp
import IR._
def transformDef[T](d: Def[T]): Exp[T] = d match {
case IfThenElse(c,a,b) =>
__ifThenElse(transformExp(c), reflectBlock(a), reflectBlock(b))
case ArrayFill(n,i,y) =>
arrayFill(transformExp(n), { j => withSubstScope(i -> j) { reflectBlock(y) }})
case _ => ...
}
}
```

The staged transformer facility can be extended slightly to translate not only within a single language but also between two languages:

```
trait FlexTransformer {
val SRC: Expressions
val DST: Base
trait TypeTransform[A,B]
var subst: Map[SRC.Exp[_],DST.Rep[_]]
def transformExp[A,B](s: SRC.Exp[A])(implicit t: TypeTransform[A,B]): DST.Rep[B]
}
```

It is also possible to add more abstraction on top of the base transforms to build combinators for rewriting strategies in the style of Stratego (*) or Kiama (*).

This all works but is not completely satisfactory. With fine grained separate transformations we immediately run into phase ordering problems (*). We could execute optimization passes in a loop until we reach a fixpoint but even then we may miss opportunities if the program contains loops. For best results, optimizations need to be tightly integrated. Optimizations need a different mechanisms than lowering transformations that have a clearly defined before and after model. In the next chapter, we will thus consider a slightly different IR representation.

To remedy phase ordering problems and overall allow for more flexibility in rearranging program pieces, we switch to a program representation based on structured graphs. This representation is not to be confused with control-flow graphs (CFGs): Since one of our main goals is parallelization, a sequential CFG would not be a good fit.

Let us first consider a purely functional language subset. There are much more possibilities for aggressive optimizations. We can rely on referential transparency: The value of an expression is always the same, no matter when and where it is computed. Thus, optimizations do not need to check availability or lifetimes of expressions. Global common subexpression elimination (CSE), pattern rewrites, dead code elimination (DCE) and code motion are considerably simpler than the usual implementations for imperative programs.

We switch to a “sea of nodes”-like (*) representation that is a directed (and for the moment, acyclic) graph:

```
trait Expressions {
// expressions (atomic)
abstract class Exp[T]
case class Const[T](x: T) extends Exp[T]
case class Sym[T](n: Int) extends Exp[T]
def fresh[T]: Sym[T]
// definitions (composite, subclasses provided by other traits)
abstract class Def[T]
// blocks -- no direct links to statements
case class Block[T](res: Exp[T])
// bind definitions to symbols automatically
// by creating a statement
implicit def toAtom[T](d: Def[T]): Exp[T] =
reflectPure(d)
def reifyBlock[T](b: =>Exp[T]): Block[T]
def reflectPure[T](d: Def[T]): Sym[T] =
findOrCreateDefinition(d)
def findDefinition[T](s: Sym[T]): Option[Def[T]]
def findDefinition[T](d: Def[T]): Option[Sym[T]]
def findOrCreateDefinition[T](d: Def[T]): Sym[T]
}
```

It is instructive to compare the definition of trait `Expressions`

with the
one from the previous chapter. Again there are three
categories of objects involved: expressions, which are atomic (subclasses of
`Exp`

: constants and symbols; with a “gensym” operator `fresh`

to create
fresh symbols), definitions, which represent composite operations (subclasses
of `Def`

, to be provided by other components), and blocks, which model nested
scopes.

Trait `Expressions`

now provides methods to find a definition given a symbol
or vice versa. Direct links between blocks and statements are removed. The
actual graph nodes are `(Sym[T], Def[T])`

pairs. They need not be accessible
to clients at this level. Thus method `reflectStm`

from the previous chapter
is replaced by `reflectPure`

.

Graphs also carry nesting information (boundSyms, see below). This enables code motion for different kinds of nested expressions such as lambdas, not only for loops or conditionals. The structured graph representation is also more appropriate for parallel execution than the traditional sequential control-flow graph. Pure computation can float freely in the graph and can be scheduled for execution anywhere.

The object language implementation code is the same compared to the tree representation:

```
trait BaseExp extends Base with Expressions {
type Rep[T] = Exp[T]
}
```

Again, we have separate traits, one for each unit of functionality:

```
trait ArithExp extends BaseExp with Arith {
case class Plus(x: Exp[Double], y: Exp[Double]) extends Def[Double]
case class Times(x: Exp[Double], y: Exp[Double]) extends Def[Double]
def infix_+(x: Rep[Double], y: Rep[Double]) = Plus(x, y)
def infix_*(x: Rep[Double], y: Rep[Double]) = Times(x, y)
...
}
trait IfThenElseExp extends BaseExp with IfThenElse {
case class IfThenElse(c: Exp[Boolean], a: Block[T], b: Block[T]) extends Def[T]
def __ifThenElse[T](c: Rep[Boolean], a: =>Rep[T], b: =>Rep[T]): Rep[T] =
IfThenElse(c, reifyBlock(a), reifyBlock(b))
}
```

Several optimizations are very simple to implement on this purely functional graph IR. The implementation draws inspiration from previous work on compiling embedded DSLs (*) as well as staged FFT kernels (*).

Common subexpressions are eliminated during IR construction using hash consing:

```
def findOrCreateDefinition[T](d: Def[T]): Sym[T]
```

Invoked by `reflectPure`

through the implicit conversion method `toAtom`

,
this method converts a definition to an atomic expression and links it to the
scope being built up by the innermost enclosing `reifyBlock`

call. When the
definition is known to be side-effect free, it will search the already
encountered definitions for a structurally equivalent one. If a matching
previous definition is found, its symbol will be returned, possibly moving the
definition to a parent scope to make it accessible. If the definition may
have side effects or it is seen for the first time, it will be associated
with a fresh symbol and saved for future reference. This simple scheme
provides a powerful global value numbering optimization
(*) that effectively prevents generating duplicate
code.

Using `findDefinition`

, we can implement an extractor object
(*) that enables pattern matching on a symbol to
lookup the underlying definition associated to the symbol:

```
object Def {
def unapply[T](s: Exp[T]): Option[Def[T]] = s match {
case s: Sym[T] => findDefinition(s)
case _ => None
}
}
```

This extractor object can be used to implement smart constructors for IR nodes that deeply inspect their arguments:

```
def infix_*(x: Exp[Double], y: Exp[Double]) = (x,y) match {
case (Const(x), Const(y)) => Const(x * y)
case (Const(k), Def(Times(Const(l), y))) => Const(k * l) * y
case _ => Times(x,y)
}
```

Smart constructors are a simple yet powerful rewriting facility. If the smart
constructor is the only way to construct `Times`

nodes we obtain a strong
guarantee: No `Times`

node is ever created without applying all possible
rewrites first.

Some profitable optimizations, such as the global value numbering described
above, are very generic. Other optimizations apply only to specific aspects of
functionality, for example particular implementations of constant folding (or
more generally symbolic rewritings) such as replacing computations like
`x * 1.0`

with `x`

. Yet other optimizations are specific to the actual program
being staged. Kiselyov et al. (*) describe a
number of rewritings that are particularly effective for the patterns of code
generated by a staged FFT algorithm but not as much for other programs. The
FFT example is discussed in more detail here.

What we want to achieve again is modularity, so that optimizations can be
combined in a way that is most useful for a given task. To implement a
particular rewriting rule (whether specific or generic), say, `x * 1.0`

$\rightarrow$ `x`

, we can provide a specialized implementation of `infix_*`

(overriding the one in trait `ArithExp`

) that will test its arguments for a
particular pattern. How this can be done in a modular way is shown by the
traits `ArithExpOpt`

and `ArithExpOptFFT`

, which implement some generic and
program specific optimizations. Note that the use of `x*y`

within the body of
`infix_*`

will apply the optimization recursively.

The appropriate pattern is to override the smart constructor in a separate trait and call the super implementation if no rewrite matches. This decouples optimizations from node type definitions.

```
trait ArithExpOpt extends ArithExp {
override def infix_*(x:Exp[Double],y:Exp[Double]) = (x,y) match {
case (Const(x), Const(y)) => Const(x * y)
case (x, Const(1)) => x
case (Const(1), y) => x
case _ => super.infix_*(x, y)
}
}
trait ArithExpOptFFT extends ArithExp {
override def infix_*(x:Exp[Double],y:Exp[Double]) = (x,y) match {
case (Const(k), Def(Times(Const(l), y))) => Const(k * l) * y
case (x, Def(Times(Const(k), y))) => Const(k) * (x * y))
case (Def(Times(Const(k), x)), y) => Const(k) * (x * y))
...
case (x, Const(y)) => Const(y) * x
case _ => super.infix_*(x, y)
}
}
```

Note that the trait linearization order defines the rewriting strategy. We
still maintain our guarantee that no `Times`

node could be rewritten further.

\begin{figure*}\centering
\includegraphics[width=\textwidth]{papers/cacm2012/figoverview.pdf}
\caption{\label{fig:overview}Component architecture. Arrows denote extends relationships,
dashed boxes represent units of functionality.}
\vspace{1cm}
\end{figure*}

Figure~\ref{fig:overview} shows the component architecture formed by base traits and corresponding optimizations.

Context and flow sensitive transformation become very important once we introduce effects. But even pure functional programs can profit from context information:

```
if (c) { if (c) a else b } else ...
```

The inner test on the same condition is redundant and will always succeed. How
do we detect this situation? In other cases we can use the Def extractor to
lookup the definition of a symbol. This will not work here, because Def works
on Exp input and produces a Def object as output. We however need to work on
the level of Exps, turning a Sym into `Const(true)`

based on context
information.

We need to adapt the way we construct IR nodes. When we enter the then
branch, we add `c$\rightarrow$Const(true)`

to a substitution. This
substitution needs to be applied to arguments of IR nodes constructed within
the then branch.

One possible solution would be add yet another type constructor, `Ref`

, with
an implicit conversion from Exp to Ref that applies the substitution. A
signature like `IfThenElse(c: Exp, ...)`

would become ```
IfThenElse(c: Ref,
<br />...)
```

. A simpler solution is to implement `toAtom`

in such a way that it
checks the resulting Def if any of its inputs need substitution and if so
invoke `mirror`

(see below) on the result Def, which will apply the
substitution, call the appropriate smart constructor and finally call `toAtom`

again with the transformed result.

In addition to optimizations performed during graph constructions, we can also
implement transformation that work directly on the graph structure. This is
useful if we need to do analysis on a larger portion of the program first and
only then apply the transformation. An example would be to find all `FooBar`

statements in a graph, and replace them uniformly with `BarBaz`

. All dependent
statements should re-execute their pattern rewrites, which might trigger on
the new `BarBaz`

input.

We introduce the concept of *mirroring*: Given an IR node, we want to apply a
substitution (or generally, a `Transformer`

) to the arguments and call the
appropriate smart constructor again. For every IR node type we require a
default `mirror`

implementation that calls back its smart constructor:

```
override def mirror[A](e: Def[A], f: Transformer): Exp[A] = e match {
case Plus(a,b) => f(a) + f(b) // calls infix_+
case Times(a,b) => f(a) * f(b)
case _ => super.mirror(e,f)
}
```

There are some restrictions if we are working directly on the graph level: In
general we have no (or only limited) context information because a single IR
node may occur multiple times in the final program. Thus, attempting to
simplify effectful or otherwise context-dependent expressions will produce
wrong results without an appropriate context. For pure expressions, a smart
constructor called from `mirror`

should not create new symbols apart from the
result and it should not call reifyBlock. Otherwise, if we were creating new
symbols when nothing changes, the returned symbol could not be used to check
convergence of an iterative transformation easily.

The `Transfomer`

argument to `mirror`

can be
queried to find out whether `mirror`

is allowed to call context
dependent methods:

```
override def mirror[A](e: Def[A], f: Transformer): Exp[A] = e match {
case IfThenElse(c,a,b) =>
if (f.hasContext)
__ifThenElse(f(c),f.reflectBlock(a),f.reflectBlock(b))
else
ifThenElse(f(c),f(a),f(b)) // context-free version
case _ => super.mirror(e,f)
}
```

If the context is guaranteed to be set up correctly, we call the regular smart
constructor and use `f.reflectBlock`

to call mirror recursively on the
contents of blocks `a`

and `b`

. Otherwise, we call a more restricted context
free method.

Dead code elimination can be performed purely on the graph level, simply by finding all statements reachable from the final result and discarding everything else.

We define a method to find all symbols a given object references directly:

```
def syms(x: Any): List[Sym[Any]]
```

If `x`

is a Sym itself, `syms(x)`

will return `List(x)`

. For a case class
instance that implements the `Product`

interface such as `Times(a,b)`

, it
will return `List(a,b)`

if both `a`

and `b`

are Syms. Since the argument type
is `Any`

, we can apply `syms`

not only to Def objects directly but also to
lists of Defs, for example.

Then, assuming `R`

is the final program result, the set of remaining symbols
in the graph `G`

is the least fixpoint of:

```
G = R $\cup$ syms(G map findDefinition)
```

Dead code elimination will discard all other nodes.

To turn program graphs back into trees for code generation we have to decide which graph nodes should go where in the resulting program. This is the task of code motion.

Other optimizations can apply transformations optimistically and need not worry about maintaining a correct schedule: Code motion will fix it up. The algorithm will try to push statements inside conditional branches and hoist statements out of loops. Code motion depends on dependency and frequency information but not directly on data-flow information. Thus it can treat functions or other user defined compound statements in the same way as loops. This makes our algorithm different from code motion algorithms based on data flow analysis such as Lazy Code Motion (LCM, (*)) or Partial Redundancy Elimination (PRE, (*)).

The graph IR reflects “must before” (ordering) and “must inside” (containment) relations, as well as anti-dependence and frequency. These relations are implemented by the following methods, which can be overridden for new definition classes:

```
def syms(e: Any): List[Sym[Any]] // value dependence (must before)
def softSyms(e: Any): List[Sym[Any]] // anti dependence (must not after)
def boundSyms(e: Any): List[Sym[Any]] // nesting dependence (must not outside)
def symsFreq(e: Any): List[(Sym[Any], // frequency information (classify
Double)] // sym as 'hot', 'normal', 'cold')
```

To give an example, `boundSyms`

applied to a loop node
`RangeForeach(range,idx,body)`

with index variable `idx`

would return
`List(idx)`

to denote that `idx`

is fixed “inside” the loop expression.

Given a subgraph and a list of result nodes, the goal is to identify the graph nodes that should form the “current” level, as opposed to those that should remain in some “inner” scope, to be scheduled later. We will reason about the paths on which statements can be reached from the result. The first idea is to retain all nodes on the current level that are reachable on a path that does not cross any conditionals, i.e. that has no “cold” refs. Nodes only used from conditionals will be pushed down. However, this approach does not yet reflect the precedence of loops. If a loop is top-level, then conditionals inside the loop (even if deeply nested) should not prevent hoisting of statements. So we refine the characterization to retain all nodes that are reachable on a path that does not cross top-level conditionals.

\begin{figure} \begin{center} \includegraphics[width=0.7\textwidth]{fig_graph_nesting.pdf} \end{center} \caption{\label{fig:codemotion}Graph IR with regular and nesting edges (boundSyms, dotted line) as used for code motion.} \end{figure}

This leads to a simple iterative algorithm: Starting with the known top level statements, nodes reachable via normal links are added and for each hot ref, we follow nodes that are forced inside until we reach one that can become top- level again.

Code Motion Algorithm: Compute the set $L$ of top level statements for the current block, from a set of available statements $E$, a set of forced-inside statements $G \subseteq E$ and a block result $R$.

Start with $L$ containing the known top level statements, initially the (available) block result $R \cap E$.

Add to $L$ all nodes reachable from $L$ via normal links (neither hot nor cold) through $E-G$ (not forced inside).

For each hot ref from $L$ to a statement in $E-L$, follow any links through $G$, i.e. the nodes that are forced inside, if there are any. The first non-forced-inside nodes (the “hot fringe”) become top level as well (add to $L$).

Continue with 2 until a fixpoint is reached.

To implement this algorithm, we need to determine the set `G`

of nodes that
are forced inside and may not be part of the top level. We start with the
block result `R`

and a graph `E`

that has all unnecessary nodes removed (DCE
already performed):

```
E = R $\cup$ syms(E map findDefinition)
```

We then need a way to find all uses of a given symbol `s`

, up to but not
including the node where the symbol is bound:

```
U(s) = {s} $\cup$ { g $\in$ E | syms(findDefinition(g)) $\cap$ U(s) $\ne\emptyset$
&& s $\notin$ boundSyms(findDefinition(g))) }
```

We collect all bound symbols and their dependencies. These cannot live on the current level, they are forced inside:

```
B = boundSyms (E map findDefinition)
G = union (B map U) // must inside
```

Computing `U(s)`

for many symbols `s`

individually is costly but
implementations can exploit considerable sharing to optimize the computation
of `G`

.

The iteration above uses `G`

to follow forced-inside nodes after a hot ref
until a node is found that can be moved to the top level.

Let us consider a few examples to build some intuition about the code motion
behavior. In the code below, the starred conditional is on the fringe (first
statement that can be outside) and on a hot path (through the loop). Thus it
will be hoisted. Statement `foo`

will be moved inside:

```
loop { i => z = if* (x) foo
if (i > 0) loop { i =>
if* (x) if (i > 0)
foo z
} }
```

The situation changes if the inner conditional is forced inside by a value
dependency. Now statement `foo`

is on the hot fringe and becomes top level.

```
loop { i => z = foo*
if (x) loop { i =>
if (i > 0) if (x)
foo* if (i > 0)
} z
}
```

For loops inside conditionals, the containing statements will be moved inside (relative to the current level).

```
if (x) if (x)
loop { i => z = foo
foo loop { i =>
} z
}
```

The described algorithm works well and is reasonably efficient in practice. Being a heuristic, it cannot be optimal in all cases. Future versions could employ more elaborate cost models instead of the simple hot/cold distinction. One case worth mentioning is when a statement is used only in conditionals but in different conditionals:

```
z = foo if (x)
if (x) foo
z if (y)
if (y) foo
z
```

In this situation `foo`

will be duplicated. Often this duplication is
beneficial because `foo`

can be optimized together with other statements
inside the branches. In general of course there is a danger of slowing down
the program if both conditions are likely to be true at the same time. In that
case it would be a good idea anyways to restructure the program to factor out
the common criteria into a separate test.

Once we have determined which statements should occur on which level, we have to come up with an ordering for the statements. Before starting the code motion algorithm, we sort the input graph in topological order and we will use the same order for the final result. For the purpose of sorting, we include anti-dependencies in the topological sort although they are disregarded during dead code elimination. A bit of care must be taken though: If we introduce loops or recursive functions the graph can be cyclic, in which case no topological order exists. However, cycles are caused only by inner nodes pointing back to outer nodes and for sorting purposes we can remove these back-edges to obtain an acyclic graph.

To generate code or to perform transformation by iterated staging (see above) we need to turn our graph back into a tree. The interface to code motion allows us to build a generic tree-like traversal over our graph structure:

```
trait Traversal {
val IR: Expressions; import IR._
// perform code motion, maintaining current scope
def focusExactScope(r: Exp[Any])(body: List[Stm[Any]] => A): A
// client interface
def traverseBlock[T](b: Block[T]): Unit =
focusExactScope(b.res) { levelScope =>
levelScope.foreach(traverseStm)
}
def traverseStm[T](s: Stm[T]): Unit = blocks(s).foreach(traverseBlock)
}
```

This is useful for other analyses as well, but in particular for building
transformers that traverse one graph in a tree like fashion and create another
graph analogous to the appraoach above above. The
implementation of trait `ForwardTransformer`

carries over almost unmodified.

To ensure that operations can be safely moved around (and for other optimizations as well), a compiler needs to reason about their possible side effects. The graph representation presented so far is pure and does not mention effects at all. However all the necessary ingredients are already there: We can keep track of side effects simply by making effect dependencies explicit in the graph. In essence, we turn all programs into functional programs by adding an invisible state parameter (similar in spirit but not identical to SSA conversion).

We first consider global effects like console output via `println`

.
Distinguishing only between “has no effect” and “may have effect” means
that all operations on mutable data structures, including reads, have to be
serialized along with all other side effects.

By default, we assume operations to be pure (i.e. side-effect free).
Programmers can designate effectful operations by using `reflectEffect`

instead of the implicit conversion `toAtom`

which internally delegates to
`reflectPure`

. Console output, for example, is implemented like this:

```
def print(x: Exp[String]): Exp[Unit] = reflectEffect(Print(x))
```

The call to `reflectEffect`

adds the passed IR node to a list of effects for
the current block. Effectful expressions will attract dependency edges
between them to ensure serialization. A compound expression such as a loop
or a conditional will internally use `reifyBlock`

, which attaches nesting
edges to the effectful nodes contained in the block.

Internally, `reflectEffect`

creates `Reflect`

nodes that keep track of the
context dependencies:

```
var context: List[Exp[Any]]
case class Reflect[T](d: Def[T], es: List[Sym[Any]]) extends Def[T]
def reflectEffect[T](d: Def[T]): Exp[T] = createDefinition(Reflect(d, context)).sym
```

The context denotes the “current state”. Since state can be seen as an abstraction of effect history, we just define context as a list of the previous effects.

In this simple model, all effect dependencies are uniformly encoded in the IR graph. Rewriting, CSE, DCE, and Code Motion are disabled for effectful statements (very pessimistic). Naturally we would like something more fine grained for mutable data.

We can add other, more fine grained, variants of `reflectEffect`

which allow
tracking mutations per allocation site or other, more general abstractions of
the heap that provide a partitioning into regions. Aliasing and sharing of
heap objects such as arrays can be tracked via optional annotations on IR
nodes. Reads and writes of mutable objects are automatically serialized and
appropriate dependencies inserted to guarantee a legal execution schedule.

Effectful statements are tagged with an effect summary that further describes
the effect. The summary can be extracted via `summarizeEffects`

, and there
are some operations on summaries (like `orElse`

, `andThen`

) to combine
effects. As an example consider the definition of conditionals, which computes
the compound effect from the effects of the two branches:

```
def __ifThenElse[T](cond: Exp[Boolean], thenp: => Rep[T], elsep: => Rep[T]) {
val a = reifyBlock(thenp)
val b = reifyBlock(elsep)
val ae = summarizeEffects(a) // get summaries of the branches
val be = summarizeEffects(b)
val summary = ae orElse be // compute summary for whole expression
reflectEffect(IfThenElse(cond, a, b), summary) // reflect compound expression
// (effect might be none, i.e. pure)
}
```

To specify effects more precisely for different kinds of IR nodes, we add
further `reflect`

methods:

```
reflectSimple // a 'simple' effect: serialized with other simple effects
reflectMutable // an allocation of a mutable object; result guaranteed unique
reflectWrite(v) // a write to v: must refer to a mutable allocation
// (reflectMutable IR node)
reflectRead(v) // a read of allocation v (not used by programmer,
// inserted implicitly)
reflectEffect(s) // provide explicit summary s, specify may/must info for
// multiple reads/writes
```

The framework will serialize reads and writes so to respect data and anti- dependency with respect to the referenced allocations. To make this work we also need to keep track of sharing and aliasing. Programmers can provide for their IR nodes a list of input expressions which the result of the IR node may alias, contain, extract from or copy from.

```
def aliasSyms(e: Any): List[Sym[Any]]
def containSyms(e: Any): List[Sym[Any]]
def extractSyms(e: Any): List[Sym[Any]]
def copySyms(e: Any): List[Sym[Any]]
```

These four pieces of information correspond to the possible pointer
operations `x = y`

, `*x = y`

, `x = *y`

and `*x = *y`

. Assuming an operation
`y = Foo(x)`

, `x`

should be returned in the following cases:

```
x $\in$ aliasSyms(y) if y = x // if then else
x $\in$ containSyms(y) if *y = x // array update
x $\in$ extractSyms(y) if y = *x // array apply
x $\in$ copySyms(y) if *y = *x // array clone
```

Here, `y = x`

is understood as “y may be equal to x”, `*y = x`

as
“dereferencing y (at some index) may return x” etc.

It is often useful to restrict the allowed effects somewhat to make analysis
more tractable and provide better optimizations. One model, which works
reasonably well for many applications, is to prohibit sharing and aliasing
between mutable objects. Furthermore, read and write operations must
unambiguously identify the allocation site of the object being accessed. The
framework uses the aliasing and points-to information to enforce these rules
and to keep track of immutable objects that point to mutable data. This is to
make sure the right serialization dependencies and `reflectRead`

calls are
inserted for operations that may reference mutable state in an indirect way.

We have seen above how many classic compiler optimizations can be applied to the IR generated from embedded programs in a straightforward way. Due to the structure of the IR, these optimizations all operate in an essentially global way, at the level of domain operations. In this chapter we discuss some other advanced optimizations that can be implemented on the graph IR. We present more elaborate examples for how these optimizations benefit larger use cases later.

Many optimizations that are traditionally implemented using an iterative dataflow analysis followed by a transformation pass can also be expressed using various flavors of rewriting. Whenever possible we tend to prefer the rewriting version because rewrite rules are easy to specify separately and do not require programmers to define abstract interpretation lattices.

Smart constructors in our graph IR can be context sensitive. For example, reads of local variables examine the current effect context to find the last assignment, implementing a form of copy propagation (middle):

```
var x = 7 var x = 7 println(5)
x = 5 x = 5
println(x) println(5)
```

This renders the stores dead, and they will be removed by dead code elimination later (right).

Many optimizations are mutually beneficial. In the presence of loops, optimizations need to make optimistic assumptions for the supporting analysis to obtain best results. If multiple analyses are run separately, each of them effectively makes pessimistic assumptions about the outcome of all others. Combined analyses avoid the phase ordering problem by solving everything at the same time. Lerner, Grove, and Chambers showed a method of composing optimizations by interleaving analyses and transformations (*). We use a modified version of their algorithm that works on structured loops instead of CFGs and using dependency information and rewriting instead of explicit data flow lattices. Usually, rewriting is semantics preserving, i.e. pessimistic. The idea is to drop that assumption. As a corollary, we need to rewrite speculatively and be able to rollback to a previous state to get optimistic optimization. The algorithm proceeds as follows: for each encountered loop, apply all possible transforms to the loop body, given empty initial assumptions. Analyze the result of the transformation: if any new information is discovered throw away the transformed loop body and retransform the original with updated assumptions. Repeat until the analysis result has reached a fixpoint and keep the last transformation as result.

Here is an example of speculative rewriting, showing the initial optimistic iteration (middle), with the fixpoint (right) reached after the second iteration:

```
var x = 7 var x = 7 var x = 7 //dead
var c = 0 var c = 0 var c = 0
while (c < 10) { while (true) { while (c < 10) {
if (x < 10) print("!") print("!") print("!")
else x = c print(7) print(7)
print(x) print(0) print(c)
print(c) c = 1 c += 1
c += 1 } }
}
```

This algorithm allows us to do all forward data flow analyses and transforms
in one uniform, combined pass driven by rewriting. In the example above,
during the initial iteration (middle), separately specified rewrites for
variables and conditionals work together to determine that `x=c`

is never
executed. At the end of the loop body we discover the write to `c`

, which
invalidates our initial optimistic assumption `c=0`

. We rewrite the original
body again with the augmented information (right). This time there is no
additional knowledge discovered so the last speculative rewrite becomes the
final one.

For some transformations, e.g. data structure representation lowering, we do not execute rewrites now, but later, to give further immediate rewrites a change to match on the current expression before it is rewritten. This is a simple form of prioritizing different rewrites, in this case optimizations over lowerings. It also happens to be a central idea behind telescoping languages (*).

We perform simplifications eagerly, after each transform phase. Thus we guarantee that CSE, DCE etc. have been applied on high-level operations before they are translated into lower-level equivalents, on which optimizations would be much harder to apply.

We call the mechanism to express this form of rewrites *delayed* rewriting.
Here is an example that delayedly transforms a plus operation on Vectors into
an operation on arrays:

```
def infix_plus(a: Rep[Vector[Double]], b: Rep[Vector[Double]]) = {
VectorPlus(a,b) atPhase(lowering) {
val data = Array.fill(a.length) { i => a(i) + b(i) }
vector_with_backing_array(data)
}
}
```

The transformation is only carried out at phase `lowering`

. Before that, the
IR node remains a `VectorPlus`

node, which allows other smart constructor
rewrites to kick in that match their arguments against `VectorPlus`

.

Technically, delayed rewriting is implemented using a worklist transformer that keeps track of the rewrites to be performed during the next iteration. The added convenience over using a transformer directly is that programmers can define simple lowerings inline without needing to subclass and install a transformer trait.

Since our graph IR contains structured expressions, optimizations need to work with compound statements. Reasoning about compound statements is not easy: For example, our simple dead code elimination algorithm will not be able to remove only pieces of a compound expression. Our solution is simple yet effective: We eagerly split many kinds of compound statements, assuming optimistically that only parts will be needed. We find out which parts through the regular DCE algorithm. Afterwards we reassemble the remaining pieces.

A good example of statement splitting are effectful conditionals:

```
var a, b, c = ... var a, b, c = ... var a, c = ...
if (cond) { if (cond) if (cond)
a = 9 a = 9 a = 9
b = 1 if (cond) else
} else b = 1 c = 3
c = 3 if (!cond) println(a+c)
println(a+c) c = 3
println(a+c)
```

From the conditional in the initial program (left), splitting creates three separate expressions, one for each referenced variable (middle). Pattern rewrites are executed when building the split nodes but do not have any effect here. Dead code elimination removes the middle one because variable b is not used, and the remaining conditionals are merged back together (right). Of course successful merging requires to keep track of how expressions have been split.

Splitting is also very effective for data structures, as often only parts of a data structure are used or modified. We can define a generic framework for data structures:

```
trait StructExp extends BaseExp {
abstract class StructTag
case class Struct[T](tag: StructTag, elems: Map[String,Rep[Any]]) extends Def[T]
case class Field[T](struct: Rep[Any], key: String) extends Def[T]
def struct[T](tag: StructTag, elems: Map[String,Rep[Any]]) = Struct(tag, elems)
def field[T](struct: Rep[Any], key: String): Rep[T] = struct match {
case Def(Struct(tag, elems)) => elems(index).asInstanceOf[Rep[T]]
case _ => Field[T](struct, index)
}
}
```

There are two IR node types, one for structure creation and one for field
access. The structure creation node contains a hash map that holds (static)
field identifiers and (dynamic) field values. It also contains a `tag`

that
can be used to hold further information about the nature of the data
structure. The interface for field accesses is method `field`

, which pattern
matches on its argument and, if that is a `Struct`

creation, looks up the
desired value from the embedded hash map.

We continue by adding a rule that makes the result of a conditional a
`Struct`

if the branches return `Struct`

:

```
override def ifThenElse[T](cond: Rep[Boolean], a: Rep[T], b: Rep[T]) =
(a,b) match {
case (Def(Struct(tagA,elemsA)), Def(Struct(tagB, elemsB))) =>
assert(tagA == tagB)
assert(elemsA.keySet == elemsB.keySet)
Struct(tagA, elemsA.keySet map (k => ifThenElse(cond, elemsA(k), elemsB(k))))
case _ => super.ifThenElse(cond,a,b)
}
```

Similar rules are added for many of the other core IR node types. DCE can remove individual elements of the data structure that are never used. During code generation and tree traversals, the remaining parts of the split conditional are merged back together.

We will study examples of this struct abstraction here and an extension to unions and inheritance in here.

A natural extension of this mechanism is a generic array-of-struct to struct-
of-array transform. The definition is analogous to that of conditionals. We
override the array constructor `arrayFill`

that represents expressions of the
form \c|Array.fill(n) { i => body }| to create a struct with an array for each
component of the body if the body itself is a Struct:

```
override def arrayFill[T](size: Exp[Int], v: Sym[Int], body: Def[T]) = body match {
case Block(Def(Struct(tag, elems))) =>
struct[T](ArraySoaTag(tag,size),
elems.map(p => (p._1, arrayFill(size, v, Block(p._2)))))
case _ => super.arrayFill(size, v, body)
}
```

Note that we tag the result struct with an `ArraySoaTag`

to keep track of the
transformation. This class is defined as follows:

```
case class ArraySoaTag(base: StructTag, len: Exp[Int]) extends StructTag
```

We also override the methods that are used to access array elements and return the length of an array to do the right thing for transformed arrays:

```
override def infix_apply[T](a: Rep[Array[T]], i: Rep[Int]) = a match {
case Def(Struct(ArraySoaTag(tag,len),elems)) =>
struct[T](tag, elems.map(p => (p._1, infix_apply(p._2, i))))
case _ => super.infix_at(a,i)
}
override def infix_length[T](a: Rep[Array[T]]): Rep[Int] = a match {
case Def(Struct(ArraySoaTag(tag, len), elems)) => len
case _ => super.infix_length(a)
}
```

Examples for this struct of array transformation are shown in here and here.

The use of independent and freely composable traversal operations such as
`v.map(..).sum`

is preferable to explicitly coded loops. However, naive
implementations of these operations would be expensive and entail lots of
intermediate data structures. We provide a novel loop fusion algorithm for
data parallel loops and traversals (see here for
examples of use). The core loop abstraction is

```
loop(s) $\overline{\mathtt{{x=}}\G}$ { i => $\overline{E[ \mathtt{x} \yield \mathtt{f(i)} ]}$ }
```

where `s`

is the size of the loop and `i`

the loop variable ranging over
$[0,\mathtt{s})$. A loop can compute multiple results $\overline{\mathtt{x}}$,
each of which is associated with a generator $\G$, one of `Collect`

, which
creates a flat array-like data structure, `Reduce($\oplus$)`

, which reduces
values with the associative operation $\oplus$, or `Bucket($\G$)`

, which
creates a nested data structure, grouping generated values by key and applying
$\G$ to those with matching key. Loop bodies consist of yield statements ```
x
<br />$\yield$ f(i)
```

that define values passed to generators (of this loop or an
outer loop), embedded in some outer context $E[.]$ that might consist of other
loops or conditionals. For `Bucket`

generators yield takes (key,value) pairs.

The fusion rules are summarized below:

```
Generator kinds: $\mathcal{G} ::= $ `Collect` $|$ `Reduce($\oplus$)` $|$ `Bucket($\mathcal{G`$)} \\
Yield statement: xs $\yield$ x \\
Contexts: $E[.] ::= $ loops and conditionals
Horizontal case (for all types of generators):
loop(s) x1=$\G_1$ { i1 => $E_1[$ x1 $\yield$ f1(i1) $]$ }
loop(s) y1=$\G_2$ { i2 => $E_2[$ x2 $\yield$ f2(i2) $]$ }
---------------------------------------------------------------------
loop(s) x1=$\G_1$, x2=$\G_2$ { i =>
$E_1[$ x1 $\yield$ f1(i) $]$; $E_2[$ x2 $\yield$ f2(i) $]$ }
Vertical case (consume collect):
loop(s) x1=Collect { i1 => $E_1[$ x1 $\yield$ f1(i1) $]$ }
loop(x1.size) x2=$\G$ { i2 => $E_2[$ x2 $\yield$ f2(x1(i2)) $]$ }
---------------------------------------------------------------------
```

loop(s) x1=Collect, x2=$\G$ { i =>

```
$E_1[$ x1 $\yield$ f1(i); $E_2[$ x2 $\yield$ f2(f1(i)) $]]$ }
Vertical case (consume bucket collect):
loop(s) x1=Bucket(Collect) { i1 =>
$E_1[$ x1 $\yield$ (k1(i1), f1(i1)) $]$ }
loop(x1.size) x2=Collect { i2 =>
loop(x1(i2).size) y=$\G$ { j =>
$E_2[$ y $\yield$ f2(x1(i2)(j)) $]$ }; x2 $\yield$ y }
---------------------------------------------------------------------
loop(s) x1=Bucket(Collect), x2=Bucket($\G$) { i =>
$E_1[$ x1 $\yield$ (k1(i), f1(i));
$E_2[$ x2 $\yield$ (k1(i), f2(f1(i))) $]]$ }
```

This model is expressive enough to model many common collection operations:

```
x=v.map(f) loop(v.size) x=Collect { i => x $\yield$ f(v(i)) }
x=v.sum loop(v.size) x=Reduce(+) { i => x $\yield$ v(i) }
x=v.filter(p) loop(v.size) x=Collect { i => if (p(v(i)))
x $\yield$ v(i) }
x=v.flatMap(f) loop(v.size) x=Collect { i => val w = f(v(i))
loop(w.size) { j => x $\yield$ w(j) }}
x=v.distinct loop(v.size) x=Bucket(Reduce(rhs)) { i =>
x $\yield$ (v(i), v(i)) }
```

Other operations are accommodated by generalizing slightly. Instead of
implementing a `groupBy`

operation that returns a sequence of (Key,
Seq[Value]) pairs we can return the keys and values in separate data
structures. The equivalent of `(ks,vs)=v.groupBy(k).unzip`

is:

```
loop(v.size) ks=Bucket(Reduce(rhs)),vs=Bucket(Collect) { i =>
ks $\yield$ (v(i), v(i)); vs $\yield$ (v(i), v(i)) }
```

In the rules above, multiple instances of `f1(i)`

are subject to CSE and not
evaluated twice. Substituting `x1(i2)`

with `f1(i)`

will remove a reference to
`x1`

. If `x1`

is not used anywhere else, it will also be subject to DCE.
Within fused loop bodies, unifying index variable `i`

and substituting
references will trigger the uniform forward transformation pass. Thus, fusion
not only removes intermediate data structures but also provides additional
optimization opportunities inside fused loop bodies (including fusion of
nested loops).

Fixed size array construction `Array(a,b,c)`

can be expressed as

```
loop(3) x=Collect { case 0 => x $\yield$ a
case 1 => x $\yield$ b case 2 => x $\yield$ c }
```

and concatenation `xs ++ ys`

as `Array(xs,ys).flatMap(i=>i)`

:

```
loop(2) x=Collect { case 0 => loop(xs.size) { i => x $\yield$ xs(i) }
case 1 => loop(ys.size) { i => x $\yield$ ys(i) }}
```

Fusing these patterns with a consumer will duplicate the consumer code into each match case. Implementations should have some kind of cutoff value to prevent code explosion. Code generation does not need to emit actual loops for fixed array constructions but can just produce the right sequencing of yield operations.

Examples for the fusion algorithm are shown here.

Comments? Suggestions for improvement? View this file on GitHub.