04_atwork.scala // Jump To …

(Chapter 0) Intro: Abstraction Without Regret

\label{chap:400}

Outline:

LMS is a dynamic multi-stage programming approach: We have the full Scala
language at our disposal to compose fragments of object code. In fact, DSL
programs are program generators that produce an object program IR when run.
DSL or library authors and application programmers can exploit this multi-
level nature to perform computations explicitly at staging time, so that the
generated program does not pay a runtime cost. Multi-stage programming shares
some similarities with partial evaluation (*), but instead
of an automatic binding-time analysis, the programmer makes binding times
explicit in the program. We have seen how LMS uses Rep types for this
purpose:

val s: Int = ...      // a static value: computed at staging time
val d: Rep[Int] = ... // a dynamic value: computed when generated program is run

Unlike with automatic partial evaluation, the programmer obtains a guarantee
about which expressions will be evaluated at staging time.

While moving computations from run time to staging time is an interesting
possibility, many computations actually depend on dynamic input and cannot be
done before the input is available. Nonetheless, explicit staging can be used
to combine dynamic computations more efficiently. Modern programming
languages provide indispensable constructs for abstracting and combining
program functionality. Without higher-order features such as first-class
functions or object and module systems, software development at scale would
not be possible. However, these abstraction mechanisms have a cost and make
it much harder for the compiler to generate efficient code.

Using explicit staging, we can use abstraction in the generator stage to
remove abstraction in the generated program. This holds both for control
(e.g. functions, continuations) and data abstractions (e.g. objects,
boxing). Some of the material in this chapter is taken from
(*).

Common Compiler Optimizations

We have seen in Part 3 how many classic compiler
optimizations can be applied to the IR generated from embedded programs in a
straightforward way. Among those generic optimizations are common
subexpression elimination, dead code elimination, constant folding and code
motion. Due to the structure of the IR, these optimizations all operate in an
essentially global way, at the level of domain operations. An important
difference to regular general-purpose compilers is that IR nodes carry
information about effects they incur (see here). This permits to
use quite precise dependency tracking that provides the code generator with a
lot of freedom to group and rearrange operations. Consequently, optimizations
like common subexpression elimination and dead code elimination will easily
remove complex DSL operations that contain internal control-flow and may span
many lines of source code.

Common subexpression elimination (CSE) / global value numbering (GVN) for pure
nodes is handled by toAtom: whenever the Def in question has been
encountered before, its existing symbol is returned instead of a new one (see
here). Since the operation is pure, we do not need to check via
data flow analysis whether its result is available on the current path.
Instead we just insert a dependency and let the later code motion pass (see
here) schedule the operation in a correct order. Thus,
we achieve a similar effect as partial redundancy elimination
(PRE~(*)) but in a simpler way.

Based on frequency information for block expression, code motion will hoist
computation out of loops and push computation into conditional branches. Dead
code elimination is trivially included. Both optimizations are coarse
grained and work on the level of domain operations. For example, whole data
parallel loops will happily be hoisted out of other loops.

Consider the following user-written code:

v1 map { x =>
  val s = sum(v2.length) { i => v2(i) }
  x/s
}

This snippet scales elements in a vector v1 relative to the sum of v2's
elements. Without any extra work, the generic code motion transform places the
calculation of s (which is itself a loop) outside the loop over v1 because
it does not depend on the loop variable x.

val s = sum(v2.length) { i => v2(i) }
v1 map { x =>
  x/s
}

Delite: An End-to-End System for Embedded Parallel DSLs

\label{sec:delite}

This section gives an overview of our approach to developing and executing
embedded DSLs in parallel and on heterogeneous devices. A more thorough
description of Delite can be found here.

Delite seeks to alleviate the burden of building a high performance DSL by
providing reusable infrastructure. Delite DSLs are embedded in Scala using
LMS. On top of this layer, Delite is structured into a framework and a
runtime component. The framework provides primitives for parallel
operations such as map or reduce that DSL authors can use to define
higher-level operations. Once a DSL author uses Delite operations, Delite
handles code generating to multiple platforms (e.g. Scala and CUDA), and
handles difficult but common issues such as device communication and
synchronization. These capabilities are enabled by exploiting the domain-
specific knowledge and restricted semantics of the DSL compiler.

Building a Simple DSL

\label{subsec:lms}

On the surface, DSLs implemented on top of Delite appear very similar to
purely-embedded (i.e. library only) Scala-based DSLs. However, a key aspect
of LMS and hence Delite is that DSLs are split in two parts, interface and
implementation. Both parts can be assembled from components in the form of
Scala traits. DSL programs are written in terms of the DSL interface, agnostic
of the implementation. Part of each DSL interface is an abstract type
constructor Rep[_] that is used to wrap types in DSL programs. For example,
DSL programs use Rep[Int] wherever a regular program would use Int. The
DSL operations defined in the DSL interface (most of them are abstract
methods) are all expressed in terms of Rep types.

The DSL implementation provides a concrete instantiation of Rep as
expression trees (or graphs). The DSL operations left abstract in the
interface are implemented to create an expression representation of the
operation. Thus, as a result of executing the DSL program, we obtain an
analyzable representation of the very DSL program which we will refer to as IR
(intermediate representation).

To substantiate the description, let us consider an example step by step. A
simple (and rather pointless) program that calculates the average of 100
random numbers, written in a prototypical DSL MyDSL that includes numeric
vectors and basic console IO could look like this:

object HelloWorldRunner extends MyDSLApplicationRunner with HelloWorld
trait HelloWorld extends MyDSLApplication {
  def main() = {
    val v = Vector.rand(100)
    println("today's lucky number is: ")
    println(v.avg)
  }
}

Programs in our sample DSL live within traits that inherit from
MyDSLApplication, with method main as the entry point.

MyDSLApplication is a trait provided by the DSL that defines the DSL
interface. In addition to the actual DSL program, there is a singleton object
that inherits from MyDSLApplicationRunner and mixes in the trait that
contains the program. As the name implies, this object will be responsible for
directing the staged execution of the DSL application.

Here is the definition of MyDSL's components encountered so far:

trait MyDSLApplication extends DeliteApplication with MyDSL
trait MyDSLApplicationRunner extends DeliteApplicationRunner with MyDSLExp

trait MyDSL extends ScalaOpsPkg with VectorOps
trait MyDSLExp extends ScalaOpsPkgExp with VectorOpsExp with MyDSL

MyDSLApplicationRunner inherits the mechanics for invoking code generation
from DeliteApplication. We discuss how Delite provides these facilities
here. We observe a structural split in the inheritance
hierarchy that is rather fundamental: MyDSL defines the DSL interface,
MyDSLExp the implementation. A DSL program is written with respect to the
interface but it knows nothing about the implementation. The main reason for
this separation is safety. If a DSL program could observe its own structure,
optimizing rewrites that maintain semantic but not structural equality of DSL
expressions could no longer be applied safely.\footnote{In fact, this is the
main reason why MSP languages do not allow inspection of staged code at all
(*).} Our sample DSL includes a set of common Scala
operations that are provided by the core LMS library as trait ScalaOpsPkg.
These operations include conditionals, loops, variables and also println. On
top of this set of generic things that are inherited from Scala, the DSL
contains vectors and associated operations. The corresponding interface is
defined as follows:

trait VectorOps extends Base {

abstract class Vector[T]                    // placeholder ("phantom") type
object Vector {
  def rand(n: Rep[Int]) = vector_rand(n)    // invoked as: Vector.rand(n)
}
def vector_rand(n: Rep[Int]): Rep[Vector[Double]]
def infix_length[T](v: Rep[Vector[T]]): Rep[Int]      // invoked as: v.length
def infix_sum[T:Numeric](v: Rep[Vector[T]]): Rep[T]   // invoked as: v.sum
def infix_avg[T:Numeric](v: Rep[Vector[T]]): Rep[T]   // invoked as: v.avg
...

}

There is an abstract class Vector[T] for vectors with element type T. The
notation T:Numeric means that T may only range over numeric types such as
Int or Double. Operations on vectors are not declared as instance methods
of Vector[T] but as external functions over values of type Rep[Vector[T]].

Returning to our sample DSL, this is the definition of VectorOpsExp, the
implementation counterpart to the interface defined above in VectorOps:

trait VectorOpsExp extends DeliteOpsExp with VectorOps {
  case class VectorRand[T](n: Exp[Int]) extends Def[Vector[Double]]
  case class VectorLength[T](v: Exp[Vector[T]]) extends Def[Int]

  case class VectorSum[T:Numeric](v: Exp[Vector[T]]) extends DeliteOpLoop[Exp[T]] {
    val range = v.length
    val body = DeliteReduceElem[T](v)(_ + _) // scalar addition (impl not shown)
  }

  def vector_rand(n: Rep[Int]) = VectorRand(n)
  def infix_length[T](v: Rep[Vector[T]]) = VectorLength(v)
  def infix_sum[T:Numeric](v: Rep[Vector[T]]) = VectorSum(v)
  def infix_avg[T:Numeric](v: Rep[Vector[T]]) = v.sum / v.length
  ...
}

The constructor rand and the function length are implemented as new plain
IR nodes (extending Def). Operation avg is implemented directly in terms
of sum and length whereas sum is implemented as a DeliteOpLoop with a
DeliteReduceElem body. These special classes of structured IR nodes are
provided by the Delite framework and are inherited via DeliteOpsExp.

Code Generation

\label{subsec:codegen}

The LMS framework provides a code generation infrastructure that includes a
program scheduler and a set of base code generators. The program scheduler
uses the data and control dependencies encoded by IR nodes to determine the
sequence of nodes that should be generated to produce the result of a block.
After the scheduler has determined a schedule, it invokes the code generator
on each node in turn. There is one code generator object per target platform
(e.g. Scala, CUDA, C++) that mixes together traits that describe how to
generate platform-specific code for each IR node. This organization makes it
easy for DSL authors to modularly extend the base code generators; they only
have to define additional traits to be mixed in with the base generator.

Therefore, DSL designers only have to add code generators for their own
domain-specific types. They inherit the common functionality of scheduling and
callbacks to the generation methods, and can also build on top of code
generator traits that have already been defined. In many cases, though, DSL
authors do not have to write code generators at all; the next section
describes how Delite takes over this responsibility for most operations.

The Delite Compiler Framework and Runtime

\label{subsec:delite}

On top of the LMS framework that provides the basic means to construct IR
nodes for DSL operations, the Delite Compiler Framework provides high-level
representations of execution patterns through DeliteOp IR, which includes a
set of common parallel execution patterns (e.g. map, zipWith, reduce).

DeliteOp extends Def, and DSL operations may extend one of the DeliteOps
that best describes the operation. For example, since VectorSum has the
semantics of iterating over the elements of the input Vector and adding them
to reduce to a single value, it can be implemented by extending DeliteOpLoop
with a reduction operation as its body. This significantly reduces the amount
of work in implementing a DSL operation since the DSL developers only need to
specify the necessary fields of the DeliteOp (range and body in the case
of DeliteOpLoop) instead of fully implementing the operation.

DeliteOpLoops are intended as parallel for-loops. Given an integer index
range, the runtime guarantees to execute the loop body exactly once for each
index but does not guarantee any execution order. Mutating global state from
within a loop is only safe at disjoint indexes. There are specialized
constructs to define loop bodies for map and reduce patterns
(DeliteCollectElem, DeliteReduceElem) that transform a collection of
elements point-wise or perform aggregation. An optional predicate can be added
to perform filter-style operations, i.e. select or aggregate only those
elements for which the predicate is true. All loop constructs can be fused
into DeliteOpLoops that do several operations at once.

Given the relaxed ordering guarantees, the framework can automatically
generate efficient parallel code for DeliteOps, targeting heterogeneous
parallel hardware. Therefore, DSL developers can easily implement parallel
DSL operations by extending one of the parallel DeliteOps, and only focus on
the language design without knowing the low-level details of the target
hardware.

The Delite Compiler Framework currently supports Scala, C++, and CUDA targets.
The framework provides code generators for each target in addition to a main
generator (Delite generator) that controls them. The Delite generator
iterates over the list of available target generators to emit the target-
specific kernels. By generating multiple target implementations of the kernels
and deferring the decision of which one to use, the framework provides the
runtime with enough flexibility in scheduling the kernels based on dynamic
information such as resource availability and input size. In addition to the
kernels, the Delite generator also generates the Delite Execution Graph
(DEG) of the application. The DEG is a high-level representation of the
program that encodes all necessary information for its execution, including
the list of inputs, outputs, and interdependencies of all kernels. After all
the kernels are generated, the Delite Runtime starts analyzing the DEG and
emits execution plans for each target hardware.

(Chapter 1) Control Abstraction

\label{chap:450}

Among the most useful control abstractions are higher order functions. We can
implement support for higher order functions in DSLs while keeping the
generated IR strictly first order. This vastly simplifies the compiler
implementation and makes optimizations much more effective since the compiler
does not have to reason about higher order control flow. We can implement a
higher order function foreach over Vectors as follows:

def infix_foreach[A](v: Rep[Vector[A]])(f: Rep[A] => Rep[Unit]) = {
  var i = 0; while (i < v.length) { f(v(i)); i += 1 }
}
// example:
Vector.rand(100) foreach { i => println(i) }

The generated code from the example will be strictly first order, consisting
of the unfolded definition of foreach with the application of f
substituted with the println statement:

while (i < v.length) { println(v(i)); i += 1 }

The unfolding is guaranteed by the Scala type system since f has type
Rep[A]=>Rep[Unit], meaning it will be executed statically but it operates on
staged values. In addition to simplifying the compiler, the generated code
does not pay any extra overhead. There are no closure allocations and no
inlining problems (*).

Other higher order functions like map or filter could be expressed on top
of foreach. Actual Delite DSLs implement these operations as data parallel
loops. The rest of this chapter shows how other control structures such as
continuations can be supported in the same way.

Leveraging Higher-Order Functions in the Generator

Higher-order functions are extremely useful to structure programs but also
pose a significant obstacle for compilers, recent advances on higher-order
control-flow analysis notwithstanding
(*). While
we would like to retain the structuring aspect for DSL programs, we would like
to avoid higher-order control flow in generated code. Fortunately, we can use
higher-order functions in the generator stage to compose first-order DSL
programs.

Consider the following program that prints the number of elements greater than
7 in some vector:

val xs: Rep[Vector[Int]] = ...
println(xs.count(x => x > 7))

The program makes essential use of a higher-order function count to count
the number of elements in a vector that fulfill a predicate given as a
function object. Ignoring for the time being that we would likely want to use
a DeliteOp for parallelism, a good and natural way to implement count
would be to first define a higher-order function foreach to iterate over
vectors, as shown at the beginning of the chapter:

def infix_foreach[A](v: Rep[Vector[A]])(f: Rep[A] => Rep[Unit]) = {
  var i: Rep[Int] = 0
  while (i < v.length) {
    f(v(i))
    i += 1
  }
}

The actual counting can then be implemented in terms of the traversal:

def infix_count[A](v: Rep[Vector[A]])(f: Rep[A] => Rep[Boolean]) = {
  var c: Rep[Int] = 0
  v foreach { x => if (f(x)) c += 1 }
  c
}

It is important to note that infix_foreach and infix_count are static
methods, i.e. calls will happen at staging time and result in inserting the
computed DSL code in the place of the call. Likewise, while the argument
vector v is a dynamic value, the function argument f is again static.
However, f operates on dynamic values, as made explicit by its type
Rep[A] => Rep[Boolean]. By contrast, a dynamic function value would have
type Rep[A => B].

This means that the code generated for the example program will look roughly
like this, assuming that vectors are represented as arrays in the generated
code:

val v: Array[Int] = ...
var c = 0
var i = 0
while (i < v.length) {
  val x = v(i)
  if (x > 7)
    c += 1
  i += 1
}
println(c)

All traces of higher-order control flow have been removed and the program is
strictly first-order. Moreover, the programmer can be sure that the DSL
program is composed in the desired way.

This issue of higher-order functions is a real problem for regular Scala
programs executed on the JVM. The Scala collection library uses essentially
the same foreach and count abstractions as above and the JVM, which
applies optimizations based on per-call-site profiling, will identify the
call site within foreach as a hot spot. However, since the number of
distinct functions called from foreach is usually large, inlining or other
optimizations cannot be applied and every iteration step pays the overhead of
a virtual method call (*).

Using Continuations in the Generator to Implement Backtracking

\label{sec:450bam}

Apart from pure performance improvements, we can use functionality of the
generator stage to enrich the functionality of DSLs without any work on the
DSL-compiler side. As an example we consider adding backtracking
nondeterministic computation to a DSL using a simple variant of McCarthy's
amb operator (*). Here is a nondeterministic program
that uses amb to find pythagorean triples from three given vectors:

val u,v,w: Rep[Vector[Int]] = ...
nondet {
  val a = amb(u)
  val b = amb(v)
  val c = amb(w)
  require(a*a + b*b == c*c)
  println("found:")
  println(a,b,c)
}

We can use Scala's support for delimited continuations
(*) and the associated control operators shift
and reset (*) to
implement the necessary primitives. The scope delimiter nondet is just the
regular reset. The other operators are defined as follows:

def amb[T](xs: Rep[Vector[T]]): Rep[T] @cps[Rep[Unit]] = shift { k =>
  xs foreach k
}
def require(x: Rep[Boolean]): Rep[Unit] @cps[Rep[Unit]] = shift { k =>
  if (x) k() else ()
}

Since the implementation of amb just calls the previously defined method
foreach, the generated code will be first-order and consist of three nested
while loops:

val u,v,w:Rep[Vector[Int]]=...
var i = 0
while (i < u.length) {
  val a = u(i)
  val a2 = a*a
  var j = 0
  while (j < v.length) {
    val b = v(j)
    val b2 = b*b
    val a2b2 = a2+b2
    var k = 0
    while (k < w.length) {
      val c = w(k)
      val c2 = c*c
      if (a2b2 == c2) {
        println("found:")
        println(a,b,c)
      }
      k += 1
    }
    j += 1
  }
  i += 1
}

Besides the advantage of not having to implement amb as part of the DSL
compiler, all common optimizations that apply to plain while loops are
automatically applied to the unfolded backtracking implementation. For
example, the loop invariant hoisting performed by code motion has moved the
computation of a*a and b*b out of the innermost loop.

The given implementation of amb is not the only possibility, though. For
situations where we know the number of choices (but not necessarily the actual
values) for a particular invocation of amb at staging time, we can
implement an alternative operator that takes a (static) list of dynamic
values and unfolds into specialized code paths for each option at compile
time:

def bam[T](xs: List[Rep[T]]): Rep[T] @cps[Rep[Unit]] = shift { k =>
  xs foreach k
}

Here, foreach is not a DSL operation but a plain traversal of the static
argument list xs. The bam operator must be employed with some care because
it incurs the risk of code explosion. However, static specialization of
nondeterministic code paths can be beneficial if it allows aborting many paths
early based on static criteria or merging computation between paths.

val u: Rep[Vector[Int]] = ...
nondet {
  val a = amb(u)
  val b = bam(List(x1), List(x2))
  val c = amb(v)
  require(a + c = f(b))  // assume f(b) is expensive to compute
  println("found:")
  println(a,b,c)
}

If this example was implemented as three nested loops, f(x1) and f(x2)
would need to be computed repeatedly in each iteration of the second loop as
they depend on the loop-bound variable b. However, the use of bam will
remove the loop over x1,x2 and expose the expensive computations as
redundant so that code motion can extract them from the loop:

val fx1 = f(x1)
val fx2 = f(x2)
while (...) { // iterate over u
  while (...) { // iterate over v
    if (a + c == fx1) // found
  }
  while (...) { // iterate over v
    if (a + c == fx2) // found
  }
}

In principle, the two adjacent inner loops could be subjected to the loop
fusion optimization discussed in here. This would remove
the duplicate traversal of v. In this particular case fusion is currently
not applied since it would change the order of the side-effecting println
statements.

Using Continuations in the Generator to Generate Async Code Patterns

\label{sec:cpsAsync}

The previous section used continuations that were completely translated away
during generation. In this section, we will use a CPS-transformed program
generator to generate code that is in CPS. While the previous section
generated only loops, we will generate actual functions in this section, using
the mechanisms described here. The example is taken
from~(*) and concerned with generating
JavaScript but the techniques apply to any target.

A callback-driven programming style is pervasive in JavaScript programs.
Because of lack of thread support, callbacks are used for I/O, scheduling and
event-handling. For example, in an Ajax call (Asynchronous JavaScript and
XML), one has to provide a callback that will be called once the requested
data arrives from the server. This style of programming is known to be
unwieldy in more complicated scenarios. To give a specific example, let us
consider a scenario where we have an array of Twitter account names and we
want to ask Twitter for the latest tweets of each account. Once we obtain the
tweets of all users, we would like to log that fact in a console.

We implement this program both directly in JavaScript and in the embedded
JavaScript DSL (*). Let us start by
implementing logic that fetches tweets for a single user by using the jQuery
library for Ajax calls:

function fetchTweets(user, callback) {
  jQuery.ajax({
    url: "http://api.twitter.com/1/"
     + "statuses/user_timeline.json/",
    type: "GET",
    dataType: "jsonp",
    data: {
      screen_name: user,
      include_rts: true,
      count: 5,
      include_entities: true
    },
    success: callback
  })
}

def fetchTweets(user: Rep[String]) =
  (ajax.get { new JSLiteral {
    val url = "http://api.twitter.com/1/"
          + "statuses/user_timeline.json"
    val `type` = "GET"
    val dataType = "jsonp"
    val data = new JSLiteral {
      val screen_name = user
      val include_rts = true
      val count = 5
      val include_entities = true
    }
  }
}).as[TwitterResponse]
type TwitterResponse =
  Array[JSLiteral {val text: String}]

Note that the JavaScript version takes a callback as second argument that will
be used to process the fetched tweets. We provide the rest of the logic that
iterates over array of users and makes Ajax requests:

var processed = 0
var users = ["gkossakowski",
  "odersky", "adriaanm"]
users.forEach(function (user) {
  console.log("fetching " + user)
  fetchTweets(user, function(data) {
    console.log("finished " + user)
    data.forEach(function (t) {
      console.log("fetched " + t.text)
    })
    processed += 1
    if (processed == users.length) {
      console.log("done")
    }
  })
})

val users = array("gkossakowski",
  "odersky", "adriaanm")
for (user <- users.parSuspendable) {
  console.log("fetching " + user)
  val tweets = fetchTweets(user)
  console.log("finished " + user)
  for (t <- tweets)
    console.log("fetched " + t.text)
}
console.log("done")

Because of the inverted control flow of callbacks, synchronization between
callbacks has to be handled manually. Also, the inverted control flow leads to
a code structure that is distant from the programmer's intent. Notice that the
in JavaScript version, the call to console that prints done" is put inside <br />of the foreach loop. If it was put it after the loop, we would getdone''
printed before any Ajax call has been made leading to counterintuitive
behavior.

As an alternative to the callback-driven programming style, one can use an
explicit monadic style, possibly sugared by a Haskell-like ``do''-notation.
However, this requires rewriting possibly large parts of a program into
monadic style when a single async operation is added. Another possibility is
to automatically transform the program into continuation passing style (CPS),
enabling the programmer to express the algorithm in a straightforward,
sequential way and creating all the necessary callbacks and book-keeping code
automatically. Links~(*) uses this approach. However, a whole-program
CPS transformation can cause performance degradation, code size blow-up, and
stack overflows. In addition, it is not clear how to interact with existing
non-CPS libraries as the whole program needs to adhere to the CPS style. Here
we use a selective CPS transformation, which precisely identifies
expressions that need to be CPS transformed.

In fact, the Scala compiler already does selective, @suspendable
type-annotation-driven CPS transformations of Scala programs
(*).
We show how this mechanism can be used for transforming our DSL code before
staging and stripping out most CPS abstractions at staging time. The generated
JavaScript code does not contain any CPS-specific code but is written in CPS-style
by use of JavaScript anonymous functions.

CPS and Staging

As an example, we will consider a seemingly blocking sleep method
implemented in a non-blocking, asynchronous style. In JavaScript, there are no
threads and there is no notion of blocking. However the technique is useful in
other circumstances as well, for example when using thread pools, as no thread
is being blocked during the sleep period. Let us see how our CPS transformed
sleep method can be used:

def foo() = {
  sleep(1000)
  println("Called foo")
}
reset {
  println("look, Ma ...")
  foo()
  sleep(1000)
  println(" no callbacks!")
}

We define sleep to use JavaScript's asynchronous setTimeout function,
which takes an explicit callback:

def sleep(delay: Rep[Int]) = shift { k: (Rep[Unit]=>Rep[Unit]) =>
  window.setTimeout(lambda(k), delay)
}

The setTimeout method expects an argument of type Rep[Unit=>Unit] i.e. a
staged function of type Unit=>Unit. The shift method offers us a function
of type Rep[Unit]=>Rep[Unit], so we need to reify it to obtain the desired
representation. The reification is achieved by the fun function provided by
LMS and performed at staging time.

It is important to note that reification preserves function composition.
Specifically, let f: Rep[A] => Rep[B] and g: Rep[B] => Rep[C] then
{\tt\small lambda(g compose f) == (lambda(g) compose lambda(f))} where we
consider two reified functions to be equal if they yield the same observable
effects at runtime. That property of function reification is at the core of
reusing the continuation monad in our DSL. Thanks to the fact that the
continuation monad composes functions, we can just insert reification at some
places (like in a sleep) and make sure that we reify effects of the
continuation monad without the need to reify the monad itself.

CPS for Interruptible Traversals

We need to be able to interrupt our execution while traversing an array in
order to implement the functionality shown above. Let us consider a
simplified example where we want to sleep during each iteration. Both
programs, when executed, will print the following output:

//pause for 1s
1
//pause for 2s
2
//pause for 3s
3
done

We present both JavaScript and DSL code that achieves that:

var xs = [1, 2, 3]
var i = 0
var msg = null
function f1() {
  if (i < xs.length) {
    window.setTimeout(f2, xs[i]*1000)
    msg = xs[i]
    i++
  }
}
function f2() {
  console.log(msg)
  f1()
}
f1()

val xs = array(1, 2, 3)
// shorthand for xs.suspendable.foreach
for (x <- xs.suspendable) {
  sleep(x * 1000)
  console.log(String.valueOf(x))
}
log("done")

In the DSL code, we use a suspendable variant of arrays, which is achieved
through an implicit conversion from regular arrays:

implicit def richArray(xs: Rep[Array[A]]) = new {
  def suspendable: SArrayOps[A] = new SArrayOps[A](xs)
}

The idea behind suspendable arrays is that iteration over them can be
interrupted. We will have a closer look at how to achieve that with the help
of CPS. The suspendable method returns a new instance of the SArrayOps
class defined here:

class SArrayOps(xs: Rep[Array[A]]) {
  def foreach(yld: Rep[A] => Rep[Unit] @suspendable):
    Rep[Unit] @suspendable = {
      var i = 0
      suspendableWhile(i < xs.length) { yld(xs(i)); i += 1 }
    }
}

Note that one cannot use while loops in CPS but we can simulate them with
recursive functions. Let us see how regular while loop can be simulated with
a recursive function:

def recursiveWhile(cond: => Boolean)(body: => Unit): Unit = {
  def rec = () => if (cond) { body; rec() } else {}
  rec()
}

By adding CPS-related declarations and control abstractions, we implement
suspendableWhile:

def suspendableWhile(cond: => Rep[Boolean])(
  body: => Rep[Unit] @suspendable): Rep[Unit] @suspendable =
  shift { k =>
    def rec = fun { () =>
      if (cond) reset { body; rec() } else { k() }
    }
    rec()
  }

Defining the Ajax API

With the abstractions for interruptible loops and traversals at hand, what
remains to complete the Twitter example from the beginning of the section is
the actual Ajax request/response cycle.

The Ajax interface component provides a type Request that captures the
request structure expected by the underlying JavaScript/jQuery implementation
and the necessary object and method definitions to enable the use of
ajax.get in user code:

trait Ajax extends JS with CPS {
  type Request = JSLiteral {
    val url: String
    val `type`: String
    val dataType: String
    val data: JSLiteral
  }
  type Response = Any
  object ajax {
    def get(request: Rep[Request]) = ajax_get(request)
  }
  def ajax_get(request: Rep[Request]): Rep[Response] @suspendable
}

Notice that the Request type is flexible enough to support an arbitrary
object literal type for the data field through subtyping. The Response
type alias points at Any which means that it is the user's responsibility
to either use dynamic or perform an explicit cast to the expected data
type.

The corresponding implementation component implements ajax_get to capture a
continuation, reify it as a staged function using fun and store it in an
AjaxGet IR node.

trait AjaxExp extends JSExp with Ajax {
  case class AjaxGet(request: Rep[Request],
    success: Rep[Response => Unit]) extends Def[Unit]
  def ajax_get(request: Rep[Request]): Rep[Response] @suspendable =
    shift { k =>
      reflectEffect(AjaxGet(request, lambda(k)))
    }
}

During code generation, we emit code to attach the captured continuation as a
callback function in the success field of the request object:

trait GenAjax extends JSGenBase {
  val IR: AjaxExp
  import IR._
  override def emitNode(sym: Sym[Any], rhs: Def[Any])(
    implicit stream: PrintWriter) = rhs match {
      case AjaxGet(req, succ) =>
        stream.println(quote(req) + ".success = " + quote(succ))
        emitValDef(sym, "jQuery.ajax(" + quote(req) + ")")
    case _ => super.emitNode(sym, rhs)
  }
}

It is interesting to note that, since the request already has the right
structure for the jQuery.ajax function, we can simply pass it to the
framework-provided quote method, which knows how to generate JavaScript
representations of any JSLiteral.

The Ajax component completes the functionality required to run the Twitter
example with one caveat: The outer loop in listing \ref{code:twitter_example}
uses parSuspendable to traverse arrays instead of the suspendable
traversal variant we have defined in listing \ref{code:suspendable_foreach}.

In fact, if we change the code to use suspendable instead of
parSuspendable and run the generated JavaScript program, we will get
following output printed to the JavaScript console:

fetching gkossakowski
finished fetching gkossakowski
fetched [...]
fetched [...]
fetching odersky
finished fetching odersky
fetched [...]
fetched [...]
fetching adriaanm
finished fetching adriaanm
fetched [...]
fetched [...]
done

Notice that all Ajax requests were done sequentially. Specifically, there was
just one Ajax request active at a given time; when the callback to process one
request is called, it would resume the continuation to start another request,
and so on. In many cases this is exactly the desired behavior, however, we
will most likely want to perform our Ajax request in parallel.

CPS for Parallelism

The goal of this section is to implement a parallel variant of the foreach
method from listing~\ref{code:suspendable_foreach}. We will start with
defining a few primitives like futures and dataflow cells. We start with
cells, which we decide to define in JavaScript, as another example of
integrating external libraries with our DSL:

function Cell() {
  this.value = undefined
  this.isDefined = false
  this.queue = []
  this.get = function (k) {
    if (this.isDefined) {
      k(this.value)
    } else {
      this.queue.push(k)
    }
  }
  this.set = function (v) {
    if (this.isDefined) {
      throw "can't set value twice"
    } else {
      this.value = v
      this.isDefined = true
      this.queue.forEach(function (f) {
        f(v) //non-trivial spawn could be used here
      })
    }
  }
}

A cell object allows us to track dependencies between values. Whenever the
get method is called and the value is not in the cell yet, the continuation
will be added to a queue so it can be suspended until the value arrives. The
set method takes care of resuming queued continuations. We expose Cell as
external library using our typed API mechanism and we use it for implementing
an abstraction for futures.

def createCell(): Rep[Cell[A]]
trait Cell[A]
trait CellOps[A] {
  def get(k: Rep[A => Unit]): Rep[Unit]
  def set(v: Rep[A]): Rep[Unit]
}
implicit def repToCellOps(x: Rep[Cell[A]]): CellOps[A] =
  repProxy[Cell[A],CellOps[A]](x)

def spawn(body: => Rep[Unit] @suspendable): Rep[Unit] = {
  reset(body) //non-trivial implementation uses
              //trampolining to prevent stack overflows
}
def future(body: => Rep[A] @suspendable) = {
  val cell = createCell[A]()
  spawn { cell.set(body) }
  cell
}

The last bit of general functionality we need is RichCellOps that ties
Cells and continuations together inside of our DSL.

class RichCellOps(cell: Rep[Cell[A]]) {
  def apply() = shift { k: (Rep[A] => Rep[Unit]) =>
    cell.get(lambda(k))
  }
}
implicit def richCellOps(x: Rep[Cell[A]]): RichCell[A] =
  new RichCellOps(x)

It is worth noting that RichCellOps is not reified so it will be dropped at
staging time and its method will get inlined whenever used. Also, it contains
CPS-specific code that allows us to capture the continuation. The fun
function reifies the captured continuation.

We are ready to present the parallel version of foreach defined in listing
\ref{code:suspendable_foreach}.

def foreach(yld: Rep[A] => Rep[Unit] @suspendable):
  Rep[Unit] @suspendable = {
    val futures = xs.map(x => future(yld(x)))
    futures.suspendable.foreach(_.apply())
  }

We instantiate each future separately so they can be executed in parallel. As
a second step we make sure that all futures are evaluated before we leave the
foreach method by forcing evaluation of each future and ``waiting'' for its
completion. Thanks to CPS transformations, all of that will be implemented in
a non-blocking style.

The only difference between the parallel and serial versions of the Twitter
example~\ref{code:twitter_example} is the use of parSuspendable instead of
suspendable so the parallel implementation of the foreach method is used.
The rest of the code stays the same. It is easy to switch between both
versions, and users are free to make their choice according to their needs
and performance requirements.

Guarantees by Construction

Making staged functions explicit through the use of lambda (as described
here) enables tight control over how functions are
structured and composed. For example, functions with multiple parameters can
be specialized for a subset of the parameters. Consider the following
implementation of Ackermann's function:

def ack(m: Int): Rep[Int=>Int] = lambda { n =>
  if (m == 0) n+1 else
  if (n == 0) ack(m-1)(1) else
  ack(m-1)(ack(m)(n-1))
}

Calling ack(m)(n) will produce a set of mutually recursive
functions, each specialized to a particular value of m
(example m=2):

def ack_2(n: Int) = if (n == 0) ack_1(1) else ack_1(ack_2(n-1))
def ack_1(n: Int) = if (n == 0) ack_0(1) else ack_0(ack_1(n-1))
def ack_0(n: Int) = n+1
acc_2(n)

In essence, this pattern implements what is known as ''polyvariant
specialization'' in the partial evaluation community. But unlike automatic
partial evaluation, which might or might not be able to discover the right
specialization, the use of staging provides a strong guarantee about the
structure of the generated code.

Other strong guarantees can be achieved by restricting the interface of
function definitions. Being of type Rep[A=>B], the result of lambda is a
first-class value in the generated code that can be stored or passed around in
arbitrary ways. However we might want to avoid higher-order control flow in
generated code for efficiency reasons, or to simplify subsequent analysis
passes. In this case, we can define a new function constructor fundef as
follows:

def fundef[A,B](f: Rep[A] => Rep[B]): Rep[A] => Rep[B] =
  (x: Rep[A]) => lambda(f).apply(x)

Using fundef instead of lambda produces a restricted function that can
only be applied but not passed around in the generated code (type
Rep[A]=>Rep[B]). At the same time, a result of fundef is still a first
class value in the code generator. If we do not expose lambda and apply
at all to client code, we obtain a guarantee that each function call site
unambiguously identifies the function definition being called and no closure
objects will need to be created at runtime.

(Chapter 2) Data Abstraction

\label{chap:455data-abstraction}

High level data structures are a cornerstone of modern programming and at the
same time stand in the way of compiler optimizations.

As a running example we consider implementing a complex number datatype in a
DSL. The usual approach of languages executed on the JVM is to represent every
non-primitive value as a heap-allocated reference object. The space overhead,
reference indirection as well as the allocation and garbage collection cost
are a burden for performance critical code. Thus, we want to be sure that our
complex numbers can be manipulated as efficiently as two individual doubles.
In the following, we explore different ways to achieve that.

Static Data Structures

\label{subsubsec:complexA}

The simplest approach is to implement complex numbers as a fully static data
type, that only exists at staging time. Only the actual Doubles that
constitute the real and imaginary components of a complex number are dynamic
values:

case class Complex(re: Rep[Double], im: Rep[Double])
def infix_+(a: Complex, b: Complex) =
  Complex(a.re + b.re, a.im + b.im)
def infix_*(a: Complex, b: Complex) =
  Complex(a.re*b.re - a.im*b.im, a.re*b.im + a.im*b.re)

Given two complex numbers c1,c2, an expression like

c1 + 5 * c2  // assume implicit conversion from Int to Complex

will generate code that is free of Complex objects and only contains
arithmetic on Doubles.

However the ways we can use Complex objects are rather limited. Since they
only exists at staging time we cannot, for example, express dependencies on
dynamic conditions:

val test: Rep[Boolean] = ...
val c3 = if (test) c1 else c2 // type error: c1/c2 not a Rep type

It is worthwhile to point out that nonetheless, purely static data structures
have important use cases. To give an example, the fast fourier transform (FFT)
(*) is branch-free for a fixed input size. The
definition of complex numbers given above can be used to implement a staged
FFT that computes the well-known butterfly shaped computation circuits from
the textbook Cooley-Tukey recurrences (see here).

To make complex numbers work across conditionals, we have have to split the
control flow explicitly (another option would be using mutable variables).
There are multiple ways to achieve this splitting. We can either duplicate
the test and create a single result object:

val test: Rep[Boolean] = ...
val c3 = Complex(if (test) c1.re else c2.re, if (test) c1.im else c2.im)

Alternatively we can use a single test and duplicate the rest of the program:

val test: Rep[Boolean] = ...
if (test) {
  val c3 = c1
  // rest of program
} else {
  val c3 = c2
  // rest of program
}

While it is awkward to apply this transformation manually, we can use
continuations (much like for the bam operator here) to
generate two specialized computation paths:

def split[A](c: Rep[Boolean]) = shift { k: (Boolean => A) =>
  if (c) k(true) else k(false) // "The Trick"
}
val test: Rep[Boolean] = ...
val c3 = if (split(test)) c1 else c2

The generated code will be identical to the manually duplicated, specialized
version above.

Dynamic Data Structures with Partial Evaluation

\label{sec:455struct}

We observe that we can increase the amount of statically possible computation
(in a sense, applying binding-time improvements) for dynamic values with
domain-specific rewritings:

val s: Int = ...            // static
val d: Rep[Int] = ...       // dynamic

val x1 = s + s + d          // left assoc: s + s evaluated statically,
                            // one dynamic addition
val x2 = s + (d + s)        // naively: two dynamic additions,
                            // using pattern rewrite: only one

In computing x1, there is only one dynamic addition because the left
associativity of the plus operator implies that the two static values will be
added together at staging time. Computing x2 will incur two dynamic
additions, because both additions have at least one dynamic summand. However
we can add rewriting rules that first replace d+c (c denoting a dynamic
value that is know to be a static constant, i.e. an IR node of type Const)
with c+d and then c+(c+d) with (c+c)+d. The computation c+c can again
be performed statically.

We have seen here how we can define a generic framework for
data structures that follows a similar spirit. The interface for field
accesses field pattern matches on its argument and, if that is a Struct
creation, looks up the desired value from the embedded hash map.

An implementation of complex numbers in terms of Struct could look like
this:

trait ComplexOps extends ComplexBase with ArithOps {
  def infix_+(x: Rep[Complex], y: Rep[Complex]): Rep[Complex] =
    Complex(x.re + y.re, x.im + y.im)
  def infix_*(x: Rep[Complex], y: Rep[Complex]): Rep[Complex] =
    Complex(a.re*b.re - ...)
}
trait ComplexBase extends Base {
  class Complex
  def Complex(re: Rep[Double], im: Rep[Double]): Rep[Complex]
  def infix_re(c: Rep[Complex]): Rep[Double]
  def infix_im(c: Rep[Complex]): Rep[Double]
}
trait ComplexStructExp extends ComplexBase with StructExp {
  def Complex(re: Rep[Double], im: Rep[Double]) =
    struct[Complex](classTag("Complex"), Map("re"->re, "im"->im))
  def infix_re(c: Rep[Complex]): Rep[Double] = field[Double](c, "re")
  def infix_im(c: Rep[Complex]): Rep[Double] = field[Double](c, "im")
}

Note how complex arithmetic is defined completely within the interface trait
ComplexOps, which inherits double arithmetic from ArithOps. Access to the
components via re and im is implemented using struct.

Using virtualized record types (see here) that map to
struct internally, we can express the type definition more conveniently as

class Complex extends Struct { val re: Double, val im: Double }

and remove the need for methods infix_re and infix_im. The Scala-
Virtualized compiler will automatically provide staged field accesses like
c.re and c.im. It is still useful to add a simplified constructor method

def Complex(r: Rep[Double], i: Rep[Double]) =
  new Complex { val re = re; val im = im }

to enable using Complex(re,im) instead of the new Complex syntax.

In contrast to the completely static implementation of complex numbers
presented here, complex numbers are a fully dynamic
DSL type now. The previous restrictions are gone and we can write the
following code without compiler error:

val c3 = if (test) c1 else c2
println(c3.re)

The conditional ifThenElse is overridden to split itself for each field of a
struct. Internally the above will be represented as:

val c3re = if (test) c1re else c2re
val c3im = if (test) c1im else c2im   // removed by dce
val c3 = Complex(c3re, c3im)          // removed by dce
println(c3re)

The computation of the imaginary component as well as the struct creation for
the result of the conditional are never used and thus they will be removed
by dead code elimination.

Generic Programming with Type Classes

The type class pattern (*), which decouples data
objects from generic dispatch, fits naturally with a staged programming model
as type class instances can be implemented as static objects.

Extending the Vector example, we might want to be able to add vectors that
contain numeric values. We can use a lifted variant of the Numeric type
class from the Scala library

class Numeric[T] {
  def num_plus(a: Rep[T], b: Rep[T]): Rep[T]
}

and provide a type class instance for complex numbers:

implicit def complexIsNumeric = new Numeric[Complex] {
  def num_plus(a: Rep[Complex], b: Rep[Complex]) = a + b
}

Generic addition on Vectors is straightforward, assuming we have a method
zipWith already defined:

def infix_+[T:Numeric](a: Rep[Vector[T]], b: Rep[Vector[T]]) = {
  val m = implicitly[Numeric[T]] // access type class instance
  a.zipWith(b)((u,v) => m.num_plus(u,v))
}

With that definition at hand we can add a type class instance for numeric
vectors:

implicit def vecIsNumeric[T:Numeric] = new Numeric[Vector[T]] {
  def num_plus(a: Rep[Vector[T]], b: Rep[Vector[T]]) = a + b

which allows us to pass, say, a Rep[Vector[Complex]] to any function that
works over generic types T:Numeric including vector addition itself. The
same holds for nested vectors of type Rep[Vector[Vector[Complex]]]. Usually,
type classes are implemented by passing an implicit dictionary, the type class
instance, to generic functions. Here, type classes are a purely stage-time
concept. All generic code is specialized to the concrete types and no type
class instances exist (and hence no virtual dispatch occurs) when the DSL
program is run.

An interesting extension of the type class model is the notion of polytypic
staging, studied on top of LMS (*).

Unions and Inheritance

\label{sec:455inherit}

The struct abstraction from here can be extended to sum
types and inheritance using a tagged union approach
(*). We add a clzz field to
each struct that refers to an expression that defines the object's class.
Being a regular struct field, it is subject to all common optimizations. We
extend the complex number example with two subclasses:

abstract class Complex
class Cartesian extends Complex with Struct { val re: Double, val im: Double }
class Polar extends Complex with Struct { val r: Double, val phi: Double }

Splitting transforms work as before: e.g. conditional expressions are
forwarded to the fields of the struct. But now the result struct will contain
the union of the fields found in the two branches, inserting null values as
appropriate. A conditional is created for the clzz field only if the exact
class is not known at staging time. As an example, the expression

val a = Cartesian(1.0, 2.0); val b = Polar(3.0, 4.0)
if (x > 0) a else b

produces this generated code:

val (re, im, r, phi, clzz) =
  if (x > 0) (1.0, 2.0, null, null, classOf[Cartesian])
  else (null, null, 3.0, 4.0, classOf[Polar])
struct("re"->re, "im"->im, "r"->r, "phi"->phi, "clzz"->clzz)

The clzz fields allows virtual dispatch via type tests and type casting,
e.g. to convert any complex number to its cartesian representation:

def infix_toCartesian(c: Rep[Complex]): Rep[Cartesian] =
  if (c.isInstanceOf[Cartesian]) c.asInstanceOf[Cartesian]
  else { val p = c.asInstanceOf[Polar]
    Cartesian(p.r * cos(p.phi), p.r * sin(p.phi)) }

Appropriate rewrites ensure that if the argument is known to be a Cartesian,
the conversion is a no-op. The type test that inspects the clzz field is only
generated if the type cannot be determined statically. If the clzz field is
never used it will be removed by DCE.

Struct of Array and Other Data Format Conversions

\label{sec:455structUse}

There is another particularly interesting use case for the splitting of data
structures: Let us assume we want to create a vector of complex numbers. Just
as with the if-then-else example above, we can override the vector
constructors such that a Vector[Cartesian] is represented as a struct that
contains two separate arrays, one for the real and one for the imaginary
components. A more general Vector[Complex] that contains both polar and
cartesian values will be represented as five arrays, one for each possible
data field plus the clzz tag for each value. In fact, we have expressed our
conceptual array of structs as a struct of arrays (AoS to SoA transform, see
here). This data layout is beneficial in many cases. Consider
for example calculating complex conjugates (i.e. swapping the sign of the
imaginary components) over a vector of complex numbers.

def conj(c: Rep[Complex]) = if (c.isCartesian) {
  val c2 = c.toCartesian; Cartesian(c2.re, -c2.im)
} else {
  val c2 = c.toPolar; Polar(c2.r, -c2.phi)
}

To make the test case more interesting we perform the calculation only in one
branch of a conditional.

val vector1 = ... // only Cartesian values
if (test) {
  vector1.map(conj)
} else {
  vector1
}

All the real parts remain unchanged so the array holding them need not be
touched at all. Only the imaginary parts have to be transformed, cutting the
total required memory bandwidth in half. Uniform array operations like this
are also a much better fit for SIMD execution. The generated intermediate code
is:

val vector1re = ...
val vector1im = ...
val vector1clzz = ... // array holding classOf[Cartesian] values
val vector2im = if (test) {
  Array.fill(vector1size) { i => -vector1im(i) }
} else {
  vector1im
}
struct(ArraySoaTag(Complex,vector1size),
  Map("re"->vector1re, "im"->vector2im, "clzz"->vector1clzz))

Note how the conditionals for the re and clzz fields have been eliminated
since the fields do not change (the initial array contained cartesian numbers
only). If the struct expression will not be referenced in the final code, dead
code elimination removes the clzz array.

In the presence of conditionals that produce array elements of different
types, it can be beneficial to use a sparse representation for arrays that
make up the result struct-of-array, similar to the approach in Data Parallel
Haskell~(*). Of course no choice of layout is
optimal in all cases, so the usual sparse versus dense tradeoffs regarding
memory use and access time apply here as well.

We conclude this section by taking note that we can actually guarantee that no
dynamic Complex or Struct object is ever created just by not implementing
code generation logic for Struct and Field IR nodes and signaling an error
instead. This is a good example of a performance-oriented DSL compiler
rejecting a program as ill-formed because it cannot be executed in the
desired, efficient way.

Loop Fusion and Deforestation

\label{subsec:fusion}

Building complex bulk operations out of simple ones often leads to inefficient
generated code. For example consider the simple vector code

val a: Rep[Double] = ...
val x: Rep[Vector[Double]] = ...
val y: Rep[Vector[Double]] = ...

a*x+y

Assuming we have provided the straightforward loop-based implementations of
scalar-times-vector and vector-plus-vector, the resulting code for this
program will perform two loops and allocate a temporary vector to store a*x.
A more efficient implementation will only use a single loop (and no temporary
vector allocations) to compute a*x(i)+y(i).

In addition to operations that are directly dependent as illustrated above,
side-by-side operations also appear frequently. As an example, consider a DSL
that provides mean and variance methods.

def mean(x: Rep[Vector[Double]]) =
    sum(x.length) { i => x(i) } / x.length
def variance(x: Rep[Vector[Double]]) =
    sum(x.length) { i => square(x(i)) } / x.length - square(mean(x))

val data = ...

val m = mean(data)
val v = variance(data)

The DSL developer wishes to provide these two functions separately, but many
applications will compute both the mean and variance of a dataset together.
In this case we again want to perform all the work with a single pass over
data. In both of the above example situations, fusing the operations into a
single loop greatly improves cache behavior and reduces the total number of
loads and stores required. It also creates coarser-grained functions out of
fine-grained ones, which will likely improve parallel scalability.

Our framework handles all situations like these two examples uniformly and for
all DSLs. Any non-effectful loop IR node is eligible for fusing with other
loops. In order to handle all the interesting loop fusion cases, the fusing
algorithm uses simple and general criteria: It fuses all pairs of loops where
either both loops have the exact same size or one loop iterates over a data
structure the other loop creates, as long as fusing will not create any cyclic
dependencies. The exact rules are presented here. When
it finds two eligible loops the algorithm creates a new loop with a body
composed of both of the original bodies. Merging loop bodies includes array
contraction, i.e. the fusing transform modifies dependencies so that all
results produced within a loop iteration are consumed directly rather than by
reading an output data structure. %In the case of groupBy operations,
contraction also works across hash tables. Whenever this renders an output
data structure unnecessary (it does not escape the fused loop) it is removed
automatically by dead code elimination. All DeliteOpLoops are parallel
loops, which allows the fused loops to be parallelized in the same manner as
the original loops.

The general heuristic is to apply fusion greedily wherever possible. For
dominantly imperative code more refined heuristics might be needed
(*). However, our loop abstractions are
dominantly functional and many loops create new data structures. Removing
intermediate data buffers, which are potentially large and many of which are
used only once is clearly a win, so fusing seems to be beneficial in almost
all cases.

Our fusion mechanism is similar but not identical to deforestation
(*) and related approaches
(*). Many of these approaches only consider
expressions that are directly dependent (vertical fusion), whereas we are able
to handle both dependent and side-by-side expressions (horizontal fusion) with
one general mechanism. This is critical for situations such as the mean and
variance example, where the only other efficient alternative would be to
explicitly create a composite function that returns both results
simultaneously. This solution additionally requires the application writer to
always remember to use the composite version when appropriate. It is
generally difficult to predict all likely operation compositions as well as
onerous to provide efficient, specialized implementations of them. Therefore
fusion is key for efficient compositionality in both applications and DSL
libraries.

Extending the Framework

A framework for building DSLs must be easily extensible in order for the DSL
developer to exploit domain knowledge starting from a general-purpose IR
design. Consider a simple DSL for linear algebra with a Vector type. Now we
want to add norm and dist functions to the DSL. The first possible
implementation is to simply implement the functions as library methods.

def norm[T:Numeric](v: Rep[Vector[T]]) = {
  sqrt(v.map(j => j*j).sum)
}
def dist[T:Numeric](v1: Rep[Vector[T]], v2: Rep[Vector[T]]) = {
  norm(v1 - v2)
}

Whenever the dist method is called the implementation will be added to the
application IR in terms of vector subtraction, vector map, vector sum, etc.
(assuming each of these methods is built-in to the language rather than also
being provided as a library method). This version is very straightforward to
write but the knowledge that the application wishes to find the distance
between two vectors is lost.

By defining norm explicitly in the IR implementation trait (where Rep[T] =
Exp[T]) we gain ability to perform pattern matching on the IR nodes that
compose the arguments.

override def norm[T:Numeric](v: Exp[Vector[T]]) = v match {
  case Def(ScalarTimesVector(s,u)) => s * norm(u)
  case Def(ZeroVector(n)) => 0
  case _ => super.norm(v)
}

In this example there are now three possible implementations of norm. The
first case factors scalar-vector multiplications out of norm operations,
the second short circuits the norm of a ZeroVector to be simply the constant
0, and the third falls back on the default implementation defined above.
With this method we can have a different implementation of norm for each
occurrence in the application.

An even more powerful alternative is to implement norm and dist as custom
IR nodes. This enables the DSL to include these nodes when optimizing the
application via pattern matching and IR rewrites as illustrated above. For
example, we can add a rewrite rule for calculating the norm of a unit vector:
if $v_1 = \frac{v}{\|v\|}$ then $\left\|v_1\right\|=1$. In order to implement
this optimization we need to add cases both for the new norm operation as
well as to the existing scalar-times-vector operation to detect the first half
of the pattern.

case class VectorNorm[T](v: Exp[Vector[T]]) extends Def[T]
case class UnitVector[T](v: Exp[Vector[T]]) extends Def[Vector[T]]

override def scalar_times_vector[T:Numeric](s: Exp[T], v: Exp[Vector[T]]) =
(s,v) match {
  case (Def(Divide(Const(1), Def(VectorNorm(v1)))), v2)
    if v1 == v2 => UnitVector(v)
  case _ => super.scalar_times_vector(s,v)
}
override def norm[T:Numeric](v: Exp[Vector[T]]) = v match {
  case Def(UnitVector(v1)) => 1
  case _ => super.norm(v)
}

In this example the scalar-times-vector optimization requires vector-norm to
exist as an IR node to detect\footnote{The == operator tests structural
equality of IR nodes. The test is cheap because we only need to look at
symbols, one level deep. Value numbering/CSE ensures that intensionally equal
IR nodes get assigned the same symbol.} and short-circuit the operation to
simply create and mark unit vectors. The vector-norm optimization then
detects unit vectors and short circuits the norm operation to simply add the
constant 1 to the IR. In every other case it falls back on the default
implementation, which is to create a new VectorNorm IR node.

The default constructor for VectorNorm uses delayed rewriting (see
here) to specify the desired lowering of the IR node:

def norm[T:Numeric](v: Rep[Vector[T]]) = VectorNorm(v) atPhase(lowering) {
  sqrt(v.map(j => j*j).sum)
}

The right hand side of this translation is exactly the initial norm
implementation we started with.

Lowering Transforms

In our running example, we would like to treat linear algebra operations
symbolically first, with individual IR nodes like VectorZeros and
VectorPlus. In Figure~\ref{fig:vectorImpl}, the smart constructor vec_plus
implements a rewrite that simplifies v+zero to v. CSE, DCE, etc. will all
be performed on these high level nodes.

After all those optimizations are applied, we want to transform our
operations to the low-level array implementation from
Figure~\ref{fig:stagedArrays} in a separate lowering pass. Trait
LowerVectors in Figure~\ref{fig:vectorImpl} implements this transformation
by delegating back to user-space code, namely method vec_plus_ll in trait
VectorsLowLevel.

// Vector interface
trait Vectors extends Base {
  // elided implicit enrichment boilerplate:
  //   Vector.zeros(n) = vec_zeros(n), v1 + v2 = vec_plus(a,b)
  def vec_zeros[T:Numeric](n: Rep[Int]): Rep[Vector[T]]
  def vec_plus[T:Numeric](a: Rep[Vector[T]], b: Rep[Vector[T]]): Rep[Vector[T]]
}
// low level translation target
trait VectorsLowLevel extends Vectors {
  def vec_zeros_ll[T:Numeric](n: Rep[Int]): Rep[Vector[T]] =
    Vector.fromArray(Array.fill(n) { i => zero[T] })
  def vec_plus_ll[T:Numeric](a: Rep[Vector[T]], b: Rep[Vector[T]]) =
    Vector.fromArray(a.data.zipWith(b.data)(_ + _))
}
// IR level implementation
trait VectorsExp extends BaseExp with Vectors {
  // IR node definitions and constructors
  case class VectorZeros(n: Exp[Int]) extends Def[Vector[T]]
  case class VectorPlus(a: Exp[Vector[T]],b: Exp[Vector[T]]) extends Def[Vector[T]]
  def vec_zeros[T:Numeric](n: Rep[Int]): Rep[Vector[T]] = VectorZeros(n)
  def vec_plus[T:Numeric](a: Rep[Vector[T]], b: Rep[Vector[T]]) = VectorPlus(a,b)
  // mirror: transformation default case
  def mirror[T](d: Def[T])(t: Transformer) = d match {
    case VectorZeros(n) => Vector.zeros(t.transformExp(n))
    case VectorPlus(a,b) => t.transformExp(a) + t.transformExp(b)
    case _ => super.mirror(d)
  }
}
// optimizing rewrites (can be specified separately)
trait VectorsExpOpt extends VectorsExp {
  override def vec_plus[T:Numeric](a:Rep[Vector[T]],b:Rep[Vector[T]])=(a,b)match{
    case (a, Def(VectorZeros(n))) => a
    case (Def(VectorZeros(n)), b) => b
    case _ => super.vec_plus(a,b)
  }
}
// transformer: IR -> low level impl
trait LowerVectors extends ForwardTransformer {
  val IR: VectorsExp with VectorsLowLevel; import IR._
  def transformDef[T](d: Def[T]): Exp[T] = d match {
    case VectorZeros(n) => vec_zeros_ll(transformExp(n))
    case VectorPlus(a,b) => vec_plus_ll(transformExp(a), transformExp(b))
    case _ => super.transformDef(d)
  }
}

The result of the transformation is a staged program fragment just like in
Figure~\ref{fig:stagedArrays}.

This setup greatly simplifies the definition of the lowering transform, which
would otherwise need to assemble the fill or zipWith code using low level
IR manipulations. Instead we benefit directly from the staged zipWith
definition from Figure~\ref{fig:stagedArrays}. Also, further rewrites will
take place automatically. Essentially all simplifications are performed
eagerly, after each transform phase. Thus we guarantee that CSE, DCE, etc.
have been applied on high-level operations before they are translated into
lower-level equivalents, on which optimizations would be much harder to apply.
To give a quick example, the initial program

val v1 = ...
val v2 = Vector.zeros(n)
val v3 = v1 + v2
v1 + v3

will become

val v1 = ...
Vector.fromArray(v1.data.zipWith(v1.data)(_ + _))

after lowering (modulo unfolding of staged zipWith).

(Chapter 3) Case Studies

\label{chap:460fusionUse}

This chapter presents case studies for Delite apps (using the OptiML and
OptiQL DSLs) as well as classical staging use cases (FFT specialization and
regular expression matching). The Delite apps are real-world examples for the
loop fusion algorithm from here and the struct
conversion from here.

OptiML Stream Example

OptiML is an embedded DSL for machine learning (ML) developed on top of LMS
and Delite. It provides a MATLAB-like programming model with ML- specific
abstractions. OptiML is a prototypical example of how the techniques described
in this thesis can be used to construct productive, high performance DSLs
targeted at heterogeneous parallel machines.

\label{sec:optiml}

Downsampling in Bioinformatics

In this example, we will demonstrate how the optimization and code generation
techniques discussed in previous sections come together to produce efficient
code in real applications. SPADE is a bioinformatics application that builds
tree representations of large, high-dimensional flow cytometry datasets.
Consider the following small but compute-intensive snippet from SPADE (C++):

std::fill(densities, densities+obs, 0);
#pragma omp parallel for shared(densities)
for (size_t i=0; i<obs; i++) {
  if (densities[i] > 0)
    continue;
  std::vector<size_t> apprxs;  // Keep track on observations we can approximate
  Data_t *point = &data[i*dim];
  Count_t c = 0;

  for (size_t j=0; j<obs; j++) {
    Dist_t d = distance(point, &data[j*dim], dim);
    if (d < apprx_width) {
      apprxs.push_back(j);
      c++;
    } else if (d < kernel_width) c++;
  }
  // Potential race condition on other density entries, use atomic
  // update to be safe
  for (size_t j=0; j<apprxs.size(); j++)
    __sync_bool_compare_and_swap(densities+apprxs[j],0,c);
  densities[i] = c;
}

This snippet represents a downsampling step that computes a set of values,
densities, that represents the number of samples within a bounded distance
(kernel_width) from the current sample. Furthermore, any distances within
apprx_width of the current sample are considered to be equivalent, and the
density for the approximate group is updated as a whole. Finally, the loop is
run in parallel using OpenMP. This snippet represents hand-optimized, high
performance, low-level code. It took a systems and C++ expert to port the
original MATLAB code (written by a bioinformatics researcher) to this
particular implementation. In contrast, consider the equivalent snippet of
code, but written in OptiML:

val distances = Stream[Double](data.numRows, data.numRows) {
  (i,j) => dist(data(i),data(j))
}
val densities = Vector[Int](data.numRows, true)

for (row <- distances.rows) {
  if(densities(row.index) == 0) {
    val neighbors = row find { _ < apprxWidth }
    densities(neighbors) = row count { _ < kernelWidth }
  }
}
densities

This snippet is expressive and easy to write. It is not obviously high
performance. However, because we have abstracted away implementation detail,
and built-in high-level semantic knowledge into the OptiML compiler, we can
generate code that is essentially the same as the hand-tuned C++ snippet.
Let's consider the OptiML code step by step.

Line 1 instantiates a Stream, which is an OptiML data structure that is
buffered; it holds only a chunk of the backing data in memory at a time, and
evaluates operations one chunk at a time. Stream only supports iterator-style
access and bulk operations. These semantics are necessary to be able to
express the original problem in a more natural way without adding overwhelming
performance overhead. The foreach implementation for stream.rows is:

def stream_foreachrow[A:Typ](x: Exp[Stream[A]],
              block: Exp[StreamRow[A]] => Exp[Unit]) = {
  var i = 0
  while (i < numChunks) {
    val rowsToProcess = stream_rowsin(x, i)
    val in = (0::rowsToProcess)
    val v = fresh[Int]

    // fuse parallel initialization and foreach function
    reflectEffect(StreamInitAndForeachRow(in, v, x, i, block))   // parallel
    i += 1
  }
}

This method constructs the IR nodes for iterating over all of the chunks in
the Stream, initalizing each row, and evaluating the user-supplied foreach
anonymous function. We first obtain the number of rows in the current chunk by
calling a method on the Stream instance (stream_rowsin). We then call the
StreamInitAndForeachRow op, which is a DeliteOpForeach, over all of the rows
in the chunk. OptiML unfolds the foreach function and the stream
initialization function while building the IR, inside StreamInitAndForeachRow.
The stream initialization function ((i,j) => dist(data(i),data(j))
constructs a StreamRow, which is the input to the foreach function. The
representation of the foreach function consists of an IfThenElse operation,
where the then branch contains the VectorFind, VectorCount, and
VectorBulkUpdate operations from lines 6-7 of the OptiML SPADE snippet.
VectorFind and VectorCount both extend DeliteOpLoop. Since they are both
DeliteOpLoops over the same range with no cyclic dependencies, they are fused
into a single DeliteOpLoop. This eliminates an entire pass (and the
corresponding additional memory accesses) over the row, which is a non-trivial
235,000 elements in one typical dataset.

Fusion helps to transform the generated code into the iterative structure of
the C++ code. One important difference remains: we only want to compute the
distance if it hasn't already been computed for a neighbor. In the streaming
version, this corresponds to only evaluating a row of the Stream if the user-
supplied if-condition is true. In other words, we need to optimize the
initialization function together with the anonymous function supplied to the
foreach. LMS does this naturally since the foreach implementation and the user
code written in the DSL are all uniformly represented with the same IR. When
the foreach block is scheduled, the stream initialization function is pushed
inside the user conditional because the StreamRow result is not required
anywhere else. Furthermore, once the initialization function is pushed inside
the conditional, it is then fused with the existing DeliteOpLoop, eliminating
another pass. We can go even further and remove all dependencies on the
StreamRow instance by bypassing field accesses on the row, using the pattern
matching mechanism described earlier:

trait StreamOpsExpOpt extends StreamOpsExp {
  this: OptiMLExp with StreamImplOps =>

  override def stream_numrows[A:Typ](x: Exp[Stream[A]]) = x match {
    case Def(Reflect(StreamObjectNew(numRows, numCols,
                      chunkSize, func, isPure),_,_)) => numRows
    case _ => super.stream_numrows(x)
  }
  // similar overrides for other stream fields
}
trait VectorOpsExpOpt extends VectorOpsExp {
  this: OptiMLExp with VectorImplOps =>
  // accessing an element of a StreamRow directly accesses the underlying Stream
  override def vector_apply[A:Typ](x: Exp[Vector[A]], n: Exp[Int]) = x match {
    case Def(StreamChunkRow(x, i, offset)) => stream_chunk_elem(x,i,n)
    case _ => super.vector_apply(x,n)
  }
}

Now as the row is computed, the results of VectorFind and VectorCount are also
computed in a pipelined fashion. All accesses to the StreamRow are short-
circuited to their underlying data structure (the Stream), and no StreamRow
object is ever allocated in the generated code. The following listing shows
the final code generated by OptiML for the ``then'' branch (comments and
indentation added for clarity):

// ... initialization code omitted ...
// -- FOR EACH ELEMENT IN ROW --
while (x155 < x61) {
  val x168 = x155 * x64
  var x185: Double = 0
  var x180 = 0

  // -- INIT STREAM VALUE (dist(i,j))
  while (x180 < x64) {
    val x248 = x164 + x180
    val x249 = x55(x248)
    val x251 = x168 + x180
    val x252 = x55(x251)
    val x254 = x249 - x252
    val x255 = java.lang.Math.abs(x254)
    val x184 = x185 + x255
    x185 = x184
    x180 += 1
  }
  val x186 = x185
  val x245 = x186 < 6.689027961000001
  val x246 = x186 < 22.296759870000002

  // -- VECTOR FIND --
  if (x245) x201.insert(x201.length, x155)

  // -- VECTOR COUNT --
  if (x246) {
    val x207 = x208 + 1
    x208 = x207
  }
  x155 += 1
}

// -- VECTOR BULK UPDATE --
var forIdx = 0
while (forIdx < x201.size) {
  val x210 = x201(forIdx)
  val x211 = x133(x210) = x208
  x211
  forIdx += 1
}

This code, though somewhat obscured by the compiler generated names, closely
resembles the hand-written C++ snippet shown earlier. It was generated from a
simple, 9 line description of the algorithm written in OptiML, making heavy
use of the building blocks we described in previous sections to produce the
final result.

OptiQL Struct Of Arrays Example

\label{sec:460optiqlSoa}

OptiQL is a DSL for data querying of in-memory collections, inspired by
LINQ~(*). We consider querying a data set with roughly 10
columns, similar to the table lineItems from the TPCH benchmark. The example
is slightly trimmed down from TPCH Query 1:

val res = lineItems Where(_.l_shipdate <= Date("1998-12-01"))
GroupBy(l => l.l_returnflag) Select(g => new Result {
  val returnFlag = g.key
  val sumQty = g.Sum(_.l_quantity)
})

A straightforward implementation is rather slow. There are multiple traversals
that compute intermediate data structures. There is also a nested Sum
operation inside the Select that follows the groupBy.

We can translate this code to a single while loop that does not construct any
intermediate data structures and furthermore ignores all columns that are not
part of the result. First, the complete computation is split into separate
loops, one for each column. Unnecessary ones are removed. Then the remaining
component loops are reassembled via loop fusion. For the full TPCH Query 1,
these transformations provide a speed up of 5.5x single threaded and 8.7x with
8 threads over the baseline array-of-struct version (see
here).

We use two hash tables in slightly different ways: one to accumulate the keys
(so it is really a hash set) and the other one to accumulate partial sums.
Internally there is only one hash table that maps keys to positions. The
partial sums are just kept in an array that shares the same indices with the
key array.

Below is the annotated generated code:

val x11 = x10.column("l_returnflag")
val x20 = x10.column("l_shipdate")
val x52 = generated.scala.util.Date("1998-12-01")
val x16 = x10.columns("l_quantity")
val x283 = x264 + x265

// hash table constituents, grouped for both x304,x306
var x304x306_hash_to_pos: Array[Int] = alloc_htable // actual hash table
var x304x306_hash_keys: Array[Char] = alloc_buffer  // holds keys
var x304_hash_data: Array[Char] = alloc_buffer      // first column data
var x306_hash_data: Array[Double] = alloc_buffer    // second column data
val x306_zero = 0.0
var x33 = 0
while (x33 < x28) {  // begin fat loop x304,x306
  val x35 = x11(x33)
  val x44 = x20(x33)
  val x53 = x44 <= x52
  val x40 = x16(x33)

  // group conditionals
  if (x53) {
    val x35_hash_val = x35.hashCode
    val x304x306_hash_index_x35 = {
      // code to lookup x35_hash_val
      // in hash table x304x306_hash_to_pos
      // with key table x304x306_hash_keys
      // (growing hash table if necessary)
    }

    if (x304x306_hash_index_x35 >= x304x306_hash_keys.length) { // not found
      // grow x304x306_hash_keys and add key
      // grow x304_hash_data
      // grow x306_hash_data and set to x306_zero
    }
    x304_hash_data (x304x306_hash_index_x35) = x35

    val x264 = x306_hash_data (x304x306_hash_index_x35)
    val x265 = x40
    val x283 = x264 + x265
    x304_hash_data (x304x306_hash_index_x35) = x283
  }
} // end fat loop x304,x306
val x304 = x304_hash_data
val x305 = x304x306_hash_to_pos.size
val x306 = x306_hash_data

val x307 = Map("returnFlag"->x304,"sumQty"->x306) //Array Result
val x308 = Map("data"->x307,"size"->x305) //DataTable

Comments? Suggestions for improvement? View this file on GitHub.