Automata-Based Regex Matcher

Outline:

Specializing string matchers and parsers is a popular benchmark in the partial evaluation and supercompilation literature.

We consider “multi-threaded” regular expression matchers, that spawn a new conceptual thread to process alternatives in parallel. Of course these matchers do not actually spawn OS-level threads, but rather need to be advanced manually by client code. Thus, they are similar to coroutines.

package scala.lms.tutorial

Regexp Matchers as Nondeterministic Finite Automata (NFA)

class AutomataTest extends TutorialFunSuite {
  val under = "dfa_"

  test("findAAB") {
    val p = new AutomataDriver with NFAtoDFA {
      def snippet(x: Rep[Unit]) = {

Here is a simple example for the fixed regular expression .*AAB:

        def findAAB(): NIO = {
          guard(C('A')) {
            guard(C('A')) {
              guard(C('B'), true) {
                stop()
          }}} ++
          guard(W) { findAAB() } // in parallel ...
        }

        // NFA to DFA conversion via staging explained below.
        convertNFAtoDFA((findAAB(), false))
      }
    }

    // Some tests.
    assertResult(true){p.matches("AAB")}
    assertResult(false){p.matches("AAC")}
    assertResult(true){p.matches("AACAAB")}
    assertResult(true){p.matches("AACAABAAC")}

    // The generated code for the DFA is shown at the end.
    check("aab", p.code)
  }
}

We can easily add combinators on top of the core abstractions that take care of producing matchers from textual regular expressions. However, the point here is to demonstrate how the implementation works.

trait NFAOps extends scala.lms.util.ClosureCompare {

The given matcher uses an API that models nondeterministic finite automata (NFA):

An NFA state consists of a list of possible transitions. Each transition may be guarded by a set of characters and it may have a flag to be signaled if the transition is taken. It also knows how to compute the following state. We use Chars for simplicity, but of course we could use generic types as well. Note that the API does not mention where input is obtained from (files, streams, etc).

  type NIO = List[NTrans] // state: many possible transitions
  def guard(cond: CharSet, found: => Boolean = false)(e: => NIO): NIO = {
    List(NTrans(cond, () => found, () => e))
  }
  def stop(): NIO = Nil
  def trans(c: CharSet)(s: () => NIO): NIO = List(NTrans(c, () => false, s))
  def guards(conds: List[CharSet], found: Boolean = false)(e: => NIO): NIO = {
    conds.flatMap(guard(_, found)(e))
  }
  case class NTrans(c: CharSet, e: () => Boolean, s: () => NIO) extends Ordered[NTrans] {
    override def compare(o: NTrans) = {
      val i = this.c.compare(o.c)
      if (i != 0) i else {
        val i2 = this.e().compare(o.e())
        if (i2 != 0) i2 else {
          val tf = canonicalize(this.s())
          val of = canonicalize(o.s())
          if (tf == of) 0 else tf.compare(of)
	      }
      }
    }
  }

  sealed abstract class CharSet extends Ordered[CharSet] {
    override def compare(o: CharSet) = (this,o) match {
      case (W,W) => 0
      case (W,_) => 1
      case (_,W) => -1
      case (C(c1),C(c2)) => c1.compare(c2)
    }
  }
  case class C(c: Char) extends CharSet
  case object W extends CharSet
}

From NFA to DFA using Staging

We will translate NFAs to DFAs using staging. This is the staged DFA API, which is just a thin wrapper over an unstaged API with no Reps:

case class Automaton[@specialized(Char) I, @specialized(Boolean) O](
  out: O, next: I => Automaton[I,O])

trait DFAOps extends Dsl {
  implicit def dfaTyp: Typ[DfaState]
  type DfaState = Automaton[Char,Boolean]
  type DIO = Rep[DfaState]
  def dfa_trans(f: Rep[Char] => DIO): DIO = dfa_trans(false)(f)
  def dfa_trans(e: Boolean)(f: Rep[Char] => DIO): DIO
}

trait NFAtoDFA extends NFAOps with DFAOps {

Translating an NFA to a DFA is accomplished by creating a DFA state for each encountered NFA configuration (removing duplicate states via canonicalize):

  def convertNFAtoDFA(in: (NIO, Boolean)): DIO = {
    def iterate(flag: Boolean, state: NIO): DIO = {
      dfa_trans(flag){ c: Rep[Char] => exploreNFA(canonicalize(state), c) { iterate }
    }}
    iterate(in._2, in._1)
  }

  def canonicalize(state: NIO): NIO = {
    if (state.isEmpty) state else {
      val state_sorted = state.sorted
      state_sorted.head :: (for ((s,sn) <- (state_sorted zip state_sorted.tail)
        if s.compare(sn) != 0) yield sn)
    }
  }

The LMS framework memoizes functions which ensures termination if the NFA is indeed finite.

We use a separate function to explore the NFA space, advancing the automaton by a symbolic character cin to invoke its continuations k with a new automaton, i.e. the possible set of states after consuming cin. The given implementation assumes character sets contain either zero or one characters, the empty set denoting a wildcard match. More elaborate cases such as character ranges are easy to add. The algorithm tries to remove as many redundant checks and impossible branches as possible. This only works because the character guards are staging time values.

  def exploreNFA[A:Typ](xs: NIO, cin: Rep[Char])(
    k: (Boolean, NIO) => Rep[A]): Rep[A] = xs match {
    case Nil => k(false, Nil)
    case NTrans(W, e, s)::rest =>
      val (xs1, xs2) = xs.partition(_.c != W)
      exploreNFA(xs1,cin)((flag,acc) =>
        k(flag || xs2.exists(_.e()), acc ++ xs2.flatMap(_.s())))
    case NTrans(cset, e, s)::rest =>
      if (cset contains cin) {
        val xs1 = for (
          NTrans(rcset, re, rs) <- rest;
          kcset <- rcset knowing cset
        ) yield NTrans(kcset,re,rs)
        exploreNFA(xs1,cin)((flag,acc) => k(flag || e(), acc ++ s()))
      } else {
        val xs1 = for (
          NTrans(rcset, re, rs) <- rest;
          kcset <- rcset knowing_not cset
        ) yield NTrans(kcset,re,rs)
        exploreNFA(xs1, cin)(k)
      }
  }

  def infix_contains(s: CharSet, c: Rep[Char]): Rep[Boolean] = s match {
    case C(c1) => c == c1
    case W => unit(true)
  }
  def infix_knowing(s1: CharSet, s2: CharSet): Option[CharSet] = (s1,s2) match {
    case (W,_) => Some(W)
    case (C(c1),C(c2)) if c1 == c2 => Some(W)
    case _ => None
  }
  def infix_knowing_not(s1: CharSet, s2: CharSet): Option[CharSet] = (s1,s2) match {
    case (C(c1), C(c2)) if c1 == c2 => None
    case _ => Some(s1)
  }
}

The generated code is shown further down below. Each function corresponds to one DFA state. Note how negative information has been used to prune the transition space: Given input such as ...AAB the automaton jumps back to the initial state, i.e. it recognizes that the last character B cannot also be A and starts looking for two As after the B.

Generated State Machine Code

trait DFAOpsExp extends DslExp with DFAOps {
  implicit def dfaTyp: Typ[DfaState] = manifestTyp
  case class DFAState(e: Boolean, f: Rep[Char => DfaState]) extends Def[DfaState]
  def dfa_trans(e: Boolean)(f: Rep[Char] => DIO): DIO = DFAState(e, doLambda(f))
}

trait ScalaGenDFAOps extends DslGen {
  val IR: DFAOpsExp
  import IR._
  override def emitNode(sym: Sym[Any], rhs: Def[Any]) = rhs match {
    case dfa@DFAState(b,f) =>
      emitValDef(sym, "scala.lms.tutorial.Automaton(" + quote(b) + ", " + quote(f) + ")")
    case _ => super.emitNode(sym, rhs)
  }
}

To use the generated code, we use the matches small driver loop:

abstract class AutomataDriver extends DslDriver[Unit,Automaton[Char,Boolean]] with DFAOpsExp { q =>
  override val codegen = new ScalaGenDFAOps {
    val IR: q.type = q
  }
  def matches(s: String): Boolean = {
    var a: Automaton[Char,Boolean] = f(())
    var i: Int = 0
    while (!a.out && i < s.length) {
      a = a.next(s(i))
      i += 1
    }
    a.out
  }
}

If the matcher and input iteration logic is generated together, further translations can be applied to transform the mutually recursive lambdas into tight imperative state machines.

Here is the generated code for .*AAB, shown initially:

/*****************************************
Emitting Generated Code
*******************************************/
class Snippet extends ((Unit)=>(scala.lms.tutorial.Automaton[Char, Boolean])) {
  def apply(x19:Unit): scala.lms.tutorial.Automaton[Char, Boolean] = {
    var x13 = null.asInstanceOf[scala.lms.tutorial.Automaton[Char, Boolean]]
    var x1 = null.asInstanceOf[scala.Function1[Char, scala.lms.tutorial.Automaton[Char, Boolean]]]
    var x17 = null.asInstanceOf[scala.lms.tutorial.Automaton[Char, Boolean]]
    var x4 = null.asInstanceOf[scala.Function1[Char, scala.lms.tutorial.Automaton[Char, Boolean]]]
    var x10 = null.asInstanceOf[scala.lms.tutorial.Automaton[Char, Boolean]]
    var x7 = null.asInstanceOf[scala.Function1[Char, scala.lms.tutorial.Automaton[Char, Boolean]]]
    var x12 = null.asInstanceOf[scala.lms.tutorial.Automaton[Char, Boolean]]
    x1 = {x2: (Char) =>
      val x3 = x2 == 'A'
      val x18 = if (x3) {
        x17
      } else {
        x13
      }
      x18: scala.lms.tutorial.Automaton[Char, Boolean]
    }
    x12 = scala.lms.tutorial.Automaton(true, x1)
    x7 = {x8: (Char) =>
      val x9 = x8 == 'A'
      val x15 = if (x9) {
        x10
      } else {
        val x11 = x8 == 'B'
        val x14 = if (x11) {
          x12
        } else {
          x13
        }
        x14
      }
      x15: scala.lms.tutorial.Automaton[Char, Boolean]
    }
    x10 = scala.lms.tutorial.Automaton(false, x7)
    x4 = {x5: (Char) =>
      val x6 = x5 == 'A'
      val x16 = if (x6) {
        x10
      } else {
        x13
      }
      x16: scala.lms.tutorial.Automaton[Char, Boolean]
    }
    x17 = scala.lms.tutorial.Automaton(false, x4)
    x13 = scala.lms.tutorial.Automaton(false, x1)
    x13
  }
}
/*****************************************
End of Generated Code
*******************************************/

What's next?

Go back to the tutorial index or continue with the SQL Query Compiler.

Comments? Suggestions for improvement? View this file on GitHub.