/ Cats/Scala

What is a Semigroup?

We often need to aggregate data. Hadoop/Spark map-reduce is a classic example of this (the reduce part). There are other more trivial examples - e.g. summing the line items in a shopping basket to calculate the total basket price

TL;DR Semigroups allow us to parallelise operations on large and small data sets, then combine the results together. In essence a Semigroup encapsulates the reduce part of map reduce.

Semigroups are useful in themselves but they also allow us to customise other typeclasses and data types in the Cats' ecosystem (e.g. Validated) so it's worth learning about them, even if you don't see an immediate need

In functional programming terms a Semigroup is a concept which encapsulates this combining/aggregating process. The Semigroup typeclass defines a simple method:

def combine(x: A, y: A): A

and here is how we use it:

// The type class definition
import cats.kernel.Semigroup
// Bring an instance for Int into scope
import cats.instances.int._

val onePlusTwo = Semigroup[Int].combine(1, 2)

Given two values of type A, combine them to produce a single A. There is one additional constraint that's not represented in the signature though:

A Semigroup has an associative binary operation

What exactly does this mean? The binary part is easy to understand (combine takes two parameters). Associative simply means

combine(combine(a, b), c) == combine(a, combine(b, c))

We know Integer addition is associative i.e. (a + b) + c == a + (b + c) but integer subtraction is not: (a - b) - c != a - (b - c)

Why is the associative requirement so important?

At this stage you're probably wondering what all the fuss is about? The Scala collections library already includes reduce() which looks very similar but is more flexible: reduce(op: (A, A) => A)

Associativity allows us to partition the data any way we want and potentially parallelise the operations e.g. given a list of numbers: 1, 2, 3, 4, 5 we can do something like this:

// Could be run in different threads
val group1 = Semigroup[Int].combine(1, 2)
val group2 = Semigroup[Int].combine(3, 4)
val group3 = Semigroup[Int].combine(group1, group2)
val total  = Semigroup[Int].combine(group3, 5)

or

val group1 = Semigroup[Int].combine(2, 3)
val group2 = Semigroup[Int].combine(4, 5)
val group3 = Semigroup[Int].combine(group1, group2)
val total  = Semigroup[Int].combine(1, group3)

Writing a custom Semigroup

Cats includes Semigroup implementations for most of the common types in the Scala ecosystem (String, Int, List etc). Most of the time the behaviour is predictable and obvious. The combine() implementation for Int will add the two parameters, the String implemention will concatenate them. What if we want to multiply two numbers? We can easily write our own implementation:

import cats.kernel.Semigroup

implicit val multiplicationSemigroup = new Semigroup[Int] {
  override def combine(x: Int, y: Int): Int = x * y
} 

// uses our implicit Semigroup instance above
val four = Semigroup[Int].combine(2, 2)

Cats also includes a factory function to make the code a bit less verbose:

implicit val multiplicationSemigroup = Semigroup.instance[Int](_ * _)

Scala 2.12 also allows us to use Java 8's Single Abstract Method pattern so if you're using Scala 2.12 and Java 8 you could also do something like:

implicit val multiplicationSemigroup: Semigroup[Int] = (x: Int, y: Int) => x * y

// or more succintly

implicit val multiplicationSemigroup: Semigroup[Int] = (x, y) => x * y

// or even

implicit val multiplicationSemigroup: Semigroup[Int] = _ * _

Of course the same principle applies if we want to combine our own types however we have to remember that its (A, A) => A not (A, A) => B so we can't do something like

implicit val lineItemSemigroup = Semigroup.instance[LineItem] { (item1, item2) => 
  item1.price + item2.price // returning a Float not a LineItem
}

however we could do something like

implicit val orderSummarySemigroup = Semigroup.instance[OrderSummary] { (summary1, summary2) => 
  Summary(
    items = summary1.items ++ summary2.items, 
    total = summary1.total + summary2.total
  )
}

I mentioned that the Cats implementation for String concatenates the two parameters. Be aware that it doesn't include a space between them which isn't very useful:

@ import cats.instances.string._
import cats.instances.string._

@ Semigroup[String].combine("john", "doe") 
res7: String = "johndoe"

but we can bring in our own instance easily:

@ implicit val mySemigroup = Semigroup.instance[String](_ + " " + _) 
mySemigroup: Semigroup[String] = [email protected]

@ Semigroup[String].combine("john", "doe") 
res3: String = "john doe"

Working with collections

Given the associative constraint we can build more useful constructs from the simple combine(x, y) method. We can use recursion directly or make use of scala's fold() to operate on a collection of values:

import cats.kernel.Semigroup
import cats.instances.string._ // bring a Semigroup into scope for String

def combineStrings(collection: Seq[String], acc: String = ""): String = {
  collection.headOption match {
    case Some(head) =>
      val updatedAcc = Semigroup[String].combine(acc, head)
      combineColectionOfStrings(collection.tail, updatedAcc)
    case None => acc
  }
}

or 

def combineStrings(collection: Seq[String]): String = {
  collection.foldLeft("")(Semigroup[String].combine)
}

Notice how in both cases I need to pass a default or fallback value, in this case an empty String "". This limitation means that I can't easily write a generic method combineStuff[A](collection: Seq[A]) because the fallback value will depend on the type of A ("" for Strings, zero for Ints etc).

There is a solution to this problem though, it's called the Monoid ...