Do we need nested datatypes?
Tags: haskellDec 4 2020
By Stephanie Weirich
This blog post was generated from a literate Haskell file.
{-# LANGUAGE AllowAmbiguousTypes #-}
{-# LANGUAGE DataKinds #-}
{-# LANGUAGE DeriveTraversable #-}
{-# LANGUAGE GADTs #-}
{-# LANGUAGE InstanceSigs #-}
{-# LANGUAGE MultiParamTypeClasses #-}
{-# LANGUAGE RankNTypes #-}
{-# LANGUAGE ScopedTypeVariables #-}
{-# LANGUAGE StandaloneDeriving #-}
{-# LANGUAGE TypeApplications #-}
{-# LANGUAGE TypeFamilies #-}
{-# LANGUAGE TypeOperators #-}
{-# LANGUAGE UndecidableInstances #-}
{-# OPTIONS_GHC -Wno-name-shadowing -Wno-unticked-promoted-constructors #-}
module Nested where
import Prelude hiding (head, tail, Double)
import Data.Kind (Type)
import Control.Monad (ap)
import Data.Type.Equality
import Data.Some
Sometimes we want non-regular datatypes
Typed functional programming languages excel at representing tree structured data. Most of the time, this means that we are working with regular recursive datatypes.
For example, we can represent a normal binary tree, with values only stored at the leaves, using the following definition.
data Tree (a :: Type)
= Leaf a
| Node (Two (Tree a))
deriving (Eq, Ord, Read, Show, Functor, Foldable, Traversable)
This type definition represents a binary tree where each node has exactly two children. To simplify later comparison, we record that this is a binary tree using the following simple datatype—a pair of two values of the same type1.
data Two a = Two a a
deriving (Eq, Ord, Read, Show, Functor, Foldable, Traversable)
The tree type above can be used to express all sorts of trees, of many different shapes. But what if we want our type to only include perfect trees — i.e. those where every path from the root to a leaf is the same length.
For example, consider the following trees that we can represent with this type.
-- a perfect tree
t1 :: Tree Int
= Leaf 1 t1
-- a perfect tree
t2 :: Tree Int
= Node (Two (Leaf 1) (Leaf 2)) t2
-- not a perfect tree
t3 :: Tree Int
= Node (Two (Node (Two (Leaf 1) (Leaf 2))) (Leaf 3)) t3
-- a perfect tree, but not the same as t3
t4 :: Tree Int
= Node (Two (Node (Two (Leaf 1) (Leaf 2))) (Node (Two (Leaf 3) (Leaf 4)))) t4
Perfect trees are symmetric and complete. They have exactly 2^n
leaf
values, where n
is the height of the tree.
Can we rule out imperfect trees using the type system?
Of course we can! Perfect trees are a classic example of a constraint that can be captured using nested datatypes.
To express this constraint using a nested datatype, we modify the tree
definition above to say that the Node
constructor carries a tree of
two values instead of carrying two values that are trees.
data NTree (a :: Type) =
NLeaf a
| NNode (NTree (Two a))
deriving (Eq, Ord, Read, Show, Functor, Foldable, Traversable)
With this change, this tree definition can only represent perfect trees.
-- a perfect tree
n1 :: NTree Int
= NLeaf 1
n1
-- a perfect tree
n2 :: NTree Int
= NNode (NLeaf (Two 1 2))
n2
-- not a perfect tree, doesn't type check
-- n3 :: NTree Int
-- n3 = NNode (NNode (NLeaf (Two (Two 1 2) 3)))
-- a perfect tree
n4 :: NTree Int
= NNode (NNode (NLeaf (Two (Two 1 2) (Two 3 4)))) n4
What is the general form of values of type NTree Int
? It is some
number of NNode
data constructors, followed by an NLeaf
data
constructor containing a value of type
Two (Two ... (Two Int))) (
In other words, this structure is constrained to store exactly 2^n
Int
values, in a perfectly symmetric tree shape.
In fact, we can decode “prefix” of these values i.e. the sequence of
NNode
and NLeaf
constructors as n
, the height of the perfect tree,
in unary notation. In other words, we can decode NLeaf
as 0,
NNode . NLeaf
as 1, and NNode . NNode . NLeaf
as 2, etc.
The key feature that defines a nested datatype is the use of
non-regular recursion. If you go back and look at the NTree
definition, the NNode
constructor has an argument of type
NTree (Two a)
. This is a recursive use of the NTree
type in its
definition, but the argument to this recursive call is not just the
parameter to the recursive type itself. Regular recursion requires this
argument to always be the parameter a
and non-regular recursion
happens when some recursive call uses something else (like Two a
).2
I often find nested datatypes confusing. How can a simple modification to the type make such a big difference in the structure? How should I express similar constraints? Once I have a nested data type definition, what can I do with it? And how does this approach compare to other mechanisms for constraining datatypes, such as GADTs?
The rest of this post is a comparison between approaches using the example of perfect trees. I’ve chosen perfect trees for this treatment due to their relative simplicity. It isn’t that difficult to say all there is to say about this sort of data structure.
More generally, nested datatypes can be used for many purposes beyond
perfect trees. For example, they feature prominently in practical
Haskell libraries such as Data.Sequence
(based on the FingerTree data
structure) and the bound
library (based on well-scoped lambda terms).
I conjecture that these and other applications can also be expressed
using GADTs instead of nested datatypes. I’ll return to that point at
the end of the post.
Working with nested datatypes: polymorphic recursion
At first, nested datatypes don’t seem all that different from regular
datatypes. For example, even though the NTree
type uses nested
recursion, the usual type classes (e.g. Eq
, Functor
, etc) are
derivable, as we saw above. In fact, if you were to implement these
instances by hand, the code you write would be identical to the
non-nested version. In other words, the derived instances for the
Functor
class look something like this for the two types.
instance Functor Tree where
fmap :: forall a b. (a -> b) -> (Tree a -> Tree b)
Leaf x) = Leaf (f x)
fmap f (Node z) = Node (fmap (fmap f) z) fmap f (
instance Functor NTree where
fmap :: forall a b. (a -> b) -> (NTree a -> NTree b)
NLeaf x) = NLeaf (f x)
fmap f (NNode z) = NNode (fmap (fmap f) z) fmap f (
But don’t be fooled! These two definitions are not the same. There is one important difference in these derived definitions: functions over nested data types require polymorphic recursion.
In a regular datatype, recursive calls to polymorphic functions use
exactly the same type parameter. To make this clear, I’ve redefined
fmap
for regular trees to mark the location of the recursive call and
used scoped type variables and type applications to annotate its
instantiation.
tmap :: forall a b. (a -> b) -> (Tree a -> Tree b)
Leaf x) = Leaf (f x)
tmap f (Node z) = Node (fmap (tmap @a @b f) z) tmap f (
Even without the type signature (and type applications), this code would still type check. Recursive functions over regular datatypes are well within the expressive power of HM type inference.
However, compare the above to the implementation of the mapping function
for the nested datatype. Note that in this case, the recursive call to
ntmap
uses not a
and b
, but Two a
and Two b
. This is
polymorphic recursion in action, directly corresponding to the fact that
the datatype definition uses Two a
instead of a
in its definition.
ntmap :: forall a b. (a -> b) -> (NTree a -> NTree b)
NLeaf x) = NLeaf (f x)
ntmap f (NNode z) = NNode (ntmap @(Two a) @(Two b) (fmap f) z) ntmap f (
In the absence of type annotations, like the definition of ntmap
above, Hindley-Milner type inference with polymorphic recursion is
undecidable34. As a consequence, if we remove the type
annotation, then we get an error message from GHC:
nested.lhs:(118,3)-(119,44): error: …
• Occurs check: cannot construct the infinite type: t ~ f t
Expected type: (f t -> f b) -> NTree (Two t) -> NTree (Two b)
Actual type: (t -> b) -> NTree t -> NTree b
• Relevant bindings include
tmap :: (f t -> f b) -> NTree (Two t) -> NTree (Two b)
(bound at /Users/sweirich/github/dth/nested-datatypes/nested.lhs:118:3)
|
Compilation failed.
So, when working with nested datatypes, we must always remember to annotate the type of the function—GHC cannot figure it out for us. In the presence of this type annotation, polymorphic recursion is not difficult and has been a part of Haskell for years. (The Haskell 98 report specifically states that type signatures can be used to support polymorphic recursion5.)
Representing perfect trees with GADTs
One thing that puzzles me about nested datatypes, like NTree
is that
the type parameter does double-duty. It both constrains the shape of the
tree and parameterizes the type of data stored in the tree.
A definition of perfect trees using GADTs separates these aspects of the definition. Let’s compare.
First, we’ll define some natural numbers so that we can count.
data Nat = S Nat | Z
Now, let’s index a tree by its height and require that both subtrees in a node have the same height. Here, we’re using datatype promotion with in a GADT so that we can refer to natural numbers in types.
data HTree (h :: Nat) (a :: Type) where
DLeaf :: a -> HTree Z a
DNode :: Two (HTree h a) -> HTree (S h) a
This data type definition is a GADT because the result types of the leaf and node data constructors differ in the height index6. So that we can express this difference in the result type, we use GADT syntax for the definition.
But, we haven’t yet implemented a type equivalent to NTree a
because
the height index h
“leaks” into the HTree
type. Therefore, to define
the equivalent type, we need to also use an existential type to hide
this index.
data DTree a = forall h. DTree (HTree h a)
Here are some example trees. In construction, they look a lot more like the regular tree type than the nested tree, but the type system still rules out non-perfect trees.
d1 :: DTree Int
= DTree $ DLeaf 1 d1
d2 :: DTree Int
= DTree $ DNode (Two (DLeaf 1) (DLeaf 2)) d2
-- not a perfect tree, doesn't type check
-- d3 = DTree $ DNode (Two (DNode (Two (DLeaf 1) (DLeaf 2))) (DLeaf 3))
d4 :: DTree Int
= DTree $ DNode (Two (DNode (Two (DLeaf 1) (DLeaf 2)))
d4 DNode (Two (DLeaf 3) (DLeaf 4)))) (
However, because HTree
is a GADT, we must use standalone deriving for
the usual instances above.
deriving instance Show a => Show (HTree h a)
deriving instance Eq a => Eq (HTree h a)
deriving instance Ord a => Ord (HTree h a)
deriving instance Functor (HTree h)
deriving instance Foldable (HTree h)
deriving instance Traversable (HTree h)
These derived instances look the same as the derived instances for
Tree
. And, like the nested datatype, they require polymorphic
recursion. For example, if we add scoped type variables to the
definition of fmap
for this type, we can see that the recursive call
uses the height k
instead of h
. (The type variable k
must be bound
in the patter for DNode
. When it is introduced we also know that it is
equal to S h
.) What this means is that we need to add annotations on
recursive functions that use the type HTree
in order to get them to
type check. (Again, we don’t need to bind the scoped type variables and
provide the explicit type applications, the top-level type annotation
provides enough information for GHC to figure it out.)
dtmap :: forall h a b. (a -> b) -> (HTree h a -> HTree h b)
DLeaf x) = DLeaf (f x)
dtmap f (DNode (z :: Two (HTree k a)))
dtmap f (= DNode (fmap (dtmap @k @a @b f) z)
But, there is a cost to the GADT-based approach. The derived
implementations of Eq
and Ord
don’t type check for DTree
!
deriving instance Show a => Show (DTree a)
-- deriving instance Eq a => Eq (DTree a) -- doesn't type check
-- deriving instance Ord a => Ord (DTree a) -- doesn't type check
deriving instance Functor DTree
deriving instance Foldable DTree
deriving instance Traversable DTree
We can see why by looking at the error message for this attempt:
-- treeEq :: Eq a => DTree a -> DTree a -> Bool
-- treeEq (DTree (t1 :: HTree h1 a)) (DTree (t2 :: HTree h2 a)) = t1 == t2
If we try to define an equality function this way, the two HTree
s have
potentially different height indices, so we cannot use the derived
equality function for HTree
s. The expression t1 == t2
doesn’t type
check because t1
and t2
have different heights.
There are two different ways that we can implement the equality
operation for DTree
. The first is to define a type class for
heterogeneous equality. This type class allows us to compare arguments
with different types. GHC doesn’t know how to automatically derive
instances of this class, so we must provide them directly. However, the
code that we write is the same as for the Eq
class.
-- Heterogeneous equality type class
class Heq a b where
heq :: a -> b -> Bool
instance Heq a b => Heq (Two a) (Two b) where
Two x y) (Two z w) = heq x z && heq y w
heq (instance Eq a => Heq (HTree n a) (HTree m a) where
DLeaf x) (DLeaf y) = x == y
heq (DNode p1) (DNode p2) = heq p1 p2
heq (= False heq _ _
instance Eq a => Eq (DTree a) where
DTree t1 == DTree t2 = t1 `heq` t2
The second solution is to compare the heights of t1
and t2
and then
only call the equality function if the two trees have the same height.
To implement this version we need three components:
- A run time version of the (so far) compile-time only height. This
GADT, called a singleton type, exactly reflects the structure of
the
Nat
type above.7
data SNat n where
SZ :: SNat Z
SS :: SNat n -> SNat (S n)
The important property of this data structure is that if a value x
has
type SNat h
then it is isomorphic to h
(i.e. it is the unary
representation of the same number.) It is common, when programming with
indexed types in Haskell, to need such a singleton type. Compared to
full-spectrum dependently-typed languages, like Agda or Idris, Haskell
does not allow the same parameter to be computationally relevant
(i.e. around at run time) and an index to a data structure. Thus, we
need to set up this somewhat awkward mirroring. (The singletons
library can help automate this.)
- A way to calculate the run time representation of the height of a height-indexed tree.
The type of this function asserts its correctness — the value that it returns must be the height of the tree.
heightHTree :: HTree h a -> SNat h
DLeaf _) = SZ
heightHTree (DNode (Two t1 _)) = SS (heightHTree t1) heightHTree (
- A way to compare two singleton values for equality.
For this step, we’ll create an instance of the class TestEquality
from
the module Data.Type.Equality
. This class captures the idea that
sometimes, when comparing indexed data structures, we can produce a
proof that indices are equal.
class TestEquality f where
testEquality :: f a -> f b -> Maybe (a :~: b)
Singleton types make ideal instances of this class. For the SNat
type,
if we have two copies of SZ
then we can prove that the two
(compile-time) nats are equal, using the trivial proof Refl
.
Otherwise, if they both start with SS
, we can compare their
predecessors. If those two happen to be equal, we can still use Refl
as our proof.
instance TestEquality SNat where
testEquality :: SNat n1 -> SNat n2 -> Maybe (n1 :~: n2)
SZ SZ = Just Refl
testEquality SS n) (SS m)
testEquality (| Just Refl <- testEquality n m
= Just Refl
= Nothing testEquality _ _
With these three components, we can define an equality function for
DTree
s that is based on our derived, homogeneously typed equality
function for HTree
s.
dtreeEq :: Eq a => DTree a -> DTree a -> Bool
DTree t1) (DTree t2) = case
dtreeEq (of
testEquality (heightHTree t1) (heightHTree t2) Just Refl -> t1 == t2 -- here we know the types are the same because the
-- two trees must have the same height
Nothing -> False
Programming with perfect trees
We’ve looked at the difference in terms of derivable type classes. But,
how do the types NTree
and DTree
compare with operations that we
write by hand? How difficult is it to actually program with these data
structures?
Here are several examples of functions that work with perfect trees. All of these functions are straightforward to define for regular trees. Sometimes they are a bit easier to express with the GADT version and sometimes they are a bit easier to express with the nested datatype version.
Tree inversion
OK, let’s mirror our trees. I don’t know why you would want to do this, but it seems important in technical coding interviews and is straightforward enough.
Here’s the basic building block of tree mirroring: swap the order of the two components.
swap :: Two a -> Two a
Two x y) = Two y x swap (
For regular trees, tree mirroring is straightforward. We recur over the
tree and apply the swap
function above.
invertTree :: Tree a -> Tree a
Leaf x) = Leaf x
invertTree (Node p) = Node (swap (fmap invertTree p)) invertTree (
For GADT-based trees, we rely on a local helper function that tells us that inverting the tree preserves its height. But the code for the local function is the same as the definition above.
invertDTree :: DTree a -> DTree a
DTree t) = DTree (invert t) where
invertDTree ( invert :: HTree n a -> HTree n a
DLeaf x) = DLeaf x
invert (DNode p) = DNode (swap (fmap invert p)) invert (
However, inverting nested trees is slightly trickier and not similar to
the regular tree version. The local function needs to create the
appropriate inversion function f
With every recursive call. Then at
the base case this function inverts the entire tree in one go.
invertNTree :: NTree a -> NTree a
= go id where
invertNTree go :: (a -> a) -> NTree a -> NTree a
NLeaf x) = NLeaf (f x)
go f (NNode p) = NNode (go (swap . fmap f) (invertNTree p)) go f (
Tree replication
Given some height n
, and some value x
, let’s generate a perfect tree
containing 2^n
copies of x
.
This function is straightforward enough to write with the regular tree
datatype, though you really want to be careful to maintain sharing in
the recursive calls (i.e. make sure that you use a local definition of
y
.)
replicateTree :: a -> Int -> Tree a
= go
replicateTree x where
0 = Leaf x
go | m < 0 = error "invalid argument to replicateTree"
go m = Node (Two y y) where
go m = go (m - 1) y
For Nested trees, as above, we need to add an “accumulator” to the local helper function and update this accumulator on on each recursive call. However, with this version we can’t help but create a tree with a lot of sharing.
replicateNTree :: a -> Int -> NTree a
= go
replicateNTree where
go :: forall a. a -> Int -> NTree a
0 = NLeaf a
go a | m < 0 = error "invalid argument to replicateNTree"
go a m = NNode (go (Two a a) (m - 1)) go a m
For GADT-based trees, we need to first interpret the height argument as
an SNat
, using the toSomeNat
function, and then use that run time
natural number to control the size of tree that we generate.
replicateDTree :: forall a. a -> Int -> DTree a
= case toSomeNat i of
replicateDTree x i Just (Some n) -> DTree (go n)
where
go :: SNat n -> HTree n a
SZ = DLeaf x
go SS m) = DNode (Two y y) where
go (= go m
y Nothing -> error "invalid argument to replicate DTree"
This translation function would be part of a library for working with
SNat
values. Because we don’t know statically what number it will
produce, we must hide that height parameter in the Some
datatype from
Data.Some
.
toSomeNat :: Integral n => n -> Maybe (Some SNat)
0 =
toSomeNat Some SZ)
return (= do
toSomeNat n Some sn <- toSomeNat (n-1)
Some (SS sn)) return (
Validating perfect trees
Can we write functions that convert a Tree
to be an NTree
or DTree
if it happens to be perfect? Or inject an NTree
or DTree
into a
regular tree?
Both validation functions can be expressed with the same general
structure: recur over the tree datatype, and use a “smart constructor”,
called node
, to construct a binary node out of two subtrees if the the
two subtrees have the same height.
For nested trees, this “smart constructor” recursively merges the two trees if they have the same prefix (which ensures they have the same height).
toNTree :: Tree a -> Maybe (NTree a)
Leaf x) = return (NLeaf x)
toNTree (Node p) = traverse toNTree p >>= node where
toNTree ( node :: Two (NTree a) -> Maybe (NTree a)
Two n1 n2) = NNode <$> merge n1 n2 where
node ( merge :: NTree a -> NTree a -> Maybe (NTree (Two a))
NLeaf x) (NLeaf y) = pure (NLeaf (Two x y))
merge (NNode x) (NNode y) = NNode <$> merge x y
merge (= Nothing merge _ _
For GADT-based trees, the “smart constructor” more directly calculates the heights to the two trees, tests them for equality, and constructs a new tree if the test succeeds. A more efficient version might return the height as part of the recursion (instead of recalculating it at every step).
toDTree :: Tree a -> Maybe (DTree a)
Leaf x) = return (DTree (DLeaf x))
toDTree (Node p) = traverse toDTree p >>= node where
toDTree ( node :: Two (DTree a) -> Maybe (DTree a)
Two (DTree u1) (DTree u2)) = do
node (Refl <- testEquality (heightHTree u1) (heightHTree u2)
DTree (DNode (Two u1 u2))) return (
For the operation of “forgetting” that a tree is perfect, the GADT-based
version looks like the identity function it should be, modulo a little
syntactic noise from the DNode
constructor.
fromDTree :: DTree a -> Tree a
DTree (DLeaf x)) = Leaf x
fromDTree (DTree (DNode p)) = Node (fromDTree . DTree <$> p) fromDTree (
The nested datatype version, on the other hand, must defer to a helper
function to split the NTree
at each step.
fromNTree :: NTree a -> Tree a
NLeaf x) = Leaf x
fromNTree (NNode p) = Node (fromNTree <$> split p) where
fromNTree ( split :: NTree (Two a) -> Two (NTree a)
NLeaf p) = NLeaf <$> p
split (NNode p) = NNode <$> split p split (
Due to the need for dsplit
and dmerge
, both of the operations take
longer than we might like for nested trees. The ideal would be O(n)
,
but instead we get O (n log n)
.
For the GADT approach, the injection function has a linear running time,
but validation is still O (n log n)
due to the height computation and
equality comparison on unary nats. If we were to be a bit more clever
(return the height as we traverse the tree, use an representation of run
time nats), we could get a linear time conversion. However, it’s
probably not worth it in practice to do so.
Other operations
Of course, these are not the only operations that one may define using
perfect trees. However, these operations are good exemplars of what the
set of operations might look like. Furthermore, the Foldable
and
Traversable
instances mean that many operations can be written
generically, for all trees.
On the other hand, there is one significant difference between nested
datatypes and GADTs: the former is supported directly by GHC.Generics
but the latter is not. This means that the Generic
class, from
GHC.Generics
is automatically derivable for nested datatypes, but not
for GADTs. There may be a way to create an instance of the Generic
type class by hand, but GHC cannot do so automatically.
Conclusions
This is about as far as we can go with the comparison just using perfect
trees. There isn’t all that much you can do with them. From my point of
view, I find the indexed version of the data structure a bit easier to
understand because of the contortions required for operations such as
invertNTree
. However, maybe that is because I am already familiar with
the patterns of Dependent Haskell. If you are the opposite, perhaps this
explanation will serve as a Rosetta stone.
Other ways to define perfect trees
There are still more ways to represent perfect trees in Haskell, using type families, a finally-tagless encoding, LiquidHaskell, etc. A few of these alternatives are collected here.
Other nested datatypes
Perfect trees are an example of a fairly constrained, symmetric and artificial data structure. So was it just a fluke that we could define a GADT analogue to the nested datatype definition?
I don’t think so.
I know of at least two other examples of constrained data structures that can be implemented by either nested datatypes and GADT-based alternatives.
-
A famous use of nested datatypes is to ensure that lambda calculus expressions are well-scoped. This idea underlies the design of Kmett’s bound library.
However, instead of using a nested datatype, it is also possible to use a type-level natural number to bound the scopes of bound variables, as shown in this implementation.
-
Haskell’s implementation of the sequence data structure is built on FingerTrees. In the module DFinger.lhs I’ve sketched out a nat-indexed replacement to the nested datatype.
Many examples of nested datatypes, especially for perfect trees, use the type
(a,a)
instead ofTwo
. However, it is convenient in modern Haskell to have the appropriate definitions offmap
etc. available for this auxiliary type.↩︎Regular datatypes produce regular trees, i.e. those that are either finite or self similar. Since the collection of perfect trees is a subset of normal binary trees (which are regular) it is a bit confusing that we use a non-regular datatype to represent them. But, note that this nested datatype definition includes a height “prefix” that prevents it from being self-similar (i.e. we cannot embed smaller values of this type directly into larger values).↩︎
Fritz Henglein, Type Inference with Polymorphic Recursion. ACM Transactions on Programming Languages and Systems. Vol 15, Issue 2. April 1993.↩︎
Assaf J Kfoury, Jerzy Tiuryn, Paweł Urzyczyn. Type reconstruction in the presence of polymorphic recursion. ACM Transactions on Programming Languages and Systems. Vol 15, Issue 2. April 1993.↩︎
https://www.haskell.org/onlinereport/decls.html#type-signatures↩︎
I follow the terminology of Coq and call
n
a type index (because it varies in the result type) anda
a type parameter (because it does not).↩︎We could use https://hackage.haskell.org/package/singletons for these types but it is simpler to just write them here.↩︎