Building a Typechecker

So, as per my understanding of the process, the first step is to create a TypeChecker class, which we'll put in a new package, org.flasck.flas.typechecker. I don't think we're going to need a separate package for the "type checked form" because it can easily just fit inside the parsed form, if needed at all. The initial definition of this class is just going to have a constructor, a typecheck method, and an ErrorResult to handle any errors. To go with that, we obviously need a test case class.

Handling Numbers

Testing a function up front is just more complex than I want to try. So first, let's deal with some expressions that (hopefully) are associated with known outputs. Because of the way that he approaches his data model, PH sees everything as "an expression" and, interestingly enough, doesn't specifically deal with the "constant" case, which seems the simplest to me.

Since I'm dealing with a different data model, I'm going to introduce a new method to my TypeChecker class - tcExpr - which is responsible for attempting to determine the type of a single expression given everything it knows about the current state of the world. I'm not sure that this is going to work out in the long run - it could be too big a deviation from the overall plan - but I'm sure we can refactor back to where we need to be.

In doing this, it becomes clear that we need a return type for tcExpr. Since we're putting any errors we find in the ErrorResult object, the only real option here is the "valid" type of the expression - or null if typechecking fails. It's easy enough to create "TypeExpr", but what goes into it?

PH discusses his representation of type expressions in PH9.2 and it seems easy enough to copy his definition and map it over into Java. Basically, what he suggests is keeping track of types in much the way they are usually represented: a type "name" and a list of "type expression" arguments - for those types that need arguments. There are then a couple of special types (which seem more common because of the presentation) for handling functions and tuples. We will represent these by the types -> and () to avoid clashing with any "normal" names; a type such as "Cons" would need one argument - the type of its head element.

There is also the option to typecheck something to just be a "type variable". Since this is a completely different beast, we will use two different classes and have the tcExpr method return Object. We can then wire up our typechecker to just directly say that any expression which has a numeric constant returns Number and voila! We're done. (See testWeCanTypecheckANumber).

Handling a Function

Because I'm not really sure what I'm doing, I'm next going to skip to the simplest function case I can think of - which corresponds to attempting to typecheck a lambda expression (PH9.7.4). This appears to ease me into doing a lot of the more complex things without getting in too deep too quickly.

This requires us to understand type schemes, which were introduced in PH9.5 (q.v. and my interpretation). I'm now going to go ahead and created the appropriate classes for TypeScheme and TypeVar in line with my understanding from that section.

We can create a type scheme, but in order to do that, we need the VariableFactory alluded to in UPH. This is just a simple bean counter that can issue unique new TypeVar objects.

OK, whoa there! In defining this next test, I realize I want to use and typecheck HSIE formed expressions, not the parse form. So, what I previously wrote as a test case won't work. OK, I'll have to pull that back and rewrite it. In the meantime, let's press on.

So, I created a new simple function in the HSIETestData set. It takes one parameter - needed to obtain a LAMBDA expression - and returns the number 1 as a result. It should have the type A->Number where "A" here represents a type variable. The factory is actually issuing things that print tv_1, but that looks ugly, so I'm going with something better here.

Another wrinkle that promptly emerges is that PH folds all his lambdas to look like (λx.(λy.(λz.E(x,y,z))). Since we have direct functions of multiple arguments, I am going to simply do "all of these" at once.

I had been thinking that I would be storing the equivalent of the "type environment" gamma (PH9.7) in the typechecker's state. While it's true that I probably do want to store a significant amount of that state there, the "current portion" needs to be in a local variable so that we can push new entries onto it - for example, when binding a lambda expression.

So, backtracking to the first test case, we want to rewrite this a test of typechecking an HSIEBlock, specifically one that returns the number 1. So let's do that, by extracting the expression from our newly defined test data. We then need to look and see if the command is a RETURN of an integer, and, if so, return type Number. OK, good.

Two random thoughts

I've been thinking a lot about how to interpret the pattern matching and switching that we do on the left hand sides of equations. It seems like it adds information, but it's not really clear where that information goes or what we do with it.

I think the simple answer is that we just throw it away. I think we need to know something about the BIND instructions so as to introduce the new variables, along with their types (although I'm still not sure how we do that). But the important thing is to say that if a function f is to be consistently typed it must have a type v0->v1 for all the cases; i.e. that whatever the input types are (which we know for a specific case), the output type (i.e. the value of the function) must be the same.

Which brings me on to the other thing that's been bothering me. I'm very concerned about the union types (e.g. List = Nil|Cons). It's not clear if we can distinguish such a beast from a type error (this appears to be Nil here and Cons there - can we unify that). Having studied TypeExpr in a bit more detail, it seems reasonable that we could represent a union type by the constructor || across a set of types. But that still leaves the question of how to unify them. I'm not even really sure where in the algorithm this pops up. One possibility would be to "always" allow it, but I think that would allow too many invalid programs; one possibility would be to allow it whenever there was a suitable pre-defined union type (such as List), but I think that would too often devolve to Any; one possibility would be only allow it if there was a type which exactly matched the union when it was fully resolved at a "top level", but contrariwise I think that might have too many false negatives.

PHI

So, that leads us to what the appropriate return value should be from typechecking. It seems intuitive to me that the correct thing to get back is a type, at least from an expression. But somewhere in between, we have a logical deduction to make. We start off by saying "we know what the type is, it is T7 or whatever". We then have to solve this set of simultaneous equations to deduce what we mean by T7. This is done in part of a larger computation of this magical function PHI. This takes me back to trying to understand Mr. Hancock more thoroughly again.

So, in light of that exploration, what we're basically saying is that the result of typechecking an expression is two items: a PhiSolution that contains a mapping from type variables to type expressions and the exact type of that sub-expression. Given what I think I understand about the roles of these two, I propose to return the type of the expression and to use an accumulator for the PhiSolution. I think the important thing is to make sure that at all times we are passing around expressions that have had the "existing" solution applied to them and all the variables mapped by the solution removed.

That appears to be sufficient to typecheck both of these functions:

simple x = 1
id x = x

Function Application

The next task is to try and apply a function. I think in order to do this, I want to put a "pre-defined" function into the scope with an already established type. I'm not quite sure how this shows up in Hancock's algorithm, but he starts his sentence with the phrase "let's assume that we already know the types of e1 and e2". So if we know that we have a function which does something like +1 :: Number->Number and a number, say 1, then when we do the application we should end up with Number.

After further review and investigation, it seems like this assumption is not as clear cut as it would seem. What it basically means is that the first step is to determine an "expression" for the types of both Tf and Tx. But, as we look at this recursively, we see we get to a point where we want to typecheck "f", which is a constant. When this happens, we can return the type for it from our existing pool of such things.

So our next test case should be one that if we put something in the "externals" pool and then we have a function which returns that, can we see that we get the right type back?

Returning to random thoughts

Having thought about these things some more, it seems like the SWITCH/BIND thing is very much like function application and lambda.

That is, it seems like SWITCH is very much like a function s :: T0 -> T1 where T0 is the type of the variable being switched and T1 is the desired type (e.g. Cons). And BIND is much like a lambda expression, in that it binds a variable in the nested statement.

All the switch cases are then grouped together in some way so that all their (result) types get unified together.

On the subject of which, I now propose a solution to the other problem. The type unification seems to happen at the last line of Hancock's algorithm (in which he raises an error because you wouldn't be able to unify Nil with Cons A). We can return a union type here || (Nil, Cons A). When we reach the top level of any given function, we can then see if one of two cases holds:

This is exactly a well-recognized type;
The user has provided an explicit type definition which happens to exactly match this definition

The first handles the most common case (List), many similar cases (i.e. any with exactly two sub-types, since if only one is specified there is no union; and if both then it will be an exact match) and probably many more; while the second case catches all the esoteric cases.

Emerging API

I said I'd flip flop on the API. I'm increasingly feeling that what I want to do is have the Scope and dependency analysis be outside of the typechecker. This could just be an artifact of TDD, but it feels like the constructor is getting awkward. So, instead I propose a simple, default constructor and the following two methods:

void addPredefined(String symbol, type_expression type);
void typecheck(Set functions);

I'm thinking that the second one cannot usefully return anything and the process of typechecking will update the internal state so that it is possible to ask for the type of any given symbol after the typechecking is complete.

I still think it is necessary for the typechecker to be able to handle multiple function definitions at once in order to handle mutual recursion. I think this grouping is just like grouping the cases for a function and much like a LETREC: it involves us unifying a set of things that we can build up individually.

OK, function application - let's do it

Let's start by creating a new test case (testWeCanTypecheckSimpleFunctionApplication). This is going to be the thing that forces the API change described above, so there will be a bit of a refactoring that goes along with it. What we're going to do is to try and apply the function plus1 (which is intended to be the curried function (_+1) of type Number->Number) to the number 1.

From our investigation, it seems that actually the first thing we want to do is to be able to treat that as a constant. So I'm briefly going to insert another test case (testExternalPlus1HasExpectedType) to check that. Yeah, and, as we might expect, we haven't considered the case that we might return an identifier. Let's do that.

OK, that wasn't too hard. Let's get back to the real matter at hand.

The first step in checking function application is to try and determine the types of the items at hand. Now, we have a closure which has a set of entries (for now - simple function application - we're just going to consider two), the first of which is the function and the other are the arguments. So let's determine their types.

OK, fair enough. The next step is to create the artificial type Tx->T? where Tx is the type of the argument and T? is a newly minted type variable. And now we need to unify them. Ho, ho!

Implementing the unification algorithm

As presented in Hancock's work, unification is a process of extending a substitution function phi which is the solution of the set of type expressions. It thus makes sense to me to make our unification algorithm be a stateful method on the PhiSolution object. That's fairly easy to create.

As described, there are three cases. The case for unifying a two type expressions is fairly straightforward and requires the nested method unifyl, so let's do that first. OK.

Now, the case for unifying two variables. This requires the method extend, so let's implement that. The three cases are simple enough when written in Java, so let's call that easy as well (it does require adding containsVar to the TypeExpr class). Calling it isn't too hard either; we need to call subst on the second var and then pass the var and the substituted expression to extend. There's no need to return anything because this is just unify.

Finally, the case for a variable and a type expression. We need to apply subst on the type expression, meaning on the variable, and then unify the two results.

OK, I think that's unification done.

Back to checking function application

Having done unification, we now have a "new" (extended) solution phi' which we want to apply to our freshly introduced variable T? and return that.

Uh, yes, that's it.

Moving on to multi-variable function application. We want to proceed from the left and apply the "current" function to the next argument until we're done. Let's create a new test case for plus applied to two arguments. As expected, that fails, and with a reasonable complaint: just applying + to the first argument returns a curried function (2+_ in this case). Let's fix that.

First I'm going to refactor 90% of what we just did into checkSingleApplication. Then we can easily call this with the first pair of arguments and then repeatedly call it with the result (which we "assume" to be a curried function) and the next argument. Either this will fail at some point (returning null) or we will succeed in obtaining the "real type". Simples.

LET expressions

As noted in my commentary, I don't really use LETs at all. Instead, I'm interpreting this as meaning that it is possible to recover type information from the store of knowledge (which links back to the complementary question above about when we get this knowledge). But what I didn't do before was to allocate fresh variables. So let's try something very similar to the example from PH:

f = id (decode (id 32))

This is similar but easier, since we do not currently handle character constants. The typechecking challenge is the same.

Interestingly enough, with the test in place (testWeCanUseIDTwiceWithDifferentInstantiationsOfItsSchematicVar) the typechecking seems to proceed OK but then returns a schematic variable at the end, suggesting that some level of typechecking or unification has not been done. First back to Mr. Hancock to see if there's a step I've missed, then look into the code.

So this seems to depend on a method sub_te which in turn depends on sub_scheme (p173) which in turn depends on exclude (p173). So let's implement those. exclude goes on PhiSolution and (I think) returns a new, temporary version of phi. sub_scheme goes on TypeScheme and pretty definitely returns a new object. The "obvious" definition of sub_te (p175) presumably wants to go on TypeEnvironment. For good measure, I implemented it in "the obvious way". Having done that, I'm not sure I'm going to use any of it (at least not right now).

What I'm going to try and do is just to introduce the name with a new set of variables replacing the old ones and see if that does the trick. Basically, this is a question of implementing a method freshVarsIn in the type checker which consistently replaces the current schematic variables with new ones.

Although this is obviously a theoretical improvement, it still ends up with the wrong answer, but I think the fundamental problem is that we have unified tv_1 with tv_5 and tv_1 with "Char" but have simply not returned the right thing, possibly just as simple as not applying phi to the result.

Actually, not quite that simple. The problem was in mapping PH's three cases of unification to my three, I'd failed to realize that I need to apply phi to the var in case 1 before doing extend. As it happened, with tv_1 already having a value, I needed to select case 1b, not case 1a.

Assuming that I didn't so dramatically cheat that it worked for the wrong reasons (and looking at the tracing, that doesn't appear to be the case), that was not too bad. Basically, that just leaves us with LETREC and handling pattern matching (SWITCH and BIND).

LETRECs

I think the description of LETREC is particularly confusing. We basically want to just process a bunch of function definitions in parallel. The one thing we do need to do is to once-and-for-all instantiate the variables in all of the function definitions with new variables so that they are named apart.

The other thing we need to do is to keep track of the "more local" definitions than the global "knowledge" so that we don't try looking up these names in the global scope but rather find the locally renamed definitions.

So let's get started.

It takes a bit of setup to get all this working which involves creating two new function definitions (for which we also create parsing test cases to make sure that we have written the right HSIE) and then a new test case (testWeCanCheckTwoFunctionsAtOnceBecauseTheyAreMutuallyRecursive). As it happens, this fails on the existing code base with a couple of errors because we don't check for null everywhere we should and we don't (yet) have the right information about where to tag errors but it ultimately comes down to "there is no type for f" (or g). Well, duh.

So what we need to do is to allocate local knowledge for both of these when we start doing the typechecking. An interesting issue is how we're going to pass that around, since it's not something which is integral to the state of the typechecker, but rather depends on the methods being typechecked. So I guess it's another parameter we pass around along with phi and gamma.

These need to be "named apart" in terms of variables, because we don't want to have the variables in different functions being unified together. So, let's use a pool of fresh variables for each equation, rewrite them and then process the resulting equations.

Doing that basically seems to work, at least in terms of a first approximation. There are details to be sure, but basically creating new parallel equations, each with their own set of variables, and then attempting to typecheck each one in turn (but with the others' types as local definitions with pre-bound type variables for arguments), and then, at the end using unification to tie them together, seems to "just work".

That still leaves us with the wrinkle of what to do when we have scoped definitions. In this case we need to make sure that any inner bound variables are not rewritten (or at least, are rewritten to the same new variable as the outer equation) so that they are unified together. Consider, for example:

f x = g 2
  g y = x * y

If we were to simply typecheck g first, then that would have an "unknown" var (x) but would resolve to Number->Number. If we were to then typecheck f, we would end up with no information about x and thus A->Number. Thus we need to make sure that we handle both definitions together, thus providing g with a definition for x, and doing the unification to make f typecheck to Number->Number.

Typechecking HSIE

But that's for later, when we can parse that sort of thing. For now, we need to consider things that we do already handle, like SWITCH, BIND and IF.

Well, possibly that lulled me into a false sense of security, but for the simple cases of matching numbers, that was all very easy. In fact, I think inasmuch as this was a task to handle HSIE, I've handled SWITCH and IF successfully.

The bigger issue hiding in here is, of course, my extension to the type system. We'll come back to that, but first, let's make sure we nail BIND.

Yeah, that seemed easy.

Handling Union Types

So, while HSIE is not something discussed by Mr. Hancock, it is at least "on the rails" in terms of being something that clearly he'd thought about and had case logic for. We'll need to come back to that time and again as we extend things (scoped variables, proper boolean if conditionals, character and string matching), but it is nothing compared to the idea of Union types.

This, as previously noted, comes from the work I did in my Ph.D. thesis with regard to "base types". Although the original idea was in support of proof constraints (having more granular types makes it easier to state precisely which cases are, or are not, covered by a function definition or proof obligation), the idea sticks with me as being generally useful. It provides at least three benefits:

while it is generally the case that we are interested in considering the type List consisting of either Nil or Cons nodes, we may also be interested in reusing either or both of these primitives in a different combination to build a new type;
not all functions handling take integers to lists (or whatever): if we want a function that can, for example, either return a List or an Error, we should be able to say so, and have anybody using that function be expected to pick up on both cases;
I still subscribe to the idea of Object Oriented programming according to Peter Wegner, in which it is possible to view object oriented programming as an encapsulation of data behind a defined interface and then define those methods on any object to provide overloading. While this is basically "duck typing", I want to be able to do it with strong, inferred typing (and no, I don't yet know if this is possible, let alone what exactly it would look like).

Contradicting some of my thoughts above, I now propose a third case to go alongside TypeVar and TypeExpr, called UnionType. Here's the logic for the change: this really is a special case in the unification algorithm, and everywhere it appears we will have special handling for it, so let's go all out and make it a separate thing; that way, we'll know when we have one and have to deal with it appropriately.

The basic idea is that Hancock's unification algorithm has a failure case: it is not possible to literally unify concrete types X and Y (say String and Number). I propose to replace that failure case with a unification which returns UnionType[X,Y], where the argument here is a list of types which participate in the union.

This requires at least one extra case in the unification algorithm, which is if one of the two sides is a UnionType. If so, then the two types must be "merged" by adding the other type to the list, except in the case where it is itself a UnionType, in which case the result must be that both are replaced with a UnionType containing the contents of both lists.

The algorithm should de-dup any added TypeExpr objects that are variable-free, but, at least initially, not de-dup any type variables or TypeExpr objects containing type variables. At the end of processing, it would be good to further de-dup (or at least simplify) and types which are "equivalent" in the sense that the union of Nil, Cons(tv_3) and Cons(tv_6) are equivalent. In all these cases, it seems that the "most general" should win. After all, what is the union of List<A> and List if not List<A> since A could be Integer? (Note to the unwary: there are contravariance issues to consider, so don't just glibly accept that statement).

So, let's get started. Because of the way I have structured my types (in anticipation of this moment), it's fairly easy to find a case to consider: in fact, I've had to tiptoe around them up until now (most recently, I lied that the signature of Cons was A→List<A>→List<A> whereas it is really (obviously) A→List<A>→Cons<A>). So let's take the real take as our example and set up a test case, testWeCanDoASimpleUnionOfNilAndCons.

It should come as no surprise to anyone (except the very optimistic or very cynical) that we get:

3 errors encountered
Cannot unify Cons and Nil
Cannot unify Cons and Nil
Cannot unify List and Cons

Turning around and trying to implement this, everything goes pretty much by the book. We can introduce and implement a TypeUnion easily enough, and we can populate it with TypeExpr objects when needed. It's not too hard to update the unification code to create one of these if it can't unify, and then it's just a question of following through on all the places where errors occur because we can handle TypeVars and TypeExprs, but we can't handle TypeUnions. That includes being able to merge two TypeUnion objects together to make a new object.

The one thing that comes out of this is that previously the unification algorithm did not need to return a type, since it could be deduced from the phi transform as the mapping from a type variable. Now, a (potentially new) unified type may need to be returned which does not really reflect a unified variable since it may reflect the unification of two different return values.

And we're done with the actual unification.

Cleaning up (Union) types

It's particularly apparent with union types, but it's also the case with polymorphic types, that the presentation of types that we have is not particularly pleasant. The problem with union types is that we are showing all the individual "constructors" and thus describing something as {Cons,Nil} or even {Cons,Nil,List}, where what we really mean is List. In polymorphic types, we have variables called things like tv_3 when A would be more natural.

At the same time, I would like to clean up the presentation so that function types can have multiple parameters, instead of being strictly pairs. This basically consists of unfolding the TypeExpr definitions.

I thus propose that in the process of adding types to the store of existing knowledge, we map them from being TypeExprs to being Type objects, which don't allow for "type variables" as such but for polymorphic type parameters. Also, TypeUnion objects are not permitted here: they must be resolved to a suitable type.

For this, I am going to follow the strategy I outlined above: if there is a suitable type which is exactly what is required, then that is what will be used; if not, then the combined type will be rejected. If and when I allow for functions to be specifically typed, the second part of the rule will be implemented.

This isn't going to involve the creation of any new test cases, but a refactoring of all the existing ones. It also requires us to add the knowledge about struct and type definitions to the typechecker.

The hard part of all of this is none of the things I expected, but the thorny issue of matching up the polymorphic types in a Union. The problem appears to be simply that because we have a union of different things, we want to resolve to T[A], but we obviously can't say that Cons[Number] and Cons[String] merge to List[A]. I think it's even more tricky in cases where there are multiple polymorphic structs.

The only reasonable approach (at least until we have more data) is to say that for each case, we want the defined struct-polymorphic variable to match the same rhs every time. Hopefully unification will mean that this is true.

And so on and so on. Almost all details in bookkeeping and making sure that everything ends up in the right places and populating the right structures in the typechecker and lots of mapping backwards and forwards between Types, TypeExprs and TypeReferences. I'm not sure that TypeReference shouldn't go away and be replaced with the more general Type.