Since I'm dealing with a different data model, I'm going to introduce a new method to my TypeChecker class - tcExpr - which is responsible for attempting to determine the type of a single expression given everything it knows about the current state of the world. I'm not sure that this is going to work out in the long run - it could be too big a deviation from the overall plan - but I'm sure we can refactor back to where we need to be.
In doing this, it becomes clear that we need a return type for tcExpr. Since we're putting any errors we find in the ErrorResult object, the only real option here is the "valid" type of the expression - or null if typechecking fails. It's easy enough to create "TypeExpr", but what goes into it?
PH discusses his representation of type expressions in PH9.2 and it seems easy enough to copy his definition and map it over into Java. Basically, what he suggests is keeping track of types in much the way they are usually represented: a type "name" and a list of "type expression" arguments - for those types that need arguments. There are then a couple of special types (which seem more common because of the presentation) for handling functions and tuples. We will represent these by the types -> and () to avoid clashing with any "normal" names; a type such as "Cons" would need one argument - the type of its head element.
There is also the option to typecheck something to just be a "type variable". Since this is a completely different beast, we will use two different classes and have the tcExpr method return Object. We can then wire up our typechecker to just directly say that any expression which has a numeric constant returns Number and voila! We're done. (See testWeCanTypecheckANumber).
This requires us to understand type schemes, which were introduced in PH9.5 (q.v. and my interpretation). I'm now going to go ahead and created the appropriate classes for TypeScheme and TypeVar in line with my understanding from that section.
We can create a type scheme, but in order to do that, we need the VariableFactory alluded to in UPH. This is just a simple bean counter that can issue unique new TypeVar objects.
OK, whoa there! In defining this next test, I realize I want to use and typecheck HSIE formed expressions, not the parse form. So, what I previously wrote as a test case won't work. OK, I'll have to pull that back and rewrite it. In the meantime, let's press on.
So, I created a new simple function in the HSIETestData set. It takes one parameter - needed to obtain a LAMBDA expression - and returns the number 1 as a result. It should have the type A->Number where "A" here represents a type variable. The factory is actually issuing things that print tv_1, but that looks ugly, so I'm going with something better here.
Another wrinkle that promptly emerges is that PH folds all his lambdas to look like (λx.(λy.(λz.E(x,y,z))). Since we have direct functions of multiple arguments, I am going to simply do "all of these" at once.
I had been thinking that I would be storing the equivalent of the "type environment" gamma (PH9.7) in the typechecker's state. While it's true that I probably do want to store a significant amount of that state there, the "current portion" needs to be in a local variable so that we can push new entries onto it - for example, when binding a lambda expression.
So, backtracking to the first test case, we want to rewrite this a test of typechecking an HSIEBlock, specifically one that returns the number 1. So let's do that, by extracting the expression from our newly defined test data. We then need to look and see if the command is a RETURN of an integer, and, if so, return type Number. OK, good.
I think the simple answer is that we just throw it away. I think we need to know something about the BIND instructions so as to introduce the new variables, along with their types (although I'm still not sure how we do that). But the important thing is to say that if a function f is to be consistently typed it must have a type v0->v1 for all the cases; i.e. that whatever the input types are (which we know for a specific case), the output type (i.e. the value of the function) must be the same.
Which brings me on to the other thing that's been bothering me. I'm very concerned about the union types (e.g. List = Nil|Cons). It's not clear if we can distinguish such a beast from a type error (this appears to be Nil here and Cons there - can we unify that). Having studied TypeExpr in a bit more detail, it seems reasonable that we could represent a union type by the constructor || across a set of types. But that still leaves the question of how to unify them. I'm not even really sure where in the algorithm this pops up. One possibility would be to "always" allow it, but I think that would allow too many invalid programs; one possibility would be to allow it whenever there was a suitable pre-defined union type (such as List), but I think that would too often devolve to Any; one possibility would be only allow it if there was a type which exactly matched the union when it was fully resolved at a "top level", but contrariwise I think that might have too many false negatives.
So, in light of that exploration, what we're basically saying is that the result of typechecking an expression is two items: a PhiSolution that contains a mapping from type variables to type expressions and the exact type of that sub-expression. Given what I think I understand about the roles of these two, I propose to return the type of the expression and to use an accumulator for the PhiSolution. I think the important thing is to make sure that at all times we are passing around expressions that have had the "existing" solution applied to them and all the variables mapped by the solution removed.
That appears to be sufficient to typecheck both of these functions:
simple x = 1 id x = x
After further review and investigation, it seems like this assumption is not as clear cut as it would seem. What it basically means is that the first step is to determine an "expression" for the types of both Tf and Tx. But, as we look at this recursively, we see we get to a point where we want to typecheck "f", which is a constant. When this happens, we can return the type for it from our existing pool of such things.
So our next test case should be one that if we put something in the "externals" pool and then we have a function which returns that, can we see that we get the right type back?
That is, it seems like SWITCH is very much like a function s :: T0 -> T1 where T0 is the type of the variable being switched and T1 is the desired type (e.g. Cons). And BIND is much like a lambda expression, in that it binds a variable in the nested statement.
All the switch cases are then grouped together in some way so that all their (result) types get unified together.
On the subject of which, I now propose a solution to the other problem. The type unification seems to happen at the last line of Hancock's algorithm (in which he raises an error because you wouldn't be able to unify Nil with Cons A). We can return a union type here || (Nil, Cons A). When we reach the top level of any given function, we can then see if one of two cases holds:
I'm thinking that the second one cannot usefully return anything and the process of typechecking will update the internal state so that it is possible to ask for the type of any given symbol after the typechecking is complete.void addPredefined(String symbol, type_expression type); void typecheck(Setfunctions);
I still think it is necessary for the typechecker to be able to handle multiple function definitions at once in order to handle mutual recursion. I think this grouping is just like grouping the cases for a function and much like a LETREC: it involves us unifying a set of things that we can build up individually.
From our investigation, it seems that actually the first thing we want to do is to be able to treat that as a constant. So I'm briefly going to insert another test case (testExternalPlus1HasExpectedType) to check that. Yeah, and, as we might expect, we haven't considered the case that we might return an identifier. Let's do that.
OK, that wasn't too hard. Let's get back to the real matter at hand.
The first step in checking function application is to try and determine the types of the items at hand. Now, we have a closure which has a set of entries (for now - simple function application - we're just going to consider two), the first of which is the function and the other are the arguments. So let's determine their types.
OK, fair enough. The next step is to create the artificial type Tx->T? where Tx is the type of the argument and T? is a newly minted type variable. And now we need to unify them. Ho, ho!
As described, there are three cases. The case for unifying a two type expressions is fairly straightforward and requires the nested method unifyl, so let's do that first. OK.
Now, the case for unifying two variables. This requires the method extend, so let's implement that. The three cases are simple enough when written in Java, so let's call that easy as well (it does require adding containsVar to the TypeExpr class). Calling it isn't too hard either; we need to call subst on the second var and then pass the var and the substituted expression to extend. There's no need to return anything because this is just unify.
Finally, the case for a variable and a type expression. We need to apply subst on the type expression, meaning on the variable, and then unify the two results.
OK, I think that's unification done.
Uh, yes, that's it.
Moving on to multi-variable function application. We want to proceed from the left and apply the "current" function to the next argument until we're done. Let's create a new test case for plus applied to two arguments. As expected, that fails, and with a reasonable complaint: just applying + to the first argument returns a curried function (2+_ in this case). Let's fix that.
First I'm going to refactor 90% of what we just did into checkSingleApplication. Then we can easily call this with the first pair of arguments and then repeatedly call it with the result (which we "assume" to be a curried function) and the next argument. Either this will fail at some point (returning null) or we will succeed in obtaining the "real type". Simples.
This is similar but easier, since we do not currently handle character constants. The typechecking challenge is the same.f = id (decode (id 32))
Interestingly enough, with the test in place (testWeCanUseIDTwiceWithDifferentInstantiationsOfItsSchematicVar) the typechecking seems to proceed OK but then returns a schematic variable at the end, suggesting that some level of typechecking or unification has not been done. First back to Mr. Hancock to see if there's a step I've missed, then look into the code.
So this seems to depend on a method sub_te which in turn depends on sub_scheme (p173) which in turn depends on exclude (p173). So let's implement those. exclude goes on PhiSolution and (I think) returns a new, temporary version of phi. sub_scheme goes on TypeScheme and pretty definitely returns a new object. The "obvious" definition of sub_te (p175) presumably wants to go on TypeEnvironment. For good measure, I implemented it in "the obvious way". Having done that, I'm not sure I'm going to use any of it (at least not right now).
What I'm going to try and do is just to introduce the name with a new set of variables replacing the old ones and see if that does the trick. Basically, this is a question of implementing a method freshVarsIn in the type checker which consistently replaces the current schematic variables with new ones.
Although this is obviously a theoretical improvement, it still ends up with the wrong answer, but I think the fundamental problem is that we have unified tv_1 with tv_5 and tv_1 with "Char" but have simply not returned the right thing, possibly just as simple as not applying phi to the result.
Actually, not quite that simple. The problem was in mapping PH's three cases of unification to my three, I'd failed to realize that I need to apply phi to the var in case 1 before doing extend. As it happened, with tv_1 already having a value, I needed to select case 1b, not case 1a.
Assuming that I didn't so dramatically cheat that it worked for the wrong reasons (and looking at the tracing, that doesn't appear to be the case), that was not too bad. Basically, that just leaves us with LETREC and handling pattern matching (SWITCH and BIND).
The other thing we need to do is to keep track of the "more local" definitions than the global "knowledge" so that we don't try looking up these names in the global scope but rather find the locally renamed definitions.
So let's get started.
It takes a bit of setup to get all this working which involves creating two new function definitions (for which we also create parsing test cases to make sure that we have written the right HSIE) and then a new test case (testWeCanCheckTwoFunctionsAtOnceBecauseTheyAreMutuallyRecursive). As it happens, this fails on the existing code base with a couple of errors because we don't check for null everywhere we should and we don't (yet) have the right information about where to tag errors but it ultimately comes down to "there is no type for f" (or g). Well, duh.
So what we need to do is to allocate local knowledge for both of these when we start doing the typechecking. An interesting issue is how we're going to pass that around, since it's not something which is integral to the state of the typechecker, but rather depends on the methods being typechecked. So I guess it's another parameter we pass around along with phi and gamma.
These need to be "named apart" in terms of variables, because we don't want to have the variables in different functions being unified together. So, let's use a pool of fresh variables for each equation, rewrite them and then process the resulting equations.
Doing that basically seems to work, at least in terms of a first approximation. There are details to be sure, but basically creating new parallel equations, each with their own set of variables, and then attempting to typecheck each one in turn (but with the others' types as local definitions with pre-bound type variables for arguments), and then, at the end using unification to tie them together, seems to "just work".
That still leaves us with the wrinkle of what to do when we have scoped definitions. In this case we need to make sure that any inner bound variables are not rewritten (or at least, are rewritten to the same new variable as the outer equation) so that they are unified together. Consider, for example:
If we were to simply typecheck g first, then that would have an "unknown" var (x) but would resolve to Number->Number. If we were to then typecheck f, we would end up with no information about x and thus A->Number. Thus we need to make sure that we handle both definitions together, thus providing g with a definition for x, and doing the unification to make f typecheck to Number->Number.f x = g 2 g y = x * y
Well, possibly that lulled me into a false sense of security, but for the simple cases of matching numbers, that was all very easy. In fact, I think inasmuch as this was a task to handle HSIE, I've handled SWITCH and IF successfully.
The bigger issue hiding in here is, of course, my extension to the type system. We'll come back to that, but first, let's make sure we nail BIND.
Yeah, that seemed easy.
This, as previously noted, comes from the work I did in my Ph.D. thesis with regard to "base types". Although the original idea was in support of proof constraints (having more granular types makes it easier to state precisely which cases are, or are not, covered by a function definition or proof obligation), the idea sticks with me as being generally useful. It provides at least three benefits:
The basic idea is that Hancock's unification algorithm has a failure case: it is not possible to literally unify concrete types X and Y (say String and Number). I propose to replace that failure case with a unification which returns UnionType[X,Y], where the argument here is a list of types which participate in the union.
This requires at least one extra case in the unification algorithm, which is if one of the two sides is a UnionType. If so, then the two types must be "merged" by adding the other type to the list, except in the case where it is itself a UnionType, in which case the result must be that both are replaced with a UnionType containing the contents of both lists.
The algorithm should de-dup any added TypeExpr objects that are variable-free, but, at least initially, not de-dup any type variables or TypeExpr objects containing type variables. At the end of processing, it would be good to further de-dup (or at least simplify) and types which are "equivalent" in the sense that the union of Nil, Cons(tv_3) and Cons(tv_6) are equivalent. In all these cases, it seems that the "most general" should win. After all, what is the union of List<A> and List
So, let's get started. Because of the way I have structured my types (in anticipation of this moment), it's fairly easy to find a case to consider: in fact, I've had to tiptoe around them up until now (most recently, I lied that the signature of Cons was A→List<A>→List<A> whereas it is really (obviously) A→List<A>→Cons<A>). So let's take the real take as our example and set up a test case, testWeCanDoASimpleUnionOfNilAndCons.
It should come as no surprise to anyone (except the very optimistic or very cynical) that we get:
Turning around and trying to implement this, everything goes pretty much by the book. We can introduce and implement a TypeUnion easily enough, and we can populate it with TypeExpr objects when needed. It's not too hard to update the unification code to create one of these if it can't unify, and then it's just a question of following through on all the places where errors occur because we can handle TypeVars and TypeExprs, but we can't handle TypeUnions. That includes being able to merge two TypeUnion objects together to make a new object.3 errors encountered Cannot unify Cons and Nil Cannot unify Cons and Nil Cannot unify List and Cons
The one thing that comes out of this is that previously the unification algorithm did not need to return a type, since it could be deduced from the phi transform as the mapping from a type variable. Now, a (potentially new) unified type may need to be returned which does not really reflect a unified variable since it may reflect the unification of two different return values.
And we're done with the actual unification.
At the same time, I would like to clean up the presentation so that function types can have multiple parameters, instead of being strictly pairs. This basically consists of unfolding the TypeExpr definitions.
I thus propose that in the process of adding types to the store of existing knowledge, we map them from being TypeExprs to being Type objects, which don't allow for "type variables" as such but for polymorphic type parameters. Also, TypeUnion objects are not permitted here: they must be resolved to a suitable type.
For this, I am going to follow the strategy I outlined above: if there is a suitable type which is exactly what is required, then that is what will be used; if not, then the combined type will be rejected. If and when I allow for functions to be specifically typed, the second part of the rule will be implemented.
This isn't going to involve the creation of any new test cases, but a refactoring of all the existing ones. It also requires us to add the knowledge about struct and type definitions to the typechecker.
The hard part of all of this is none of the things I expected, but the thorny issue of matching up the polymorphic types in a Union. The problem appears to be simply that because we have a union of different things, we want to resolve to T[A], but we obviously can't say that Cons[Number] and Cons[String] merge to List[A]. I think it's even more tricky in cases where there are multiple polymorphic structs.
The only reasonable approach (at least until we have more data) is to say that for each case, we want the defined struct-polymorphic variable to match the same rhs every time. Hopefully unification will mean that this is true.
And so on and so on. Almost all details in bookkeeping and making sure that everything ends up in the right places and populating the right structures in the typechecker and lots of mapping backwards and forwards between Types, TypeExprs and TypeReferences. I'm not sure that TypeReference shouldn't go away and be replaced with the more general Type.