While constructing the type checker in Miranda demonstrates a desire to "eat his own dog food" (and moreover, in those days was considerably easier than doing so in, say, C), it adds a certain amount to the complexity of the overall approach due to the absence of state. In the "modern era", it would be possible to use monads to compensate for this "deficiency", but obviously when implementing in an OO language such as Java or C# this is not a challenge.
Furthermore, while it is instructive to build up from the bottom, it is (to me) more intuitive to always break a problem down from the top. The actual section on implementing the type checker is merely the final six pages of a 43 page exposition. For me at least, the first few times I read this chapter, I had lost the plot before I reached section 9.7.
But given the facts that it is an explanation of the algorithm, and it provides working code, it seems an excellent place to start from, once the complexities are unravelled. Furthermore, in part simply because of the detail, it is probably not possible to make sense of the explanation that I will now present except in the context of his work, which is available for download from SLPJ's site. That being the case, the one intent I have to be meticulous in detail is to reference his work by section number everywhere I take a step.
There are a few extra wrinkles that are introduced into this approach because of my personal tinkering. I'm not sure what effects they will have yet, and since I am attempting to explain this to myself as much as to anybody else at the moment, we'll have to see what effects they end up having.
The main one is that - based on my own PhD thesis - I have deliberately designed the Flasck type system to not be strictly algebraic types, but to accept the notion that there is a single set of values which has a "Constructor" and a set of (tagged) values. Numbers, booleans, characters and strings are somewhat "forced" into this system at the "code" level, although they are implemented natively in JavaScript. But, for example, Nil and Cons are considered separate entities, but the type List is "later" introduced to unify the set of values which is in either of them. The mathematical concept of non-evaluation (⊥) is (has to be) included in every type, but the notion of error values (which is often associated with ⊥ is hopefully excluded, although it will frequently end up being present in the function definitions.
The introduction of cards, state, methods and actors cannot, I feel, go unnoticed by the type system, although by dint of translating almost all of this away - or else externalizing it - I cannot actually see where the impact will be.
The object-oriented convention of field access - and tagged names - that I have used for "simplicity", while it should not have any effect on the overall validity of the program (it could always be translated away) may have impacts on some of the information we need to store and check.
Given that the context of this work is the Flasck compiler, for which is the full source code is available in this repository, the "input" to the type checker is a set of function definitions grouped (basically) into scopes. At the point at which the type checker is invoked, the following steps have all been completed:
Thus, in comparison to PH9.1, I would define a program as a pair of a map of function names to type expressions for the externals, and a scope of new definitions. Note that this scope encompasses both the concept of a list of definitions and the concept of nested LET and LETREC definitions from nested scopes.
The first step in the algorithm (see PH8.4) is to rework this form into a set of either standalone or grouped definitions which can be individually typechecked by doing dependency analysis across the definitions. The idea is to try and identify the dependencies between definitions so that "more free" definitions can be processed and added to our "store of knowledge" before subsequent definitions are attempted.
It will be noted from the description of a program in PH9.1 that such a program is basically a single expression with all of its dependencies resolved through the use of LET and LETREC expressions. I propose to take the same approach in a piecemeal fashion by saying that for every function definition we will say that this is "the program" with all its dependencies introduced in these forms. However, we will, as much as possible, have tried to already resolve these. I'm somewhat concerned about the extent to which this is possible, because of the comment in PH8.4 about doing type checking before lambda lifting, which would be a process which would remove some of these mutual dependencies. I (currently) believe that the issue here is that lifting would cause some bound-in-the-same-context lambda variables to become unbound, which would change the set of programs which would be allowed to pass (see the long discussion in PH8.5.1, particularly starting near the bottom of p152).
class TypeChecker {
ErrorResult errors;
Map environment;
VariableFactory tvFactory;
List> workToDo;
void typecheck();
}
Obviously, since my implementation is intended for Java, I'm not going to stray too far from that, but since none of the code presented here is actually being run through a compiler I'm (a) not going to stick too close to the syntax; and (b) there are probably going to be errors and typos creeping in.
This presentation is to be compared to the function tc in PH9.7. The TypeChecker will be instantiated with a set of known types and expressions that will go into the "type environment" (I think it may actually need more work than this to get the appropriate struct and type definitions where we need them, e.g. for field access). As the typechecker proceeds, it will add to that store of knowledge. This is equivalent to the parameter gamma.
In lieu of the (somewhat convoluted) system of a "supply of unique type variable names" (ns) necessary in a functional language, I have used a type variable factory to generate a stream of unique names as they are demanded by the type checker.
The workToDo variable is instantiated with the set of dependency-analyzed groups of function definitions, where a function is, at the very least, resolved to HSIE form before processing. Each of these will be presented to the internal typechecking function in the same way as the expression e is used in PH9.7.
The function typecheck() is called exactly once to do all the work. Any errors will be collected and stored in errors; no exceptions should be thrown. The typechecker may cease functioning after the first block of function definitions which have any errors, although all the errors detected for that block will be returned. The reason for this is simply that since any subsequent blocks may be dependent on this block, if this block fails to typecheck, those blocks will not be capable of being typechecked. If this condition does not hold (i.e. they are independent, they just happen to be later), they should be typechecked and errors reported.
The approach is thus to have a structure of the following form:
On pages 172 and 173 he defines two functions (unknowns_scheme and sub_scheme) which are responsible for obtaining a list of all the unknown type variables in the expression and to apply a substitution to the scheme in such a way that only the unknowns are affected. It seems reasonable that we would want both of these methods to be defined on the type scheme. I believe the remainder of PH9.5 (9.5.3 about association lists) can be simply replaced with the word Map.class TypeScheme { ListschematicVars; Object typeExpr; // some method to return all the unknowns in typeExpr // some method to apply a substitution to just the unknowns in typeExpr }
The use of a function phi to do all the hard work appears to be just a convenience for describing what amounts to a mapping from type variables to types. This appears to be approximately equivalent to constructing a Map to do the same thing. In short, phi is the solution to the problem of unifying a system of equations.
The unification algorithm itself seems to progress one step at a time, unifying two equations at a time. The only places in the whole typechecking algorithm where it appears to actually be invoked is buried in function application to try and unify the types of (A->B) and C in the expression (f x); and during the processing of letrecs where it appears to be responsible for trying to bring together a set of equations. This is actually done by using the function unifyl (p170), which takes a list of equations and simply does a foldr over them starting with the initial value being a pre-determined "solution" or phi.
It seems reasonable to me to define such a solution to a set of equations as a new class Solution, but to keep the link to the original text, for now at least (it may disappear in a refactoring), I think I'm going to call it PhiSolution. Each element of the solution is a "substitution" which is basically a Map.Entry from type variables to type expressions. I think it probably also makes sense to put methods such as extend, unify and unifyl on the PhiSolution class.
Given that we don't actually know what the types of f and x are, the first thing to do is to apply the type checking algorithm recursively to each of them.
Assuming all of that goes well, we next apply the unification algorithm based on the existing solution phi and the types Tf and (Tx -> T?) where T? is a new type variable.
Finally, if that all goes well, we return the result of applying the substitution phi to the newly introducted type variable T?, since this is the return type, i.e. the type of the application, which is what we want.
Can we directly expand this to a multi-application scenario easily, or is it better to just break it down per application?
I think it's probably easiest to do it one step at a time, but to do it WITHIN a single block.
Finally, the way in which we store apply nodes is in Closure objects. The closure object consists of a set of values, each of which is a variable or a constant. One possible value for a variable is another closure, in which case that needs to be evaluated. I'm not quite sure where that leaves us. But hopefully it will all come out in the wash.
Jumping forward to p170, the unification algorithm is presented as three cases, but the second is basically the same as the first (indeed, it calls the first with the arguments in the reversed order), while the first case seems to me to really be two separate cases.
So, this is how I read the unification algorithm. Look at the two type expressions that come in and then:
Now, we have already seen in laying our hands on a function definition, that we need to be able to resolve this; the tricky thing is making sure that we can type check things correctly even in the face of a schematic definition popping up multiple times in different contexts and thus with different instantiated types. This is discussed in the "informal" presentation in PH8.5.3 and more clearly exampled on p144 (PH8.2.1) with id (code (id 'a')) in which id clearly has both the type int->int and char->char so that we need to make sure that we do an appropriate substitution of A in the "formal" type A->A.
The "first step" is to typecheck all the sub-expressions es, rendered tcl gamma ns1 es. Because of the different way we have ordered this, we only have one at a time; because we already have them "in stock" type checking them is just a question of looking the type up.
The next step is to apply tclet1 to the "actual" expression, which of course, we don't really have. This repeated mismatch is starting to make me think that maybe I should take advantage of the fact that I do know up front which the externals are and rewrite in favor of using LET. But let's keep going for now. Basically, we seem to want to get a new, substituted type environment under the current phi and then use that to add a bunch of declarations for the variables.
Actually, that seems to work
I propose an alternative strategy. Follow the basic rule from Chapter 8 - this is just like a LET but you don't substitute fresh variables in for schematic variables, and see where that takes us.