pure is the Pure interpreter. The interpreter has an LLVM backend which JIT-compiles Pure programs to machine code, hence programs run blazingly fast and interfacing to C modules is easy, while the interpreter still provides a convenient, fully interactive environment for running Pure scripts and evaluating expressions.
If any source scripts are specified on the command line, they are loaded and executed, after which the interpreter exits. Otherwise the interpreter enters the interactive read-eval-print loop. You can also use the -i option to enter the interactive loop (continue reading from stdin) even after processing some source scripts. To exit the interpreter, just type the quit command or the end-of-file character (^D on Unix) at the beginning of the command line.
When the interpreter is in interactive mode and reads from a tty, commands are read using readline(3) (providing completion for all commands listed in section INTERACTIVE USAGE below, as well as for global function and variable symbols) and, when exiting the interpreter, the command history is stored in ~/.pure_history, from where it is restored the next time you run the interpreter.
Options and source files are processed in the order in which they are given on the command line. Processing of options and source files ends when either the -- or the -x option is encountered. The -x option must be followed by the name of a script to be executed, which becomes the ``main script'' of the application. In either case, any remaining parameters are passed to the executing script by means of the global argc and argv variables, denoting the number of arguments and the list of the actual parameter strings, respectively. In the case of -x this also includes the script name as argv!0. The -x option is useful, in particular, to turn Pure scripts into executable programs by including a ``shebang'' like
#!/usr/local/bin/pure -x
as the first line in your main script. (This trick only works with Unix shells, though.)
On startup, the interpreter also defines the version variable, which is set to the version string of the Pure interpreter, and the sysinfo variable, which provides a string identifying the host system. These are useful if parts of your script depend on the particular version of the interpreter and the system it runs on.
If available, the prelude script prelude.pure is loaded by the interpreter prior to any other other definitions, unless the -n option is specified. The prelude as well as other source scripts specified with a relative pathname are first searched for in the current directory and then in the directory specified with the PURELIB environment variable. If the PURELIB variable is not set, a system-specific default is used.
The -v option is most useful for debugging the interpreter, or if you are interested in the code your program gets compiled to. The level argument is optional; it defaults to 1. Six different levels are implemented at this time (two more bits are reserved for future extensions). For most purposes, only the first two levels will be useful for the average Pure programmer; the remaining levels are most likely to be used by the Pure interpreter developers.
These values can be or'ed together, and, for convenience, can be specified in either decimal or hexadecimal. Thus 0xff always gives you full debugging output (which isn't most likely be used by anyone but the Pure developers).
Note that the -v option is only applied after the prelude has been loaded. If you want to debug the prelude, use the -n option and specify the prelude.pure file explicitly on the command line. Alternatively, you can also use the interactive list command (see the INTERACTIVE USAGE section below) to list definitions along with additional debugging information.
Pure is a fairly simple language. Programs are simply collections of equational rules defining functions, let commands binding global variables, and expressions to be evaluated. Here's a simple example, entered interactively in the interpreter:
> // my first Pure example > fact 1 = 1; > fact n::int = n*fact (n-1) if n>1; > let x = fact 10; x; 3628800
The language is free-format (blanks are insignificant). As indicated, definitions and expressions at the toplevel have to be terminated with a semicolon. Comments have the same syntax as in C++ (using // for line-oriented and /* ... */ for multiline comments; the latter may not be nested). Lines beginning with #! are treated as comments, too; as already discussed above, on Unix-like systems this allows you to add a ``shebang'' to your main script in order to turn it into an executable program.
On the surface, Pure is quite similar to other modern functional languages like Haskell and ML. But under the hood it is a much more dynamic and reflective language, more akin to Lisp. In particular, Pure is dynamically typed, so functions can be fully polymorphic and you can add to the definition of an existing function at any time:
> fact 1.0 = 1.0; > fact n::double = n*fact (n-1) if n>1; > fact 10.0; 3628800.0 > fact 10; 3628800
Also, due to its term rewriting semantics, Pure can do symbolic evaluations:
> square x = x*x; > square (a+b); (a+b)*(a+b)
The Pure language provides built-in support for machine integers (32 bit), bigints (implemented using GMP), floating point values (double precision IEEE), character strings (UTF-8 encoded) and generic C pointers (these don't have a syntactic representation in Pure, though, so they need to be created with external C functions). Truth values are encoded as machine integers (as you might expect, zero denotes ``false'' and any non-zero value ``true'').
Expressions are generally evaluated from left to right, innermost expressions first, i.e., using call by value semantics. Pure also has a few built-in special forms (most notably, conditional expressions and the short-circuit logical connectives && and ||) which take some of their arguments using call by name semantics.
Expressions consist of the following elements:
Expressions are parsed according to the following precedence rules: Lambda binds most weakly, followed by when, with and case, followed by conditional expressions (if-then-else), followed by the ``simple'' expressions (i.e., all other kinds of expressions involving operators, function applications, constants, symbols and other primary expressions). Precedence and associativity of operator symbols are given by their declarations (in the prelude or the user's program), and function application binds stronger than all operators. Parentheses can be used to override default precedences and associativities as usual.
At the toplevel, a Pure program basically consists of rules a.k.a. equations defining functions, variable definitions a.k.a. global ``pattern bindings'', and expressions to be evaluated.
Pure also provides some abbreviations for factoring out common left-hand or right-hand sides in collections of rules; see below for details.
Basically, the same rule syntax is used to define functions at the toplevel and in with expressions, as well as inside case and when expressions for the purpose of performing pattern bindings (however, for obvious reasons guards are not permitted in when expressions). When matching against a function call or the subject term in a case expression, the rules are always considered in the order in which they are written, and the first matching rule (whose guard evaluates to a nonzero value, if applicable) is picked. (Again, the when construct is treated differently, because each rule is actually a separate pattern binding.)
In any case, the left-hand side pattern must not contain repeated variables (i.e., rules must be ``left-linear''), except for the ``anonymous'' variable `_' which matches an arbitrary value without binding a variable symbol. Moreover, a left-hand side variable may be followed by one of the special type tags ::int, ::bigint, ::double, ::string, to indicate that it can only match a constant value of the corresponding built-in type. (This is useful if you want to write rules matching any object of one of these types; note that there is no way to write out all ``constructors'' for the built-in types, as there are infinitely many.)
The left-hand side of a rule can be omitted if it is the same as for the previous rule. This provides a convenient means to write out a collection of equations for the same left-hand side which discriminates over different conditions:
lhs = rhs if guard;
= rhs if guard;
...
= rhs otherwise;
Pure also allows a collection of rules with different left-hand sides but the same right-hand side(s) to be abbreviated as follows:
lhs |
...
lhs = rhs;
This is useful if you need different specializations of the same rule which use different type tags on the left-hand side variables. For instance:
fact n::int |
fact n::double |
fact n = n*fact(n-1) if n>0;
= 1 otherwise;
In fact, the left-hand sides don't have to be related at all, so that you can also write something like:
foo x | bar y = x*y;
The same works in case expressions, which is convenient if different cases should be mapped to the same value, e.g.:
case ans of "y" | "Y" = 1; _ = 0; end;
Here are some more definitions showing most of the elements discussed above in action:
fact n = n*fact (n-1) if n>0;
= 1 otherwise;
fib n = a when a, b = fibs n end
with fibs n = 0, 1 if n<=0;
= case fibs (n-1) of
a, b = b, a+b;
end;
end;
let facts = map fact (1..10); let fibs = map fib (1..100);
facts; fibs;
This is a little list comprehension example: Erathosthenes' classical prime sieve.
primes n = sieve (2..n) with sieve [] = []; sieve (p:qs) = p : sieve [q; q = qs; q mod p]; end;
For instance:
> primes 100; [2,3,5,7,11,13,17,19,23,29,31,37,41,43,47,53,59,61,67,71,73,79,83,89,97]
If you dare, you can actually have a look at the catmap-lambda-if-then-else expression the comprehension expanded to:
> list primes primes n = sieve (2..n) with sieve [] = []; sieve (p:qs) = p:sieve (catmap (\q -> if q mod p then [q] else []) qs) end;
We mention in passing that list comprehensions are also a useful device to organize backtracking searches. For instance, here's an algorithm for the n queens problem, which returns the list of all placements of n queens on an n x n board (encoded as lists of n pairs (i,j) with i = 1..n), so that no two queens hold each other in check.
queens n = search n 1 [] with
search n i p = [reverse p] if i>n;
= cat [search n (i+1) ((i,j):p); j = 1..n; safe (i,j) p];
safe (i,j) p = not any (check (i,j)) p;
check (i1,j1) (i2,j2)
= i1==i2 || j1==j2 || i1+j1==i2+j2 || i1-j1==i2-j2;
end;
> catch error (throw hello_world); error hello_world
Exceptions are also generated by the runtime system if the program runs out of stack space, when a guard does not evaluate to a truth value, and when the subject term fails to match the pattern in a pattern-matching lambda abstraction, or a let, case or when construct. These types of exceptions are reported using the symbols stack_fault, failed_cond and failed_match, respectively, which are declared as constant symbols in the standard prelude. You can use catch to handle these kinds of exceptions just like any other. For instance:
> fact n = if n>0 then n*fact(n-1) else 1; > catch error (fact foo); error failed_cond > catch error (fact 100000); error stack_fault
(You'll only get the latter kind of exception if the interpreter does stack checks, see the discussion of the PURE_STACK environment variable in the CAVEATS AND NOTES section.)
Note that unhandled exceptions are reported by the interpreter with a corresponding error message:
> fact foo; <stdin>:2.0-7: unhandled exception 'failed_cond' while evaluating 'fact foo'
Exceptions can also be used to implement non-local value returns. For instance, here's a variation of our n queens algorithm which only returns the first solution. Note the use of throw in the recursive search routine to bail out with a solution as soon as we found one. The value thrown there is caught in the main routine. If no value gets thrown, the function regularly returns with () to indicate that there is no solution.
queens1 n = catch reverse (search n 1 []) with
search n i p = throw p if i>n;
= void [search n (i+1) ((i,j):p); j = 1..n; safe (i,j) p];
safe (i,j) p = not any (check (i,j)) p;
check (i1,j1) (i2,j2)
= i1==i2 || j1==j2 || i1+j1==i2+j2 || i1-j1==i2-j2;
end;
E.g., let's compute a solution for a standard 8x8 board:
> queens 8; (1,1):(2,5):(3,8):(4,6):(5,3):(6,7):(7,2):(8,4):[]
infixl 6 + - ; infixl 7 * / div mod ;
Moreover, constant symbols are introduced using a declaration of the form:
nullary symbol ...;
Examples for all of these can be found in the prelude which declares a bunch of standard (arithmetic, relational, logical) operator symbols as well as the list and pair constructors `:' and `,' and the constant symbols `[]' and `()' denoting the empty list and tuple, respectively.
> extern double sin(double); > sin 0.3; 0.29552020666134
For clarity, the parameter types can also be annotated with parameter names, e.g.:
extern double sin(double x);
Parameter names in prototypes only serve informational purposes and are for the human reader; they are effectively treated as comments by the compiler.
The interpreter makes sure that the parameters in a call match; if not, the call is treated as a normal form expression. The range of supported C types is a bit limited right now (void, bool, char, short, int, long, double, as well as arbitrary pointer types, i.e.: void*, char*, etc.), but in practice these should cover most kinds of calls that need to be done when interfacing to C libraries.
Since Pure only has 32 bit machine integers and GMP bigints, a variety of C integer types are provided which are converted from/to the Pure types in a straightfoward way. The short type indicates 16 bit integers which are converted from/to Pure machine ints using truncation and sign extension, respectively. The long type always denotes 64 bit integers, even if the corresponding C type is actually 32 bit (as it usually is on most contemporary systems). This type is to be used if a C function takes or returns 64 bit integer values. For a long parameter you can either pass a Pure machine int (which is sign-extended to 64 bit) or a Pure bigint (which is truncated to 64 bit if necessary). 64 bit return values are always converted to (signed) Pure bigints.
Concerning the pointer types, char* is for string arguments and return values which need translation between Pure's internal utf-8 representation and the system encoding, while void* is for any generic kind of pointer (including strings, which are not translated when passed/returned as void*). Any other kind of pointer (except expr*, see below) is effectively treated as void* right now, although in a future version the interpreter may keep track of the type names for the purpose of checking parameter types.
The expr* pointer type is special; it indicates a Pure expression parameter or return value which is just passed through unchanged. All other types of values have to be ``unboxed'' when they are passed as arguments (i.e., from Pure to C) and ``boxed'' again when they are returned as function results (from C to Pure). All of this is handled by the runtime system in a transparent way, of course.
It is even possible to augment an external C function with ordinary Pure equations, but in this case you have to make sure that the extern declaration of the function comes first. For instance, we might want to extend our imported sin function with a rule to handle integers:
> sin 0; sin 0 > sin x::int = sin (double x); > sin 0; 0.0
Sometimes it is preferable to replace a C function with a wrapper function written in Pure. In such a case you can specify an alias under which the original C function is known to the Pure program, so that you can still call the C function from the wrapper. An alias is introduced by terminating the extern declaration with a clause of the form ``= alias''. For instance:
> extern double sin(double) = c_sin; > sin x::double = c_sin x; > sin x::int = c_sin (double x); > sin 0.3; sin 0; 0.29552020666134 0.0
External C functions are resolved by the LLVM runtime, which first looks for the symbol in the C library and Pure's runtime library (or the interpreter executable, if the interpreter was linked statically). Thus all C library and Pure runtime functions are readily available in Pure programs. Other functions can be provided by including them in the runtime, or by linking the interpreter against the corresponding modules. Or, better yet, you can just ``dlopen'' shared libraries at runtime with a special form of the using clause:
using "lib:libname[.ext]";
For instance, if you want to call the GMP functions directly from Pure:
using "lib:libgmp";
After this declaration the GMP functions will be ready to be imported into your Pure program by means of corresponding extern declarations.
Shared libraries opened with using clauses are searched for on the usual system linker path (LD_LIBRARY_PATH on Linux). The necessary filename suffix (e.g., .so on Linux or .dll on Windows) will also be supplied automatically. You can also specify a full pathname for the library if you prefer that. If a library file cannot be found, or if an extern declaration names a function symbol which cannot be resolved, an appropriate error message is printed.
> fact 1 = 1; > fact n = n*fact (n-1) if n>1; > let x = fact 10; x; 3628800 > map fact (1..10); [1,2,6,24,120,720,5040,40320,362880,3628800]
When running interactively, the interpreter also accepts a number of special commands useful for interactive purposes. Here is a quick rundown of the currently supported operations:
Note that these special commands are only recognized at the beginning of the interactive command line. (Thus you can escape a symbol looking like a command by prefixing it with a space.)
Some commands which are especially important for effective operation of the interpreter are discussed in more detail in the following sections.
Note that some of the options (in particular, -c and -d) may produce excessive amounts of information. By setting the PURE_MORE environment variable accordingly, you can specify a shell command to be used for paging, usually more(1) or less(1) .
For instance, to list all definitions in all loaded scripts (including the prelude), simply say:
> list
This may produce quite a lot of output, depending on which scripts are loaded. The following command will only show summary information about the variable symbols along with their current values (using the ``long format''):
> list -lv argc var argc = 0; argv var argv = []; sysinfo var sysinfo = "i686-pc-linux-gnu"; version var version = "0.4"; 4 variables
If you're like me then you'll frequently have to look up how some operations are defined. No sweat, with the Pure interpreter there's no need to dive into the sources, the list command can easily do it for you. For instance, here's how you can list the definitions of all list ``zipping'' operations from the prelude in one go:
> list -g zip* zip (x:xs) (y:ys) = (x,y):zip xs ys; zip _ _ = []; zip3 (x:xs) (y:ys) (z:zs) = (x,y,z):zip3 xs ys zs; zip3 _ _ _ = []; zipwith f (x:xs) (y:ys) = f x y:zipwith f xs ys; zipwith f _ _ = []; zipwith3 f (x:xs) (y:ys) (z:zs) = f x y z:zipwith3 f xs ys zs; zipwith3 f _ _ _ = [];
Each save command introduces a new temporary level, and each subsequent clear command ``pops'' the symbols and definitions on the current level (including any definitions read using the run command) and returns you to the previous one. This gives you a ``stack'' of up to 255 temporary environments which enables you to ``plug and play'' in a safe fashion, without affecting the rest of your program. Example:
> save save: now at temporary definitions level #2 > foo (x:xs) = x+foo xs; > foo [] = 0; > list foo foo (x:xs) = x+foo xs; foo [] = 0; > foo (1..10); 55 > clear This will clear all temporary definitions at level #2. Continue (y/n)? y clear: now at temporary definitions level #1 > list foo > foo (1..10); foo [1,2,3,4,5,6,7,8,9,10]
We've seen already that normally, if you enter a sequence of equations, they will be recorded in the order in which they were written. However, it is also possible to override definitions in lower levels with the override command:
> foo (x:xs) = x+foo xs; > foo [] = 0; > list foo foo (x:xs) = x+foo xs; foo [] = 0; > foo (1..10); 55 > save save: now at temporary definitions level #2 > override > foo (x:xs) = x*foo xs; > list foo foo (x:xs) = x*foo xs; foo (x:xs) = x+foo xs; foo [] = 0; > foo (1..10); 0
Note that the equation `foo (x:xs) = x*foo xs;' was inserted before the previous `foo (x:xs) = x+foo xs;' rule, which is at level #1.
Even in override mode, new definitions will be added after other definitions at the current level. This allows us to just continue adding more high-priority definitions overriding lower-priority ones:
> foo [] = 1; > list foo foo (x:xs) = x*foo xs; foo [] = 1; foo (x:xs) = x+foo xs; foo [] = 0; > foo (1..10); 3628800
Again, the new equation was inserted above the existing lower-priority rules, but below our previous `foo (x:xs) = x*foo xs;' equation entered at the same level. As you can see, we have now effectively replaced our original definition of `foo' with a version that calculates list products instead of sums, but of course we can easily go back one level to restore the previous definition:
> clear This will clear all temporary definitions at level #2. Continue (y/n)? y clear: now at temporary definitions level #1 clear: override mode is on > list foo foo (x:xs) = x+foo xs; foo [] = 0; > foo (1..10); 55
Note that clear reminded us that override mode is still enabled (save will do the same if override mode is on while pushing a new definitions level). To turn it off again, use the underride command. This will revert to the normal behaviour of adding new equations below existing ones:
> underride
Tuples and parentheses. Please note that parentheses are really only used to group expressions and are not part of the tuple syntax; tuples are in fact not really part of the Pure language at all, but are implemented in the prelude. As you can see there, the pairing operator `,' used to construct tuples is (right-)associative. We call these the ``poor man's tuples'' since they are always flat and thus there are no nested tuples (if you need this then you should use lists instead). This also implies that an expression like [(1,2),(3,4)] is in fact exactly the same as [1,2,3,4]. If you want to denote a list of tuples, you must use the syntax (1,2):(3,4):[] instead; this is also the notation used when the interpreter prints such objects.
Special forms. Special forms are recognized at compile time only. Thus the catch function as well as the short-circuit logical connectives && and || are only treated as special forms in direct (saturated) calls. They can still be used if you pass them around as function values or partial applications, but in this case they lose all their special call-by-name argument processing.
Manipulating function applications. The ``head = function'' rule means that the head symbol f of an application f x1 ... xn occurring on (or inside) the left-hand side of an equation, pattern binding, or pattern-matching lambda expression, is always interpreted as a literal function symbol (not a variable). This implies that you cannot match the ``function'' component of an application against a variable, and thus you cannot directly define a generic function which operates on arbitrary function applications. As a remedy, the prelude provides three operations to handle such objects: applp, a predicate which checks whether a given expression is a function application, and fun and arg, which determine the function and argument parts of such an expression, respectively. (This may seem a little awkward, but as a matter of fact the ``head = function'' rule is quite convenient since it covers the common cases without forcing the programmer to declare ``constructor'' symbols (except nullary symbols). Also note that in standard term rewriting you do not have rules parameterizing over the head symbol of a function application either.)
Numeric types. If possible, you should always decorate numeric variables on the left-hand sides of function definitions with the appropriate type tags, like ::int or ::double. This often helps the compiler to generate better code and makes your programs run faster.
Talking about the built-in types, please note that int (the machine integers) and bigint (the GMP ``big'' integers) are really different kinds of objects, and thus if you want to define a function operating on both kinds of integers, you'll also have to provide equations for both. This also applies to equations matching against constant values of these types; in particular, a small integer constant like `0' only matches machine integers, not bigints; for the latter you'll have to use the ``big L'' notation `0L'.
External C functions. The interpreter always takes your extern declarations of C routines at face value. It will not go and read any C header files to determine whether you actually declared the function correctly! So you have to be careful to give the proper declarations, otherwise your program will probably segfault calling the function.
You also have to be careful when passing generic pointer values to external C routines, since currently there is no type checking for these; any pointer type other than char* and expr* is effectively treated as void*. This considerably simplifies lowlevel programming and interfacing to C libraries, but also makes it very easy to have your program segfault all over the place! Therefore it is highly recommended that you wrap your lowlevel code in Pure routines and data structures which do all the checks necessary to ensure that only the right kind of data is passed to C routines.
Stack size and tail recursion. Pure programs may need a considerable amount of stack space to handle recursive function calls, and the interpreter itself also takes its toll. So you may have to configure your system accordingly (8 MB of stack space is recommended for 32 bit systems, systems with 64 bit pointers probably need more). If the PURE_STACK environment variable is defined, the interpreter performs advisory stack checks and raises a Pure exception if the current stack size exceeds the given limit. The value of PURE_STACK should be the maximum stack size in kilobytes. Please note that this is only an advisory limit which does not change the program's physical stack size. Your operating system should supply you with a command such as ulimit(1) to set the real process stack size. Also note that this feature isn't 100% foolproof yet, since for performance reasons the stack will be checked only on certain occasions, such as entry into a global function.
Fortunately, Pure normally does proper tail calls (if LLVM provides that feature on the platform at hand), so most tail-recursive definitions should work fine in limited stack space. For instance, the following little program will loop forever if your platform supports the required optimizations:
loop = loop;
In the current implementation, a tail call will be eliminated only if the call is done directly, i.e., through an explicit call, not through a (global or local) function variable. Otherwise the call will be handled by the runtime system which is written in C and can't do proper tail calls because C can't (at least not in a portable way). This also affects mutually recursive global function calls, since there the calls are handled in an indirect way, too, through an anonymous global variable. (This is done so that a global function definition can be changed at any time during an interactive session, without having to recompile the entire program.) However, mutual tail recursion does work with local functions, so it's easy to work around this limitation.
Scheme programmers should note that conditional expressions (if-then-else) are tail-recursive in both branches, just like in Scheme, while the logical operators && and || are not tail-recursive. This is because the logical operators always return a proper truth value (0 or 1) which wouldn't be possible with tail call semantics.