Popcorn Reference Manual

DRAFT

April 1999

Greg Morrisett

Overview:

Popcorn is a research prototype programming language that we (and others) are using to explore various issues in type-safe language design, type-directed compilation, proof carrying code, safe runtime code generation, and other issues that arise in modern language design and implementation.  Syntactically, Popcorn resembles a mix of C, C++, and Java and most of the control constructs and scoping mechanisms are lifted directly from these languages.  With respect to type and type structure, Popcorn is more akin to the ML family of languages, in that it provides support for parametric polymorphism, some type inference, algebraic datatypes with pattern matching, etc.  In practice, it is relatively straightforward to port C code to Popcorn.   Porting ML code is also straightforward though the syntax and emphasis on imperative aspects makes this a bit more involved.  In the near future, we are planning support for various kinds of objects, classes and subtyping, perhaps along the lines of LOOM [?], Cecil [?], or Moby [?] with the goal of being able to port code from C++, Java, or these OO-based languages relatively easily. 

Smoothly integrating all of these facilities is extremely difficult and we view Popcorn as a vehicle for exploring tradeoffs.  Consequently, we expect the definition of the language to evolve quite rapidly over time.  To limit the scope of exploration, we have a determined a set of (loose) goals that we wish to follow as the language evolves.  In particular, Popcorn should be:

As it stands, Popcorn fails to achieve all of these goals in a nice way (thankfully -- that's why there's work to do!) but goes pretty far in comparison to other languages.   As such, we think it can be used today to build interesting systems and would encourage anyone to try it and give us feedback about the design and/or implementation.   Some good questions for us to answer are:

This document is meant to give you a brief overview of the syntax, type-system, dynamic semantics, and tools that currently make up the Popcorn language.  We have tried to detail the strengths and weaknesses of the language here and to point out areas where we expect the language to improve (or at least change in some substantial way.)  We would appreciate any criticism, comments, suggestions, etc. from users. 

Popcorn and Typed Assembly Language:

Originally, Popcorn was developed as a vehicle for showing that type-safe, high level languages could be compiled to Typed Assembly Language (TAL).   Thus, to understand some of the design issues of Popcorn, it is useful to know a little bit about TAL. 

Like the Java Virtual Machine Language (JVML), TAL is a statically typed low-level language intended to serve as the output of a compiler for a type-safe high-level language.  Such languages are extremely useful in security contexts where we wish to ensure that a number of "bad" things cannot happen when code is executed.  In particular, type-safety (in the TAL sense) at least implies address space isolation, so it is not necessary to use hardware-enforced address translation to ensure that a TAL program does not read or write locations in memory that are not in its conceptual address space.  As with JVML, the type system of TAL ensures many stronger safety properties, including the fact that "private" fields of data structures cannot be read or written, all functions are passed the right number (and type) of arguments, stack frames from a caller are protected from a callee reading or writing local variables, etc. 

But unlike the JVML, TAL instructions correspond to real Intel 80x86 instructions instead of virtual machine instructions. Consequently, there is no overhead for interpretation, and we can use a fully optimizing compiler to produce our TAL code, but still be able to type-check.  Indeed, for critical loops, it is possible to code directly in TAL to achieve the best possible performance (subject to the limitations of the type system.)   Thus, TAL seems like a more attractive target for systems such as the PLAN Active Networks project at the University of Pennsylvania [?] that need safe but highly efficient code to achieve their goals. 

Furthermore, the TAL type system provides a set of general type constructors which may be used to encode typing abstractions for a number of type-safe, high-level languages such as SML, Scheme, or Modula-3.  In contrast, the JVML type system is closely related to the Java source language and is not very well suited to compiling other languages.  For instance, it is not possible to eliminate tail-calls to another method using JVML, but it is trivial to do so in TAL.  This is because the JVML design bakes in a fixed notion of method, calling conventions, etc., whereas TAL provides a set of general-purpose type constructors that can be used to build custom procedural abstractions, custom calling conventions, etc.  As another example, it is difficult to compile high-level languages such as SML that provide parametric polymorphism or higher-order type constructors to JVML because the type system of JVML does not have such facilities.  Rather, a compiler must insert additional run-time checks (downcasts) to convince the JVML static verifier to pass the code, and these run-time checks can hurt performance.  Again, because TAL provides support for features like parametric polymoprhism, it is not necessary to insert run-time checks to get such code past the verifier. 

Popcorn was designed as a relatively simple, type-safe language that could be easily compiled to TAL in order to demonstrate all of these claims about the advantages of TAL.  For more information on TAL, please refer to the following publications:

 

or visit the TAL project homepage at http://www.cs.cornell.edu/talc.


Some Popcorn Examples:


The Popcorn Language Reference:


Lexical Conventions:

To do:  similar to C and Java.  Key thing to note, comments can be either C-style (/* ... */) or C++/Java style (// ...). Upper/lower case is significant, etc.

Reserved Words:

abstract bool break byte case char chr const continue default do else 
exn exception extern for fprintf handle if int new new_array new_string 
null open ord prefix printf public raise return short signed size static 
string struct switch try union unsigned void while 

Names, NameSpaces, and Qualifiers:

To do:  similar to C++.  Use Foo::bar where Foo is the "name space" (think structure or module).  You can write prefix Foo { <top-level-decls> } to cause the Foo:: to be implicitly appended to the front of identifiers defined within the braces.  You can write open Foo { ... } in a context where you don't want to have to write Foo::bar all the time and just want to write bar.   This is currently broken.

Also, need to talk about public, static, abstract, extern, and const qualifiers.


Type Expressions:

Popcorn provides a number of primitive types and type constructors similar to that of ML.  In particular, the language defines a void type (similar to ML's unit type), various integer types (signed and unsigned), string and array types, (polymorphic) function types, (polymorphic) struct and union types, and exception types.  In this section, we detail the syntax and informal semantics of type expressions and declarations.  

  <type-exp> ::= <prim-type> <type-modifiers>
  <prim-type> ::= <type-var> | void | <integer-type> | bool | char | 
                  string | *(<type-exp>,...,<type-exp>) | exn | 
                  [<<type-exp>,...,<type-exp>>]<id>
  <type-var> ::= '[a-zA-Z_][a-zA-Z0-9_]*
  <integer-type> ::= [unsigned] int | [unsigned] short | [unsigned] byte
  <type-modifiers> ::= <empty> | [] <type-modifiers> | 
                       <fntype-modifier> <type-modifiers>
  <fntype-modifier> ::= [<qid>][<<type-var>,...,<type-var>>](<type-exp>,...,<type-exp>)

Type Variables:

  'a, 'b, 'foo

Type variables are written as identifiers with a single quote ("'") at the beginning.  Type variables are used within polymorphic type expressions (e.g., the types of polymorphic functions) or within declarations of polymorphic type constructors (e.g., structs and unions).  Currently, not all types may be used to instantiate a given type variable.  Intuitively, only types whose size is a word (4 bytes) may be used to instantiate a type variable, and void may not be used to instantiate a variable. 

The Void Type:

  void

The void type is used to indicate that a function returns no values or that a union data constructor carries no values.  It is illegal to instantiate a type variable with void.  It is expected that this restriction will be lifted at some point in the future. 

Integers Types:

 [unsigned] int
  [unsigned] short 
  [unsigned] byte

There are ? types corresponding to integers:  int, unsigned int, short, unsigned short, byte, and unsigned byte.  The int type represents a signed, 32-bit integer value.  The short type represents a signed, 16-bit value, and the byte type represents a signed, 8-bit value.  Clearly, the unsigned qualifier represents the unsigned 32, 16, or 8-bit value space.  We expect to add support for long and unsigned long (64-bit integer values) as well as floats and doubles in the near future.   Arithmetic on these types should work the same as under C, and operators such as +,-,*, etc. are overloaded as appropriate.  Note that the C semantics for performing arithmetic on mixed-size objects requires that the values be cast up to the largest appropriate size.  See the C reference manual for details. 

Decimal-based integer constants are written as in C.  Popcorn also supports  hexadecimal (e.g., 0xfeedface).  We expect to add support for octal, binary, and other bases in the near future. 

The default initial value for variables of the integer types is 0 (zero).

The Boolean Type:

  bool

The type bool represents the type of boolean values.  The constants true and false are used to create boolean values.  Most comparison operations (e.g., "==", "<=", "!=", etc.) return a boolean value and both the conditional expression and conditional statement expect boolean expressions as their first argument. 

The default initial value for variables of type bool is false.

The Character Type:

  char

The type char is distinct from the integer types but one may convert chars to and from integers via ord () and chr().  Currently, chars are 8-bit, ASCII representations of characters.  At some point, we may have a separate type for unicode characters or change char to unicode.  Character literals are as in C (e.g., 'a', 'b', '\n') including escaped characters.  Switch expressions can perform a match on character expressions and some simple arithmetic (may be?) is supported on characters.  

The default initial value for variables of type char is '\000'.

The String Type:

  string

Strings are like arrays of chars.  Unlike Java, they are mutable and unlike C or C++, strings are not \000 terminated and come with their own size.   String subscript and update are written in a style analogous to array subscript and update (e.g., s[i] = s[i+1]).  Operations on strings, as with arrays, check that the index is in bounds before proceeding and raise the ??? exception if the index is out of bounds.  String literals are written as in C (e.g., "foobar", "baz\n") including escaped characters.   New (\000-filled) strings may be created by using the new_string function defined in the core library.  The core library also provides other string manipulation functions similar to the C standard library (e.g., strcmp, strcopy, etc.)

The default initial value for variables of type string is the empty string ("").

We expect to add more built-in operations on strings including some forms of pattern matching. 

The Array Type:

  int[]   // an array of integers
  char[]  // an array of characters
  int[][] // an array of an array of integers

As in C, array types are declared by putting a pair of brackets ([]) after a type expression.  Unlike C, you cannot specify the size of the array in the type, nor can you specify multi-dimensional arrays (directly).  In this respect, the arrays of Popcorn are more like Java arrays.  Association of [] is to the left.  Arrays literals, as in C, are created using {exp1,...,expn}.  Empty arrays (arrays of size 0) have a special syntax:  {:<type-exp>} so that the type-checker can easily identify the type of the expression.  Array subscript and update are written as in C (e.g., x[i] = x[i+1]) and the indices are always checked to see if they are in bounds.   If not, the ??? exception is raised.  New arrays may be created by calling the polymorphic new_array function which takes an initial element of the appropriate type and an integer size (greater than or equal to 0). 

The default initial value for variables of type array is a 0-sized array.

Tuple Types:

  *(int,int)         // the type of a pair of integers
  *(int,string,char) // a triple of an int, string, and character
  *(int[],int f())   // a pair of an integer array, function

Tuples are ordered n-tuples of their respective component types.   Tuples are particularly useful for returning multiple values or for packaging up multiple values within a union, especially since Popcorn requires that you name your struct types (i.e., does not support anonymous structs.)   If e is an expression of type type, then the components of e may be accessed by writing "e.1", "e.2", "e.3", etc.  New tuples may be created by writing "^*(<exp>,...,<exp>)" or "new *(<exp>,...,<exp>)". 

Currently, tuples are mutable but we might change this or might allow additional qualifiers to declare whether a field is mutable or not.  We expect to add tuple pattern support in the near future. 

The default initial value for variables of tuple type is a tuple of the default initial values of the respective component types.

Function Type Expressions:

  int f(short,short) // a function that takes two shorts and returns an int
  int (short,short)  // also a function that takes two shorts and returns an int
  int (short x,short y) // ditto -- you can also name arguments
  int[] (char) // takes a char and returns an integer array
  int length<'a>(<'a>list) // for all types 'a, takes an <'a>list and 
                           // returns an int
  int <'a>(<'a>list x)     // same as previous
  int (<'a>list)           // same as previous
  'b  ('b f('a,'b), <'a>list, 'b accum) 
    // takes three arguments, the first is a function from 'a and 'b to
    // 'b, the second is an <'a>list, and the third is a 'b.  The entire
    // function returns a 'b value.
  int (int)[]              // an array of functions from int to int

Functions type syntax is by far the most confusing because C fundamentally gets this wrong.  However, porting C code with a different notation for types is quite difficult so we have attempted to keep the C syntax as much as possible. 

For monomorphic, first-order functions, the syntax is essentially the same as in C:  the result type comes first, followed by an [optional] identifier, then a left parenthesis, followed by a comma-separated list of argument types, and then a right parentheses.   In addition, the arguments can be given an optional identifier for a name.  Optional identifiers for the function name and the argument names are ignored by the type system, so "int f(short x,short y)" is equivalent to "int (short,short)". 

For polymorphic functions, we can also make the type parameters explicit.   As in C++, these are enclosed by "<" and ">" and are comma separated type variables (identifiers starting with a single quote "'").   So, for instance, "'a id<'a>('a)" is the type of the polymorphic identity function and is equivalent to "'a id('a)" or simply "'a ('a)".  If explicit type parameters are omitted, then they are automatically inferred.  The order of the bindings of the type parameters is the order in which they appear (lexicographically) within the argument types, followed by the result type.  

Declaring a function and its type is detailed in Section [?].  There is no default value for variables of function type -- that is, programmers must explicitly supply such a value. 

The Exception Type:

  exn

Exception values are created with a new expression by applying an exception constructor to a value of the appropriate type.  The resulting value has type exn and may be used by a raise expression.  When a try/handle expression is executed, if the body of the try block raises an exception, then the handle is given an exn value.  The handle expression typically performs a switch on the exn value to determine which exception was raised and to extract any value(s) placed in the exn data structure.  In this respect, exn values are much like ML's exn type.  New exception constructors are declared using the exception declaration (see Section [?]).  

There is no default value for variables of exn type.

Type Constructor Expressions:

  bar                // the bar type takes no arguments
 <int>list          // the list type takes one argument -- int in this case
 <int,string>dict   // the dict type takes two arguments: int & string here
 <int[]>list        // a list of integer arrays

Type constructors are, as in ML, applied to type arguments to produce a type expression.  Also as in ML, the application occurs on the left.  As in C++, the arguments to the constructor are placed in "<" and ">" brackets and are comma-separated.

Type constructors are created either through a struct declaration (see Section [?]) or a union declaration (see Section [?]).  In the examples above, let us assume that we have declared the following struct declarations:

  struct bar {int x; int y; int z; }
  ?struct <'a>list { 'a hd; <'a>list tl; }
  struct <'a,'b>dict { int comp('a,'a); <*('a,'b)>list contents; }

Then bar is declared as a type constructor with no arguments, list is a type constructor with one argument, and dict is a type constructor with two arguments.   Using the type expression "<int>list" is similar to writing the following monomorphic struct declaration and using it instead:

  ?struct int_list { int hd; int_list tl; }

However, the type system treats "<int>list" and "int_list" as different types.

For most variables declared as having a type-constructor type, there is no default initial value.  The exception to this is constructors declared as ?struct types (optional pointers to structs) which have "null" as the default initial value (similar to Java objects.)  


Declarations:

Top-Level Variable Declarations:

  <top-var-decl> ::= [<scope>]<type-exp> <qid><type-modifiers>[= <const-exp>];
  static int x;  // a top-level, private variable x of type int
  int x = 1; // a top-level, public variable x of type int, initial value 1
  int x,y=3,z; // x, y, and z all public of type int, y has initial value 3
  static int[] a; // a is a private array of integers
  static int a[]; // same as previous

  extern int x; // x is an externally-defined integer variable
  extern string z; // z is an externally-defined string variable

Top-level declarations for variables are prefixed with an optional scope modifier (either "extern" or "static"), followed by a type and then the variable named, followed by an optional initial value.  Multiple variable definitions may be collapsed if they have the same scope and type.  Initial values for top-level expressions must be "constant" expressions and may not include function calls.  Constant expressions are detailed in Section [?], but roughly speaking consist of integer constants, string and array constants, new expressions (where the arguments are constants), null, and a few other expressions.  Top-level variables are not allowed to be polymorphic, though they may contain values that refer to polymorphic functions.

To support C-style declarations of variables of array type or functional type, it is possible to move part of the type expression onto the variable being declared.   For instance, we may write "int x[]" to declare x to be an array of integers.

The static qualifier defines a variable whose scope is the current compilation unit (usually a file).  That is, the variable is only accessible to those functions defined in the same compilation unit. 

The extern qualifier claims that such a variable is defined by some other compilation unit with the associated type.  Obviously, only one of static or extern may apply to a given definition. 

If no initializer is provided for the variable, then the compiler implicitly inserts an initializer.  Not all types admit a default initial value and so programmers must provide them.  Refer to Section [?] which describes types to see which types admit default initial values and if so, what those values are.

In the future, we expect additional qualifiers including "const" (immutable variables).  The restriction that all variables must be initialized is a painful one, especially with top-level variables of abstract type and we hope to lift this restriction some day.  However, the TAL module system only supports primitive data values (much like a standard assembler) and does not support function calls for initializing variables, hence the restriction.  In the meantime, it is relatively easy to work around the restriction.  For example, if you wish to define a variable x of abstract type FOO but don't have an initial value of type FOO, then you can instead define x to have type <FOO>Opt where Opt is a struct defined as:

  ?struct <'a>Opt { 'a v }

(For ML users, Opt is similar to the option datatype.)  Initially, x will have value null.  The value may then be later initialized (e.g., in main()) as follows:  If e is the expression that computes the initial FOO value, we assign:   "x = ^Opt(e)"  (i.e., x gets new Opt(e)).  To extract or modify the value from x, it is necessary to write "x.v" instead of just "x".   Note that the use of "x.v" causes an implicit check for null to be inserted by the compiler which is necessary to ensure that the variable is not used before it is initialized.  Finally, note that the Opt constructor is defined as above in <core.h>. 

Local Variable Declarations:

  <var-decl> ::= <type-exp> <id>[<type-modifiers>][= <exp>];              

Similar to top-level declarations except that there are no scope qualifiers.  Unlike Java, we require all variables to be initialized (or else insert an implicit initialization to the default value for the variable's type.)  We expect to lift this restriction someday, by performing a simple flow-insensitive dataflow analysis (a la JVML) to ensure that variables are initialized before they are used.

Struct Declarations:

  <struct-decl> ::= [<scope>] [?] struct [<<type-var>,...,<type-var>>] {
                      [const] <type-exp> <id>; ... [const] <type-exp> <id>; }
  struct point {int x; int y; }  // a record of two integer fields, x & y
  public struct point {int x,y; } // same as above
  struct point {const int x; const int y; } // same but x and y are immutable 
  ?struct point {int x; int y; } // same as above but point values can be null
  ?struct pointlist {point p; pointlist next; }  // lists of points
  static struct point {int x,y;} // the type declaration is private
  abstract struct point {int x,y; } // the type declaration is exported abstractly
  ?struct <'a>list {'a hd; <'a>list tl; } // polymorphic lists
  ?struct <'a,'b,'c>triple {'a x; 'b y; 'c z; } // polymorphic triple
  extern struct point {int x; int y;} // point is defined publically elsewhere
  extern point;                       // point is defined abstractly elsewhere
  extern ?point;                      // point is an ? struct defined elsewhere

Struct declarations are essentially type declarations for "pointers to records of values".  If no question mark appears before the struct declaration, then values of that type are always valid pointers to records (i.e., never null).  If a question mark appears before the type (so-called option-structs), then the pointer may be null.  When dereferencing an option-struct value, the compiler inserts implicit checks for null and raises the ??? exception if the value is found to be null.  Hence, before setting or reading a field in a struct, a programmer should always check to see whether the value is null. 

Only option-structs have default initializers and obviously, the default is null.

Before the struct declaration, one can write an optional scope qualifier.   The qualifier public indicates that the type declaration is to be fully exported to the outside world.  The qualifier static indicates that the type declaration is to be fully hidden to the outside world.  The qualifier abstract indicates that the name of the type constructor (and the number of type arguments) is exported to the outside world, but the definition is not.  The extern qualifier is used when some other compilation unit defines the struct (and exports it at least in part.)  If no qualifier is present, then "public" is assumed.  (Probably a bad idea.) 

The fields of the struct are given types and names as identifiers and may be optionally prefixed with "const" to indicate that the field is immutable.   Fields are accessed using the "dot" notation.  For example, to access the x component of a point  p we would write "p.x".  To create a struct value, we write "new <struct-name>(<exp1>,...,<expn>)" or "^<struct-name>(<exp1>,...,<expn>)" where the expression arguments are used as the initial values of the fields.  Thus, it is impossible to create a struct value with uninitialized fields.  Someday soon we will support initializing the struct fields by name so that the order of the declaration is not so important.  We may also eliminate the need to write "new" or "^" in order to have a more abbreviated syntax.  

Structs may be recursive as in the pointlist and list examples above.   Thus, the scope of a struct name includes its definition. 

Structs may also be polymorphic in any number of type arguments.  For parsing reasons, we placed the type parameters to the left of the constructor instead of the right.  This allows us to easily determine type expressions from polymorphic function declarations.  In the list example above, a value of type <int>list is either null, or a pointer to a struct that contains an int in its hd field, and an <int>list in its tail field.  Any type but void may be used to instantiate the type variables. 

There is no need (and in fact, no way) to explicitly instantiate the type variables of a polymorphic struct.  Rather, the type checker automatically determines the instantiation through unification in a style similar to ML type inference.   Similarly, when one writes "null", it can represent any option-struct -- there is no need to explicitly say which type.  We expect that sometime in the future, we will provide syntax for explicitly instantiating a polymorphic type constructor and for explicitly writing at which type null is meant to be.  As a simple example, the following polymoprhic function calculates the length of a list:

  int list_length(<'a>list x) {
    int len = 0;

    while (x != null) {
      len++;
      x = x.tl;  // implicit check for null here
    }
    return(len)
  }

Notice that the code is able to calculate the length regardless of the type of the components in the list.  Notice also that we explicitly check for null before dereferencing the list.  Currently, the popcorn compiler inserts an implicit check for null at the point where "x.tl" is dereferenced, but as the optimizations in the compiler are improved, we expect this overhead to disappear.  As another example, the following code computes and returns a list of the first n integers:

  <int>list f(int n) {
    <int>list r;  // implicitly initialized to null

    while (n != 0) {
      r = ^list(n,r);  // new list struct created here
      n--;
    }
    return(r);
  }

Notice that the variable r is implicitly initialized to the value null.   In practice, implicit initialization like this is not a good idea, and one should write "<int>list r = null" to make the code more clear.  Notice also that in the body of the while loop, a new list node is created using "^list(n,r)".  One could also write "new list(n,r)" to make the code appear a bit more like Java (at the price of more typing.) 

Type equivalence on structs is not done structurally as in Standard ML or in C.  Rather, struct equivalence is by name.  Thus, if one defines:

  struct point1 {int x; int y;}
  struct point2 {int x; int y;}
  int f(point1 p) 
  { return(p.x + p.y) }

Then one may pass a point1 value to f, but not a point2 value.  The decision to use name equivalence instead of structural equivalence makes supporting recursive struct definitions easier at the TAL level and it makes it easier to build a type-preserving compiler, because we need not expand type definitions eagerly.   However, it interferes with the addition of several desirable features including sub-typing so this decision will perhaps be re-visited in the future. 

There is no current support for "flattening" one struct into another, nor for creating arrays of flattened structs.  Rather, as in Java,  we are forced to manipulate pointers (or references) to the record.  Though TAL supports flattened records and structs, we have not yet found a simple, elegant way to export this functionality to a source language like Popcorn but hope to some day. 

Union Declarations:

  <union-decl> ::= [<scope>] union[<<type-var>,...,<type-var>>] {
                     <type-exp> <id>; ... <type-exp> <id>; }
  union <'a>list { void Null; *('a,<'a>list) Cons; }
    // the list declaration above is similar to the ?struct 
    // declaration given in the previous section, and approximates
    // the ML datatype:  datatype 'a list = Nil | Cons of 'a*'a list
    // however, the union approach requires an extra level of indirection
    // for each Cons cell.

  union exp { string Var; *(string,exp) Lambda; *(exp,exp) App; }
    // a recursive tagged union representing lambda expressions
   

Unions in Popcorn are more like ML datatypes than like C unions.   That is, union values are fundamentally disjoint or tagged sums.  The tag is necessary so that we can dynamically determine the actual type carried by the union value.   As with structs (and ML-style datatypes), unions can be recursive and polymorphic, and their visibility can be controlled via public, static, abstract, and extern qualifiers.  (See Section [?].) 

The field names of a union are used as the datatype constructors.  To create a value of a union type we write "^<union-name>.<field-name>(<arg1>,...,<argn>)" (or use the keyword "new" instead of "^".)  For example, to create an exp value representing the variable x, we would write "^exp.Var("x")".  

Union fields are not mutable.

When we have a value x of some union type, then we can use a switch expression to determine what its tag and underlying value is, similar to the case-expression and pattern matching constructs of SML.  For instance, the following function prints out exp values:

  void print_exp(exp e) {
    switch (e) {
      case Var(x):  printf("%s",x);
      case Lam(p):  printf("(lambda %s ",p.1);
                    print_exp(p.2);
                    printf(")");
      case App(p):  printf("("); print_exp(p.1); printf(" ");
                    print_exp(p.2); printf(")");
    }
  }

A switch on a union type requires cases for each of the fields (or else a default: case).  Each field's case also defines a local variable that has the type of the field which may be used locally within the switch-case-clause.  For instance, in the example above, the Var field has type string, so the x variable associated with the Var case has type string within the block of code corresponding to that case.  The Lam field has type *(string,exp) (a tuple of a string and expression) so p is assigned this type within the Lam case.  Finally, the App field has type *(exp,exp) so p is assigned this type within the App case.  Notice that unlike C, switch cases do not implicitly fall through to the next case, primarily because the scope of variables bound within the cases does not extend across cases.  Correspondingly, break does not cause control to be transferred out of the switch but rather out of the nearest enclosing loop.  In the future, we may change this by forcing programmers to explicitly write "break" to make it easier to port C code. 

It is also possible to deconstruct a union value by simply using the "dot" notation.  For instance, we can write "e.Var" and this is equivalent to writing "switch (e) { case Var(x): return(x); default: raise ???}". 

Currently, switch expressions that de-construct union types only allow one level of pattern, but we expect to add support for nested patterns, integer and tuple patterns, etc. in the near future. 

Exception Declarations:

  <exn-decl>  ::= [<scope>] exception <qid>; | 
                  [<scope>] exception <qid>(<type-exp>);

Exceptions may only be declared at the top-level.  As with identifiers and other values, the exception's scope can be controlled by an optional scope qualifier.  Exceptions can carry values of a specific type, in which case the second syntactic form should be used. 

Function Declarations:

Very similar to C except that there are provisions for polymorphism.


Expressions:

Most expressions have the same syntax and (intended) semantics as in C or Java.  We detail the various expression forms here for completeness and give examples.  The grammar for the abstract syntax of expressions is as follows:

  <exp> ::=  <const-exp> | <qid> | <if-exp> | <assign-exp> | <inst-exp> |
             <new-exp> | <dot-exp> | <subscript-exp> | <raise-exp> | 
             <seq-exp> | <cast-exp>

Constant Expressions:

  <const-exp> ::= true | false | <number> | <char> | <string> | null |
                  <fun-call> | <fun-inst> |  
                  ^*(<const-exp>,...,<const-exp>) | 
                  ^<qid>[.<id>](<const-exp>,...,<const-exp>) | 
                  {<const-exp>,...,<const-exp>} | {:<type-exp>}

Constant expressions include boolean constants (true, false), integer constants (e.g., 345, 0xfeedface), character constants (e.g., 'a', 'b', '\n', '\000), string constants (e.g., "foo", "bar\n"), and null.  Constructed constant expressions include tuples of constant expressions (e.g., *(<const-exp>,...,<const-exp>)), new structs and unions of constant expressions (e.g., list(3,null)), and array constant expressions (e.g., {3,4,5,6}).   The key use of constant expressions is to serve as initializers for top-level data structures. 

In the future, we may allow things like arithmetic on constant expressions and a few other forms. 

Variables:

  <qid> ::= [<id>::]<id>
  <id>  ::= [a-zA-Z_][a-zA-Z0-9_]*

Value variables are composed of a sequence of upper and lower-case letters, digits, and underscore.  Qualified identifiers are prefixed with an identifier (usually with the first-letter in uppercase) followed by "::".   When used on the right-hand side of an expression, they denote the value contained in the variable.  When used as the left-hand-side of an assignment, they denote the location of the variable (as in C). 

Conditional Expressions:

  <if-exp> ::= <exp> ? <exp> : <exp>

The first expression must have type bool and the other two expressions must have the same type.  When evaluated, if the first-expression yield true, then the second expression is evaluated and the result is returned as the value of the entire expression.  Otherwise, when the first-expression yields false, the third expression is evaluated.

Assignment Expressions:

  <assign-exp> ::= <lhs-exp> = <exp> | <lhs-exp> <arithop>= <exp> |
                   <lhs-exp>++ | <lhs-exp>-- | ++<lhs-exp> | --<lhs-exp>
  <lhs-exp>    ::= <qid> | <lhs-path>
  <lhs-path>   ::= <exp>.<id> | <exp>.<num> | <exp>[<exp>] | 
                   <lhs-path>.<id> | <lhs-path>[<exp>]

The normal case of an assignment is of the form <id> = <exp> and simply evaluates the given expression and assigns the resulting value to the given variable.  The value of the assignment is the value placed into the variable as in C.   In general, the left-hand-side can be a field of a struct or tuple or some subscripted element of an array.  In addition, Popcorn supports a number of the shortcut assignment operators provided by C and Java.  For instance, the expressions "x += 1" and "++x" are both equivalent to "x = x + 1".  

Function Calls:

  <fun-call> ::= <exp>(<exp>,...,<exp>)

The first expression must evaluate to a function.  If the function is not polymorphic, then the argument expressions must have types equal to the formal argument types of the function, and the type of the entire expression is the result type of the function.  If the function is polymorphic, then there must exist some instantiation of the function's type variables such that the argument types match.   The same instantiation is used to calculate the result type.  For example, in a context where the function f has type 'a (*('a,'b)) (a polymorphic function accepting 2-tuples with any pair of types returning something of the same type as the first component of the tuple), then the function call f(^*(3,"foo")) is type-correct and has type int. 

Similar to ML, polymorphic functions are implicitly instantiated when they are called.  They are not instantiated in any other context unless done explicilty.  

Instantiation Expressions:

  <inst-exp> ::= <exp> @ <<type-exp>,...,<type-exp>>

The expression must evaluate to a polymorphic function.  The function is instantiated with the given type parameters.  For instance, if f is a function that has type 'a <'a,'b>(*'a,'b)), then the expression f @ <int,string> has type int (int,string) (i.e., a function from an int and a string to an int.)  

New Expressions:

  <new-exp> ::= new <qid>(<exp>,...<exp>) | ^<qid>(<exp>,...,<exp>) | 
                new <qid>.<id>[(<exp>)] | ^<qid>.<id>[(<exp>)] |
                new_array(<exp>,<exp>) | new_string(<exp>)

There are four basic forms of new expressions:  one for struct types, one for union types, one for array types, and one for string types.  For both structs and unions, either the keyword "new" or a carat "^" may be used and the carat is preferred but new makes it easier to port Java and C++ code.

For structs, new requires the name of the struct and an argument expression for each field of the struct (in the order that the fields are declared.)  

For unions, new requires the name of the union followed by a period and the name of the field (i.e., the data constructor).  If the field carries void type, then no argument expression should be supplied.  Otherwise, if the field carries type T, then an argument expression of type T should be applied. 

For new_array, the first expression should evaluate to an unsigned integer indicating the size of the array, and the second expression should be a value of the array type.  This value is used to initialize the contents of the array.

For new_string, the argument should evaluat to an unsigned integer indicating the desired size of the string.  The string is filled with unspecified values. 

Dot Expressions:

  <dot-exp> ::= <exp>.<id>

A dot expression is used to extract a field from a struct or to implicitly check that a union value is of a particular type.  In the former case, <exp> must be of a struct type with a field named by <id>.  The type of the resulting expression is the type of the associated field.  If the struct is an option-struct, then executing the dot-expression will cause the exception ??? to be raised when the option-struct is null.  In the latter case, <exp> must be of a union type with a field named by <id>.  The resulting expression has the type associated with the field in the union declaration.  When executed, the tag on the union value is tested and the underlying value is extracted and returned if the tag matches the field specified.  Otherwise, the exception ??? is raised.  We discourage the use of dot-expressions for unions and encourage the use of switch statements instead.   However, the dot-notation is particularly convenient for porting C code. 

Subscript Expressions:

  <subscript-exp> ::= <exp>[<exp>]

The first expression must have an array type or be of type string, and the second expression should have type unsigned int.  (Note that signed integers are implicitly coerced to unsigned integers, and both shorts and bytes are implicitly coerced toi ints.)  If the second expression evaluates to the integer i, then the ith value is extracted from the array or string, as long as i is between 0 and the size of the array or string.  Otherwise, the exception ??? is raised. 

Raise Expressions:

  <raise-exp> ::= raise <qid>([<exp>]) | raise (<exp>)

In the first syntax, the qualified identifier must be an exception name in scope.  If the optional expression is present and has type T, then the exception must have been declared to carry type T.  In the second syntax, <exp> must have type exn.  As in Java, the semantics of a raise expression is to transfer control to the nearest (dynamically) enclosing try/handle statement.  The type of the raise expression is arbitrary.  Often, it is necessary to put a raise expression as the argument to a return statement to convince the type-checker to be happy. 

In the future, we may make raise a statement instead of an expression or, more likely, we may collapse the disintinction between exceptions and statements so that this won't matter.  We are also likely to make the application of an exception constructor a general expression form in which case, the first syntactic form will be unecessary.  Note that for parsing reasons, parentheses are required regardless as to whether an exception carries values or not.     

Expression Sequences:

  <seq-exp> ::= <exp> , <exp>

As in C, the first expression is evaluated and its result is discarded.   Then the secon expression is evaluated and its result is returned as the value of the compound expression.  It is often necessary to insert parentheses around the expression to disambiguate the parsing in contexts that also use commas as a separator (e.g., tuple expressions or initialization arguments for structs or constant arrays.)  

Cast Expressions:

  <cast-exp> ::= (:<type-exp>)<exp>

Cast expressions allow values of one type to be explicitly coerced to another type.  Currently, casting is only supported on integral types.  We chose to require a colon before the type expression, unlike C, in order to simplify parsing.  


Statements:

  <stmt> ::= <null-stmt> | <exp>; | <stmt> <stmt> | <if-stmt> |
             <return-stmt>; | <while-stmt> | <do-stmt> | <for-smt> |
             <break-stmt> | <continue-stmt> | <switch-stmt> | <try-stmt> |
             <labelled-stmt> | <block-stmt> | <printf-stmt>

Null Statement:

  <null-stmt> ::= ;

An empty (null, skip) statement is represented by a semi-colon. 

Expression Statements:

An expression may be used in any syntactic context for statements.   The value of the expression is discarded.

Compound Statements:

Two adjacent statements are executed in sequential order. 

Conditional Statements:

  <if-stmt> ::= if (<exp>) <stmt> | if (<exp>) <stmt> else <stmt>

An if statement requires that the <exp> argument have type bool.   The expression is evaluated and if it results in true, then the first <stmt> is executed.  In the first syntactic case, if the expression evaluates to false, then the statement has no (additional) effect.  In the second syntactic case, the second <stmt> is executed.  Thus, "if (e) s" is equivalent to "if (e) s else ;".  Nested if-expressions are disambiguated as in C.

Return Statements:

  <return-stmt> ::= return; | return <exp>;

Return is used to terminate execution of a function (immediately) and to optionally return a value to the caller.  If the enclosing function has result type void, then the first form must be used.  If the enclosing function has a type other than void, then the second form must be used and <exp> must have that type.   The compiler attempts to ensure that along all control-flow paths within a function, there is a return statement.  However, the analysis is fairly simple and flow-insensitive so it is sometimes necessary for programmers to insert extra return statements even when the programmer is sure that the return will never be executed.   In such situations, it is usual to write "return (raise Foo())" where Foo is some exception constructor. 

While Statements:

  <while-stmt> ::= while (<exp>) <stmt>

The <exp> must have type bool.  Execution of the statement proceeds by first evaluating the expression.  If it evaluates to true, then the body <stmt> of the loop is executed, followed again by the entire while expression.   Thus, the statement is roughly equivalent to "if (<exp>) { <stmt>; while (<exp>) <stmt> }" (ignoring break and continue.) 

For Statements:

  <for-stmt> ::= for ([<exp>] ; [<exp>]; [<exp>]) <stmt> |
                 for (<var-decl> [<exp>]; [<exp>]) <stmt>

For-statements are meant to be as similar to C as possible, while admitting a certain convenience from Java for declaring a loop index variable within the statement.  As in C, the first expression (or variable declaration) is executed initially.  Then the second expression, which must have type bool, is evaluated.   If the result is false, then execution of the statement terminates.   Otherwise, the statement body is executed followed by the third expression.   Then the second test expression is evaluated again, followed if true by the body, etc.  Thus, the for statement "for (e1;e2;e3) s" is equivalent to "e1; while (e2) { s; e3 }" modulo break and continue issues. 

Do Statements:

  <do-stmt> ::= do <stmt> while (<exp>)

Similar to the while statement, except that the body of the loop is executed before the test expression.  Thus, ignoring break and continue, the construct is equivalent to "<stmt>; while (exp) <stmt>". 

Break Statements:

  <break-stmt> ::= break; | break <id>;

Execution of a simple break statement causes control to be transferred out of the nearest lexically enclosing while, for, or do loop.  Control is transferred to the statement immediately succeeding the enclosing loop.  If no such statement occurs, then the break causes an implicit return which only makes sense in the context of a void-returning function.  If the break does not occur within a loop, then an error will be signalled at compile time.  Note that break is not used to transfer control out of switch statements as is done in C, C++, and Java.

Execution of break <id> causes control to be transferred to the statement labelled with the identifier <id>.  Control may only be transferred within the same block or to an outer block.  That is, it is impossible to jump into a block (which may potentially define new variables). 

Continue Statements:

  <continue> ::= continue; | continue <id>;

Execution of a simple continue statement causes control to be transferred to the "test" portion of the nearest lexically enclosing loop.  If the continue does not occur within a loop, then an error will be signalled at compile time.  

Execution of continue <id> causes control to be transferred to the statement labelled with the identifier <id> and has restrictions similar to that of break <id>.  In fact, as I'm writing this, I'm not sure why we provide both labelled break and continue instead of just goto and expect that Fred will clear this up for me. 

Switch Statements:

  <switch-stmt> ::= <int-switch> | <char-switch> | 
                    <union-switch> | <exn-switch>
  <int-switch> ::= switch (<exp>) { case <num>: <stmt> ... 
                                    case <num>: <stmt>
                                    default: <stmt> }
  <char-switch> ::= switch (<exp>) { case <char>: <stmt> ...
                                    case <char>: <stmt> 
                                    default: <stmt> }
  <union-switch> ::= switch (<exp>) { case <id>[(<id>)]: <stmt> ...
                                      case <id>[(<id>)]: <stmt>
                                      [default: <stmt>] }
  <exn-switch> ::= switch (<exp>) { case <qid>[(<id>)]: <stmt> ...
                                    case <qid>[(<id>)]: <stmt>
                                    default: <stmt> }

There are four different kinds of switch expressions.  The <int-switch> is used to test integers, <char-switch> to test characters, <union-switch> to test values of union type and to extract the values carried by a data constructor, and the <exn-switch> to test and extract values carried by an exception constructor.  The syntax for all four kinds of switches is quite similar and the semantics is close to, but subtly different than C.

In particular, switch does not support implicit fall-through from one case to the next.  Thus, it is as if an implicit C "break" is inserted before each case or default label.  Furthermore, a break statement within a switch does not cause control to be transferred out of the nearest lexically enclosing switch, but rather the nearest lexically enclosing loop.  This was always a brain-damaged part of C (witness the AT&T switch crash) and we are happy to drop it.  In addition, as explained in Section [?], cases for union switches and exception switches introduce bound variables whose scope is the associated case's <stmt>.   Supporting implicit fall-through would thus allow a statement to access an undefined variable.

In all of the switch constructs, it is necessary for the cases to be disjoint and complete.  This constraint is enforced by the type-checker.  A switch is disjoint is no case matches the same integer, character, union field, or exception constructor.  A switch is complete if every integer, character, union field, or exception has a case.  For exception switches, this is impossible without a default case as the set of exception names can grow arbitrarily at link-time (and the checking occurs at compile time.)

Union switches allow a primitive form of pattern matching in the style of ML.  In particular, if the argument union expression evaluates to a value with a tag corresponding to field f, then the case for f will be selected when the switch is executed.  If f carries type void, then it is an error for the case to define the optional argument identifier.  If the field carries any type  T besides void, then the case must include the optional identifier.  This identifier will be treated as if it has type T and will be initialized with the T value carried by the union value.   The scope of the identifier is the statement that goes with the case. 

Exception switches are similar to union switches except that instead of matching on fields within a union, the switch takes an exn value and matches cases with exception constructor names.  As with unions, if an exception carries no type, then no optional identifier should be supplied.  Otherwise, if the exception is declared to carray values of type T, then an optional identifier should be supplied and that identifier will be assigned the type T and initialized with the value carried by the exception constructor.  The scope of the identifier is the statement corresponding to the case clause. 

Try Statements:

  <try-stmt> ::= try <stmt> handle <id> <stmt>

Execution proceeds by evaluating the first statement.  If this does not cause an uncaught exception to be raised, then the statement terminates without executing the second statement.  If an exception is raised during evaluation of the first statement and is not caught by an intervening (dynamic) handler, then control is transferred to the second statement.  The exception value of type exn passed to raise is bound to the identifier <id> within the scope of the second statement.   Typically, the second statement performs an exception switch on the identifier to determine what exception was raised and to take appropriate action. 

In the near future, we expect to add a more convenient form for integrating the union swith with the handler.  We may also add a try/handle/finally statement as in Java and change the keyword "handle" to "catch".  

Labelled Statements:

  <label-stmt> ::= <id> : <stmt>

Labels are placed on statements to support easy break and continue to a specified location. 

Block Statements:

  <block-stmt> ::= { <var-decl> ... <var-decl> <stmt> }

A block statement supports both nested variable declarations and a way to disambiguate the parsing of certain statement patterns (e.g., the dangling else problem with if.)   Control may only be transferred into the beginning of a block (see break and continue). 

Printf and Fprintf Statements:

    <printf-stmt> ::= printf(<desc>,<exp>,...<exp>) | 
                      fprintf(<exp>,<desc>,<exp>,...,<exp>)

These statements are used for printing integers, strings, and characters to files or terminals, similar to the printf and frpintf functions of the C standard library.  The <desc> argument must be a string literal and is parsed at compile time to determine the appropriate number and types of the additional arguments to printf.   In the case of fprintf, the first argument expression is meant to be a FILE value such as tal_stdout.  Descriptors follow the conventions of C.  So for instance, "%d" is used for integers, "%s" for strings, and "%c" for characters. Currently, there is no support for precision in the output, other types, etc.   These statements were included purely to make porting C code easier and to simplify mundane output tasks. 


Program Structure:


BNF Grammar:


The Popcorn Library:


Core


List


Dict


Queue


Set


Lexing


The Popcorn Tools:


The popcorn Compiler:


The talc Type Checker and Assembler:


The poplex Lexer Generator:


The popbison Parser Generator: