Ironic Space Lisp Part 5

2018/08/19

Let’s talk about environmental bindings. I’m taking the unusual (I think) approach of sharing environment bindings code between the VM and the interpreter. Unfortunately, I wrote the environment code at the same time as the VM, and fit the code a little too closely to the requirements of the VM, and didn’t think enough about what the interpreter would require. In the process of writing the interpreter, I encountered a strong disconnect between the semantics that I wanted and the semantics I had. Additionally, the first version of the environment used mem::replace. mem::replace is weird, I’ve used it a couple of times, but mostly as a sort of band-aid, and I’ve always ended up refactoring code away from it. It’s ended up being a personal code smell of sorts. Luckily, the new version doesn’t use it.

mem::replace can sometimes seem like a way to get around borrow restrictions. It takes a mutable reference, and a value, and swaps the old value for the new, returning the old value. It doesn’t violate the borrow checker, but it bends it, and you can feel the unsafe blocks lurking in the background. It’s based on the swap<T>(x: &mut T, y: &mut T) function, which

Swaps the values at two mutable locations, without deinitializing either one.

This is noteworthy because it can do this to types which don’t implement Copy. It uses unsafe blocks to bypass the borrow checker. Under normal circumstances, I think the function would have to take ownership of the values to swap them, but with unsafe blocks, they can be swapped directly.

The Old Version

pub struct Environment {
    bindings: HashMap<String, Rc<data::Literal>>,
    parent: Option<Box<Environment>>,
}

This looks reasonably normal for lexical scoping (sort of). You have local bindings in a hashmap of names to values, and a pointer to a parent representing higher level bindings. That pointer is a Box, however, which means only 1 pointer to this parent environment can really exist. Box is less a pointer and more a value in heap memory: cloning the box will clone the inner value and box it.

This value represents a particular environment, so put just inserts directly into the local bindings hashmap.

  pub fn get(&self, k: &String) -> Result<Rc<data::Literal>> {
      if let Some(v) = self.bindings.get(k) {
          return Ok(Rc::clone(v));
      }

      if let Some(ref p) = self.parent {
          return p.get(k);
      }

      Err(format!("Binding not found for {:}", k).into())
  }

The binding hashmap maps strings to reference counted pointers to data, so we return a cloned pointer. Additionally, if this environment has a parent (and therefor isn’t the root environment), we delegate to our parent.

pub fn pop(&mut self) -> Result<()> {
    let parent = mem::replace(&mut self.parent, None);
    let parent = parent.ok_or("Attempted to pop root environment")?;

    *self = *parent;
    Ok(())
}

This function first takes ownership of the environment’s parent by using mem::replace to replace the parent with None, then it sets itself to its parent with *self = *parent. In practical terms, this patterns is unknown to me. Although the practice of replacing yourself makes sense if you’re writing instance methods on effectively immutable values (like numbers or booleans), I don’t really know if anybody does this with structs. I wasn’t even sure this code would compile.

pub fn push(&mut self) {
    let n = Environment::new();
    let p = mem::replace(self, n);
    self.parent = Some(Box::new(p));
}

This uses mem::replace to execute a sort of bait and switch, replacing ourselves with the new empty parent-less environment, then setting ourselves as the parent.

This upshot of all this swapping is that the user of this struct sees what appears to be a normal linked-list type of environment binding, but calling push and pop swaps the value in place, so the Environment struct acts more like a container for all those nested environments, simultaneously managing the pushing and popping of the Environments, and the looking up.

This works pretty well for the VM implementation. It only has to keep track of one environment, and it only pushes and pops the environment stack in response to instructions: it’s someone else’s job to push and pop environments at the right time.

It didn’t make quite as much sense for the interpreter. In the let interpretation, the interpreter should make a new environment parented to its own environment, load the local bindings into the new environment, then evaluate the body with the new environment. However, pushing and poping those environments is effectively done on the same struct, it just gets replaced. And you have to pass the environment as a mutable reference to allow further local bindings down the AST tree, so the upper levels are holding onto an Environment that very much isn’t for them. If the body of the let pops the environment one too many times, the entire stack gets screwed. The interpreter also maintains a global environment, but there’s no concept of separation between those environments, they’re treated as the same value, so the global environment gets pushed down the environment stack, and the struct points to environments evaluating deep in the AST. As long as the environment doesn’t get “overpopped”, it won’t break, but the substantial difference between these two uses felt like a hint that I was missing an important difference in my data model.

The New Version

This particular technique has been on my mind for a while. In the simplest self-hosting lisp interpreters, environment bindings are implemented with “assoc” lists: lists of cons of keys and values ((a . 1) (b . 2) (c . 3)), or “alists”, lists of lists of keys and values ((a 1) (b 2) (c 3)). As a map data structure, these have performance on the order of degenerate binary trees: O(n) for search at worst. They do have a couple of useful properties: they’re very easy to implement, because they are lists you can just cons more pairs onto them, and because of the way cons-lists share data, you can keep track of parent environments by maintaining a pointer to cons you consider as represents the head of your environment bindings, and generate child environments by consing additional pairs on without risking mutating to your pointer and without copying data needlessly. It also let you keep track of the child environments over the long term, making lexical lambda bindings very easy to implement.

I obviously want to avoid the O(n) lookup times, so I want to stick with hashmaps for the bindings themselves. I don’t want to copy large environments regularly, so I want to let child environments share parent environments too. However, environments need to stay mutable to allow parent environments to move on from evaluating children and add more bindings. Rust basically won’t let us leak memory, and we need to maintain multiple references to the environments, so we need to use reference counted pointers. Rc won’t let us mutate its value, so we need to use a RefCell instead. RefCell enforces borrowing semantics at runtime, allowing either 1 mutable reference or multiple immutable references at one time. We have children pointing to environment, so we can’t actually ever mutate it with RefCell. Were we going to replace the parent environment with another child to continue binding in the global environment? Deeply nested environments like this steadily lose their performance advantage as you’re forced to check a longer and longer linked list of hash maps. We’d be introducing a new child binding to be the new global binding with every local environment, and we’d need to do it at non-local levels too.

This is getting cumbersome, but it’s also beginning to remind me of immutable data structures. Immutable data structures are data structures that implement mostly normal data structure semantics, but in an immutable way. They are perhaps best well known for their use in Clojure, where they back almost every data structure in the language. The immutable data structures allows for hashmap “mutation” while also allowing you to maintain the old version of the hashmap unchanged, which is basically exactly what we want for our environments. There are immutable data structures for many different types of data structures, and I was planning on making Ironic Space Lisp ultimately backed the rust implementation of immutable data structures, im-rs. I’m not sure if I want ISL to have cons-lists or just have immutable vectors. While cons-lists are easier to implement, immutable vectors are already implemented, and probably have better performance characteristics. Changing this will probably mean changing, at least slightly, all parts of the program (parser, AST, interpreter, and VM), but I always envisioned ISL without mutability, in much the same way as Erlang avoids mutability, and immutable data structures seem like a great way to do that without being horribly inefficient and copying everything around, or attempting to implement my own bad versions of existing immutable data structures.

Anyway, back to environments. im-rs’s HashMap matches the semantics I want for environments pretty perfectly: hash maps with the option to insert “mutably” and to produce new hashmaps with new values inserted to use as child environments. I don’t feel any particular need to wrap the HashMap, so my new basic environment data structures is just this:

pub type Env = HashMap<String, Rc<data::Literal>>;

I’m continuing to use reference counted pointers to the bound values, but this doesn’t really agree with my philosophy that Literals should be either easily clonable or reference counted anyway.

This single type isn’t entirely sufficient, though. Although the interpreter keeps track of nested environments with its rust-level program stack, the VM doesn’t do that. I wasn’t particularly eager to implement the environment stack pushing and popping in the VM directly, so I decided to implement that functionality in another type in the environment module, EnvStack:

pub struct EnvStack{
    envs: Vec<Env>
}

EnvStack implements many of the old Environment methods, like get, put (which Env calls insert or update depending on whether you want to update in place or get a new hashmap), push, and pop. Although EnvStack does own all the Envs, it does let its users borrow the top of the stack with peek, and I couldn’t think of a good reason to allow for arbitrary borrowing, so you can’t. Finally, push and pop are adorably pedestrian compared to their mem::replace ancestors:

pub fn push(&mut self) {
    let n = match self.envs.last() {
        Some(e) => e.clone(),
        None => Env::new(),
    };

    self.envs.push(n);
}

pub fn pop(&mut self) -> Result<()> {
    self.envs.pop().ok_or("Attempted to pop empty environment stack")?;
    Ok(())
}

Because the external interface changed, I had to change the test suite substantially, and update the VM, but the interpreter was mid-rewrite when I realized the need for a new environment implementation, so no interpreter code got commited using the old environment before and the new AST.

In lieu of any serious conclusion, I’ll leave you with a mangled proverb: “In the land of the borrow checker, the immutable man is king."


rust code language