Week 9 – NeurIPS 2020, Variable Binding


In the past two weeks, I have focused on:

  • NeurIPS 2020
  • Variable Binding — literature review and project feasibility study
  • Alternative project proposals — literature review on contrastive learning, risk minimization, and Vision Transformer. I will include the solidified thoughts for the next post.

NeurIPS 2020

NeurIPS 2019 was my first time attending a computer science conference. I was quite overwhelmed by its extensive program and learned that it was impossible to participate in all events. I attended workshops and tutorials relevant to my work but missed most of the spotlights and posters. This year, once the list of accepted papers was released, I skimmed through all the paper titles with my peers, read the abstracts of those which excite me, and studied 6-9 of those in greater detail. I curated my calendar to include 3-4 events each day of the week and attended most of them. Most of the events are related to either Neuro-Symbolic methods or Multimodal Navigation. Overall, it was a great learning experience, and I am grateful for OpenAI’s sponsorship.

Variable Binding

In the past year, I came across the term “Neuro-Symbolic” in MSR and a few conferences. I did not know the topic, but its premise piqued my interest. At the time, I was working on Vision-Language Navigation, and generalization to unseen environments was an open problem. I noticed that agents tend to pick up spurious visual signals from the scenes rather than perform reasoning at the spatial semantics level. Even if we feed an agent with ground-truth room labels, the agent without the support of explicit mapping modules still struggles to develop search heuristics. It seemed to my mentors and me that a well-designed computer program could capture and execute these search heuristics, using the current room type, doorway directions, and navigation history as typed-arguments. If we had pursued this approach, the research problem would have been how a DNN can be modularized to learn and solve these symbolic, discrete, logical programs on the fly. 

On the other hand, the TP-transformer paper (Schlag et al. 2019) seems to have picked up some traction. It also has a symbolic flavor but attempts to address it from within the connectionist framework(contrary to the modular approach above). This transformer manages to solve math problems with SOTA, and its attention patterns show that arguments are more correctly bound to their relations. For example, in the division a/b, a can correctly attend to b as its denominator, and this property is preserved even when the problem becomes more nested as in (a/b)/(c/d). Given OpenAI’s heavy interest in transformer models, I decided to investigate possible project directions extended from this paper. 

One critical notion mentioned in this paper is Variable Binding. In the problem a/b, the filler b should be bound to a through the denominator role. Existing, standard multi-head attention layers already allow this by letting parameters WQKV in each head to learn to bind fillers for one role(Goyal and Bengio 2020 use the terminology “typed argument”, which is analogous to those accepted by program functions). However, this binding mechanism is limited by the number of heads. The TP-transformer authors introduced a specific role-binding mechanism, which allows for finer role learning and binding within each head. Out of curiosity, I tested this fraction simplification problem on the GPT-3 playground:

This is a program that simplifies the division of fractions. The variables are a, b, c, and d. Terms within parentheses are simplified first.

PROMPT: a / b
RETURN: (a) / (b)
PROMPT: b / c
RETURN: (b) / (c)
PROMPT: (a / b) / c 
RETURN: (a * c) / (b) 
PROMPT: (c / b) / b
RETURN: (c * b) / (b)
PROMPT: (c / d) / (a /a)
RETURN: (c * a) / (d * a)
PROMPT: (d / b) / (c /a)
RETURN: (d * c) / (b * a)
PROMPT: ((a / b) / c) / (a / b)
RETURN: (a* c * b) / (b * a)
PROMPT: ((b / d) / a) / (c / d)
RETURN: (b * c * d) / (a * c * d)

Given the demonstration, the model returned the correct answers for the first three problems, which establishes its ability to recognize math symbols and perform basic fraction simplification. However, its failure on the last problem shows that it cannot bind variables recursively when the problem is sufficiently nested. While it’s unfair to evaluate GPT-3 on math problems, this observation, along with TP-transformer’s results, led to the question of whether transformer models can be benchmarked on their ability to bind variables.

However, it’s not straightforward to evaluate variable binding. Suppose (a/b)/(c/d) is encoded into a vector, what should we do with this vector? We can use this information bottleneck to simplify the fraction into (a*c)/(b*d), but this metric collapses the model’s ability to come up with a logical internal representation by binding, and the model’s ability to solve the logical representation using learned arithmetics. A fairer approach is to query for the numerator(role) of the numerator(role), and see if the model returns a(filler). This process is called unbinding. I did some literature review on this. Surprisingly, the most influential literature set came from the 1990s, when connectionists (most of whom in computer science) and symbolists (most of whom in cognitive science) attempted to imbue distributional representations(like word embeddings) with structural properties(like logical expressions). Here, I discuss the one I find to be the most representative.

In 1990, Paul Smolensky introduced tensor-products for variable binding. The following figures are directly extracted from his paper.

Here, a role is defined as an \mathbb{R}^5 vector, and a filler is defined as an \mathbb{R}^4 vector. The tensor-product of this binding is an \mathbb{R}^{20} matrix, with each element at position \phi, p computed as the products of two scalars, each coming from the corresponding units filler \phi and the role p. This binding is also a rank 2 tensor.

Treating each tensor-product as a filler for higher-level binding operations, we can also bind recursively to construct representations for hierarchical structures—for instance, a complex, nested expression in lambda calculus. One caveat is that the rank of this resultant tensor-product increases — every time we bind a rank H filler with a rank 1 role, the operation returns a rank H + 1 tensor-product.

At any particular level of a hierarchical structure, the conjunction of bindings is computed as the element-wise sum of all bindings.

1) Exact Unbinding Procedure
2) Self-addressing Procedure

Theorem 3.1, 3.2, and 3.3 show two ways to unbind. The underlying assumption is that all role vectors, either designed by hand or learned, are linearly independent. This property allows us to unbind by taking the dot product between the bound tensor-product, and option 1) the dual basis of the query role vector, which returns the exact filler of interest, or option 2) the query role vector itself, which returns the sum of the filler of interest and some noise term.

One undesirable property of this binding mechanism is the tensor-product’s rank explosion when the problem’s structure becomes too nested. To mitigate this, the TP-transformer authors (which includes Smolensky himself) suggested using the Hadamard product as an approximation. With this approximation, binding an \mathbb{R}^5 role vector with an \mathbb{R}^5 filler vector returns an \mathbb{R}^5 representation. We can also arrive at this representation by applying the diagonal function on the full tensor-product representation. To unbind, we take the Hadamard product between the bound \mathbb{R}^5 representation with the \mathbb{R}^5 query role vector itself. 

Besides Smolensky’s tensor-product, other influential works includes Holographic Reduced Representation (Plate, 1994), Spatter Code (Kanerva 1994), and the Braiding operator (Gayler 1998). Across the board, researchers recognized that systems for composing structural representation need to support binding, unbinding, conjunction, and recursive construction/deconstruction. Also, note that the unbinding methods are specific to the respective binding methods. It wouldn’t be fair to evaluate one system’s ability to bind using another system’s unbinding operator.

In addition to this set of literature, I also reviewed the following to understand the definition of variable binding:

Variable Binding Tree and Lambda Calculus (Church, 1936): I find the variable binding tree’s formalism to be useful for generating synthetic data of different levels of binding complexity. The number of leaves corresponds to the number of terminal values bound as variables. The height of the tree corresponds to the number of recursive binding operations. The width of the tree at any level corresponds to the number of conjunctions at the level.

On the Binding Problem in Artificial Neural Networks (Greff et al. 2020): I find their characterization of the binding problem as three subproblems: Segregation, Representation, Composition to be insightful. They also provided a good survey of existing methods that attempts to perform binding within the connectionist framework, contrary to modular hybrid approaches, which designate some modules to deal with symbolic information content and others to deal with neural information content.

Measuring Compositionality in Representation Learning (Andreas, 2019): The author proposed a general procedure for measuring any encoding function’s ability to compose by learning an approximation function that has access to the ground-truth structure of a problem. If the encoding function doesn’t encode compositionally, the encoded representation would have no compositional properties, thus impossible for the approximation function to match using ground-truth information. While this procedure makes intuitive sense, I’m skeptical of certain aspects once the procedure is operationalized in the experiments. First, the composition function (*), is assumed to be the addition function for all experiments. There is no explanation for this collapse of conjunction and recursive binding into one single operator. Second, in some experiments, it is unclear in regards to what the encoding function is and what the approximation function is. There are no explanations for this either. These two aspects made the extension of this paper quite challenging.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s