Tom McCoy, Tal Linzen, Ewan Dunbar, Paul Smolensky
This page accompanies the paper found here. The code is available here.
Neural networks are widely celebrated for their ability to handle complex tasks. However, they have another ability that might be even more impressive because it is more surprising: They can also solve tasks that they seem extremely illsuited for.
Consider the task of copying a sequence of digits. Intuitively, this task seems to require compositional representations that indicate the sequence’s fillers (that is, which specific digits are in the sequence) and those fillers’ roles (that is, their positions). The standard way of writing a sequence is shorthand for such a fillerrole representation:
4 , 3 , 7 , 9 , 6 = 4 : first + 3 : second + 7 : third + 9 : fourth + 6 : fifth
Neural networks do not appear to use any such representation of structure, so asking a neural network to perform a symbolic task might seem like using a Phillips screwdriver on a flathead screw. Nonetheless, the neural network below is able to perform the copying task very successfully:
This neural network has two components: the encoder, which encodes the sequence into a vector of continuous values (represented here as shades of blue); and the decoder, which generates the output sequence based on the vector encoding.
This network is not perfect—for example, it fails on most singledigit sequences—but it does perform extremely well, with over 99% accuracy on a heldout test set. The question, then, is this: How can neural networks use continuous vector representations to perform symbolic tasks that depend on compositional structure? In this project, we investigate the following hypothesis:
In the rest of this project description, we first discuss fillerrole representations in more detail. Next, we investigate analogies as a preliminary source of evidence for filler/role representations in neural networks. Finally, we introduce a novel architecture, the Tensor Product Decomposition Network (TPDN) for decomposing learned vector representations into fillerrole representations. The initial sections on filler/role representations and analogies are not essential to the main experiments, so you can safely skip ahead to the main experiments if you wish.
Suppose you want to represent the abstract numerical quantity "the number of hours in a leap year." The standard way of writing this number is 8,784, which can be viewed as a fillerrole representation that uses digits as fillers and powers of 10 as roles:
eight : thousands + seven : hundreds + eight : tens + four : ones
When we describe this number in words, the roles are actually made explicit. That is, the phrase "eight thousand seven hundred eightyfour" transparently corresponds to:
eight : thousand + seven : hundred + eight : ty + four : ∅
The only minor complications are the fact that the "ty" suffix in "eighty" corresponds to the "tens" role and the fact that there is no pronounced instantiation of the "ones" role corresponding to the filler "four" at the end of the number.
With digits, the fact that positions are used as a shorthand for roles means that the order of the digits matters; the two copies of "8" in "8,784" mean different things (eight thousand vs. eighty). However, when using explicit fillerrole pairs as in word representations, order no longer matters (this is partly why we have been using the + sign for linking together fillerrole pairs, since order also does not matter for addition). Thus, even though Malagasy (the national language of Madagascar) uses a word order opposite that of English, it is still using fundamentally the same fillerrole scheme (for clarity, hyphens have been added to the Malagasy words):
Malagasy:  efatra  amby  valo  folo  sy  fiton  jato  sy  valo  arivo 
English:  four  and  eight  ty  and  seven  hundred  and  eight  thousand 
Fillers and roles:  four : ∅ + eight : ty + seven : hundred + eight : thousand 
Fillerrole representations are very flexible. There are many possible schemes for organizing the fillers, and many possible schemes for organizing the roles. For example, English standardly refers to the number 87 as “eightyseven”, which uses powers of ten as roles:
87 = eight : tens + seven : ones
However, Abraham Lincoln used a different role scheme based on twenties, instead of tens, when he referred to 87 as “four score and seven” in the Gettysburg Address:
87 = four : twenties + seven : ones
It is also possible to alter the set of fillers. English standardly uses the digit words from “one” to “nine” as its fillers (with the absence of a filler being interpreted as the “zero”). However, in The Lord of the Rings, Bilbo Baggins refers to his age of 111 as "eleventyone", a usage which expands the standard filler set to also include eleven:
111 = eleven : tens + one : ones
To give some more serious examples: Using bases besides 10 requires altering both the fillers and roles (e.g. base 7 uses only the fillers from 0 through 6, and its roles correspond to the decimal numbers 1, 7, 49, 343, etc.). Roman numerals also involve a complex set of fillers and roles that are difficult to characterize concisely. Moving beyond numbers, various types of linguistic units can be couched in terms of fillers and roles. In phonology, syllables involve the roles of onset, nucleus, and coda (with phonemes as fillers). In semantics, a sentence’s meaning involves roles such as AGENT and PATIENT, which are matched to specific entities as their fillers. Meanwhile, in syntax, nounphrase fillers are associated with sentenceposition roles such as subject and direct object.
One popular method for investigating vector representations is through the use of analogies, which can be expressed as equations relating vectors to each other. For example, the analogy “king is to queen as man is to woman” can be expressed as:
king  queen = man  woman
Mikolov et al. (2014) found that this analogy does (approximately) hold for the word representations learned by word2vec. That is, the vector representing king minus the vector representing queen is approximately equal to the vector representing man minus the vector representing woman.
Such analogies are compatible with filler role representations. Suppose that we decompose the meanings of the words king, queen, man, and woman as follows:
king = royal : status + male : gender
queen = royal : status + female : gender
man = commoner : status + male : gender
woman = commoner : status + female : gender
Using these representations, we get:
king  queen = ( royal : status + male : gender )  ( royal : status + female : gender )
man  woman = ( commoner : status + male : gender )  ( commoner : status + female : gender )
Both of these equations reduce to male : gender  female : gender , meaning that the equation king  queen = man  woman is expected to hold under this representation of word meaning. Thus, this fillerrole representation provides a possible explanation for why the analogy holds.
We can similarly use analogies to investigate whether the representations learned by our copying RNN are fillerrole representations. This RNN is copying sequences such as 4,3,2,1,6. We will assume that the fillers it would use are the digits in the sequence, but there are multiple possible ways that it could represent the roles. For example, it could use positions counting from left to right, or positions counting from right to left, or a bidirectional role scheme that incorporates both directions:
4 , 3 , 7 , 9 = 4 : first + 3 : second + 7 : third + 9 : fourth
4 , 3 , 7 , 9 , 6 = 4 : fourthtolast + 3 : thirdtolast + 7 : secondtolast + 9 : last
4 , 3 , 7 , 9 , 6 = 4 : (first, fourthtolast) + 3 : (second, thirdtolast) + 7 : (third, secondtolast) + 9 : (fourth, last)
Certain analogies will only work for some role schemes but not others. For example, the following analogy works with lefttoright roles because both sides reduce to 2 : third . However, it does not hold with righttoleft roles or bidirectional roles because in those cases it does not reduce in any clean way:
4,6,2  4,6 = 3,9,2  3,9
On the other hand, the next analogy works with righttoleft roles (because both sides reduce to 4 : thirdtolast ) but not for lefttoright roles or bidirectional roles:
4,6,2  6,2 = 4,3,9  3,9
This distinction can also be seen with whole numbers vs. numbers written after a decimal. Whole numbers use a righttoleft arrangement of digits, so the equation 462  62 = 439  39 is true. On the other hand, numbers after a decimal use a lefttoright arrangement of digits, so .462  .62 = .439  .39 is false, while it is true that .462  .46 = .392  .39.
Below, you can try different analogies using the vector representations generated by our copying RNN. This works by providing three sequences as three of the four elements of an analogy. The vector encodings of these three sequences are then used to determine the expected vector encoding of the fourth sequence. This expected encoding is then decoded using the decoder, and you can see whether the decoder produces the correct sequence to complete the analogy.
You can click "Generate lefttoright analogy" or "Generate righttoleft analogy" to automatically fill in analogies consistent with just one role scheme. The "Generate bidirectional analogy" button will automatically fill in an analogy that is consistent with any of these role schemes (lefttoright, righttoleft, or bidirectional). Alternately, you can click "Generate ambiguous analogies" to create an input set for which lefttoright and righttoleft role schemes would predict different answers, so that you can differentiate the two role schemes by seeing which of the two answers it does generate. Finally, you can manually type in sequences to create your own analogies.
Lefttoright prediction:  N/A 
Righttoleft prediction:  N/A 
Decoder's output:  N/A 
The most salient points to notice from these analogies are the following:
The fact that the bidirectional analogies always succeed while neither unidirectional analogy succeeds as reliably provides evidence that this model is using a fillerrole representation with bidirectional roles. However, the fact that lefttoright roles are more successful than righttoleft roles suggests that, within the model's bidirectional role scheme, the lefttoright direction is weighted more heavily than the righttoleft direction.
These analogies offer some preliminary evidence that this neural network's representations are some sort of fillerrole representation and that it uses a bidirectional role scheme. In the next section, we seek to investigate this hypothesis in a more comprehensive manner with a novel analysis tool, the Tensor Product Decomposition Network (TPDN).
For a neural network to be implementing a fillerrole representation, it must have some method for using a single, continuouslyvalued vector to represent the fillers and roles. We hypothesize that neural networks will accomplish this using Tensor Product Representations, which Smolensky (1990) introduced as a principled method for converting fillerrole representations into vectors. Our implementation of Tensor Product Representations (which is a slightly modified version of the formulation in Smolensky (1990)) works as follows:
There is nothing in the structure of a neural network that restricts it to create a Tensor Product Representation. Nonetheless, Tensor Product Representations are a powerful technique for representing compositional structure in continuous space, which is why we hypothesize that neural networks may implicitly implement this formalism when they need to represent compositional structure.
To test whether our RNN's representations can be viewed as an example of a Tensor Product Representation, we introduce a novel architecture called the Tensor Product Decomposition Network (TPDN), which attempts to fit a Tensor Product Representation onto a set of vector representations such that the vector encodings produced by the TPDN are as close as possible to the vector encodings produced by the RNN.
As input, the TPDN requires that the sequence be represented as a set of filler/role pairs. We assume that the digits in the sequence are the fillers, but there are multiple plausible techniques for assigning fillers to these roles. We investigate six possible role schemes:
Lefttoright roles: Each filler's role is its index in the sequence, counting from left to right. 
Type a sequence:


Righttoleft roles: Each filler's role is its index in the sequence, counting from left to right. 
Type a sequence:


Bidirectional roles: Each filler's role is an ordered pair containing its lefttoright index and its righttoleft index. 
Type a sequence:


Wickelroles: Each digit’s role is the digit before it and the digit after it. 
Type a sequence:


Tree positions: Each digit’s role is its position in a tree, such as RRL (left child of right child of right child of root). The tree structures are determined by an algorithm that clusters digits based on their values. 
Type a sequence:


Bagofwords: All digits have the same role. We call this a bagofwords because it represents which digits (“words”) are present and in what quantities, but ignores their positions. 
Type a sequence:

We evaluate the TPDN by seeing whether the RNN's decoder—which was trained on the encodings produced by the RNN encoder, not the encodings produced by the TPDN—can successfully produce the correct output sequence when fed an encoding from the TPDN. If it can, that is evidence that the TPDN is indeed providing a close approximation of the RNN encoder.
For each of the six role schemes above, we fitted a TPDN using that role scheme onto our copying RNN. You can test out these TPDNs in the following cell, which allows you to visually compare the vector encodings produced by the TPDN to the vector encodings produced by the RNN encoder, as well as to see if the RNN decoder can successfully decode from the TPDN's encoding:
The following chart gives the overall results. The bagofwords bar cannot be seen because its value is essentially zero:
The differences between role schemes may be hard to discern when only looking at the vectors (though upon closer inspection differences can be found). However, looking at the output sequences or the results in the chart reveals the differences more clearly. Some of the most salient points from the chart and the simulation are:
This asymmetry suggests that the model is implementing what we call mildly bidirectional roles: the representations are bidirectional in nature, but within this bidirectional role scheme the lefttoright direction is favored over the righttoleft direction. This conclusion is in line with the observations made about the analogies above.
The fact that this copying RNN appears to be implementing bidirectional roles provides a possible explanation for why it struggles with copying singledigit sequences. If a sequence is only a single digit long, then the bidirectional role scheme will assign that digit the role (1,1). However, this role occurs very rarely in the training set, since there are so few single digits; therefore, the RNN struggles with sequences involving this role.
We can use the TPDN to delve deeper into different aspects of RNN training to see how those aspects affect the representations that are learned. First, we can see how the training task affects the representations. So far we've focused on one task (copying, aka autoencoding), but now we expand the set of tasks to include three others:
The following chart displays how the training task affects the learned representation (this chart also allows you to vary the architecture being considered. See the next section for descriptions of the architectures):
There are several points to note about this graphic (all focusing on the unidirectional/unidirectional architecture):
Finally, we also investigate how the model architecture affects the learned representations. We investigate three model topologies: unidirectional, bidirectional, and treebased. Each topology can be used for the encoder and/or the decoder. The following graphic shows, for a given task, how the encoder and decoder affect the learned representations. Each architecture is denoted as encoder/decoder (e.g., "Bi/Tree" means a model with a bidirectional encoder and a treebased decoder):
There are two major things to notice from these graphics:
Analysis using TPDNs shows that neural networks can learn to use structured, compositional representations to solve symbolic tasks. However, just because they can generate such representations does not mean that they always do. The results described here all deal with heavily structuredependent tasks operating over sequences of digits. When we extend the results to sentence representations based on corpora of naturallyoccurring text, the TPDN analysis suggests that popular sentenceencoding models generally lack any sort of robust, systematic structure (see the full paper for more details). This finding is in line with what we observed on the models trained to perform sorting: Because this task does not depend on structure, the models trained to perform this task learned to discard most structural information in the input, with the result that these models' representations could be wellapproximated with bagofwords roles.
Therefore, we view our experiments on how the representation is affected by the architecture and the training task as a preliminary investigation of the situations under which neural networks will learn structured, compositional representations. Our results indicate the importance of the architecture (particularly the decoder) as well as the training task in determining the types of representations a model will learn. Extending such work may provide insight for improving current neural models to make them less brittle and better able to generalize to novel domains.
Questions? Comments? Email tom.mccoy@jhu.edu.