Universal linguistic inductive biases via meta-learning


Tom McCoy, Erin Grant, Paul Smolensky, Tom Griffiths, Tal Linzen

This post accompanies the paper found here and the code found here.

Introduction

Language acquisition involves a complex interplay between the data and the learner. The importance of data is clear: we can only learn a language if we have experience with it. Less obviously, acquisition is also guided by properties of the learner called inductive biases, which determine how the learner will generalize beyond the utterances they have encountered. As an example, consider the following pattern:

What output should replace the question marks? You probably answered .ta.ra.va. even though other answers are conceivable; e.g., the provided outputs are consistent with a rule that involves reversing the input, so the answer could instead be .va.ra.ta. If you answered .ta.ra.va. instead of .va.ra.ta., it suggests that you have an inductive bias for preserving the input order.

Which inductive biases enable humans to acquire language? The answer is hotly debated. To facilitate computational modeling aimed at answering this question, we introduce a framework for giving particular linguistic inductive biases to a model. Such a model can then be used to explore the effects of those inductive biases, and to see which biases yield the most human-like generalization behavior.

In our framework, the inductive biases are encoded in the initial state of a neural network. This initial state is found with meta-learning, a technique through which a model discovers how to acquire new languages more easily via exposure to many possible languages. By controlling the properties of the languages that are used during meta-learning, we can control the set of inductive biases that meta-learning imparts.

Try it out!

To demonstrate this framework, we use the linguistic domain of syllable structure. For a given language, the model must learn how to map a sequence of sounds (e.g., tarava) into a sequence of syllables (e.g., .ta.ra.va.). Each language has restrictions on what types of syllables are allowed, and sounds may need to be inserted or deleted to meet these restrictions. For example, if a language requires all syllables to end with a vowel, the input kep might map to .ke.pa. (if the language uses insertion) or .ke. (if the language uses deletion).

The demo below shows how a neural network initialized with meta-learning can learn syllable structure mappings much more quickly than a neural network with a standard random initialization. This demo trains the models in your browser using randomly-generated data, so none of the results that you see are cherry-picked.

Pattern to be learned: Delete sounds as necessary to ensure that the output is pronounceable, but otherwise change nothing; add periods to mark syllable boundariesDelete sounds as necessary to ensure that the output is pronounceable, but otherwise change nothing; add periods to mark syllable boundaries.

Define a language based on the factors in Fig. 3 of the paper (middle panel): Choose properties for the language from the fields below.

Constraint ranking:


Set of consonants:

Set of vowels:

Consonant for insertion:
Vowel for insertion:

Manually define a training and test set: Enter your training and test examples in the spaces below. Use one line per input-output pair, with a comma (no spaces) separating the input and output.

Training set


Test set

Investigate absolute universals: Certain inputs will map to the same output regardless of language. Thus, our models will ideally be able to handle these inputs even without any training. Click these buttons to test the models on such inputs. C stands for consonant, and V stands for vowel; any CV input should map to .CV., while any CVCV input should map to .CV.CV.
Training examples: 0

Test set predictions after training on 0 examples
Input
Correct output
Meta-initialized
model's output
Randomly-initialized
model's output
rOau
.rO.a.u.
.rO.a.u.
.rO.a.u.
axxaO
.a.xa.O.
.a.xa.O.
.a.xa.O.
rxxa
.xa.
.xa.
.xa.
axrxu
.a.xu.
.a.xu.
.a.xu.
ttxaO
.xa.O.
.xa.O.
.xa.O.
rOau
.rO.a.u.
.rO.a.u.
.rO.a.u.
How to use: Click "Train on one example" to show the models a single input-output pair (the training is done in your browser, so none of the results that you see have been pre-computed). Click "Random new language" to restart the demo with a new language.

The meta-initialized model can typically learn any of these syllable structure mappings from fewer than 100 examples—sometimes substantially fewer. In contrast, after the same number of examples, the outputs of the randomly initialized model are usually nowhere near the correct answers.

In the class of languages that we include in our experiment, certain inputs will map to the same output across all languages; such mappings are therefore absolute universals, or properties that all languages possess. For example, an input of the form CVCV will always map to .CV.CV. (where C stands for any consonant, V stands for any vowel, and periods indicate word and syllable boundaries). Even though some languages allow syllables of the form .CVC. and .V., the input CVCV will only ever map to .CV.CV., never to .CVC.V. The meta-initialized model typically gets the correct output for such inputs without any training in the language, indicating that the meta-learning process has imparted these absolute universals. You can test these absolute universals by clicking "Show advanced settings" in the demo.

How many examples does the randomly-initialized model need before it begins performing well? The table below should give some sense of this; it contains data from one language for each of the 8 abstract categories of languages that we use in our experiments. On these languages, the randomly-initialized model typically needs a few thousand examples, compared to a few dozen for the meta-initialized model.

Click to select language type:
(defined by a constraint ranking)

input correct output 0 iterations 1,000 iterations 2,000 iterations 3,000 iterations 4,000 iterations 5,000 iterations 6,000 iterations 7,000 iterations 8,000 iterations 9,000 iterations 10,000 iterations
uttxr .u. pzrrrrrrrr .u.xu. .u. .u. .u. .u. .u. .u. .u. .u. .u.
raxur .ra.xu. pzrrrrrrrr .ra.xu. .ra.xu. .ra.xu. .ra.xu.u. .ra. .ra.xu. .ra.xu. .ra.xu. .ra.xu. .ra.xu.
Oar .O.a. pzrrrrrrrr .O.ra. .O.a. .O.a. .O.a. .O.a. .O.a. .O.a. .O.a. .O.a. .O.a.
atrau .a.ra.u. pzrrrrrrrr .a.ra. .a.ra.u. .a.ru.u. .a.ra.u. .a.ra. .a.ra.u. .a.ra.u. .a.ra.u. .a.ra.u. .a.ra.u.
tOuxa .tO.u.xa. pzrrrrrrrr .tO.xo.x. .tO.u.xa. .tO.u.xa. .tO.u.xa. .tO.u.xa. .tO.u.xa. .tO.u.xa. .tO.u.xa. .tO.u.xa. .tO.u.xa.
xruto .ru.tO. prrrrrrrrr .tu.xO. .tu. .ru.tO. .ru.tO.O. .ru.tO. .ru.tO. .ru.tO. .ru.tO. .ru.tO. .ru.tO.
xOxra .xO.ra. pzrrrrrrrr .xO.xa. .xO.xa. .xO.ra. .xO.ra. .xO.ra. .xO.ra. .xO.ra. .xO.ra. .xO.ra. .xO.ra.
uuuuO .u.u.u.u.O. xhUjjjjjjj .u.uu.O.. .u.u.u.u. .u.u.u.u.O. .u.u.u.O.O. .u.u.u.O. .u.u.u.u. .u.u.u.u.O. .u.u.u.u.O. .u.u.u.u.u. .u.u.u.u.O.
xt <empty> prrrrrrrrr <empty> <empty> <empty> <empty> <empty> <empty> <empty> <empty> <empty> <empty>
aata .a.a.ta. pai.iii.ii .a.ta. .a.a.ta. .a.a.ta. .a.a.ta. .a.a.ta. .a.a.ta. .a.a.ta. .a.a.ta. .a.a.ta. .a.a.ta.
input correct output 0 iterations 1,000 iterations 2,000 iterations 3,000 iterations 4,000 iterations 5,000 iterations 6,000 iterations 7,000 iterations 8,000 iterations 9,000 iterations 10000 iterations
UUEm .U.U.Em. emmm....Ok .U.mU.E.. .U.E.m .U.U.Em. .U.U.Em. .U.U.Em. .U.U.Em. .U.U.E.Em. .U.U.Em. .U.U.Em. .U.U.Em.
pppE .pEp.pE. emmm....Ok .pE.pE.EE.. .pEp.pE.. .pEp.pE. .pE..pE. .pEp.pE. .pEp.pE. .pEp.pE. .pEp.pE. .pEp.pE. .pEp.pE.
UmE .U.mE. emmm....Ok .U.mE.. .U.mE. .U.mE. .U.mE. .U.mE. .U.mE. .U.mE. .U.mE. .U.mE. .U.mE.
<empty> <empty> emmm....Ok .EE. .EE. .EE. .EE. .xE. .cE. .xE. .xE. .xE. .cE.
xEpcU .xEp.cU. eNmmm....O .xE....... .xE..pE..U. .xE..pU. .xE..cc.UU. .xEp.cU. .xEp.cU. .xEp.cU. .xEp.cU. .xEp.cU. .xEp.cU.
mExE .mE.xE. emmm....Ok .mE.EE.E. .mExxE.. .mE.xE. .mE.xE. .mE.xE. .mE.xE. .mE.xE. .mE.xE. .mE.xE. .mE.xE.
UUUEE .U.U.U.E.E. emmm....Ok .U.UU.E.E.. .U.U.EE. .U.U.U.E. .U.U.U.E.E. .U.U.U.E.E. .U.U.U.E.E. .U.U.U.E.E. .U.U.U.E.E. .U.U.U.E.E. .U.U.U.E.E.
Exppp .E.xEp.pEp. eNmmm....O .E.xEppEpp. .Ex.pEppEp. .E.xEp.pEp. .E.xEp.pEp. .E.xEp.pEp. .E.xEp.pEp. .E.xEp.pEp. .E.xEp.pEp. .E.xEp.pEp. .E.xEp.pEp.
UpEUc .U.pE.Uc. emmm....O .U..pE.E... .U.pE.U.. .U.pE.cc. .U.pE.Uc. .U.pE.Uc. .U.pE.Uc. .U.pE.Uc. .U.pE.Uc. .U.pE.Uc. .U.pE.Uc.
EUUpm .E.U.U.pEm. emmm....Ok .U.E..E..E. .E.U.pEp. .E.U.U.pEm. .E.U.U.pEm. .E.U.U.pEm. .E.U.U.pEm. .E.U.U.pEm. .E.U.U.pEm. .E.U.U.pEm. .E.U.U.pEm.
input correct output 0 iterations 1,000 iterations 2,000 iterations 3,000 iterations 4,000 iterations 5,000 iterations 6,000 iterations 7,000 iterations 8,000 iterations 9,000 iterations 10,000 iterations
Aktt .kAk.tUt. U.kA.kkU.. .kA.tUt. .kAk.UUt. .kAk.tUt. .kAk.tUt. .kAk.tUt. .kAk.tUt. .kAk.tUt. .kAk.tUt. .kAk.tUt. .kAk.tUt.
aukk .ka.ku.kUk. UU.kkU..k .ka.kukk .ka.ku.kUk. .ka.ku.kUk. .ka.ku.kUk. .ka.ku.kUk. .ka.ku.kUk. .ka.ku.kUk. .ka.ku.kUk. .ka.ku.kUk. .ka.ku.kUk.
Uta .kU.ta. UU.kkU..kk .kU.ka.. .kU.ta. .kU.ta. .kU.ta. .kU.ta. .kU.ta. .kU.ta. .kU.ta. .kU.ta. .kU.ta.
tak .tak. UU.kkU..kk .tU.. .ta.k .tak. .tak. .tak. .tak. .tak. .tak. .tak. .ta.kak.
ttktA .tU.tUk.tA. nkaaaaaaaa .tU.tUt. .tU..Ut.kA. .tU.tUt.tA. .tU.tUk.tA. .tU.tUk.tA. .tU.tUk.tA. .tU.tUk.tA. .tU.tUk.tA. .tU.tUk.tA. .tU.tUk.tA.
UtAUa .kU.tA.kU.ka. UU.kkU..kk .kU.kU.ka.ka. .kU.tA.kA.ka. .kU.tA.kU.ka. .kU.tA.kU.ka. .kU.tA.kU.ka. .kU.tA.kU.ka. .kU.tA.kU.ka. .kU.tA.kU.ka. .kU.tA.kU.ka. .kU.tA.kU.ka.
kuUta .ku.kU.ta. U.kU.kkU.. .kU.kU.ka.. .ku.kU.tU. .ku.kU.ta. .ku.kU.ta. .ku.kU.ta. .ku.kU.ta. .ku.kU.ta. .ku.kU.ta. .ku.kU.ta. .ku.kU.ta.
katta .kat.ta. U.kU.kkU.. .ka.tat.. .kat.ta. .kat.ta. .kat.ta. .kat.ta. .kat.ta. .kat.ta. .kat.ta. .kat.ta. .kat.ta.
U .kU. UUkkkaaaaa .kU. .kU. .kU. .kU. .kU. .kU. .kU. .kU. .kU.
kattu .kat.tu. U.kkU..kkU .ka.tut .kat.ku. .kat.tu. .kat.tu. .kat.tu. .kat.tu. .kat.tu. .kat.tu. .kat.tu. .kat.tu.
input correct output 0 iterations 1,000 iterations 2,000 iterations 3,000 iterations 4,000 iterations 5,000 iterations 6,000 iterations 7,000 iterations 8,000 iterations 9,000 iterations 10,000 iterations
btubo .bU.tu.bo. uttjutjutj .bU.bo.fo. .bU.tu.bo. .bU.tu.bo. .bU.tu.bo. .bU.tu.bo. .bU.tu.bo. .bU.tu.bo. .bU.tu.bo. .bU.tu.bo. .bU.tu.bo.
uuutb .fu.fu.fu.tU.bU. tNujutjutj .fu.fu.fu.bU.bU. .fu.fu.tU.bU.bU. .fu.fu.tU.bU.bU. .fu.fu.fu.tU. .fu.fu.fu.tU.bU. .fu.fu.tU.bU.bU. .fu.fu.tU.bU.bU. .fu.fu.fu.tU.bU. .fu.fu.fu.tU.bU. .fu.fu.fu.tU.bU.
fffbf .fU.fU.fU.bU.fU. jutjutjutj .fU.fU.bU.bU. .fU.fU.bU.fU. .fU.fU.fU.bU.fU. .fU.fU.fU.fU.fU. .fU.fU.bU.fU.fU. .fU.fU.bU.bU.fU. .fU.fU.fU.bU.fU. .fU.fU.fU.bU.fU. .fU.fU.fU.bU.fU. .fU.fU.fU.bU.fU.
votuv .vo.tu.vU. uttjutjutj .vo.vo.vU.vU. .vo.tu.vU. .vo.tu.vU. .vo.tu.vU. .vo.tu.vU. .vo.tu.vU. .vo.tu.vU. .vo.tu.vU. .vo.tu.vU. .vo.tu.vU.
uoUUf .fu.fo.fU.fU.fU. jutjutjutj .fu.fo.fU.fU.fU. .fu.fo.fU.fU. .fu.fo.fU.fU.fU. .fu.fo.fU.fU.fU. .fu.fo.fU.fU.fU. .fu.fo.fU.fU.fU. .fu.fo.fU.fU.fU. .fu.fo.fU.fU.fU. .fu.fo.fU.fU.fU. .fu.fo.fU.fU.fU.
vuufu .vu.fu.fu. tNujutjutj .vu.fu.fu. .vu.fu.fu. .vu.fu.fu. .vu.fu.fu. .vu.fu.fu. .vu.fu.fu. .vu.fu.fu. .vu.fu.fu. .vu.fu.fu. .vu.fu.fu.
ubuu .fu.bu.fu. rZAbwjutju .fu.bu.fu. .fu.bu.fu. .fu.bu.fu. .fu.bu.fu. .fu.bu.fu. .fu.bu.fu. .fu.bu.fu. .fu.bu.fu. .fu.bu.fu. .fu.bu.fu.
fvuof .fU.vu.fo.fU. jutjutjutj .fU.fo.fU.fU. .fU.vu.fo.fU. .fU.vu.fo.fU. .fU.vu.fo.fU. .fU.vu.fo.fU. .fU.vu.fo.fU. .fU.vu.fo.fU. .fU.vu.fo.fU. .fU.vu.fo.fU. .fU.vu.fo.fU.
vuofu .vu.fo.fu. tNujutjutj .vu.fo.fu.fU. .vu.fo.fu. .vu.fo.fu. .vu.fo.fu. .vu.fo.fu. .vu.fo.fu. .vu.fo.fu. .vu.fo.fu. .vu.fo.fu. .vu.fo.fu.
ootUU .fo.fo.tU.fU. ffffffffff .fo.fo.fU.tU.tU. .fo.fo.tU.fU. .fo.fo.tU.fU. .fo.fo.tU.fU. .fo.fo.tU.fU. .fo.to.tU.fU. .fo.to.tU.fU. .fo.to.tU.fU. .fo.to.tU.fU. .fo.to.tU.fU.
input correct output 0 iterations 1,000 iterations 2,000 iterations 3,000 iterations 4,000 iterations 5,000 iterations 6,000 iterations 7,000 iterations 8,000 iterations 9,000 iterations 10,000 iterations
Ebhhh .Eh. iibEuccEcc .Eh. .Eh.hh. .Eh. .Eh. .Eh. .Eh. .Eh. .Eh. .Eh. .Eh.
AAr .A.Ar. nquccEccEc .A.AA. .A.A.Ar. .A.Ar. .A.Ar. .A.Ar. .A.Ar. .A.Ar. .A.Ar. .A.Ar. .A.Ar.
EhrA .Eh.rA. bEuccEccEc .Er.AA. .Er.rA. .Eh.hA. .Er.rA. .Eh.rA. .Eh.rA. .Eh.rA. .Eh.rA. .Eh.rA. .Eh.rA.
AhrhA .Ar.hA. xuccEccEcc .A.AA. .Ah.hA..A. .Ah.hA. .Ah.rA. .Ah.hA. .Ar.hA. .Ar.hA. .Ar.hA. .Ar.hA. .Ar.hA.
AAhEE .A.A.hE.E. ujccEccEcc .A.A.EE. .A.A.rE.E. .A.A.hE. .A.A.hE.E. .A.A.hA.E. .A.A.hE.E. .A.A.hE.E. .A.A.hE.E. .A.A.hE.E. .A.A.hE.E.
rhA .hA. xuccEccEcc .hA. .hA. .hA. .hA. .hA. .hA. .hA. .hA. .hA. .hA.
EArrA .E.Ar.rA. xuccEccEcc .E.AA.AA. .E.Ar.rA. .E.Ar.rA. .E.Ar.rA. .E.Ar.rA. .E.Ar.rA. .E.Ar.rA. .E.Ar.rA. .E.Ar.rA. .E.Ar.rA.
AEErE .A.E.E.rE. ibEuccEccE .A.E.E.. .A.E.rE.E. .A.E.EE.E. .A.E.E.rE. .A.E.E.rE. .A.E.E.rE. .A.E.E.rE. .A.E.E.rE. .A.E.E.rE. .A.E.E.rE.
EEEE .E.E.E.E. ujccEccEcc .E.E.E.E. .E.E.E.E. .E.E.E.E.. .E.E.E.E. .E.E.E.E. .E.E.E.E. .E.E.E.E. .E.E.E.E. .E.E.E.E. .E.E.E.E.
hrAA .rA.A. xxuccEccEc .rA.A. .rA.A. .rA.A. .rA.A. .rA.A. .rA.A. .rA.A. .rA.A. .rA.A. .rA.A.
input correct output 0 iterations 1,000 iterations 2,000 iterations 3,000 iterations 4,000 iterations 5,000 iterations 6,000 iterations 7,000 iterations 8,000 iterations 9,000 iterations 10,000 iterations
rUUUr .rU.U.U.re. fddddddddd .rU.U.re. .rU.U.U.re. .rU.U.U.re. .rU.U.U.re. .rU.U.U.U.e. .rU.U.U.re. .rU.U.U.re. .rU.U.U.re. .rU.U.U.re. .rU.U.U.re.
UU .U.U. xcrfdddddd .U.U. .U.U. .U.U. .U. .U.U. .U. .U. .U. .U. .U.
heee .he.e.e. xbxbkxbkxb .he.e.e. .he.e.e. .he.e.e. .he.e.e. .he.e.e.e. .he.e.e. .he.e.e. .he.e.e. .he.e.e. .he.e.e.
eger .e.ge.re. ehcrfddddd .e.ge.re. .e.ge.re. .e.ge.re. .e.ge.re. .e.ge.re. .e.ge.re. .e.ge.re. .e.ge.re. .e.ge.re. .e.ge.re.
heeUU .he.e.U.U. xSxcrfdddd .he.e.e... .he.e.U.U. .he.e.U.U. .he.e.U.U. .he.e.U.U. .he.e.U.U. .he.e.U.U. .he.e.U.U. .he.e.U.U. .he.e.U.U.
heUh .he.U.he. xtxOkkxkkx .he.h..e. .he.U.he. .he.U.he. .he.U.he. .hU.U.e.he. .he.U.he. .he.U.he. .he.U.he. .he.U.he. .he.U.he.
rrUU .re.rU.U. xtxOkkxkkx .re.rU.UU. .re.rU.U. .re.rU.U. .re.rU.U. .re.rU.U.U. .re.rU.U. .re.rU.U. .re.rU.U. .re.rU.U. .re.rU.U.
gr .ge.re. fddddddddd .ge.re. .ge.re. .ge.re. .ge.re. .ge.re. .ge.re. .ge.re. .ge.re. .ge.re. .ge.re.
grreg .ge.re.re.ge. fddddddddd .ge.re.ge.ge. .ge.re.re.ge. .ge.re.re.ge. .ge.re.ge.ge. .ge.re.re.ge.ge. .ge.re.re.ge. .ge.re.re.ge.ge. .ge.re.re.ge. .ge.re.re.ge. .ge.re.re.ge.
UeerU .U.e.e.rU. xtxOkkxkkx .U.U.rU.U. .U.e.e.rU. .U.e.e.rU. .U.e.e.rU. .U.e.e.rU. .U.e.e.rU. .U.e.e.rU. .U.e.e.rU. .U.e.e.rU. .U.e.e.rU.
input correct output 0 iterations 1,000 iterations 2,000 iterations 3,000 iterations 4,000 iterations 5,000 iterations 6,000 iterations 7,000 iterations 8,000 iterations 9,000 iterations 10,000 iterations
Uuru .ru. ssssssssss .ru. .ru. .ru. .ru. .ru. .ru. .ru. .ru. .ru. .ru.
rUj .rU. jfffffffff .rU. .rU. .rU. .rU. .rU. .rU. .rU. .rU. .rU. .rU.
udub .du. jfffffffff .du. .du. .du. .du. .du. .du. .du. .du. .du. .du.
uUUUU <empty> kmffffffff <empty> <empty> <empty> <empty> <empty> <empty> <empty> <empty> <empty> <empty>
jUu .ju. kmffffffff .ju. .ju. .ju. .ju. .ju. .ju. .ju. .ju. .ju. .ju.
uUuUu <empty> ssssssssss <empty> <empty> <empty> <empty> <empty> <empty> <empty> <empty> <empty> <empty>
bbudU .bu.dU. kmffffffff .bu.dU. .bu.dU. .bu.dU. .bu.dU. .bu.dU. .bu.dU. .bu.dU. .bu.dU. .bu.dU. .bu.dU.
Uubjj <empty> jfffffffff <empty> <empty> <empty> <empty> <empty> <empty> <empty> <empty> <empty> <empty>
UU <empty> kmffffffff <empty> <empty> <empty> <empty> <empty> <empty> <empty> <empty> <empty> <empty>
juUb .jU. kmffffffff .jU. .jU. .jU. .jU. .jU. .jU. .jU. .jU. .jU. .jU.
input correct output 0 iterations 1,000 iterations 2,000 iterations 3,000 iterations 4,000 iterations 5,000 iterations 6,000 iterations 7,000 iterations 8,000 iterations 9,000 iterations 10,000 iterations
fUxa .fU.xa. g.xwiiss.l .fU.fU.. .fU..a. .fU.xa. .fU.xa. .fU.xa. .fU.xa. .fU.xa. .fU.xa. .fU.xa. .fU.xa.
ama .ma. ..is.isrg. .ma. .ma. .ma. .ma. .ma. .ma. .ma. .ma. .ma. .ma.
xxx <empty> Oeg.xwiiss <empty> <empty> <empty> <empty> <empty> <empty> <empty> <empty> <empty> <empty>
faxUU .fa.xU. eg.xwiisse .fU..U. .fU.fU. .fa.xU. .fa.xU. .fa.xU. .fa.xU. .fa.xU. .fa.xU. .fa.xU. .fa.xU.
axUUU .xU. egk.xiiss. .xU..UU. .xU. .xU. .xU. .xU. .xU. .xU. .xU. .xU. .xU.
axU .xU. eg.xwiisse .xU. .xU. .xU. .xU. .xU. .xU. .xU. .xU. .xU. .xU.
Ufaaa .fa. ..xiiss.ri .fa. .fa. .fa. .fa. .fa. .fa. .fa. .fa. .fa. .fa.
xdUd .dUd. Uk.isrg.xi .dUd. .dUd. .dUd. .dUd. .dUd. .dUd. .dUd. .dUd. .dUd. .dUd.
UxfdU .dU. f.r.iiss.r .dU. .dU. .dU. .dU. .dU. .dU. .dU. .dU. .dU. .dU.
mxma .ma. ..is.isrg. .ma. .ma. .ma. .ma. .ma. .ma. .ma. .ma. .ma. .ma.
The performance of a randomly-initialized model after being trained on various numbers of examples. You may need to scroll to the right to see the full table.

Generalization

The main goal of our framework is to give a model a certain set of inductive biases. While the ability to learn from a small number of examples is one piece of evidence that a model has acquired some useful inductive biases, a more direct way to study a learner's inductive biases is through generalization: how does the learner handle novel types of examples?

The figure below shows how meta-learning facilitates three types of generalization: generalization to a novel length, generalization to novel symbols, and generalization to novel input structures (we refer to examples of the last category as implicational universals). In each case, we train our model on a training set which has certain types of examples withheld, then evaluate it on the withheld class of examples. Click the tabs above the figure to explore these three types of generalization.

Novel
lengths
Novel symbols
Implicational universals
Generalization to a novel length: Models are trained on inputs of length at most 4, then evaluated on inputs of the type they have seen before or on longer inputs.

Predictions for inputs of length 4 or less (the models have seen inputs of these lengths, but not these particular inputs)
Input
Correct
output
Meta-initialized
model's output
Randomly-initialized
model's output
jkk
.ja.ka.ka.
.ja.ka.ka.
.ja.ka.ka.
iji
.ji.ji.
.ji.ji.
.ji.ji.
iEEE
.ji.jE.jE.jE.
.ji.jE.jE.jE.
.ji.jE.jE.jE.
Oij
.jO.ji.ja.
.jO.ji.ja.
.jO.ji.ja.
xaOO
.xa.jO.jO.
.xa.jO.jO.
.xa.jO.jO.
axEk
.ja.xE.ka.
.ja.xE.ka.
.ja.xE.ka.
Predictions for inputs of length 5 (models have not seen any inputs of this length)
Input
Correct
output
Meta-initialized
model's output
Randomly-initialized
model's output
jxaai
.ja.xa.ja.ji.
.ja.xa.ja.ji.
.ja.xa.ji.
jjEjO
.ja.jE.jO.
.ja.jE.jO.
.ja.jE.jO.
jOOja
.jO.jO.ja.
.jO.jO.ja.
.jO.jO.ja.
jkkaa
.ja.ka.ka.ja.
.ja.ka.ka.ja.
.ja.ka.ja.
Ejjaa
.jE.ja.ja.ja.
.jE.ja.ja.ja.
.jE.ja.ja.
OEOja
.jO.jE.jO.ja.
.jO.jE.jO.ja.
.jO.jE.jO.


Comments: During meta-training, all languages featured inputs up to length 5. Thus, this experiment only requires generalization within the bounds seen during meta-learning. We also tested how models generalized from lengths at most 5 to length 6; in this case, the model initialized with meta-learning did not perform as well as it does in generalizing to length 5, suggesting that its generalization capabilities are restricted to the types of patterns seen during meta-learning.
Implicational universal: Models are trained on mappings of the form vowel-consonant → vowel, then evaluated on generalization to inputs of the form consonant-vowel-consonant. In the framework we use, any language that maps vowel-consonant to vowel will also map consonant-vowel-consonant to consonant-vowel.

Predictions for inputs of the form vowel-consonant (models have seen this type of input, but not these particular examples)
Input
Correct
output
Meta-initialized
model's output
Randomly-initialized
model's output
is
.i.
.i.
.i.
Um
.U.
.U.
.U.
iq
.i.
.i.
.i.
or
.o.
.o.
.o.
Er
.E.
.E.
.E.
Ex
.E.
.E.
.E.
Predictions for inputs of the form consonant-vowel-consonant (models have not seen this type of input during training)
Input
Correct
output
Meta-initialized
model's output
Randomly-initialized
model's output
fir
.fi.
.fi.
.o.
bit
.bi.
.bi.
.i.
bac
.ba.
.ba.
.e.
hoh
.ho.
.ho.
.o.
vAk
.vA.
.vA.
.A.
cOx
.cO.
.cO.
.O.


Comments: The randomly-initialized model appears to have learned a constraint on the form of the output, namely that it must consist of a single vowel. This constraint holds for the models' training set but not for the generalization that we test, giving this model poor generalization performance.
Generalization a novel symbols: Models are trained on inputs containing only the symbols E, A, j, x, p, and c, then are tested on generalization to inputs containing novel symbols.

Predictions for inputs composed of familiar sounds
Input
Correct
output
Meta-initialized
model's output
Randomly-initialized
model's output
Ejpjj
.E.jEp.jEj.
.E.jEp.jEj.
.E.jEp.jEj.
AjjjE
.A.jEj.jE.
.A.jEj.jE.
.A.jEj.jE.
jxAxp
.jE.xA.xEp.
.jE.xA.xEp.
.jE.xA.xEp.
EAjEA
.E.A.jE.A.
.E.A.jE.A.
.E.A.jE.A.
Ej
.Ej.
.Ej.
.Ej.
pEA
.pE.A.
.pE.A.
.pE.A.
Predictions for inputs composed of unfamiliar sounds
Input
Correct
output
Meta-initialized
model's output
Randomly-initialized
model's output
eXr
.e.XEr.
.e.XEr.
.A.E.XE.
Oenw
.O.e.nEw.
.O.e.nEw.
.A.A.cEc.
UUAfi
.U.U.A.fi.
.U.U.A.fi.
.cE.cA.Ej.
kdAri
.kE.dA.ri.
.kE.dA.ri.
.cE.Aj.
eh
.eh.
.eh.
.A.A.
tnsvU
.tE.nEs.vU.
.tE.nEs.vU.
.jE.xEj.jEc.


Comments: The randomly-initialized model has no hope of success in this case, since it has no way to know whether each novel symbol is a consonant or vowel, which is critical information for processing each input. However, the meta-initialized model could have learned this information during its meta-training phase, and its successful generalization shows that it has indeed done so.

Across all three cases, the meta-initialized model generalizes well while the randomly-initialized model generalizes poorly. These results indicate that meta-learning has successfully imparted a bias for the length-invariance of syllable patterns, a bias for assuming two universal classes of symbols (namely, consonants and vowels), and a bias for associating certain mappings with certain other mappings (a.k.a. implicational universals).

Linguistic universals are typically divided into two categories: absolute universals, which are properties that all languages possess, and implicational universals, which are cases where the presence of one property in a language implies the presence of another property. The previous section showed that meta-learning had instilled some absolute universals in the model, while this section shows that it has also instilled some implicational universals. Thus, meta-learning can impart both of the major categories of linguistic universals.

See the paper for more discussion of these experiments and for investigation of three other inductive biases using an alternate technique, namely ease of learning (that is, which types of languages are easiest for a model to learn?)

How does it work?

Our method can be summarized as follows:

  1. Determine the set of linguistic inductive biases that you wish to impart to a model.
  2. Define a space of languages encoding those inductive biases.
  3. Have a model meta-learn from this space of languages; this process is intended to give the model an initial parameter setting from which it can easily learn any language in the space of languages you have defined.
  4. Analyze the model to verify that meta-learning has given it the intended inductive biases.

The core component of our method is meta-learning; the particular type of meta-learning that we use is MAML. In MAML, a model is exposed to a variety of tasks, each of which comes with a limited amount of data. After exposure to each task, the model's weights are adjusted so that, if it were taught the same task again, it would perform better. As MAML proceeds, the model converges to an initial state from which it can readily learn any task in the distribution of tasks it has been shown. The demo below gives an example of MAML.

Model parameters
Inputs and outpus


Adjust speed

Demonstration of meta-learning using MAML. The model is a simple linear function with 2 parameters (a slope and an intercept): the green star in the left plot shows the parameters, and the solid green line in the right plot shows the function defined by these parameters. This model is exposed to many target tasks, each a linear function defined by a purple square in the left plot. All of these purple squares are drawn from the perimeter of the circle, providing some similarity across tasks that the meta-learning process can exploit. For each target task, a copy of the model is trained on 10 examples (the purple points in the right plot), yielding the parameters at the orange circle in the left plot and the function shown by the orange dotted line in the right plot. Based on how close this copy is to the target, the model's initialization is adjusted. After many iterations, the model's initialization reaches a point from which the model can learn any target task—somewhere in the middle of the circle in the left plot.

If meta-learning is successful, the model's initialization will encode some set of inductive biases that are useful for learning the space of tasks that the model is being shown. In the demo, the initialization that the model converges to encodes a bias for the slope to be close to -3 and for the intercept to be close to -2.

In this simple example, it would have been easy to set the model's parameters by hand in a way that encoded this bias. However, such hand-coding of biases is not practical with standard neural networks, which typically have far more parameters (often tens of millions of parameters, or even 175 billion parameters) which interact with each other in difficult-to-interpret ways. In such cases where manual parameter setting is impractical, meta-learning can automatically discover parameter settings that encode the desired biases.

In our application of meta-learning, each “task” is a language, and the goal is to find a set of initial parameter values that allow the model to quickly learn any language. We hypothesize that, if we carefully construct the space of languages used during meta-learning to encode a desired set of inductive biases, we can impart these inductive biases in order to study how they affect generalization behavior.

Conclusions and future directions

We have shown how meta-learning can impart inductive biases specified by the modeler. While the meta-learned biases are not as transparent as those encoded in probabilistic symbolic models, analysis of the model's learning behavior can be used to evaluate whether meta-learning has produced the desired biases, as we have shown. Our case study demonstrates that linguistic inductive biases that have previously been framed in symbolic terms can be reformulated in the context of neural networks, facilitating cognitive modeling that combines the power of neural networks with the controlled inductive biases of symbolic methods.

In future work, we plan to use this approach to study open questions in cognitive science. For example, by creating a model that has a particular bias which has been argued to be important for human cognition, we can empirically test whether the bias has the explanatory power that it has been argued to have. Alternately, while our case study involved giving a model certain pre-selected biases, this approach could instead be applied to naturally-occurring linguistic data to lend insight into the inductive biases that shaped this data. Finally, this approach might have applications in Artificial Intelligence: By giving targeted, cognitively-motivated learning biases to models, we may be able to decrease the gap between the learning capabilities of AI models and humans.

See the full paper for more details. If you have questions or comments, email tom.mccoy@jhu.edu.

*****

Some fine print about this website: The demos differ from the experiments in the paper in a few ways designed to decrease the computation load for the website. First, the models on the website are GRUs, while in the paper we used LSTMs. Second, training in the demos uses a batch size of 1 with stochastic gradient descent, while the paper uses a batch size of 10 with the Adam optimizer. Thus, the performance of the demo models is not guaranteed to be the same as the performance of the models used in the paper. Finally, on this page, we have excluded any examples containing lowercase L or capital i, as the similarity of these letters sometimes makes examples hard to read, but both letters were available in the experiments in the paper.