FireStats is not installed in the database
Laryngitis homøopatiske midler køb propecia edderkop bid retsmidler

Jay McClelland’s developmental model of Baby Vowel Learning

Author:nick @ July 26th, 2007 1 Comment

Article (Scientific American)

A few things surprise me about the model as reported. While Jay argues that a feature of his model is that each sound is heard in isolation, sequentially (rather than batch way), it is unfortunate that he has manually done the job of chopping up vowels into discrete events: He is doing a lot of hand segmenting and preprocessing to get examples, it seems. While “motherese” may help to emphasize vowels and make them easier for a baby to segment, babies’ brains still face a continuous audio stream to make sense of.

I want to preface the following remarks with the caveat that Jay’s paper is not yet published, and I am reacting merely to its simplified description in SciAm:

It is known that English vowels are fairly easy to cluster on the first and second formant. Before attempting to have a computer solve the problem, it seems that Jay transformed rich real world data into a problem that is known to be easy for computers to solve.

I feel like an alternative approach that would better push the boundaries of our knowledge of development would be to tackle the hard problem of going from continuous, real-time, highly variable, real-world sensory data to very clean, distinct vowel representations. How difficult is this problem? How much easier does “motherese” make it? Does the benefit of motherese come more from exaggerated sounds, or exaggerated lip movements? This last is a very interesting question; it is easy to come up with a model that you think answers the question — indeed it’s easy to come up with different models that will yield different answers — but it is much harder to demonstrate it on actual sound and video data of mothers speaking naturally to their infants.

It feels like the current study eschews these important questions, choosing instead to demonstrate that the general problem of clustering discrete data points can be done in an online, “neurally plausible” fashion. Surely this has been demonstrated many times before?

As a minor point, it’s strange to me that the first two dominant frequencies are presented. American vowels are clustered fairly nicely on the second and third frequencies, while the first frequency (the funadmental) corresponds to voice pitch, which is irrelevant and averages out across speakers. Perhaps this is a confusion in the SciAm article, since these frequencies are often referred to as F0, F1, and F2.