Psycholinguistics/Connectionist Models

Connectionist models of the mind (a subclass of which is neural networks) can be used to model a number of different behaviors, including language acquisition. They consist of a number of different nodes that interact via weighted connections that can be adjusted through by the system through different ways, the most common being backpropagation of error.

Connectionism as a Cognitive Theory
Connectionism is a cognitive model that grew out of a need to have a model that allowed for and built upon the interaction between the biologically coded aspects of the brain and the learned aspects that humans receive from their environment. Connectionist models focus on the concept of neural networks, where nodes or units function as neurons with many connections to other nodes. A connectionist model is based around the idea that very simple rules applied to a large set of nodes can account for rules and categories that exist in other models, such as associationist models, spreading activation models, or hierarchical models. Instead of starting out with initial categories in the brain, connectionist models allow for the emergence of categories, based on experience. The units of these models are all very simple, consisting of a weight of an activation threshold, and a weight for its interactions with other units, as well a function for determining changes of weight for correction. One of the main benefits of the connectionist model, is that it can be simulated on computers, and has been found to closely mirror effects found in humans, such as the U-shaped performance curve for learning, and similar deficiencies when nodes are removed to those found when a brain is lesioned.

The simple functioning units of the connectionist model are often based on the neuron, despite many connectionists not always operating on a low enough level to actually have a plausible one-to-one relationship between unit and neuron (most high level cognitive models do not work on this level. Many models for language do not function on this level, instead treating stimuli as already processed, such as a whole letter when modeling reading instead of having to also train the network to recognize individual letters as well.

An important aspect to the connectionist model is the ways that it can be used to explain the emergence of particular properties with only simple learning rules instead of requiring bioprogrammed (highly specific and genetically determined) areas of the brain such as Chomsky's Language acquisition device. There has been a push to consider the interaction between the individual and the environment as an important aspect in innateness, and that simple rules likely evolved that, because of the environment, allow very complex networks to form. Connectionist models that have been trained to learn tasks like reading have mirrored the natural progressions of individuals, and also mirror human behaviour in the ways that they process new words. . This also means that when training these networks, they have to be given similar input as what humans get if they are going to actually be used as comparison models.

Layers in a model for word recognition
Connectionist models have their different units organized into layers and the connections between these layers are how the output is calculated. There is a layer of input units, which are activated depending on what characteristics are present in the input stimulus, which is the word that is to be recognized. The connections that the activated units have with the next layer determine the activation of that layer, and if that is the output layer, will determine what the system determines the word to be. Hidden layers allow for better discrimination between similar words.

Input Units
Input units are where the network first accesses the stimuli for the system. This is the first analysis of the information, and can vary widely on how processed the information is to begin with. Input can be in the form of letters that are present, sounds that are present, even whole words. It really depends on what kind of processing the system is expected to do. A model that is attempting to analyze sentences is likely going to have inputs that are words, or at least morphemes, rather than letters, or letter strings, such as were used in Seidenberg and McClelland (1989), discussed later.

Output Units
The output units are the most important for evaluation, because this is how the model can be judged for its efficacy. It is also where correct/incorrect judgments are made, and this feedback is essential for the learning of the program. The output can either be a discrete unit, so that result is only one active, or most active unit, or it can be a pattern of active units. To judge differences between patterns, a non-linear hyperplane is used, and distances are calculated from that to determine how far off the output was from what was desired.

Hidden Units
Hidden units are used to connect the input and output units. The most important reason to have these is to allow for non-linear distinctions between outputs. The added layer between the input and output allows for a different dimension to contrast outputs, and so each additional hidden unit, and levels of hidden units (the number of hidden units that an input goes through before the signal gets to the output) will increase the possible dimensions for differentiating the input. The weights of these hidden units can be adjusted through backpropagation, which uses an algorithm to recalculate weights based on the distance between the output, and the expected output. A good learning resource is available here. Often, the weights of these units are randomly generated to be within +/-.5, and as the system is given thousands of learning trials, it adjusts the weights so as to reach the correct answers. This has lead to the suggestion that connectionism equates the mind with a tabula rasa.

Semantic/Meaning and Context Units
Some connectionist models integrate another module of computation, such as a semantics module, or a meaning module. These modules are used to help the system differentiate between heteronyms, which are words that have the same spelling, but different sounds like lead, which is a metal, and a verb. One method used by Joanisse and Seidenberg had a semantic module that contained 600 verbs, and a past tense node to assist the system in learning the correct past tense of verbs. Despite having all of these preprogrammed verbs, the model was still capable of learning novel words, including non-words, and was capable of giving a past tense similar to a human response, (ex. "wug" and "wugged").

Interactions
Units interact along connections that have particular weights. This means that the effect of any particular unit (a) is defined by its relation with another unit (b), and cannot be reduced to anything within (a), and is not directly affected when any other connections are changed. 'Recurrent networks' have connections that can affect the original unit. This can be direct, in that the unit, when activated, actually modifies its own activation, or it can be that in activating, it activates other units that in turn modify the first unit again. This makes it difficult to determine what units are responsible for any particular output, because of the recursive nature of the processing. Although on the first step, a certain number of units were activated, determining which ones were activated because of the inputs, and which ones were activated because of the recurrent propagation is difficult. This makes coming up with an equation that will properly adjust the weights difficult as well. A 'Feed Forward' network does exactly what it says, it feeds activation forwards. This means that during a particular calculation, a unit cannot affect itself, because each unit always affects another downstream unit, and there is no looping backwards. The inputs feed the next level of units, and proceed to the outputs. This network is very simple to look at and by examining the weights, makes it easy to see which units are driving particular calculations. However, because there is no looping back, it does not have a way of using the connections themselves to adjust the weights, which is how the system can learn from its errors. This can be solved through the use of an algorithm to calculate the backpropagation of error.

Backpropagation of error is a method through which a network can be trained. The desired output is compared with the actual output, and if they are the same, then the connections that were used become strengthened. If it turns out that they were different, and the network was wrong, then the connections are weakened in proportion to how active they were. If they had a large weight, then the weight will be decreased a lot more than if they only had a small weight. In this way the network can learn without requiring connections that loop back within the network, and maintain its feed forward nature.

Another method of modifying weights is Hebbian Learning, named after Donald Hebb who first described it. The basic rule is that units that fire together are strengthened, while units that fire singly would have their weights decreased. Units that usually fire at the same time are likely firing because of the same stimulus, and so giving them more weight allows the system to maintain its structure. This makes the system less susceptible to changes if there are a number of anomalous stimuli that occur in succession.

McClelland and Seidenberg Model
McClelland and Seidenberg made a connectionist model for word recognition. The model was trained on four letter monosyllabic words. Designed in 1989, it was capable of not only reading words from orthographic to phonological representation, but its learning process also mirrored a number of human characteristics, such as: having the same difficulties with difficult to process words, transferring from basic to skilled reading, pronouncing novel items (ones not presented during training), and similar differences in performance on naming and lexical decision tasks. Originally, their model for reading included a number of interconnected levels: the orthographic input, the phonological output, a 'meaning' level, three hidden levels between each of these, as well as a 'content' level that was connected to the 'meaning' level. All of these levels are theoretically capable of being altered by, and altering any on that they are directly connected to. In practice, only the input and output layers, as well as one hidden layer was used. This was in large part due to the length of time that it takes to simulate parallel processes on a serial processing computer.

The input level for this model consisted of a series of triple characters, and so if that series was present in the orthography, that unit became activated. There were 400 input units, and they each contained a list of ten possible letters for the first, second and last positions of the word being analyzed. This means that there are 1000 possible three-letter strings that could activate a particular unit. Although this might seem like it would be hard to differentiate words this way, each three letter string only activates about 20 units, and had effectively no chance of two different strings activating the exact same units, allowing the model to effectively distinguish between these strings.

The output units were triples of types of phonemes, so that the string "vowel fricative word boundary" would be activated by both 'laugh' and 'scoff'. There were 460 of these units, and each phoneme string activated 16 of them on average. The importance of reporting this distribution is to show how the distributed activation is at work in this model. There is not a particular unit that corresponds to a particular letter or sound, but a particular activation pattern.

This simulation was capable of learning, over 150,000 training trials, to correctly distinguish 2,820 out of the 2,897 words that it was trained on. This model also correctly read novel words, and traits that were similar to human readers when irregular pronunciations were presented, such as PINT, and pronounced it as rhyming with LINT. Non-words were also pronounced in the same fashion that Humans generally pronounce them, so that the word MAVE rhymed with GAVE, which is more common than HAVE.

Pathological Evidence
Another piece of strong evidence for the use of connectionist models is their ability to replicate deficits found in humans with particular brain injuries. Joanisse and Seidenberg found that they could recreate the double dissociation of the correct conjugation of irregular past tense verbs and the conjugation of non-words by 'lesioning' the model, which consisted of severing a number of connections after training. The results were consistent with individuals who had anterior lesion aphasia, which showed phonological deficits, and posterior lesions aphasics, who had semantic deficits. Patterson et al. had also shown that there are similarities in effect between lesioned brains and lesioning their models accounted for similar deficits as those of individuals who had surface dyslexia.

Nativism vs. Tabula Rasa
Connectionist models are often viewed as inherently accepting the theory that our minds are tabula rasa, blank slates. This view makes sense due to the fact that initial weights are generally randomized. If they were not, it would both take a long time to program, and then also not show the emergent learning properties that make connectionist models so interesting. However, this is not necessarily the case. Some connectionists suggest that it is possible that initial weights may favour particular pathways such that, barring incredible intervention, a particular network will form. This suggests that while the particular networks are not innate, because of what normal interactions between the organism and the environment, the genetically determined neurons will almost always for that network. An example of this is how Roe et al. redirected a pathway from the eyes to the auditory cortex, and a retinotopic map formed. In this experiment, the retinal inputs in a number of ferrets were redirected from the occipital lobe to the temporal lobe. When this happened, the new input changed the structure of the auditory cortex so that it resembled a visual cortex. This example shows that despite the apparent natural organization of the brain, it can adapt to rerouting, suggesting that it is the processing and connections of neurons that gives shape to the networks. The input that the neurons get, and the feedback that they receive, is instrumental in the formation of the brain.

Localization vs. Distributed Activation
The debate about the function of single units in connectionist models has its roots in the search for and engram, a place where a specific memory is stored. Karl Lashely, in an attempt to locate this by lesioning rat’s brains in the hopes of their forgetting a task they had learned, concluded that the engram did not exist, and that memories consist of neurons throughout the brain. The colloquial name "Grandmother Cell" (GMaC) is seen to be similar to this, where a particular concept is encoded, or located in a single neuron. The name comes from the idea that there could be a cell that fired at the memory of one's grandmother, and that this firing is responsible for one's remembering. The opposing view is that variations in activation over distributed neurons cause patterns, and the neurons in these patterns have particular weights associated with their firings, and this weighted pattern of activation is what would cause a memory. The localist view must be distinguished here, because it is not merely the case that if the GMaC had not fired in this pattern, then perhaps a memory of fruitcake or one's grandfather would have been remembered. Instead, it is that the GMaC is a dedicated cell for that particular idea, and will not fire as part of a pattern for another idea, or require other neurons to fire to differentiate the idea of grandmother from a different idea. This argument stems in large part from single cell recordings where particular stimuli elicited highly selective responses for particular cells. This theory is also suggested because of the possibility of the problem that is sometimes encountered in connectionist models that is called "superposition". This is when two different stimuli result in the same output pattern, although this can be achieved through different activation of preceding units. A few prominent connectionists reject this notion that there are dedicated cells for particular ideas, words or concepts. The possibility that one neuron could be dedicated to an idea runs the risk of what constitutes an idea. If there were a Grandmother cell, what about a 'Grandmother's ear' cell? And if there were one, why isn't the idea of 'grandmother' made up of all these parts of her. These arguments, while certainly not definitive, show the value of distributed, weighted activation. The fact that the activation is weighted is especially important, because one heavily weighted cell could affect the whole pattern, while a highly active one could have little effect of it had no weight.

Conclusion
Connectionist models, utilizing very simple rules are capable of simulating very complex behaviours and learning progressions that have been observed in humans. The fact that these models have been so successful without requiring a lot of specific rules applied to them suggests that the method of connectionist models could be good approximations of what occurs in the human brain. This does not mean that the connections are simulating exactly what neurons do, particularly in language processing models, but it could be similar to the activity of groups of neurons. The really important part of connectionist models is not what is learned from how well a model can mimic the function of a brain, but how simple the rules are that are required to produce this simulation, and that the emergence of complex connections can occur from randomly generated connections.

Learning Exercises
These two activities are designed to facilitate the understanding of two important concepts for connectionist models. The first one will involve the importance of the hidden units for differentiating between inputs. The second involves emergent properties in the relationship between a structure and its environment.

Hidden Units
Step one is to take a sheet of graphing paper, and three different coloured markers. It is preferable if you can at least partially see through the paper. For the first example, draw two groups of dots in different colours, and try to separate them with a straight line.

Next, draw a series of dots in the centre in colour A, and a group of colour B dots in the upper right and lower left hand corners. If you only have one colour, different shapes can be used. Now, like before, attempt to draw a single straight line to separate the two. Now try folding the paper, bottom left up to top right. This is an example of how adding a hidden layer can add another dimension to the calculation of weights, and allow a system to differentiate between two stimuli that might have the same variable. Now, unfold the paper and draw two groups of dots on the remaining two corners.
 * If you cannot do this, why not?
 * How does this relate to the problem of having only a set of connected inputs and outputs?
 * Try drawing a line to separate the two colours again, is it possible?
 * Is it possible to draw just two straight lines to separate the three colours?
 * Does one set of hidden units help? Might adding a a second layer of units help, allowing you to add an even further dimension?

Constraints and Innateness
This exercise will get you to think about the ways that the environment can put constraints on inputs that do not necessarily have to be programmed genetically. This can probably be done just with thinking about how things might change, but will probably be more amusing if you actually try it. Take 30 roughly spherical objects. Marbles work well, as do jellybeans. Now, try and form them into as high a pile as you can on a regular tabletop. Try and list five properties for both the marbles and the environment. What about trying to make the stack on a sandy or dirty surface, like a beach or a garden bed, or even a snow bank. What about if the stack was made in a different fluid besides air, might the pile's maximum height change when under water? You don't have to try this one, but what if you made the pile in molasses, do you think it would be a lot higher? In this exercise, the properties of the marbles, or whatever you were stacking, can be considered to be what is programmed genetically. This does not change over the course of the experiments. On the other hand, there are different things that can be done to increase the size of the pile. Changes in the environment evidently contribute to the way that the marbles can interact with each other. This is quite simplistic, but it hopefully gets the point across that the height of a pile of marbles is not something that is determined merely by the properties of the marbles, but also by the properties of the environment that they are piled in. Write down a list for each of these.
 * How high can you get the pile?
 * What attributes of the marbles (or whatever) that you are using might be limiting or helping the stack grow larger?
 * What aspects of the environment are contributing to the size of the stack?
 * How has the change in the environment changed the height of the stack?
 * How does this relate to connectionist models?
 * What parts of the model are analogous to the marble or neuron, and what parts to the environment?
 * Where does the weighting, input, output, and changing of weights lie in this dichotomy?
 * If you had a language learning program that was able to interpret words and give the correct orthography, what constraints that are a part of language would help the program to learn?
 * What aspects might cause difficulty?