Norms | Bangor Language Production Lab

Back in 2003, Anna Szekely, Liz Bates, and colleagues from around the world initiated the International Picture Naming Project, collecting timed picture naming norms in seven languages for 520+5 classic black-and-white line drawings. Such pared-back stimuli are useful: while there is a movement toward using more photorealistic color images (and the lack of color can clearly hinder object identification when color is a diagnostic feature), adding more detail introduces more factors to control, and introduces more ways to get the detail wrong (a line drawing of a dog can hit the centre of 'dogness' without veering off into 'labradoodleness' territory). And such naming norms have usefully facilitated stimulus selection in subsequent experiments, and provided predictors of naming difficulty for researchers to control and experimentally manipulate (e.g. name agreement). Unfortunately, they only included norms for US English, leaving researchers in the UK to guess about how they might apply across the pond.

Seven years later, Johnston and colleagues (2010) partially filled this gap, collecting untimed written norms from one small set of UK English speakers, and naming latencies from a different small set. Their norms suggested a number of potential differences from the US norms, but the small sample size and novel methods introduced concerns about how well they might predict responses under time pressure (e.g. "chest of drawers" violates the maxim of manner, while "cockerel" and "bungalow" descend past the basic level that would be expected in a timed experiment). Although such untimed written (or typed) norms are quickly becoming de rigeur (e.g. Dunabeita et al, 2016), they are obviously suboptimal for use in spoken production experiments: the best predictor of timed picture naming is timed picture naming.

Our new timed naming norms (Oppenheim, in prep) address this problem and others. Timed picture naming provides valid predictors for timed picture naming experiments. Two groups of fifty native British English speakers provide both a very large basis for estimating predictors, and a means to evaluate their replicability. And detailed consideration of individual non-dominant responses provides the first and only appropriate basis for assessing competition picture naming norms. Please watch this space for the final citation, or email me at g.m.oppenheim@bangor.ac.uk and I'll update you upon publication.

Previous norms and the question of competitive lexical selection

Beyond simply identifying dominant names for each picture, these norms were also designed to assess the role of lexical competition in determining word production latencies. At least one major model (LRM, 1999) uses competitive lexical selection as a foundational assumptions, and most active researchers currently assume such competition to be a core feature of the production system. But the clearest evidence for such competition comes from experimental paradigms (i.e. picture-word interference) that are quite far removed from normal production, and interpreting that data as evidence of lexical competition in production requires many very strong assumptions about wordform perception and comprehension and their interaction with the production system. To bring the conversation back to data that clearly reflect processing within the production system, our goal with these norms was to assess evidence that having a strong alternative name hinders selecting a picture's dominant name in simple picture naming.

Researchers have, over the years, attempted to identify data from norms that might speak to the question of lexical competition, but these efforts lacked serious consideration of what kinds of effects competitive accounts would specifically require. For instance, shorter naming latencies for pictures with higher simple name agreement -- a larger proportion of participants producing the dominant name -- is often cited as evidence of competition between the picture's various names. But many non-lexical factors can contribute to low name agreement (see e.g. Vitkovitch & Tyrell, 1995), and even within lexical factors one does not need to revert to competition to explain the effect: as illustrated by Oppenheim, Dell, & Schwartz, 2010, unlearning and divided practice can produce response time effects that superficially resemble those of online competition.

Calculating a picture's entropy (sometimes called an H-statistic) additionally takes into account the likelihoods of each alternative response -- not just the dominant. Thus it seems like a better candidate for revealing competition between possible responses. Unfortunately it inherits the limitations of simple name agreement, both conceptually and mathematically. Conceptually, there is still no reason that entropy should specifically reflect lexical sources of name disagreement. Mathematically, entropy is almost perfectly (though inversely) correlated with simple name agreement (r~-.95 for Szekely et al's 2003 IPNP norms for US English).

A third approach has been to count the number of distinct names that each picture elicits; pictures that elicit more different names also elicit slower RTs for the dominant name, presumably reflecting competition among those candidates (Szekely et al, 2003). But this statistic also has mathematical and conceptual difficulties. Mathematically, it is again highly (inversely) correlated with simple name agreement, because pictures with high simple name agreement limit the opportunities for observing alternative names (a picture that 99/100 participants names as 'zebra' (dominant) can elicit one alternative, at most), and pictures with low simple name agreement impose a lower bound on how many alternatives the picture must elicit (assuming no non-responses, a picture that 11/100 participants name as 'electric can opener' (dominant) must also elicit no fewer than 9 alternatives, because no alternative could elicit more than 11/100 responses without becoming the dominant name).

The larger problem with deriving 'competitive selection' predictions for measurements like entropy or number of names is one of face validity. Competition, as we would normally imagine it, refers to most specifically to competition between similarly strong (or 'good') alternatives: this is a condition of 'too much activation', where similar responses are similarly above an activation baseline, and the objective is to select the one best response. Selection among similarly poor (or weakly activated) responses, is quite a different thing: this is a condition of 'too little activation', where disparate responses are similarly near an activation baseline, and the objective is to avoid an omission. Such measures are maximised in the latter case, when no two responses to a stimulus are alike. Detecting competition betweeen strong responses requires something different.

Introducing Secondary Name Agreement

A better way to assess the effect of having strong alternatives depends on better specifying what exactly we mean by 'competitive lexical selection', to delineate specific predictions that necessarily follow from the assumption. At their core, competitive algorithms for lexical selection assume that the goal of lexical selection is to choose the single best word, and the selection process grows slower as it gets harder to choose best. Competitive algorithms can take many forms, including a range of competitiveness. For instance, comparing the most active option to the mean of the others sounds competitive but with many inactive options (e.g. a large vocabulary) will approximate simple absolute threshold (Oppenheim et al, 2010). To actually be meaningfully 'competitive', in the traditional sense of producing longer RTs when some finite subset of one or more alternatives are more active, an algorithm must weight the strongest alternatives most heavily, meaning that competition should be predominantly between the most active word and the second most active word (presumably with decreasing importance assigned to each next-most-active word beyond that). Thus, meaningfully competitive algorithms predict that competition -- and therefore lexical selection latencies(ceteris paribus) -- should be greater when consolidated into one very strong alternative (as estimated by its frequency of occurence as an alternative response) than when distributed among many weak alternatives.

To estimate the influence of lexical competition on target naming latencies, it is therefore crucial to estimate the activation of alternative names while retrieving a picture's dominant name. We can estimate these activations by quantifying how often each alternative name emerges in picture naming norms, on the assumption that these probabilities reflect within-subject activation arrays. (Note that researchers commonly assume that norms reflect within-subject co-availability of alternative responses; we specifically verified this assumption in Balatsou, Fischer-Baum, & Oppenheim, in prep). With such production frequencies for alternatives, it should be relatively straightforward to assess the effect of strong alternatives on dominant name retrieval latencices.

Unfortunately, no previous studies have actually reported the necessary statistics to allow such assessments, so in this study I report, for the first time, the specific probability of the second-most dominant response for each stimulus and its frequency, as a new statistic: secondary name agreement (n.b. I similarly report tertiary name agreement, as well as proportions for each remaining alternative, but robust estimation of e.g. septenary name agreement would require a much larger dataset). I also show that secondary name agreement is quite a robust measure (r~.9 correlating secondary response rates between the first and second fifty participants). Secondary name agreement is of course mathematically dependent on primary name agreement (previously just 'name agreement'), but this problem can be easily remedied via multiple regression, residualisation, or deductive mathematical correction (preferred).

Applying Secondary Name Agreement to the question of lexical competition

So to recap, all else equal, to the extent that lexical selection is accomplished by a competitive mechanism -- comparing candidate words' activations to find the best -- dominant name retrieval latencies should increase as secondary name agreement increases (because it means the strongest competitor is getting stronger). But do they really?

Weirdly, no. Pretty much any way you squint at it, stronger secondary name agreement is actually associated with faster dominant name retrieval, exactly opposite the prediction from 'competitive' selection algorithms. It doesn't seem to matter how you subset the data, or what covariates you add into the model. And tertiary name agreement seems to show basically the same facilitatory effect on dominant name RTs. In short, having more good options just seems to be... good.

So why might strong alternatives make it easier to select a picture's dominant name? I (2017) recently wrote about an effect of errors on observed correct naming latencies, and I think it might apply here, too. Let's imagine for a moment that, instead of aiming to retrieve the BEST word in any situation, you mostly just settle for retrieving a GOOD word. And you generally trust your converging activation from semantic features to get you a pretty good word. So now you're trying to retrieve a name for a soft, multiple-occupancy, seating object. Do you want couch? Sofa? Bench? Chair? Sitting-thing? Any of them should do the job of communicating the idea of a place to sit. And being able to choose any of them means you'll not only be faster overall (because you can grab whatever option is most accessible at the moment), but you should even be fast to retrieve couch when you calculate its RTs alone (because the slowest couches will end up being sofas or benches instead). It's a cheap heuristic that supports fluent communication: anything activated by all of those semantic features should probably tend to be close enough to what you want, so you can save your efforts for where they (rarely?) really matter, and you don't need to exhaust yourself struggling to find the one PERFECT word fifteen thousand times each day.

Does this mean that lexical selection isn't 'competitive' at all? The mechanism outlined above fits better with a noncompetitive mechanism than with a competitive one, in terms of producing net facilitation; it could be combined with competitive selection, but then it would merely produce a bit of 'facilitation' to offset the presumably enough inhibition from having strong competitors. In any case, there are lots of different possible flavours of 'competitive' selection, and I get the impression that there's little consensus among researchers just what the right ones might be. And as I pointed out above, there are plenty of algorithms that seem competitive at first blush, but might not be that competitive when you scale them up even a teensy bit; I wouldn't rule out such algorithms. There are also instances of explicit directed executive control that some people call 'lexical competition' (competition in PWI might be an example); I'd assume that we can do such things with language, but my hunch is that these are much rarer in normal communicative production than models relying on certain paradigms might lead us to believe. In sum, instead of asking, "Is lexical selection (non)competitive?", it's better to ask, "How competitive is normal lexical selection?" The answer here would be, "Not very competitive, if at all."

Does "name agreement" actually have any psychological validity?

An attentive reader will note a profound leap in logic in the previous study of the 'effects' of secondary name agreement. Following convention, I assumed that the distribution of responses in picture naming norms reveals the distribution of lexical activations within each speaker's head, but norms actually give us the wrong information to estimate these quantities. When using name agreement to estimate lexical competition, researchers typically assume (without acknowledging it) that sampling 50 people just once gives you the same results that you'd get if sampling 1 person 50 times. The implicit assumption is that lexical selection operates via something like a Luce Choice Rule, and each time you name a picture of a couch, you will choose couch with the independent probability described by p(couch) = a(couch) / sigma(a(couch, settee, divan, canape, chesterfield, davenport...)). In fact, much empirical research assumes this relationship, but strangely it has never before been tested (perhaps because the conclusions were acceptable enough that the premises never previously warranted scrutiny).

At the other end of the spectrum is the possibility that each person just has their favourite words for each concept, and they stick with them. 'Norms' just show us the mixture of such people in our population. Although the distinction makes little difference when using norms for their original purpose (picking reliable stimuli for an experiment), its importance is magnified when assuming that norms reflect within-subject ambiguity or variation. In other words, although name agreement can predict RT, its mechanism for doing so is less clear if the 50% of people who choose couch would never actually choose sofa in a million years.

The IRB rejected our initial plan to repeatedly test participants for a million years, so in this study we tested them just twice. The question was the same, though: how 'sticky' are participants' responses in norming studies? Does choosing couch the first time you name the picture identify you as 'a couch person' who will always choose couch over its various synonyms, or do you actually choose between couch et al each time you choose, independently of anything you've done in the past (as typically assumed)?

In this study, we found evidence for both possibilities, suggesting some midpoint between them. If a person used couch the first time they named a picture, they were more likely to name it as couch a week later than if they had not. But norms from a population also predicted how likely participants were to deviate from their previous selections: they were more likely to switch between couch and sofa (about equally likely, for this example) than between sheep and lamb (sheep is much more likely)

It strikes us that these new by-item 'stickiness' norms may be useful to others, so here they are. The data column reports the proportion of 25 participants using the image's most common name in both sessions:

Picture-name stickiness norms, from Balatsou, Fischer-Baum, & Oppenheim, in prep