I wrote an article for The Toast on the phonological constraints that allow you to identify Bandicoot Cumbersnatch, Bendandsnap Candycrush, and even Wimbledon Tennismatch as synonyms for the same long-faced British actor, by analyzing all of the names from the Benedict Cumberbatch name generator.
You should go read it there first to see what the constraints are and how I got them, and then come back here for a bonus in-depth investigation into how we can model them using constraint rankings loosely inspired by Harmonic Grammar (don’t worry if you don’t know what that is).
So the key constraints that I identified are:
- Has initial stress (97%)
- 3 syllables (98%)
- Begins with b or (hard) c, as appropriate (64%)
- Ends in a consonant (91%)
- Ends in a voiceless obstruent, e.g. t, s, k (59%)
- Nasal between the first two syllables (42%)
- Ends in a fricative or affricate (40% of last names)
- Last syllable has /æ/ (27% of last names)
Clearly, some of these constraints are more important than others: a name that doesn’t have initial stress, 3 syllables, or b/c at the beginning is going to sound much less prototypical than a name that lacks a nasal, fricative/affricate, and /æ/ in the desired positions. However, it generally seems that a name that is lacking in one of the really key features is more likely to have some of the less-important ones, whereas if a name has the main ones it can lack the lesser ones with impunity.
So if we’re trying to figure out whether a name sounds like a reasonable synonym, we want to take both types of constraints into consideration, but we want to rank them differently. How shall we rank our constraints? We could list them in order of importance and assign them numbers between 1 and 8 or in some other arbitrary range, but since we already have data on the frequency of these factors in a fairly large data set, I’m going to use the frequencies themselves. Let’s see how this works by computing a sample name, like Rinkydink Cabbagepatch.
Rinkydink satisfies the constraints initial stress, three syllables, ends in consonant, ends in voiceless obstruent, and nasal between first two syllables, and does not satisfy the constraints begins with b, ends in fricative or affricate, and last syllable has /æ/. So I’m going to assign it a score of 1 for each constraint it satisfies and a score of 0 for each constraint that it does not, as we can see in the first row of the table below. Now, I’m going to multiply each constraint that Rinkydink satisfies by the percent of all first names (bottom row) that satisfy this constraint, the results of which we can see in the middle row. As we can see in the last column, the sum of all first name constraints is 4.74, and the sum of the constraints that Rinkydink satisfies is 3.75.
We can do the same thing for Cabbagepatch, adjusting the constraint weighting to reflect the results for last names: note that last names are considerably more likely to end in a fricative/affricate and have /æ/ in the final syllable, so they have a higher overall possible score at 5.30, out of which Cabbagepatch scores 4.89.
Now if we do the same thing for all of the names in the dataset, we can look at what kinds of scores we end up with as a whole. First, let’s look at summary stats for the names as a whole. This gives us an effective threshold for what the lowest possible score is for a name that made its way into the generator: 1.61 for first names and 2.46 for last names, and also tells us that there was at least one name that achieved the highest possible score in both categories. I have a hunch that the minimum score for a full name must be higher than 4.07 because perhaps a slightly-less-ideal first name can be compensated for by a really prototypical last name, or vice versa, but I’d need a dataset with name pairings in order to test it.
We can also test how good our constraint weighting is by running a name that doesn’t sound like a good synonym for Benedict Cumberbatch, such as Umbrella Falafel:
We can see that Umbrella gets a score of 1.39, which is lower than the lowest first name score (1.61) and that Falafel gets a score of 1.97, which is lower than the lowest last name score (2.46). Together, these are 3.36 which is also lower than our minimum hypothesized joint threshold (4.07). Interestingly, Falafel would also fail the first name test, although only barely (0.96 for 3 syllables + 0.84 for consonant-final = 1.60 which really isn’t statistically different from the minimum score of 1.61). But is the lowest score really a good measurement for whether a name is a satisfactory synonym, or would it be better to look at how the name fits within the distribution of names as a whole?
Let’s look at this distribution on a Histogram Cumbergraph:
We can see a few things: firstly, that the 1.61 minimum is probably an anomaly, and a better minimum might be closer to 2 or even 2.4 for first names. Secondly, there’s a sharp increase in score right at the 3.3 mark, so another good conclusion to draw from this data might be that a good minimum score for first name + last name might be 6.6: you can have one of those names be below 3.3, but the more below 3.3 it is, the higher the other score needs to be in order for other people to recognize it as a synonym for Beetlejuice Animorph. If anyone has a good, large sample of fan-created full names, it would be interesting to see if that statistic works out.
For more information: here’s the article on The Toast where I explain the constraints, here’s the Benedict Cumberbatch name generator where I got the data, and here, here, and here is more information on Harmonic Grammar, which is a linguistic theory based on assigning constraints a numerical ranking score, often expressed as a percentage (compare Optimality Theory, where you rank constraints in order but there is no way to specify that Constraint X is twice as important than Constraint Y but five times more important than Constraint Z).