Many datasets of composers tell us relatively little about them, so we sometimes have to guess details from the information available – such as the composer’s name. Forenames, for example, are often a good indicator of gender, as described in this previous article. Titles – associated with the church, aristocracy or royalty – can also reveal gender, and tell us about occupation or social class. This article looks at what names can tell us about nationality – based on a recent attempt to identify Italian composers among the many obscure and unknown names listed in the British Library’s music catalogue.
Surnames and forenames can sometimes be very good indicators of where somebody is from. Gabriel Fauré, for example, with the give-away acute accent, is very likely to be French. Sergei Rachmaninov is clearly Russian, strongly suggested by the distinctive ‘-ov’ ending (or ‘-off’, if you prefer that spelling). Similarly, Antonio Vivaldi has to be Italian, and Gustav Mahler and Johannes Brahms must be from Germany or Austria.
Before we get too carried away, however, it is worth pointing out three reasons why this does not always work. Firstly, names point towards linguistic regions rather than specific countries. It can be hard, without further information, to decide whether a name comes from France or Belgium, for example, or from Portugal or Brazil.
Secondly, people (and their ancestors) often move about and take their names with them. Some of these movements date back many centuries – many English names have French origins dating from the Norman conquest. International migration has accelerated significantly over the last 100 years or so – any of the typically Italian, Russian or German names mentioned above could easily belong to twenty-first century US citizens, for example.
Thirdly, there are many names (especially short ones) that do not contain typical linguistic markers. Elgar, Ravel, Chopin, Grieg (and many others) do not contain many clues as to their origins.
So it is highly likely, when estimating the nationalities of unknown composers using only their names, that the results will be very approximate.
One approach I have used is to take a list of surnames of people with known nationality, and split them into pairs of letters. So ‘Verdi’ becomes $v, ve, er, rd, di, i$
where $
marks the beginning or end of the name. Using these letter pairs, and the known nationalities, we can use a technique of ‘recursive partitioning’ to create a decision tree, which might start as follows…
- if it contains
i$
(i.e. it ends with an ‘i’), then the name is probably Italian - otherwise, if it contains
o$
, then it is probably Italian - otherwise, if it contains
v$
, it is probably Russian - …and so on
This decision tree can then be used to assign nationalities to names for which we don’t have that information.
A refinement of this idea is to use a ‘Random Forest’, which is a large set of decision trees (typically several hundred), each based on a random subset of letter pairs. This forest of trees combine to give a much more robust model than is achievable with just a single tree.
Applying a random forest to a list of known composers results in the following list of letter pairs emerging as the most important (in descending order)… i$, o$, v$, a$, ov, ni, er, an, n$, ar, $b, in, en, r$, $a
.2 The endings of names are clearly the most important indicators of nationality, as we might expect from the examples given above.
The nationalities predicted best by this technique are Italian (13% error),3 German (37%), Russian (40%) and French (42%). Those that do not work so well include Spanish (78% error) and British (80%). The worst are the nationalities with small numbers of composers (Portuguese, Danish, etc) or that share a language with a larger neighbour (Irish, Austrian, etc) – all of these have error rates of around 90% or more.
These high error rates reflect the problems mentioned above. Many mis-allocated names end up being classified as American, due to the particularly cosmopolitan selection of names found in the US, combined with a large population of composers (especially post-1900).
It might be possible to improve these estimates. There are several alternatives to random forests (such as ‘neural networks’, or ‘support vector machines’) that might give better results. Letter-pairs are not necessarily the best predictor – perhaps letter triplets would be better, or a system that took the position of letters into account.4 A better model might use both the first name and the surname, rather than just the surname. It might also be worth restricting the dates if possible – in particular limiting the data to people born before about 1900, when the linkage between names and nationalities was more homogeneous.
These difficulties make this approach, as it stands, of limited use in most situations. For the purpose of identifying Italian composers among the 100,000+ names listed in the British Library music catalogue, it worked moderately well, although the likely errors are still significant.5 It would clearly be unwise to rely on this approach, without further research, for specific individuals. Nevertheless, in the absence of other information about these numerous but obscure figures in music history, even approximate methods such as this can be of some use in revealing – or at least suggesting – overall trends and patterns that can be investigated in more detail.
- It is hard to be precise about the number of names in the BL catalogue, as there is a lot of duplication, with variations of spelling, format and punctuation, along the lines discussed in this article.
- This used the list compiled for this analysis of British composers mentioned by Hofmeister.
- This is the proportion of actual Italians who would have been classified as something else.
- An approach using a random forest on the last five letters of the name, in order, did not do as well as using letter-pairs, but a combination of these might be worth a try.
- It is hard to be precise about the number of names in the BL catalogue, as there is a lot of duplication, with variations of spelling, format and punctuation, along the lines discussed in this article.