Deduplication is an important, though often messy and time-consuming, part of many statistical investigations. It is usually required when data comes from several different sources, to identify all of the records that actually refer to the same thing. For example, I have recently been deduplicating the names appearing in the ‘women composers’ sources listed in this previous article. Deduplication may also be needed where several publications of the same work are described in different ways in a library catalogue.
Here are the different listings for the same person in the various ‘women composers’ sources…
[1] "Maria Antonia Walpurgis, Kurfürstin von Sachsen" [2] "Maria-Antonia (Grande Duchesse de Saxe)" [3] "Walpurgis, Maria Antonia" [4] "Maria Antonia Kurfürstin von Sachsen" [5] "Maria Antonia Walpurgis, Electress Of Saxony" [6] "Maria Antonia Walpurgis of Bavaria" [7] "Maria Walpurgis" [8] "Maria WALPURGIS" [9] "Maria Antonia Walpurgis"
Maria Walpurgis (1724-1780) was a German composer who wrote operas and other works. The variations in her name across the different sources include
- changes of order (surname-forename or forename-surname)
- with or without middle names
- differences of capitalisation and punctuation (commas, hyphens, parentheses)
- with and without titles, and with different titles (she was both princess of Bavaria and duchess of Saxony)
- language differences (
"Kurfürstin von Sachsen", "Grande Duchesse de Saxe", "Electress Of Saxony"
)
Other variations that may be found include
- variable use of accents (
"Bonis, Melanie", "Bonis, Mélanie"
) - pseudonyms or variant names
- initials or shortened forms (
"Melanie Bonis", "Mel Bonis", "M Bonis", "Bonis"
) - married or maiden names (
"Fanny Hensel", "Fanny Mendelssohn", "Fanny Hensel-Mendelssohn", "Fanny Hensel (née Mendelssohn)"
or even forms along the lines of"Mrs Wilhelm Hensel"
) - spelling or scanning mistakes.
If we are researching a single composer, these different representations are not usually a problem, but with a list of several thousand names,1 we need a way to do at least some of this deduplication automatically. What we are aiming for is a ‘lookup table’ giving the original names, and the deduplicated names, which can be used to link each entry to a single standardised name, such as “Walpurgis, Maria”.
One tool that can be useful for this sort of thing is OpenRefine. There are instructions on the OpenRefine website for how to use it,2 but of particular interest is the description of the techniques it uses for ‘clustering’, i.e. identifying possible duplicates. A number of methods are described for comparing text strings (using ‘fingerprints’, various distance measures, phonetic transformations, etc), which between them cover many of the common sources of variation mentioned above. OpenRefine does not automatically deduplicate a list for you, but it makes it very much easier!
The ‘distance’ between two character strings is a useful concept for deduplication. One common measure – the Levenshtein Distance – is the minimum number of edits (insertions, deletions, or substitutions) needed to turn one word into another. Other measures exist, which can be more suitable for certain purposes, such as comparing names, or picking up common spelling or typing errors. A small distance between two strings can be an indicator that they might be duplicates.
Having used OpenRefine to do the first wave of deduplication of women composers, I used four further passes – each of which found some duplicates missed by previous attempts…
- When extracting the data, I had also identified birth years for many composers, so I used a composite measure (based on Levenshtein distance plus difference in birth years) to identify some further potential duplicates.
- Some sources also contained variant names and pseudonyms. I was able to extract these and find a few more duplicates.
- For each composer, I created a list of all of the names appearing in any of the sources (including the pseudonyms found above), ignoring capitalisation, punctuation and accents, and removing words of three letters or fewer. For Maria Walpurgis, for example, this list included
"maria", "antonia", "walpurgis", "kurfurstin", "sachsen", "saxony", "duchesse", etc
. I then wrote a program to find composers with at least two words in common. This produced a few more possible duplicates, which were reviewed manually. - The final phase was to load everything into a spreadsheet and carry out a manual review. Sorting by birthdate, alphabetically by forename or surname, and looking for anything suspicious (such as very long entries, or names with numbers in) will usually reveal a few new duplicates, non-duplicates, and other errors in the data. It was at this stage that I found a man lurking among the female composers!
Deduplication is far from being an exact science that can be readily automated, but there are techniques that can identify potential duplicates for further consideration. There are certainly still a few errors in my deduplicated list, but they are probably relatively few. Some might come to light later, others will not, but the effect on the subsequent analysis should be minimal. It took perhaps three or four days of work to reduce the list of almost 11,000 names down to 6,148 deduplicated women composers, which is time well spent in terms of the quality and credibility of the subsequent analysis.3
- The ‘women composers’ sources resulted in a list of almost 11,000 names, including duplicates.
- Broadly the process is to start with the deduplicated names equal to the original names, to do a ‘text facet’ on the deduplicated names, and then use the ‘cluster’ button. Be careful not to deduplicate the original names column, as you will need this intact to do the lookup afterwards.
- As an aside, it is worth noting that most of the duplicates were found among a relatively small number of ‘big names’ – Maria Walpurgis had nine variants, and she was by no means the highest. 74% of the 6,148 deduplicated names only appeared in one source.