Song Lyrics 8: Parts of Speech Tagging

In this previous post in the series, we used capitalisation to identify proper nouns (names, places, etc) in our dataset of song lyrics. Other parts of speech – verbs, adjectives, etc – are not so easy to identify, although software exists to do just that.

The R package koRpus will take a piece of text and run it through TreeTagger, which is a standard piece of software for a process known as ‘parts of speech (POS) tagging’.¹ The output of the process is a tag associated with every word – marking it as, for example, a noun, adjective, or conjunction.

For those of us who speak English fluently, this is not a problem, but for a computer the task is not trivial. To tell whether the word “light”, for example, is a noun, a verb or an adjective, we have to look at other words around it in order to work out the context in which it is being used. For example “turn on the light” (noun), “you light up my life” (verb), “this bag is very light” (adjective). There are very many English words that are just as hard to categorise – “close”, “fine”, “home”, “like”, “mine”, “mean”, “minute”. We also have to recognise plurals, possessives, and the many forms and tenses of verbs. Even for software like TreeTagger, therefore, this can be quite time-consuming. On my (admittedly rather ageing) PC, it would have taken over two days to tag all of the song lyrics in the dataset. Instead, I just did the list of well-known artists used in some previous articles. This cut the processing time down to about a hour.

Tagging actually produces two levels of detail. At the high level, there are broad categories – adjective, adverb, pronoun, etc – and then subcategories within these. Adjectives, for example, include ordinary adjectives (“good”), comparatives (“better”) and superlatives (“best”). The full set of 58 detailed tags can be found here.²

There are a few reasons why the tagging of song lyrics is not always accurate or meaningful. They don’t always make grammatical sense, so the text parsing rules might not be able accurately to work out the context of each word. Lyrics are often not punctuated as normal sentences: the line/verse structure can be used to indicate the breaks, with full-stops and other punctuation omitted (even though they are implied). There can be a lot of (semantically unnecessary) repetition. Songs often use shortened forms of words (“jammin”, “gonna”) or nonsense filler words (“yeah”, “ah”). Some of these will be correctly tagged (e.g. as UH – interjection), but others, such as “la”, are often tagged as FW – foreign words.

Nevertheless, it is possible to come up with tagged song lyrics. Here is an example of the start of one of the more grammatically coherent songs: “Somewhere Over the Rainbow”…

    Ref     token     tag     lemma lttr      wclass  stop idx sntc
 1  938 Somewhere      RB somewhere    9      adverb  TRUE   1    1
 2  938      over      IN      over    4 preposition  TRUE   2    1
 3  938       the      DT       the    3  determiner  TRUE   3    1
 4  938   rainbow      NN   rainbow    7        noun FALSE   4    1
 5  938         ,       ,         ,    1       comma FALSE   5    1
 6  938       Way      NP       Way    3        name  TRUE   6    1
 7  938        up      RB        up    2      adverb  TRUE   7    1
 8  938      high      JJ      high    4   adjective  TRUE   8    1
 9  938         .    SENT         .    1    fullstop FALSE   9    1
 10 938     There      EX     there    5 existential  TRUE  10    2
 11 938        's     VBZ        be    2        verb FALSE  11    2
 12 938         a      DT         a    1  determiner  TRUE  12    2
 13 938      land      NN      land    4        noun FALSE  13    2
 14 938      that IN/that      that    4 preposition  TRUE  14    2
 15 938         I      PP         I    1     pronoun  TRUE  15    2
 16 938     heard     VVD      hear    5        verb FALSE  16    2
 17 938        of      IN        of    2 preposition  TRUE  17    2
 18 938         ,       ,         ,    1       comma FALSE  18    2
 19 938      Once      RB      once    4      adverb  TRUE  19    2
 20 938        in      IN        in    2 preposition  TRUE  20    2
 21 938         a      DT         a    1  determiner  TRUE  21    2
 22 938   lullaby      NN   lullaby    7        noun FALSE  22    2
 23 938         .    SENT         .    1    fullstop FALSE  23    2

We get one row for each word or punctuation mark. Ref is the identification number of this song in the dataset. token is the word or punctuation mark, and tag is the POS tag assigned to it.³ The lemma is a standardised form from which the word is derived – such as the infinitive of a verb, or the singular of a noun. lttr is the number of letters, and wclass is the coarser classification of the tag. stop identifies stopwords,⁴ idx is an identification number of the token, and sntc is the sentence to which the token belongs.

There are some problems here. The first “sentence” is not a grammatical sentence, as it has no verb. TreeTagger marks it as a sentence because of the full stop – which should probably be a comma. Such punctuation inconsistencies are very common in song lyrics. Also, TreeTagger gets confused by mid-sentence capital letters (which are actually just the start of new lines): “Way” is wrongly classified as a name for this reason.

POS tagging enables us to do much more detailed analysis than we have been able to so far. We can look at specific types of word – for example

comparative adjectives JJR: the top five are “better”, “stronger”, “older”, “worse”, “higher” (in descending order)
numbers CD: “one”, “two”, “three”, “four”, “thousand”
plural nouns NNS: “things”, “eyes”, “people”, “days”, “dreams”
verbs in the past tense V*D: “was”, “got”, “did”, “went”, “saw”
possessive pronouns PP$: “my”, “your”, “his”, “her”, “their”
question words W*: “when”, “what”, “where”, “how”, “who”
modal verbs MD: “will”, “can”, “could”, “would”, “should”.

Perhaps even more powerful is the ability to look for combinations – such as adjective + noun, pronoun + verb, etc. And of course we could use any of these to look for differences and similarities between artists, or trends over time.

As usual, there are reasons to be cautious when using this technique on song lyrics. Nevertheless, POS tagging is a handy tool to be aware of, and may be more useful in other contexts, such as for the analysis of writing about music (such as concert reviews, composer biographies, or academic articles).

Cite this article as: Gustar, A.J. 'Song Lyrics 8: Parts of Speech Tagging' in Statistics in Historical Musicology, 23rd November 2019, https://musichistorystats.com/song-lyrics-8-parts-of-speech-tagging/.

See here for details of TreeTagger.
There are other tagging systems in use. This Wikipedia article gives a good overview of the different approaches.
See previous link.
See this previous article for an explanation of stopwords.

Statistics in Historical Musicology

Song Lyrics 8: Parts of Speech Tagging

Like this:

Related

Leave a Reply Cancel reply

Share this:

Like this:

Related

Leave a Reply Cancel reply