
In this previous post in the series, we used capitalisation to identify proper nouns (names, places, etc) in our dataset of song lyrics. Other parts of speech – verbs, adjectives, etc – are not so easy to identify, although software exists to do just that.
The R
package koRpus
will take a piece of text and run it through TreeTagger
, which is a standard piece of software for a process known as ‘parts of speech (POS) tagging’.1 The output of the process is a tag associated with every word – marking it as, for example, a noun, adjective, or conjunction.
For those of us who speak English fluently, this is not a problem, but for a computer the task is not trivial. To tell whether the word “light”, for example, is a noun, a verb or an adjective, we have to look at other words around it in order to work out the context in which it is being used. For example “turn on the light” (noun), “you light up my life” (verb), “this bag is very light” (adjective). There are very many English words that are just as hard to categorise – “close”, “fine”, “home”, “like”, “mine”, “mean”, “minute”. We also have to recognise plurals, possessives, and the many forms and tenses of verbs. Even for software like TreeTagger, therefore, this can be quite time-consuming. On my (admittedly rather ageing) PC, it would have taken over two days to tag all of the song lyrics in the dataset. Instead, I just did the list of well-known artists used in some previous articles. This cut the processing time down to about a hour.
Tagging actually produces two levels of detail. At the high level, there are broad categories – adjective, adverb, pronoun, etc – and then subcategories within these. Adjectives, for example, include ordinary adjectives (“good”), comparatives (“better”) and superlatives (“best”). The full set of 58 detailed tags can be found here.2
There are a few reasons why the tagging of song lyrics is not always accurate or meaningful. They don’t always make grammatical sense, so the text parsing rules might not be able accurately to work out the context of each word. Lyrics are often not punctuated as normal sentences: the line/verse structure can be used to indicate the breaks, with full-stops and other punctuation omitted (even though they are implied). There can be a lot of (semantically unnecessary) repetition. Songs often use shortened forms of words (“jammin”, “gonna”) or nonsense filler words (“yeah”, “ah”). Some of these will be correctly tagged (e.g. as UH
– interjection), but others, such as “la”, are often tagged as FW
– foreign words.
Nevertheless, it is possible to come up with tagged song lyrics. Here is an example of the start of one of the more grammatically coherent songs: “Somewhere Over the Rainbow”…
Ref token tag lemma lttr wclass stop idx sntc
1 938 Somewhere RB somewhere 9 adverb TRUE 1 1
2 938 over IN over 4 preposition TRUE 2 1
3 938 the DT the 3 determiner TRUE 3 1
4 938 rainbow NN rainbow 7 noun FALSE 4 1
5 938 , , , 1 comma FALSE 5 1
6 938 Way NP Way 3 name TRUE 6 1
7 938 up RB up 2 adverb TRUE 7 1
8 938 high JJ high 4 adjective TRUE 8 1
9 938 . SENT . 1 fullstop FALSE 9 1
10 938 There EX there 5 existential TRUE 10 2
11 938 's VBZ be 2 verb FALSE 11 2
12 938 a DT a 1 determiner TRUE 12 2
13 938 land NN land 4 noun FALSE 13 2
14 938 that IN/that that 4 preposition TRUE 14 2
15 938 I PP I 1 pronoun TRUE 15 2
16 938 heard VVD hear 5 verb FALSE 16 2
17 938 of IN of 2 preposition TRUE 17 2
18 938 , , , 1 comma FALSE 18 2
19 938 Once RB once 4 adverb TRUE 19 2
20 938 in IN in 2 preposition TRUE 20 2
21 938 a DT a 1 determiner TRUE 21 2
22 938 lullaby NN lullaby 7 noun FALSE 22 2
23 938 . SENT . 1 fullstop FALSE 23 2
We get one row for each word or punctuation mark. Ref
is the identification number of this song in the dataset. token
is the word or punctuation mark, and tag
is the POS tag assigned to it.3 The lemma
is a standardised form from which the word is derived – such as the infinitive of a verb, or the singular of a noun. lttr
is the number of letters, and wclass
is the coarser classification of the tag
. stop
identifies stopwords,4 idx
is an identification number of the token, and sntc
is the sentence to which the token belongs.
There are some problems here. The first “sentence” is not a grammatical sentence, as it has no verb. TreeTagger marks it as a sentence because of the full stop – which should probably be a comma. Such punctuation inconsistencies are very common in song lyrics. Also, TreeTagger gets confused by mid-sentence capital letters (which are actually just the start of new lines): “Way” is wrongly classified as a name for this reason.
POS tagging enables us to do much more detailed analysis than we have been able to so far. We can look at specific types of word – for example
- comparative adjectives
JJR
: the top five are “better”, “stronger”, “older”, “worse”, “higher” (in descending order) - numbers
CD
: “one”, “two”, “three”, “four”, “thousand” - plural nouns
NNS
: “things”, “eyes”, “people”, “days”, “dreams” - verbs in the past tense
V*D
: “was”, “got”, “did”, “went”, “saw” - possessive pronouns
PP$
: “my”, “your”, “his”, “her”, “their” - question words
W*
: “when”, “what”, “where”, “how”, “who” - modal verbs
MD
: “will”, “can”, “could”, “would”, “should”.
Perhaps even more powerful is the ability to look for combinations – such as adjective + noun, pronoun + verb, etc. And of course we could use any of these to look for differences and similarities between artists, or trends over time.
As usual, there are reasons to be cautious when using this technique on song lyrics. Nevertheless, POS tagging is a handy tool to be aware of, and may be more useful in other contexts, such as for the analysis of writing about music (such as concert reviews, composer biographies, or academic articles).
- See here for details of TreeTagger.
- There are other tagging systems in use. This Wikipedia article gives a good overview of the different approaches.
- See previous link.
- See this previous article for an explanation of stopwords.