In the previous article in this series we looked at counting the frequency of words in a dataset of song lyrics. This time we will look at combinations of words – or n-grams.
N-grams are groups of words taken n at a time. For the first sentence of this article, the 2-grams (or ‘bigrams’) are “in the”, “the previous”, “previous article”, “article we”, “we looked”, etc. The 3-grams (‘trigrams’) are “in the previous”, “the previous article”, “previous article we”, etc.
Before leaping in to the analysis, we need to prepare the data. Last time we discussed the removal of ‘stopwords’. With n-grams we must consider not only whether to remove stopwords, and which list to use, but how and when to remove them. The ‘when’ question is whether to take out stopwords before or after cutting the text into n-grams. It would normally make sense to create the n-grams first, but in some circumstances it might be more useful to start by taking out the stopwords. “Article looked”, for example, would be one of the bigrams of the first sentence if the stopwords were removed first.
Assuming we cut the text into n-grams first, there are two options for how we remove stopwords: we could just delete n-grams consisting wholly of stopwords (“in the”), or we could remove those that include any stopword (“at counting”), which would only keep those n-grams containing no stopwords. This choice can make quite a difference to the analysis and needs to be considered carefully.
We must also decide whether to include n-grams that cut across sentences or paragraphs. Would we include “lyrics this” among the bigrams of the first paragraph, for example? In practice, the n-grams that straddle sentences or paragraphs will hopefully be sufficiently rare to not materially affect the analysis.
The number of bigrams in a text is one less than the number of words, but the variety of possible bigrams is very much larger. Some combinations might be quite common (“in the”, “we looked”), whereas others are unusual (“article we”). With 3-grams and higher the range of possibilities becomes even greater, whilst the number of common n-grams reduces substantially.
Any analysis done with single words (1-grams) could in principle be done with n-grams. We could repeat the approach in the previous article, and generate wordclouds, counts, and tf-idf analyses based on n-grams. However, n-grams reflect more than just a collection of words – they also represent some of the relationships between them. It is therefore often useful to consider subsets of n-grams in order to investigate a particular topic.
For example, we could just consider n-grams containing the word “love”. The following chart shows the top 10 “love” bigrams by decade (no stopwords removed):
“Love you” just beats “I love”, apart from in the 1950s, where the order is reversed. Note that “love I” appears in the top 10 in the 1980s and 2010s, which is surely an example of a bigram straddling lines or sentences. The 1950s’ “love baby” is probably another example.
We could also look at “love” trigrams. Unsurprisingly, “I love you” has a strong lead in every decade. However, there are less than 80 occurrences of this most common trigram in the 425 1950s songs in the dataset. Compare that with 150 of the most common “love” bigram, or over 600 of the most common word.1 As n-grams get longer, their frequencies get lower, and the statistical significance of our analysis can quickly deteriorate.
Another instructive set of bigrams are those beginning with “I”: what are the verbs of which the singer is the subject? The following chart shows (among other things) that “I love” has fallen in use over time, while “I don’t” has risen. Again, no stopwords have been removed.
Methodologically, there is a problem here with “I am”. This will often be abbreviated to “I’m”, but in this analysis, where words are defined by the spaces separating them, “I’m” is treated as a single word, not as a bigram. Resolving this issue is not necessarily as simple as splitting apostrophised words into bigrams – if we did that, we would also lose the important distinction between “I can” and “I can’t”.
An additional difficulty is that many of these phrases really need a third word to work out what the singer is up to. I don’t what? I can what? I will what? Simply extending the analysis to trigrams would partially solve the problem, but would also introduce further complications. Later in this series of articles we will see some other techniques for addressing this sort of question.
A bigram represents a connection between two words, and therefore can be used as the basis of a network diagram (or ‘graph’). The graph below shows the bigrams appearing more than 150 times in the dataset (excluding those containing any stopwords). The words are shown as red dots, which are joined by blue lines if they form a common bigram. The thickness of the blue lines reflects the number of times each bigram appears. The lines curve to the right to indicate direction (“human race”, “lonely nights”, rather than “race human” or “nights lonely” which would involve a curve to the left).
Several words appear in more than one bigram (indicated by groups of three or more connected points). There is a large cluster of connected words on the left, which includes the most common words mentioned in the previous article – “love”, “yeah”, “baby”, “time”, “life”, etc. The top bigrams (indicated by the thickest lines) are “deep inside”, “uh huh”, “true love”, “broken heart” and “coming home”.
The effect of the no-stopwords policy is apparent in the bigram “york city” (top right): there should certainly be a “New” attached to “York”, but this has been removed as a stopword!
Network graphs can be a useful way to visualise the links between data – in this case 2-grams, but also in many other situations.2 There are many ways of laying out the points and lines, or using size and colour to reflect different variables, and such visualisations can often reveal or suggest otherwise hidden relationships in the data. However, network graphs can be quite difficult to interpret: like wordclouds, they often work best as a high-level overview, with more detailed analysis being required before drawing firm conclusions.3
- See the previous article.
- For example I have used network graphs to show the cities that composers have travelled to and from, with the strength of links reflecting the number of journeys; and to show composer networks, where the links between them reflect the number of years they spent in the same city at the same time.
- The graph above was produced with the
ggraph
package inR
. The layout of the points uses the ‘Fruchterman-Reingold’ algorithm, which is one of several methods that assume that the links form an attractive force between the points.