Song Lyrics 3: Repetition and Compression

We all know that a good song depends on repetition – both of the tune and the lyrics. Too much repetition and it is just boring; too little, and it can lack structure. This article looks at different aspects of repetition in song lyrics.

In the first article in this series, we counted words, ignoring their relationship to each other. The second article started to consider these relationships by investigating groups of words. In this article we build on this, then move on to consider larger scale structure.

Let’s start with repeated words. Using the approach from the previous article, we can analyse the bigrams that consist of a repeated word. Which are the most common?

The following chart shows the top ten repeated words in songs from each decade. The frequencies, relative to the most repeated word in each decade, are indicated by the height of the text.

As we might expect, most of these are not proper words at all. “Hey, hey” was top in the 1950s but had fallen out of favour by the 1970s. “No, no” was common in the 1950s, but was overtaken by “yeah, yeah” in the 1960s. Since the 1970s, “la, la”, “oh, oh”, and “yeah, yeah” have consistently been the top three. Looking at the relative text sizes, in the 1950s all ten words seem to have been quite common. In the 1960s, “la, la” was noticeably more dominant than the rest, but by the 1970s a more even balance had returned. In the last decade, “oh, oh” appears to be some way ahead of all the rest.

Compression Ratios

What about repetition in the structure of the songs themselves? Many song lyrics include repeated phrases, refrains or choruses, or even whole sections. Is it possible to quantify this repetition?

One approach is to measure how much the lyrics can be compressed. That is, rather than writing out the full lyrics, including all repeats, we look for the shortest set of instructions that would allow us to reproduce them. Consider, for example, a song consisting of the word “love” repeated 100 times. The entire lyrics would require 499 characters (100 times “love” plus 99 spaces). We could, however, easily recreate the song from the instruction “love*100”, which is just eight characters long, a 98.4% reduction. The amount of compression is a measure of the degree of repetition – the more repetitive a text, the more it can be compressed.

There are complex and sophisticated algorithms for compressing different types of data.1 These use a variety of approaches, but are often based on generalisations of the example above. One common multi-purpose algorithm (used, for example, in computer “.zip” files) is called “GZIP” compression. Because compression/decompression is so useful, computers are fast at it, and it is easy to compress a set of lyrics and calculate the percentage reduction in length.2

The chart below shows the compression ratio (i.e. the compressed size as a proportion of the original size) of all of the song lyrics (in red), and of samples of ordinary English prose (in blue).3

There are several important things to notice about this chart. The first is that compression ratios tend to improve for longer texts. This is clear by analogy from the fact that “love*100” is only 1.6% the length of its full version, whereas “love*50” is 2.8% (i.e. 7/249), and “love*3” is 43% (i.e. 6/14).4

The second thing to notice is that the compressibility of ordinary prose of a given length is quite consistent, and falls within a narrow band. This reflects the normal frequencies of words in English. Compression of prose can only really operate on common words or combinations of letters, rather than take advantage of deliberate repetition. GZIP can compress 1,000 characters of normal English text down to about 50-60% of its original length, and it is unlikely to fall far outside of this range.

Song lyrics, on the other hand, are almost always more compressible than prose, reflecting the fact that they have more deliberate repetition. 1,000 characters of song lyrics can typically be compressed to 25-50% of its original length, and even lower values are not uncommon. So, not only are song lyrics more repetitive than prose, but they are much more variable, covering a wide range from very repetitive (10% compression ratio at 1,000 characters) through to prose-like (60%).

If you don’t like GZIP compression, you can get a similar result by counting the number of unique words in a text (i.e. each distinct word is counted once, however many times it appears), and dividing this by the total number of words. The result is shown below. In this case, we see that 1,000 characters of prose typically has 60-70% as many unique words as total words, whereas song lyrics are more likely to be in the range 20-60%.

I have investigated both of these measures of compressibility for different periods, and there does not appear to be any significant difference between the decades, either in the typical values, or in their spread. It seems that, from that point of view at least, songwriting has not changed much over the last 70 years.

Let’s look at a few examples of what these compressibility statistics mean in practice. To do this we can use a nice visualisation technique that seems to have been invented by Colin Morris. The idea is to line up the lyrics along the sides of a grid, and to colour the squares that have the same word horizontally and vertically. The result is a pattern showing the location of repeated words throughout the song.

A song with quite a low level of compressibility, and therefore quite ‘prose-like’ is ‘Viva Las Vegas’ by Elvis Presley. It has GZIP compression of 49%, or 59% as many unique words as total words. Here is the chart:

Each unit on the horizontal or vertical axes represents one word of the lyrics. The dots where both the horizontal and vertical lyrics are the same are coloured blue. There is always a central diagonal line that corresponds to each word being the same as itself. Starting in the bottom left, the song starts “Bright light city gonna set my soul, Gonna set my soul on fire…”. The four dots running parallel either side of the main diagonal are the repetition of the words “gonna set my soul”. There are other small repetitions about 60 and 120 words in, and a final outro repeating the phrase “Viva Las Vegas” (ending with a “Viva, Viva Las Vegas”). Otherwise, there is little repetition other than scattered individual words (mainly common ‘stopwords’ such as “a”, “the” and “there”).

A more typical compression ratio of 34% is exhibited by ABBA’s ‘Money Money Money’…

There are several small repetitions (“must be funny”, “rich man’s world”, etc), as well as the distinctive larger squares of the repeated words of the title. This forms the basis of the chorus that appears a third of the way through, and is repeated twice towards the end.

Another typical example, also with a compression ratio of 34%, is Billy Joel’s ‘Uptown Girl’…

Here there is quite a lot of small-scale repetition, such as repeated phrases at the end of lines, many mentions of “uptown girl”, and a not-quite-exactly repeated second verse before the final refrain, which is itself repeated four times.

There is a more organised repetitive structure in The Beatles’ ‘Ticket to Ride’, which can be compressed down to 22% of its original length:

This is a verse+chorus form, with plenty of repetition within the choruses. A bridge section (starting about 100 words in) is repeated exactly after the following verse+chorus, with final repeats of “my baby don’t care” to close.

The highest compressibility is possible where there is a lot of repetition. There are some extreme (but rather dull) examples of this, but a more interesting one is ‘The Twelve Days of Christmas’, which can be compressed to 13% of its full length…

Apart from the direct repetition, with one line added each time, the individual dots scattered on this pattern seem to all be the word “a”.

These charts are an excellent way of visualising the repetition within individual songs, but it is hard to see how they could be used to assess or compare large numbers of songs, for example in order to identify changing trends over time. The measures of compression are less meaningful for individual songs, but at least have the advantage that they can be directly compared across large datasets.

Cite this article as: Gustar, A.J. 'Song Lyrics 3: Repetition and Compression' in Statistics in Historical Musicology, 9th August 2019,
  1. Examples are “mp3” compression for audio files, “mp4” for video, and “jpeg” for images.
  2. In R there is a built-in function to do it, memCompress, which I have used here.
  3. The prose is randomly selected extracts from a corpus of English novels that I happened to have available.
  4. For very short texts, the GZIP algorithm can actually increase the size. This is because the ‘instructions’ part of a compressed file tends to have an effective minimum size, even if the ‘content’ part is quite short. By analogy, “love*1” is longer than its realisation.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.