The Google Books Ngram Viewer is a powerful tool for analysing historical text data. It uses the enormous corpus of books scanned by Google to analyse the frequency of words and phrases over time. An n-grams is just a combination of words – so a single word is a 1-gram, a pair of words a 2-gram, etc. The Google viewer has data up to 5-grams.
This has potential uses in many fields – including musicology. Here we will use the ngram viewer to analyse the rise and fall of ragtime music.
Google’s help page describes the various corpuses of books, and the syntax of using the Ngram Viewer to perform different types of query. There is also a useful
ngramr which I have used for this article.
Let’s start with a simple chart – looking at all mentions of the word “ragtime” in the corpus of all English language books. (Just click the link at the very top of this article, if you haven’t already.) The Wikipedia article on ragtime says that the first ragtime composition was published in 1895, and this is indeed when the curve on the graph (below) starts to rise above the zero line. It peaks in 1915 at just over 0.3 words per million, and then falls back sharply to 0.1 wpm in the early 1930s.1
The peak of 0.3 words per million in 1915 represents 3,195 occurrences out of a total of almost 10 billion words in the “eng_2019” corpus for that year. So, although not all books have been scanned by Google, this is a pretty good sample size.2
Actually, this is not the whole picture. We have searched for “ragtime”, and although the “case-insensitive” option will automatically include “Ragtime”, “RAGTIME”, etc, it does not cover the variants “rag time” and “rag-time”. It turns out that these are significant…
So “rag-time” was actually the most common form in the early days, but it was overtaken in 1904 by “ragtime”, but it continued to be quite popular until the early 1930s, after which it failed to enjoy the same resurgence as “ragtime”.
Ragtime developed partly from the “cake walk”, and was closely related to the “Dixieland” style of jazz. We can compare these terms (including hyphenated and separated versions) with the Ngram Viewer…
This confirms that the cake walk preceded ragtime, and peaked around 1901 (at 0.1 wpm), following which it was largely supplanted by ragtime. Dixieland grew more steadily (and indeed was already underway before 1895), peaked at around the same time as ragtime, but did not fall off as steeply, so perhaps represents a bridge between the short-term fashion for ragtime and the development of other jazz styles in the first half of the twentieth century.
We can use the Ngram Viewer to compare different corpuses. For example, the following chart shows the frequency of “ragtime” and its variants among British and American English books, and corpuses in French, German, Italian and Spanish. The peak was highest, as we might expect, in the US, and it also kept going a little longer – peaking in 1916. In Britain, ragtime was slow to get off the ground, only really catching on after 1909, peaking in 1915, then falling back sharply.
In other languages, ragtime made less of an impact (the French and Italians were keenest, but only at about a tenth of the level of the Americans), and it didn’t make much impression until the early 1920s.3
Around 3% of the “eng_2019” corpus in the period 1915-1920 was fiction writing, included as “eng_fiction_2019”. As the following chart shows, ragtime and its variants peaked among fiction writers at almost double the overall level, but about five years later – by which time ragtime’s popularity was already in decline.
- whether ragtime is used as a noun (“ragtime_NOUN”) or an adjective (“ragtime_ADJ”) – it turns out that the vast majority of occurrences are classified as nouns;
- which adjectives are used to modify ragtime (“*_ADJ ragtime”) – the top ones being “American”, “popular”, “latest” and “new”;
- which nouns are modified by ragtime when it is used as an adjective (“*_NOUN=>ragtime”) – “ragtime music”, “ragtime song”, “ragtime tune”, “ragtime band”, etc.
It is also possible to add terms together, divide them, etc, to produce sophisticated analyses within the Google Ngrams Viewer itself.
More generally, this is an example of time series data, which I have not covered much on this website, but which forms a huge topic in itself. Watch this space!
- I have used smoothing of 2 years either side (i.e. a five-year centred moving average), in order to make the curves smoother.
- The standard deviation is roughly proportional to the square root of the number of occurrences – or about 56 in this case. So, with 95% confidence, we can expect the peak value to be within about 4% of the true value.
- There are also Chinese, Russian and Hebrew corpuses, although these will use different character sets, so searching for “ragtime” in them doesn’t make much sense.
- See this previous article on POS tagging.