I have recently been trying to collect data from the Listening Experience Database (LED) in order to put together a proposal for a conference paper. The LED is a nicely constructed database using linked open data and a structure based on something called the ‘Semantic Web’. Rather than traditional databases that have a hierarchical ‘tree’ structure, the Semantic Web concept is a true ‘network’, where anything can be linked to anything else. The LED, for example, includes links to data on a number of other databases. Have a look at the LED and follow a few links and you will see what this means – a very rich and flexible means of linking data together.
As an open system, the entire LED database can be downloaded for free as a compressed ‘RDF’ (Resource Description Framework) file, which consists of a long list of so-called ‘triples’ that define the links that form the dataset. Each triple consists of a subject (the item being referred to), a predicate (a property of the subject), and an object (the value of that property). So if the subject is ‘the eroica symphony’, the predicate might be ‘composed by’, and the object would be ‘Beethoven’. As you might expect, this is not quite as straightforward as it sounds. Many objects also appear on the list as subjects (linking to, in this example, Beethoven’s date of birth, other works, etc). A subject-predicate pair might appear several times with different values (perhaps referring to different publications of the symphony, for example). Many of the triples simply describe what sort of data it is (date, place, name, etc). And all of the entries in a triple (except for objects that are simply values) are expressed as URLs (web addresses) that refer to the LED or to external databases and their definition documents known as ‘ontologies’.
This sort of structure is great for a database that is intended to be browsed and searched for specific information. It all works behind the scenes to make the LED quick and easy to use, with a wealth of interesting and relevant information. The sophisticated structure is irrelevant as far as the user is concerned. When it comes to harvesting large amounts of data in order to do some statistical analysis, however, things get a little tricky. For those of us looking to investigate the downloaded RDF ‘dump’ of the whole database, the structure is very relevant, as it needs to be converted into a form suitable for statistical analysis.
Statistical analysis is easiest if data is in a ‘tidy’ format – think of a spreadsheet where each row is one element of the dataset (a listening experience), and the columns represent different variables (names, dates, places, the text of the listening experience itself, etc). Semantic Web data does not readily fall into this tidy format. There is probably a better way to do it, but I managed to extract data in a tidy format by chaining the triples together (i.e. following up those objects that also appeared as subjects) until all the chains were complete (about six links in some cases), and then considering the chains of predicates to identify what the final object of each chain represents. A lot of these could be ignored. Others had to be combined (as the same property could be defined in several different ways, not all of which were valid for every entry). I now have a spreadsheet-like table with some useful data on over 8,800 listening experiences, ready for some analysis. Watch this space!
In fact, many datasets present this sort of problem. The transcription of Hofmeister, for example, can be downloaded in XML format, which does not translate readily into ‘rectangular’ data. With the continued development of data structures such as XML and the Semantic Web, I suspect this is going to be an increasing challenge for those of us wanting to get hold of datasets for statistical analysis. Website designers have long been moving away from simple data structures that can be easily scraped or downloaded, towards structures that enable a richer and more flexible user experience – where the ‘user’ is almost always considered to be not the statistical historian, but someone who wants to browse or search the database for particular pieces of information.
I suppose this is fair enough, as there aren’t many of us who want to analyse historical musical data in bulk – although the numbers are increasing, and ‘big data’ is a hugely active area of research in other fields. Perhaps a measure of statistical music history coming of age will be the point at which database designers routinely made it easy to download extracts or samples of data in a tidy format. I won’t hold my breath!