Press "Enter" to skip to content

Category: Biodiversity Informatics

Biodiversity Informatics is the poor, country cousin of Bioinformatics. Where bioinformatics is concerned with the computational aspects of genes and their expression within organisms (usually in the lab) biodiversity informatics is concerned with how we handle data about the occurrence and identity of whole organism out there in the wild or dead in collections of voucher specimens.

Taxonomy, Nomenclature and PESI – An explanation for mortals.

I just wrote 500 words explaining the relationship between Taxonomy, Nomenclature and PESI for use in the PESI portal. Here they are:

The process of creating a classification of life is split into two parts. Firstly experts decide which species exist. This process is called taxonomy. Secondly the experts work out what to call the species they recognise. This is called nomenclature.

The relationship between taxonomy and nomenclature is complex.

Words About Names – What I do for a living?

The Frog in the Pond

Sometimes two things cross your desk at the same time and they say more than either one of them would on their own.

Firstly I was looking for a list of British birds and happened across the British Ornithologists’ Union (BOU) list of bird names and how they have changed between 1923 and 2007. This is most delightful list as it shows the English names are as stable as the scientific names – or both are equally unstable. If it hadn’t been for an attempt to standardise the use of the hyphen the English names would have been much more stable in my opinion (though by no means totally static). Here is a quote:

Are author names really necessary?

Although there are standards for abbreviation of author names (notably Brummitt in botany) these are not always followed and often embellished. Furthermore it is believed that the added nomenclatural precision author names add is not worth the cost of their inclusion. If author names were included then every variation of authority string would result in a new URI implying the existence of a new taxon. This would defeat the principle goal of – to get people using the same URIs for the same things. Homonyms are rare it is even rarer that they cause problems outside of taxonomy and nomenclature.
Consider the following classification of confidence limits from International Panel on Climate Change (taken from here)
virtually certain – more than 99%
extremely likely – more than 95%
very likely – more than 90%
likely – more than 60%
more likely than not – more than 50%
unlikely – less than 33%
very unlikely – less than 10%
extremely unlikely – less than 5%
Now consider the estimate in Paton et al (2008) Taxon 57:602-611 that 4.1% of plant names have homonyms i.e. it is “extremely unlikely” that any one name is a homonym. Also consider the following list of kinds of homonyms:
Nomenclatural Artefacts These occur where the same taxon is published multiple times. Perhaps the same publication comes out in two languages or is published a second time with a slightly different title and set of authors. For all intent and purposes these do not matter as the names are intended to refer to the same taxon.
Competitive Publication New material is found. Two authors publish accounts based on it using the same names. The taxa are substantively the same.
Quickly Synonymised. An author publishes new species only for someone to quickly realise that this is a homonym and publish the fact. Subsequent publications place it in synonymy and it is never widely used. The name in circulation will almost always refer to the correct taxon but the homonym will be kept in circulation due to always being mentioned as being a homonym in monographs, floras and faunas. Modern indexing will exasperate this situation.
Back From The Dead Everyone is happy using a junior (or later) homonym without knowing it when a taxonomist finds a publication containing the senior (earlier) homonym and overturns the nomenclatural apple cart. The rules of nomenclature say that the taxon now needs a new name even if the senior homonym is not currently the name of an accepted taxon. There is a case for nomenclatural conservation of the junior homonym or rejecting the senior homonym. Either way the original usage of the name is the most common.
Problematic Homonyms The same name string is widely used for multiple taxon concepts. This is rarer in terms of nomenclatural homonyms (where different names have actually been published) than it is where authors have simply used the same name in different senses (taxon concepts and/or misapplied names). This is particularly common with European names being used for the “wrong” taxa in the New World. Author strings are of no help here as the nomenclature is correct only the usage incorrect. A full-blown taxon concept based approach is needed to handle these situations. takes the premise that names specified to nomenclatural code, rank, spelling and, in the case of zoological names, year are “virtually certain” to be referring to the same general taxon.

It is customary for scientists to cite the author of a scientific name whenever that name is used. Indeed it is considered grossly amateurish in some circles to omit such details. This causes problems because, although there are standards for abbreviation of author names (notably Brummitt in botany), these are not always followed and often embellished. This means that the entire string of name characters is never guaranteed to be unique. To a machine every variation of authority string would results in a new combination of characters and implies the existence of a new taxon

What if we just stopped using author strings (other than in monographs) and ignore them when other people use them?

Biodiversity Informatics – A ‘sackable offence’

Frankenstein's Monster Required tremendous energy to re-animate.
Tremendous energy is required to re-animate the dead.

At last month’s TDWG2009 conference I was on a panel for a brief discussion at the end of a session. There were around 200 people in the audience and handful of us up front as lambs for the slaughter.

One of the questions from the floor concerned the automation of the taxonomic process. I don’t recall the precise question but it triggered one of my (probably boring) canned responses.

I pointed out that the usual practice in software engineering, when asked to automate a system, is to produce a Domain Model based on an analysis of some Use Cases that then leads on to some Object Model or implementation model that is actually created in software. The assumption behind this is that whatever was being done was good but needs to be done faster – with computers!

In biodiversity informatics, and particularly in biological taxonomy, this is not such a good idea. Current working practice was developed in the light of the prevailing technology of the time. If computers and the internet had been available from the start things would probably have been done differently. The worst thing we can do now is automate a paper based system.

Synonyms Are SubClasses And Higher Taxa Are Just Tags

strict_baptist_chapelI have been wrestling for some time with how to handle taxonomic hierarchies when combining multiple classifications. This is partly motivated by a pressure to produce consensus hierarchies for navigation (a task that I think is probably not worth doing but which is beyond the scope of this post) and partly from a need to carry out inference over multiple classifications using OWL (something that I think is an important research topic if we are to overcome the ‘taxonomic impediment’).

Take the simplest scenario where we have classification C1 that contains family Z with two genera X and Y that contain a total of three species Xa, Xb and Yc. Now let there be another classification C2 that is identical but for the species Xb being moved to the genus Y as Yb.

Episode 987: Cabinets – A Taxonomic Soap Opera.

Amanita muscaria (fly agaric)In this episode of our longest running soap opera Terry & Tina confuse Eric who takes off with Malcolm.

“Cabinets” is a public service broadcast with the aim of promoting  community understanding of complex taxonomic issues.

— Cue opening credits —

The story so far:Terry is a taxonomist and he works very hard to produce a classification of the family Z. It includes two genera, X and Y and three species A, B and C. Here is a picture of his classification.

Managing The Managing Of The TDWG Ontology

Castle CampbellSeveral years ago I was involved in the developing the “TDWG Ontology”. Quite what the TDWG Ontology was/is remains an enigma for many. Around 2005/6 we tried to move away from modeling things in XML Schema and into some form of frame based modeling with well defined classes and properties – as opposed to the document structures implied by XML Schema.  With the help of Jessie Kennedy’s team at Napier and people around the world we started building an OWL ontology of the whole domain – then ran out of money.

We still needed basic terms for use in LSID RDF metadata. This lead to the  development of the LSID Vocabularies. These were very light weight “ontologies” but were still an attempt at defining terms using OWL.

In all our efforts there was a problem. There was no continuity of resourcing. For two years no one has been paid to manage the TDWG Ontology even though there is an increasing need for the disparate biodiveristy informatics projects to have a formal mechanism for defining shared terms. Because the resource is seen as common no one feels responsible to commit resources to manage it.

In the last few days I have been doing some work with Kehan Harman on establishing a technical fix for this.

Nomenclature is Dead! Long Live Barcode Taxa!

nigella-1Over the past few months I have been working on how to represent biological taxonomy and nomenclature using Description Logics. Here I combine these thoughts with a rather naive view of DNA Barcoding to suggest a new approach to taxonomy.

Description Logic (DL) is an extension of frame based languages (such as those used in object orientated programming paradigms) and semantic networks (e.g. WordNet) to link them to first-order predicate logic thus enabling the representation of application domains in formal, well understood ways that can be reasoned over by machines. DL has come to the fore in recent years with the advent of the Web Ontology Lanugage (OWL) by the World Wide Web Consortium (W3C). Two subsets of which, OWL-DL and OWL-Lite, are based on DL. Notably these two sub-languages guarantee decidability within finite time. From now on I’ll use the terminology of OWL-DL and OWL-Lite rather than generic DL terms. The OWL terms are more likely to be understood by a general reader who can read the OWL documentation as background. A concept in DL is referred to as a class in OWL. A role in DL is a property in OWL.

There are three principal features within OWL:

  • Classes are groups of individuals that belong together typically because they share some properties or property values.
  • Individuals are instances of classes.
  • Properties are statements of relationships between individuals or from individuals to data values.

There are other features within the language that allow the expression of things such as equivalence, cardinality and the domains and ranges of properties. Using OWL principally involves asserting specialization hierarchies of classes and inferring unknown subclass relationships and class membership using an inference engine such as Fact++. A set of OWL assertions is frequently referred to as an ontology.

My part in GBIF’s Role in Persistent Resolvable Identifiers

mermaidLast week I took part in a meeting at GBIF in Copenhagen to discuss the role GBIF could play in  Persistent Resolvable Identifiers (the technology formally known as GUIDs and often confused with UUIDs. Perhaps they should be called PRIs – pronounced ‘prize’ – just kidding.) This is the culmination of the LGTG (a.k.a. the Less Than Greater Than group). Thanks are due to Éamonn O Tauma and the team at the GBIF Secretariat for being wonderful hosts and to my fellow participants for being such good company.

This was a two and half day meeting that involved a group of us working on a document full of recommendations (to be published in the next month or so). As part of my contribution I came up with a slightly more detailed plan for how GBIF would interact with data suppliers and consumers. For a brief time this formed part of the final document but was then cut because it was too detailed. It may still make it back into the appendix but may also drop out completely so I thought I would present it here for posterity.

These are more or less just a series of notes and diagrams but they should be understandable to anyone involved in the field. I use the term GUID as this was before we changed to calling them persistent identifiers.

Note that what I present here is what I presented to the group and does not necessarily reflect the views of the group which will officially be published later.

Calling Time on Biological Nomenclature

Gathering Storm
Gathering Storm

I was writing a report on the role of nomenclators in PESI when I realized that (with a little tweaking and injection of dangerous opinions) one section would make a good blog post.

In order to facilitate the accurate exchange of taxonomic information, both within the taxonomic community and more widely in the biological and environmental sciences, the e-infrastructure needs to provide  two dictionary functions for scientific names of organisms i.e.

  1. A recognized list of the names used. To establish that any two studies are actually using the same names whilst accounting for spelling variants and homonyms as well as to facilitate consistency in spelling and presentation.
  2. A mapping between the names and descriptions of the taxa they are used for. To establish that any two studies are using the names in the same sense or compatible senses.

If the ICBN and ICZN codes required all names to be registered in a single or limited number of places then this would effectively fulfil the first function. Unfortunately neither the ICBN or ICZN codes require names to be registered. Neither do they require names to be published in a particular list of journals. They merely set out the conditions for effective publication. The  publications in which new names appear could be published anywhere and deposited in any library. There is no requirement for them to be peer reviewed.

GUID Persistence as Zen kōan


Most people are familiar with a few Zen kōans – the ‘nonsense’ sayings of the great Zen masters that are designed to make us think or rather not think. Their aim is to point more directly to what can’t be said in words. Examples include: “What is the sound of one hand clapping?” and “Does a dog have Buddha nature?”. Sitting silently and bearing a kōan in mind can be a powerful means of expanding our understanding. A kōan that would be useful for those of us involved in the discussions on Globally Unique Identifiers (GUIDs)  at the moment is: What is it that persists when a GUID is persistent? I have been dwelling on this for a while now and I’d like to share some of my thoughts.

A Position on LSIDs


I recently took part in a very long discussion on LSIDs on the TDWG-TAG mailing list. This seems to have been a perpetual discussion over the past four years. On reflection I realised that over two posts I had produced a kind of personal position paper on LSIDs and that it would be worth capturing the text in a blog post so it didn’t disappear into the mailing list archives. People often ask about LSIDs and it would be useful to have somewhere to point them to. Note that this text is off a technical discussion list and not newbie friendly. It assumes you know about LSIDs as a technology.

One issue that repeatedly comes up with LSIDs is that they may be more permanent than URIs. They offer a sociological advantage in that they are separate from ephemeral HTTP URLs that are used for everything on the web. The act of minting an LSID indicates that you intend to try to make it permanent or at least never re-use it for another resource.

The barrier to everyone hosting LSIDs is that they don’t all have access to DNS servers and can’t host the relevant SRV records. There are other barriers to do with binding LSIDs to particular institutional domains that may change. A solution to this may be to have a central service that hosts DNS records and it is implied that this would help with persistence but just hosting SRV records or supplying a redirect service does not actually provide any persistence at all to the data/metadata. Persistence of a GUID to 500 error rather than a not found is not helpful.

Identifiers, Identity and Me


The nice thing about blogging is that you get to mix-n-match your thoughts together in a way that you couldn’t do in the constituant parts of your life. This post brings together the notion of Globally Unique Identifiers (GUIDs) from my world of work and Buddhist notions of identity. It isn’t really acceptable to talk Buddhist spirituality in biodiversity informatics meetings and bringing up techie stuff when talking to Buddhist friends doesn’t help communication much either but here I can bravely attempt to mash the two together and I hope  shed light on both.

Buddhism is widely and erroneously believed to propose the notion of anatman meaning ‘no soul’. Atman figures big in Hinduism and in Abrahamic faiths as ‘soul’. Buddhism has a different spin on the soul and this is where the error often comes in. Generally different-from-having-something is considered to be not having it. Therefore it is concluded that there are no souls in Buddhism – but this is confused thinking.

“Do you have a soul?” is a loaded question. It assumes firstly that the world can be split into things, secondly that these things can have possessive type relationships and thirdly there are two things ‘you’ and ‘soul’ that may have this relationship. If you have a problem with any of these assumptions it is difficult to say anything in response to the question. Any notion of a self or even a thing is totally contingent on everything else in space time. Buddhism finds it difficult to locate ‘you’ and ‘soul’ and so impossible to express an opinion on their relationship.

This is exactly where we arrive at biodiversity informatics and the problems we have with GUIDs.

SpeciesIndex: A waste of midnight oil?

unicornBack last year at TDWG2008 in Fremantle there was a Wild Ideas session where people could propose crazy things that might not be serious or urgent. I gave a presentation called SpeciesIndex?: A practical alternative to fantasy mashups. This was meant to be a bit of fun but actually went down quiet well with a few people coming up to me afterward who were interest in it. A wiki page called SpeciesPages was created to flesh out the ideas.

The ideas presented in the paper to the conference and on the wiki are that each publisher of species pages. (i.e. anyone with a web site that has a page per species approach to taxonomy) should produce a SiteMap file that contains a list of just those pages and submits the location of the SiteMap to a register so that the pages could be indexed and other services built around them.

Over the intervening months I got to thinking about the idea some more  and playing around in the evenings with some code.