I put this talk together for a meeting just in case I needed to elaborate on a point in one of my reports. I never used it but…
Category: Biodiversity Informatics
Biodiversity Informatics is the poor, country cousin of Bioinformatics. Where bioinformatics is concerned with the computational aspects of genes and their expression within organisms (usually in the lab) biodiversity informatics is concerned with how we handle data about the occurrence and identity of whole organism out there in the wild or dead in collections of voucher specimens.
I just wrote 500 words explaining the relationship between Taxonomy, Nomenclature and PESI for use in the PESI portal. Here they are:
The process of creating a classification of life is split into two parts. Firstly experts decide which species exist. This process is called taxonomy. Secondly the experts work out what to call the species they recognise. This is called nomenclature.
The relationship between taxonomy and nomenclature is complex.
Sometimes two things cross your desk at the same time and they say more than either one of them would on their own.
Firstly I was looking for a list of British birds and happened across the British Ornithologists’ Union (BOU) list of bird names and how they have changed between 1923 and 2007. This is most delightful list as it shows the English names are as stable as the scientific names – or both are equally unstable. If it hadn’t been for an attempt to standardise the use of the hyphen the English names would have been much more stable in my opinion (though by no means totally static). Here is a quote:
I finally submitted deliverable D4.3 for the PESI project and in the great tradition of putting my outputs on my blog here is a PDF…
You can know the name of a bird in all the languages of the world, but when you’re finished, you’ll know absolutely nothing whatever about…
It is customary for scientists to cite the author of a scientific name whenever that name is used. Indeed it is considered grossly amateurish in some circles to omit such details. This causes problems because, although there are standards for abbreviation of author names (notably Brummitt in botany), these are not always followed and often embellished. This means that the entire string of name characters is never guaranteed to be unique. To a machine every variation of authority string would results in a new combination of characters and implies the existence of a new taxon
What if we just stopped using author strings (other than in monographs) and ignore them when other people use them?
At last month’s TDWG2009 conference I was on a panel for a brief discussion at the end of a session. There were around 200 people in the audience and handful of us up front as lambs for the slaughter.
One of the questions from the floor concerned the automation of the taxonomic process. I don’t recall the precise question but it triggered one of my (probably boring) canned responses.
I pointed out that the usual practice in software engineering, when asked to automate a system, is to produce a Domain Model based on an analysis of some Use Cases that then leads on to some Object Model or implementation model that is actually created in software. The assumption behind this is that whatever was being done was good but needs to be done faster – with computers!
In biodiversity informatics, and particularly in biological taxonomy, this is not such a good idea. Current working practice was developed in the light of the prevailing technology of the time. If computers and the internet had been available from the start things would probably have been done differently. The worst thing we can do now is automate a paper based system.
Keeping up with the nearly-year-old tradition of putting all outputs on my blog here are the latest two reports I have submitted as part of…
I have been wrestling for some time with how to handle taxonomic hierarchies when combining multiple classifications. This is partly motivated by a pressure to produce consensus hierarchies for navigation (a task that I think is probably not worth doing but which is beyond the scope of this post) and partly from a need to carry out inference over multiple classifications using OWL (something that I think is an important research topic if we are to overcome the ‘taxonomic impediment’).
Take the simplest scenario where we have classification C1 that contains family Z with two genera X and Y that contain a total of three species Xa, Xb and Yc. Now let there be another classification C2 that is identical but for the species Xb being moved to the genus Y as Yb.
“Cabinets” is a public service broadcast with the aim of promoting community understanding of complex taxonomic issues.
— Cue opening credits —
The story so far:Terry is a taxonomist and he works very hard to produce a classification of the family Z. It includes two genera, X and Y and three species A, B and C. Here is a picture of his classification.
Several years ago I was involved in the developing the “TDWG Ontology”. Quite what the TDWG Ontology was/is remains an enigma for many. Around 2005/6 we tried to move away from modeling things in XML Schema and into some form of frame based modeling with well defined classes and properties – as opposed to the document structures implied by XML Schema. With the help of Jessie Kennedy’s team at Napier and people around the world we started building an OWL ontology of the whole domain – then ran out of money.
We still needed basic terms for use in LSID RDF metadata. This lead to the development of the LSID Vocabularies. These were very light weight “ontologies” but were still an attempt at defining terms using OWL.
In all our efforts there was a problem. There was no continuity of resourcing. For two years no one has been paid to manage the TDWG Ontology even though there is an increasing need for the disparate biodiveristy informatics projects to have a formal mechanism for defining shared terms. Because the resource is seen as common no one feels responsible to commit resources to manage it.
In the last few days I have been doing some work with Kehan Harman on establishing a technical fix for this.
Over the past few months I have been working on how to represent biological taxonomy and nomenclature using Description Logics. Here I combine these thoughts with a rather naive view of DNA Barcoding to suggest a new approach to taxonomy.
Description Logic (DL) is an extension of frame based languages (such as those used in object orientated programming paradigms) and semantic networks (e.g. WordNet) to link them to first-order predicate logic thus enabling the representation of application domains in formal, well understood ways that can be reasoned over by machines. DL has come to the fore in recent years with the advent of the Web Ontology Lanugage (OWL) by the World Wide Web Consortium (W3C). Two subsets of which, OWL-DL and OWL-Lite, are based on DL. Notably these two sub-languages guarantee decidability within finite time. From now on I’ll use the terminology of OWL-DL and OWL-Lite rather than generic DL terms. The OWL terms are more likely to be understood by a general reader who can read the OWL documentation as background. A concept in DL is referred to as a class in OWL. A role in DL is a property in OWL.
There are three principal features within OWL:
- Classes are groups of individuals that belong together typically because they share some properties or property values.
- Individuals are instances of classes.
- Properties are statements of relationships between individuals or from individuals to data values.
There are other features within the language that allow the expression of things such as equivalence, cardinality and the domains and ranges of properties. Using OWL principally involves asserting specialization hierarchies of classes and inferring unknown subclass relationships and class membership using an inference engine such as Fact++. A set of OWL assertions is frequently referred to as an ontology.
Last week I took part in a meeting at GBIF in Copenhagen to discuss the role GBIF could play in Persistent Resolvable Identifiers (the technology formally known as GUIDs and often confused with UUIDs. Perhaps they should be called PRIs – pronounced ‘prize’ – just kidding.) This is the culmination of the LGTG (a.k.a. the Less Than Greater Than group). Thanks are due to Éamonn O Tauma and the team at the GBIF Secretariat for being wonderful hosts and to my fellow participants for being such good company.
This was a two and half day meeting that involved a group of us working on a document full of recommendations (to be published in the next month or so). As part of my contribution I came up with a slightly more detailed plan for how GBIF would interact with data suppliers and consumers. For a brief time this formed part of the final document but was then cut because it was too detailed. It may still make it back into the appendix but may also drop out completely so I thought I would present it here for posterity.
These are more or less just a series of notes and diagrams but they should be understandable to anyone involved in the field. I use the term GUID as this was before we changed to calling them persistent identifiers.
Note that what I present here is what I presented to the group and does not necessarily reflect the views of the group which will officially be published later.
Here is the first draft of a book chapter I have written for an upcoming Systematics Association volume. My intention with this work is to…
I was writing a report on the role of nomenclators in PESI when I realized that (with a little tweaking and injection of dangerous opinions) one section would make a good blog post.
In order to facilitate the accurate exchange of taxonomic information, both within the taxonomic community and more widely in the biological and environmental sciences, the e-infrastructure needs to provide two dictionary functions for scientific names of organisms i.e.
- A recognized list of the names used. To establish that any two studies are actually using the same names whilst accounting for spelling variants and homonyms as well as to facilitate consistency in spelling and presentation.
- A mapping between the names and descriptions of the taxa they are used for. To establish that any two studies are using the names in the same sense or compatible senses.
If the ICBN and ICZN codes required all names to be registered in a single or limited number of places then this would effectively fulfil the first function. Unfortunately neither the ICBN or ICZN codes require names to be registered. Neither do they require names to be published in a particular list of journals. They merely set out the conditions for effective publication. The publications in which new names appear could be published anywhere and deposited in any library. There is no requirement for them to be peer reviewed.
I just gave a talk at e-Biosphere ’09 . This was a 15 minute talk to possibly the largest audience I have ever addressed (400ish).…
Until December 2010 I am employed as the project officer for Work Package 4 of the PESI project. My particular responsibility is the adoption of…
Tdwg Ontology 03.Key View more presentations from rogerhyam This talk has been put together for the LifeWatch WP5 workshop “Semantic Data Integration” taking place in…
Most people are familiar with a few Zen kōans – the ‘nonsense’ sayings of the great Zen masters that are designed to make us think or rather not think. Their aim is to point more directly to what can’t be said in words. Examples include: “What is the sound of one hand clapping?” and “Does a dog have Buddha nature?”. Sitting silently and bearing a kōan in mind can be a powerful means of expanding our understanding. A kōan that would be useful for those of us involved in the discussions on Globally Unique Identifiers (GUIDs) at the moment is: What is it that persists when a GUID is persistent? I have been dwelling on this for a while now and I’d like to share some of my thoughts.
I recently took part in a very long discussion on LSIDs on the TDWG-TAG mailing list. This seems to have been a perpetual discussion over the past four years. On reflection I realised that over two posts I had produced a kind of personal position paper on LSIDs and that it would be worth capturing the text in a blog post so it didn’t disappear into the mailing list archives. People often ask about LSIDs and it would be useful to have somewhere to point them to. Note that this text is off a technical discussion list and not newbie friendly. It assumes you know about LSIDs as a technology.
One issue that repeatedly comes up with LSIDs is that they may be more permanent than URIs. They offer a sociological advantage in that they are separate from ephemeral HTTP URLs that are used for everything on the web. The act of minting an LSID indicates that you intend to try to make it permanent or at least never re-use it for another resource.
The barrier to everyone hosting LSIDs is that they don’t all have access to DNS servers and can’t host the relevant SRV records. There are other barriers to do with binding LSIDs to particular institutional domains that may change. A solution to this may be to have a central service that hosts DNS records and it is implied that this would help with persistence but just hosting SRV records or supplying a redirect service does not actually provide any persistence at all to the data/metadata. Persistence of a GUID to 500 error rather than a not found is not helpful.
The nice thing about blogging is that you get to mix-n-match your thoughts together in a way that you couldn’t do in the constituant parts of your life. This post brings together the notion of Globally Unique Identifiers (GUIDs) from my world of work and Buddhist notions of identity. It isn’t really acceptable to talk Buddhist spirituality in biodiversity informatics meetings and bringing up techie stuff when talking to Buddhist friends doesn’t help communication much either but here I can bravely attempt to mash the two together and I hope shed light on both.
Buddhism is widely and erroneously believed to propose the notion of anatman meaning ‘no soul’. Atman figures big in Hinduism and in Abrahamic faiths as ‘soul’. Buddhism has a different spin on the soul and this is where the error often comes in. Generally different-from-having-something is considered to be not having it. Therefore it is concluded that there are no souls in Buddhism – but this is confused thinking.
“Do you have a soul?” is a loaded question. It assumes firstly that the world can be split into things, secondly that these things can have possessive type relationships and thirdly there are two things ‘you’ and ‘soul’ that may have this relationship. If you have a problem with any of these assumptions it is difficult to say anything in response to the question. Any notion of a self or even a thing is totally contingent on everything else in space time. Buddhism finds it difficult to locate ‘you’ and ‘soul’ and so impossible to express an opinion on their relationship.
This is exactly where we arrive at biodiversity informatics and the problems we have with GUIDs.
Back last year at TDWG2008 in Fremantle there was a Wild Ideas session where people could propose crazy things that might not be serious or urgent. I gave a presentation called SpeciesIndex?: A practical alternative to fantasy mashups. This was meant to be a bit of fun but actually went down quiet well with a few people coming up to me afterward who were interest in it. A wiki page called SpeciesPages was created to flesh out the ideas.
The ideas presented in the paper to the conference and on the wiki are that each publisher of species pages. (i.e. anyone with a web site that has a page per species approach to taxonomy) should produce a SiteMap file that contains a list of just those pages and submits the location of the SiteMap to a register so that the pages could be indexed and other services built around them.
Over the intervening months I got to thinking about the idea some more and playing around in the evenings with some code.
I have been wanting to push people in the direction of semantic technologies for quite a while now. Mainly this has taken the form of…
With apologies to René Magritte. Imagine you are a judge in a small court and I am the accused. I have been caught stealing coconuts…
I have been doing some thinking about capturing images of herbarium specimens so as to facilitate the “taxonomic process” – whatever that might be. The…
There is no doubt that Globally Unique Identifiers (GUIDs) are important and not just because I have been hammering on about them for the last…
What’s the Problem? Imagine you are a biologist looking at a group of organisms. You may be interested in these organisms because they occupy similar…