Press "Enter" to skip to content

Category: Biodiversity Informatics

Biodiversity Informatics is the poor, country cousin of Bioinformatics. Where bioinformatics is concerned with the computational aspects of genes and their expression within organisms (usually in the lab) biodiversity informatics is concerned with how we handle data about the occurrence and identity of whole organism out there in the wild or dead in collections of voucher specimens.

Hierarchies Make Monographs Obsolete. Fact Sheets Are The Future.

Whilst I have been working on digitizing the Rhododendron monographs I have also been providing some technical help for Stuart Lindsay who is producing a series of fact sheets for the Ferns of Thailand. This has helped crystallize my thoughts regarding monographs and how we migrate them into the digital age.

This post is a follow on from a previous one where I discuss mapping the Rhododendron monographs to EoL. It is an opinionated rant but I offer it in the hope that it will be of some use.

When monographs/floras/faunas are mentioned in the context of digitization people usually chirp up with PDF or, if they are more clued up on biodiversity informatics,  TaXMLit and INOTAXA (Hi to Anna if you are reading) or TaxonX and Plazi.org (Hi to Donat).  The point I am going to make in this blog post is not against these ways of marking up taxonomic literature but more the nature of the monographic/floristic/faunistic taxonomic product itself. I am far more familiar with the botanical side of things so apologies to zoologists in advance.

Square Peg Into A Round Hole?

I’ve had my head down work wise for the past few weeks trying to get the Rhododendron monograph markup finished. I now have a little database with some 821 species accounts in it plus a few hundred images – mainly of herbarium specimens. The workflow has been quiet simple but very time consuming.

  1. Text is obtained from the source monograph either via OCR or access to the original word processor documents.
  2. The text is topped-and-tailed to remove the introduction and any appendices and indexes.
  3. Text is converted to UTF-8 if it isn’t already.
  4. An XML header and foot are put in place and any non-XML characters are escaped  – this actually came down to just replacing & with &
  5. The text is now in a well formed XML document.
  6. A series of custom regular expression based replacements are carried out to put XML tags at the start of each of the recognizable ‘fields’ in the species accounts. These have to be find tuned to the document as the styles of the monographs are subtly different. Even the monographs published in the same journal had some differences. It is not possible to identify the start and end of each document element automatically. This is for three reasons:
    1. OCR errors mean the punctuation, some letters and line breaks are inconsistent.
    2. Original documents have typos in them. A classic is a period appearing inside or outside or inside and outside a closing parenthesis.
    3. There are no consistent markers in the source documents structure for some fields. For example the final sentence of the description may  contain a description of the habitat, frequency and altitude but the order and style may vary presumably to make the text more pleasant to read. The only way to resolve this is by human intervention.
  7. The text is no longer in a well formed XML document!
  8. The text is manually edited whilst consulting the published hard copy to insert missing XML tags and correct really obvious OCR errors. In some places actual editing of the text is needed to get it to fit a uniform document structure as in the habitat example above.
  9. The text is now back to being a well formed XML document.
  10. An XSL transformation is carried out on the XML to turn it into ‘clean’ species accounts and alter the structure slightly.
  11. An XSL transformation is carried out to convert the clean species accounts into SQL insert statements for a simple MySQL database. The structure of this database is very like an RDF triple store (actually a quad store as there is a column for source). A canonical, simplified taxon name (without authority or rank) is used as the equivalent of the URI to identify each ‘object’ in the database. Putting the data in a database makes it much easier to clean up and to extract some additional data. An alternative would be to have a single large XML document and write XPath queries.

HTML5 Geolocation Data Sucks

I have long been excited about HTML5 having access to a geolocation data. It should make it possible to build a whole range of applications for phones and other devices that are cross platform but make use of the users location. Unfortunately reality bites when you try and actually build an application based on the technology.

I have been working with Sencha Touch and the Ext.util.Geolocation object but am having problems with accuracy. I have noted the following behaviour.

When I call for a location on iPhone (3G) and iPad (v1) I get a one with around 1.3km accuracy. Basically it places me at one of two spots about 1km apart. If I switch to the native maps app then it places my position within 10m of where I am standing – that “wow it knows where I am” accuracy . Switch back to my web app and the first call to the GeoLocation returns similar accuracy. Any subsequent calls return the old inaccurate positions.

Extracting Data From the Rhododendron Monographs

This post deals with the semantics of extraction of data from the Rhododendron monographs. Another post will deal with the technicalities of the actual extraction.

The image above shows a species description entry. It was chosen as being a small and simple example for illustrative purposes. I have marked up the bits I am interested in extracting. Red indicates important fields, blue unimportant and yellow something in between – but why those bits and those priorities? Monographs contain a great deal of other stuff such as keys and descriptions of higher taxa and discussions. We could argue for hours about what should be extracted and never come to a conclusion unless we have some guiding principles on what we are trying to do. I have therefore developed five guiding principles for the project that are probably pretty general and may be applicable to other such projects:

Links To All Curtis Botanical Magazine Illustrations in BHL

William Curtis (1746-1799)

This is a sideline to my working on the Edinburgh Rhododendron monographs.

The monographs often quote references to illustrations (icons) of species. This is useful as we know that these are illustrations that have been determined by the author of the account and are therefore “correctly” determined. What a shame we only have an abbreviated text string that can really only be understood by a human. An example might be “Rhododendron & Camellia Yearbook 25: f.58 (1970)”. Because these are in the botanical monographic style it is near impossible even to turn them into an OpenURL that a resolver could make sense of – so we have a bit of a challenge.

For the just-under-four-hundred species accounts I have extracted from the first two monographs I have 445 icon strings. Of these 144 contain ‘Bot. Mag.’ – for Curtis’ Botanical Magazine and so they look like a good set to try and parse and link up. The Biodiversity Heritage Libary have digitized that proportion of Bot. Mag. prior to 1920 that is out of copyright thanks to Missouri Botanic Gardens. I just need to join it all up. In fact I could download the relevant images and embed them in my data because they are out of copyright.

So a happy afternoon was spent learning about the BHL API and writing XSLT and regular expressions to parse the strings I had. The result was a match up of just 59 illustrations. About the same number I could have done manually in an afternoon! The rest of my Bot. Mag. references are post 1920 and so locked up in copyright.

But a happy by-product of the process was the fact that I downloaded and parsed all the metadata for Bot. Mag. in BHL and extracted the item IDs (books) and page IDs for what I believe are all the illustrations – a total of 8,215. So if you are faced with the same issue as me you don’t have to go to the bother of doing it. Here is a CSV file of the full list.

All Curtis Illustrations In BHL (CSV)

I have included the URLs to the resources in BHL although these are just trivial concatenations of the page IDs or item IDs and an http prefix.

European Natural History Collections – What’s Missing?

I am working on improving the metadata on European natural history collections as part of the Synthesys project. In an earlier post  (Big Collections First) I did an analysis of the data in the Biodiversity Collections Index. I am now building a more detailed list of those large collections (the ones believed to contain more than a million ‘specimens’) of which there appear to be around sixty. These  account for most of the biodiversity material in museums in Europe.

As I worked through the list I began to match them up against data sources in the GBIF Data Portal but this task became tricky as there were data sources in GBIF that had the names of museums but were clearly the results of observational studies and not catalogues of specimens. I decided to break off and do an analysis of what was in the GBIF Data Portal by way of specimens residing in Europe. This post is the results of that analysis.

Failing To Use OWL To Merge Occurrence Ontologies

I am not sure how to say this. Either:

  1. I just had my first Nature Precedings paper published.
  2. I just published my first paper on Nature Precedings.

The distinction is a big one. Saying I had a paper published implies the blessings of my peers. This is more like vanity publishing or even ‘stupid’ vanity publishing as the whole thing is open to comments and voting. My inflated but fragile ego will likely get punctured by negative comments or worse (and more likely) by total apathy.

I did some work last year on merging occurrence status vocabularies for the PESI project and, as I really wanted to make use of OWL in some way, I attempted to do this by creating a set of related ontologies then using inference to produce a magical-semantic-merging of them all. As this was a new thing for me I wrote it up as I went along.