If I start to talk about how the world is and therefore how we should best live there is a danger you will dismiss what I say as either playing with ideas that have no relation to real life (philosophy) or trying to impose some mumbo jumbo from a possible imaginary deity (religion). Many people are reluctant to explore this stuff because it will either prove a complete waste of time or overturn a belief system that they have accepted since childhood.

Despite this I do need to create a narrative explanation of why you should try mindfulness meditation. The rationality at the heart of our culture requires that this comes first. Please treat what follows as a pragmatic way of viewing the world for the purpose of living the good life rather than just a set of ideas or a religious doctrine.

Things arise in dependence on conditions and when those conditions cease the things cease. This is the root of the philosophy. This is easy to accept because when we look we can see it is true. This should not be confused with “cause and effect” which is more a product of language. To have a “cause” and an “effect” we need to define one thing as being the cause and something else as being the effect which is useful when we want to use words to represent these things but involves isolating them from the rest of the universe. Drawing a line around them if you like. So we could talk about it raining because it is cloudy but this conveniently leaves out the causes of clouds and the processes within the clouds. Continue reading »

This post deals with the semantics of extraction of data from the Rhododendron monographs. Another post will deal with the technicalities of the actual extraction.

The image above shows a species description entry. It was chosen as being a small and simple example for illustrative purposes. I have marked up the bits I am interested in extracting. Red indicates important fields, blue unimportant and yellow something in between – but why those bits and those priorities? Monographs contain a great deal of other stuff such as keys and descriptions of higher taxa and discussions. We could argue for hours about what should be extracted and never come to a conclusion unless we have some guiding principles on what we are trying to do. I have therefore developed five guiding principles for the project that are probably pretty general and may be applicable to other such projects: Continue reading »

William Curtis (1746-1799)

This is a sideline to my working on the Edinburgh Rhododendron monographs.

The monographs often quote references to illustrations (icons) of species. This is useful as we know that these are illustrations that have been determined by the author of the account and are therefore “correctly” determined. What a shame we only have an abbreviated text string that can really only be understood by a human. An example might be “Rhododendron & Camellia Yearbook 25: f.58 (1970)”. Because these are in the botanical monographic style it is near impossible even to turn them into an OpenURL that a resolver could make sense of – so we have a bit of a challenge.

For the just-under-four-hundred species accounts I have extracted from the first two monographs I have 445 icon strings. Of these 144 contain ‘Bot. Mag.’ – for Curtis’ Botanical Magazine and so they look like a good set to try and parse and link up. The Biodiversity Heritage Libary have digitized that proportion of Bot. Mag. prior to 1920 that is out of copyright thanks to Missouri Botanic Gardens. I just need to join it all up. In fact I could download the relevant images and embed them in my data because they are out of copyright.

So a happy afternoon was spent learning about the BHL API and writing XSLT and regular expressions to parse the strings I had. The result was a match up of just 59 illustrations. About the same number I could have done manually in an afternoon! The rest of my Bot. Mag. references are post 1920 and so locked up in copyright.

But a happy by-product of the process was the fact that I downloaded and parsed all the metadata for Bot. Mag. in BHL and extracted the item IDs (books) and page IDs for what I believe are all the illustrations – a total of 8,215. So if you are faced with the same issue as me you don’t have to go to the bother of doing it. Here is a CSV file of the full list.

All Curtis Illustrations In BHL (CSV)

I have included the URLs to the resources in BHL although these are just trivial concatenations of the page IDs or item IDs and an http prefix. Continue reading »

The first two parts of the monograph to be looked at were published in Notes from the Royal Botanic Garden Edinburgh – the house journal of the gardens until 1990.

  • Cullen, J. (1980) Revision of Rhododendron. I. subgenus Rhododendron sections Rhododendron and Pogonanthum. Notes from the Royal Botanic Garden Edinburgh. 39:1-207.
  • Chamberlain, D.F. (1982) A revision of Rhododendron. II. Subgenus Hymenanthes. Notes from the Royal Botanic Garden Edinburgh. 39:209-486.

Between them these publications cover 544 species – more or less half the genus.

The entire run of the Notes has now be digitized to page images for BHL-Europe and so I have access to good quality pictures of the text. We have an in-house OCR service that I can drop these images into to create text or other outputs. I started by dropping all 200+ images from the first publication into the OCR and creating 200+ text files but this didn’t make sense because many of the species accounts ran across multiple pages. What I needed was the contiguous text for the whole publication. I could have concatenated the text files but I figured the OCR software would do a better job if it was working through one big document as it would learn from previous pages – OK maybe this is fantasy but it is worth a try. By using Preview (the Mac’s default PDF and image viewer) I created a single PDF containing all the images and put that through the OCR processor. The result was not only a single text file but also a PDF of the whole publication including OCR’d text. Job done! Can I stop now?

This process showed how easy it is to create digital versions of publications. TheĀ  PDFs produced are not very friendly being almost 100mb in size for each of the two publications but they can be read on line and indexed so do fulfill the basic requirements of making ‘legacy’ publications available. Because of their size I do not attach the PDFs here.

Two points jump to mind:

  1. The accuracy of the OCR is masked because the text is hidden behind the page images. Although the document is searchable we can’t be sure that, if a search term is not found, it is because it isn’t there or because the OCR failed for that word in that location. This digitization process is likely to engender a false sense of security.
  2. The PDF’s of the publications do not enable re-mixing or querying of the data beyond simple text searching. Question like “What species occur in Yunnan, China?” can only be answered by working through the text manually – something that might be quicker with the printed version.

Making text available to read on line is useful in that it facilitates distribution and discovery of that text but that is all it does.

The next step is to try and turn what is basically a descriptive narrative into more useful information that can be used to answer the simple questions people are likely to ask of about biodiversity. At the least it has to be massaged into a set of web pages, one for each species, for use in EOL. There are two aspects to this process:

  • Syntax – this is really the easy bit although time consuming. The text of the monograph has a particular syntax – ordering of characters into words and sentences. We need to mark up the document with another syntax that will allow a machine to extract chunks of information. This isn’t too difficult to do at a course level because the monographs are highly structured but it becomes harder the more finely granular the syntax becomes. It inevitably involves a lot of manual work and I’ll cover it in another post.
  • Semantics – this involves tougher decisions but isn’t that time consuming. We need to decide what chunks of information in the document we want to extract and what chunks we can practically extract and reach some kind of compromise. Different chunks of text can be seen in the document. Some of these chunks have no biological meaning at all e.g. a page or a paragraph. Others have useful biological meaning e.g. a distribution string like “NE Burma, China (Yunnan, Sichuan, W Guizhou)” in the context of a species description. The decisions made about what to extract will effect the syntax used and how long it will take to impose that syntax on the raw text of the document. Making these decisions will be the subject of another blog post.

The Royal Botanic Garden Edinburgh has a history of research into the genus Rhododendron stretching back over 100 years. The legacy of this work is a herbarium that contains many type specimens, an amazing living collection and a set of monographs that cover the whole genus. My contribution back in the 1990′s was via my PhD thesis which looked at the use of emerging molecular techniques.

The bulk of the work done on Rhododendron occurred just before the digital age kicked in and so the material is not integrated in a way that can be re-used and re-purposed. An example of this is what could be called The Edinburgh Rhododendron Monograph which covers 1,027 recognized species. This is actually spread over seven publications that came out over the course of 26 years in two journals and a book and is not available in a single form anywhere. The publications are:

  • Cullen (1980) Subgenus: Rhododendron Sections: Rhododendron & Pogonanthum 231 species
  • Argent (2006) Subgenus: Rhododendron Section: Vireya 313 species
  • Chamberlain (1982) Subgenus: Hymenanthes Section: Ponticum 302 species
  • Chamberlain & Rae (1990) Subgenus: Tsutsusi 117 species
  • Kron (1993) Subgenus: Pentanthera Section: Pentanthera 23 species
  • Judd & Kron (1995) Subgenus: Pentanthera Sections: Rhodora, Viscidula & Sciadorhodion 7 species
  • Philipson & Philipson (1986) Subgenera: Azaleastrum, Therorhodion, Mumeazalea & Candidastrum 34 species

Last year I was fortunate to be awarded a Encylopedia of LifeRubenstein Fellowship to create a species page in the encyclopedia for each of the species covered by the Edinburgh monograph – the text of all seven publications now being available electronically in various forms. The award funds me for a total of 100 days to process the OCR’d or PDF text into the EOL transfer format and to link it in to as much additional data as possible. I hope to blog my experiences good and bad.

References

  • Argent, G. (2006). Rhododendrons of subgenus Vireya. Royal Horticultural Society, London.
  • Chamberlain, D.F. (1982). A revision of Rhododendron. II. Subgenus Hymenanthes. Notes from the Royal Botanic Garden Edinburgh. 39:209-486.
  • Chamberlain, D.F. & Rae, S.J. (1990). A revision of Rhododendron. IV. Subgenus Tustsusi. Edinburgh Journal of Botany. 47(2) 89-200.
  • Cullen, J. (1980) Revision of Rhododendron. I. subgenus Rhododendron sections Rhododendron and Pogonanthum. Notes from the Royal Botanic Garden Edinburgh. 39:1-207.
  • Judd, W.S. & Kron, W.S. (1995). A revision of Rhododendron sections Sciadorhodion, Rhodora and Viscidula. Edinburgh Journal of Botany. 52:1-54
  • Kron, K.A. (1993). A revision of Rhododendron section Pentanthera. Edinburgh Journal of Botany. 50:249-364.
  • Philipson, W.R. & Philipson, M.N. (1986). A revision of Rhododendron. III subgenera Azaleastrum, Mumeazalea, Candidastrum and Therorhodion. Notes from the Royal Botanic Garden Edinburgh. 44:1-23.