<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>&#160;Roger Hyam &#187; Biodiversity Informatics</title>
	<atom:link href="http://www.hyam.net/blog/archives/category/biodiv/feed" rel="self" type="application/rss+xml" />
	<link>http://www.hyam.net/blog</link>
	<description>&#34;truly pathetic verbiage&#34;</description>
	<lastBuildDate>Tue, 31 Jan 2012 09:19:50 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.1.3</generator>
		<item>
		<title>Hierarchies Make Monographs Obsolete. Fact Sheets Are The Future.</title>
		<link>http://www.hyam.net/blog/archives/1522</link>
		<comments>http://www.hyam.net/blog/archives/1522#comments</comments>
		<pubDate>Mon, 05 Dec 2011 12:58:53 +0000</pubDate>
		<dc:creator><span property="dc:creator" resource="http://www.hyam.net/blog/archives/1522">Roger Hyam</span></dc:creator>
				<category><![CDATA[Biodiversity Informatics]]></category>
		<category><![CDATA[Rhododendron Monographs]]></category>

		<guid isPermaLink="false">http://www.hyam.net/blog/?p=1522</guid>
		<description><![CDATA[Whilst I have been working on digitizing the Rhododendron monographs I have also been providing some technical help for Stuart Lindsay who is producing a series of fact sheets for the Ferns of Thailand. This has helped crystallize my thoughts regarding monographs and how we migrate them into the digital age. This post is a <a href='http://www.hyam.net/blog/archives/1522'>[...]</a>]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.hyam.net/blog/wp-content/uploads/2011/12/466555.jpg"><img class="alignleft size-medium wp-image-1531" title="466555" src="http://www.hyam.net/blog/wp-content/uploads/2011/12/466555-519x640.jpg" alt="" width="310" height="382" /></a>Whilst I have been working on digitizing the <em>Rhododendron</em> monographs I have also been providing some technical help for Stuart Lindsay who is producing a series of fact sheets for the Ferns of Thailand. This has helped crystallize my thoughts regarding monographs and how we migrate them into the digital age.</p>
<p>This post is a follow on from a previous one where I discuss mapping the <a href="http://www.hyam.net/blog/archives/1498">Rhododendron monographs to EoL</a>. It is an opinionated rant but I offer it in the hope that it will be of some use.</p>
<p>When monographs/floras/faunas are mentioned in the context of digitization people usually chirp up with PDF or, if they are more clued up on biodiversity informatics,  <a href="http://www.inotaxa.org/">TaXMLit and INOTAXA</a> (Hi to Anna if you are reading) or <a href="http://www.plasi.org">TaxonX and Plazi.org</a> (Hi to Donat).  The point I am going to make in this blog post is not against these ways of marking up taxonomic literature but more the nature of the monographic/floristic/faunistic taxonomic product itself. I am far more familiar with the botanical side of things so apologies to zoologists in advance.<span id="more-1522"></span></p>
<p>The problem is what I call the &#8220;narrative form&#8221; of monographic data not whether it is available in print, pdf, ebook or lovingly marked up XML. These publications are arranged hierarchically. There is introductory material, family descriptions, generic descriptions, species descriptions, subspecies descriptions. These descriptions are nested within each other and it isn&#8217;t always clear what information in one level of the hierarchy is repeated at lower levels. Descriptions are diagnostic within the frame of reference of that treatment i.e. they provided enough detail to separate that taxon from the others at that level in that particular hierarchy. Differentiating the taxon from other taxa in other treatments of the same or overlapping groups is usually relegated to notes.</p>
<p>Today we talk of monographs having this form because they reflect phylogeny. Previously they reflected a more ill-defined &#8216;affinity&#8217; or natural ordering. Originally hierarchies were used as an <em>aide-mémoire</em>. This results in a mishmash of  concepts that it is difficult to decode. Phylogenies do not have ranks or a linear order yet monographs do. So what does the order and rank in a phylogenetically based monograph represent? If these things exist as <em>aide-mémoire</em> then why aren&#8217;t they totally arbitrary &#8211; merely picking out the most easy to remember characteristics of different groups and making no attempt to represent evolutionary history.</p>
<p>Imagine approaching a monograph/flora/fauna with a species name. You look it up in the index. Turn to the right page and then have to assemble a description of the taxon by reading back up the taxonomic hierarchy &#8211; unless of course the author has redundantly repeated all the descriptive data in the species level account which begs the question of what the higher level accounts are for.</p>
<p>Now suppose you have in your hand an unknown specimen. First thing you need to do is to know the family and possibly the genus so you can find the right work to look it up in. There is rarely a multi access approach to getting you near to a taxon such as &#8220;deciduous trees with palmate leaves&#8221;. You have to be a taxonomist, have fertile material and more or less know what the thing is before you even start. By definition the monographs are not optimized for identification of specimens.</p>
<p>This means that these works are mainly of use to taxonomists who are familiar with the groups concerned. But what do they use them for?</p>
<p>If a taxonomist is working on a new revision they won&#8217;t be consulting current, extant monographs very much. That work has been done and shouldn&#8217;t need revising for decades. They will be working on material that hasn&#8217;t been monographed for decades if ever and needs to be classified. If they are finding new species within a recently monographed group then they will be turning over the apple cart because the descriptions in that monograph are now out of date because the monographic form is designed to be comprehensive.</p>
<p>What is more likely is that a taxonomist will use existing monographs to produce secondary taxonomic products such as field guides &#8211; and this is where my key point comes in.</p>
<h2><strong>Remixing<br />
</strong></h2>
<p>Suppose you want to produce a secondary taxonomic product. Say a guide to the lowland trees of a country. Even if you had a checklist of all the species of the country how would you know which were lowland trees? That kind of habit character is likely to be buried in descriptions. Even if you had your list of the species how would you build your guide? How would you pull together free standing descriptions of each taxon? The only way at the moment is to roll your sleeves up an become a taxonomist. Start writing new descriptions based on the contents of monographs (in which the descriptions are designed to differentiate your target taxa from taxa that will not be included in your guide). This kind of thing should really be done automatically. We should be able to do a search for all the species that are considered trees and occur below a certain altitude and find free standing descriptions of these species that we can load on our phone or tablet or print in a booklet and take into the field. The stuff that taxonomists currently produce does not support this kind of behaviour.</p>
<p>What about putting the monograph on the web? If someone links to a species what do we show on the page that is displayed? Do we include the genus description as well? What if the species description doesn&#8217;t mention the generic characteristics? Do we include the subspecific taxa? What if the subspecific taxon is only defined in terms of its minor differences to the main species &#8211; &#8220;var. alba&#8221;</p>
<p>Two years ago I discussed how difficult it is was to handle hierarchies in <a href="http://www.hyam.net/blog/archives/707">Synonyms Are SubClasses And Higher Taxa Are Just Tags</a> which is a little more technical than this piece but makes similar points.</p>
<h2><strong>Tough Love</strong></h2>
<p>Producing electronic versions of narrative monographic works is OK  from a political point of view and, if you are doing a print copy you may as well do an ebook and pdf but from the point of view of a non-taxonomist it is of little value and we shouldn&#8217;t kid ourselves that we are increasing accessibility very much. It may even be counter productive because people think they have produced an electronic resource when all they have produced is a facsimile of the paper one that is probably slower to use.</p>
<h2><strong>My Suggestion</strong></h2>
<p>Taxonomy needs to move to a <strong>One Species Per Publication</strong> model &#8211; I call this a fact sheet based approach. Instead of producing monographs of groups taxonomists should produce single free standing publications, one per single species with a global scope. If their primary interest is in phylogeny then they produce separate papers that only discuss the relationships between species that are already described in the free standing publications. This approach is far more appropriate for this digital age for the following reasons:</p>
<ul>
<li><strong>Referable -</strong> Single species can be used and referenced like other scientific or web resources. It is possible to refer to the use of a species in a study or in legislation and reference a single source that just describes that species. A lawyer can right a document that says we want to conserve species X as described in publication Y and that statement is not entailed with all the other taxa and data that is presented in publication Y.</li>
<li><strong>Remixable -</strong> It is possible to pull together a set of species descriptions to form a new resource. This may be done either automatically, say from a list of occurrence records for a region or habitat, or on a pick&#8217;n'mix basis (see taggable below).</li>
<li><strong>Granular Versionability (= Stability) -</strong> It is possible to replace individual species definitions in a set of definitions without having to reversion the whole lot. A new phylogeny or new species in a genus need not change other species in the genus that may be subjects of legal protection etc.</li>
<li><strong>Data transparent -</strong> In a typical monograph the data is of varying quality. One species may be based on five hundred specimens and another on only five. This isn&#8217;t always clear from casual use of the monograph where specimens examined and data analysis are typically presented separately from the main treatment. If all the data used to define a species is presented in a single publication then things become a great deal clearer.</li>
<li><strong>Granular Peer Review -</strong> Not all monographs are peer reviewed. Those that are are taken all or nothing. Suppose a monograph of twenty species is presented. It may be very good and have a good phylogenetic analysis etc. Perhaps two of the species are not particularly well defined but it is of high enough merit as a whole to be published. The result is that 10% of species are not particularly well defined! It would have been better to pass eighteen species and reject two. Taxonomy is riddled with such species. You only need to read a monograph that is sinking ill-defined species from the previous monograph that probably shouldn&#8217;t have been published in the first place &#8211; whilst creating new ill-defined species of its own.</li>
<li><strong>Taggable -</strong> Anything that can be reliably referenced can be tagged. This means that it becomes possible to build meaningful lists of species that can be pulled together into useable products. The tagging does not have to be done by the authors. For example IUCN tags species with conservation status and a group working on functional ecology may tag them with their functions in the environment. It is then possible to pull together a list (with descriptions) of endangered species that perform a certain role in the environment. Currently this process only leads to a list of names that can be handed off for someone to work on trying to establish what the different sources meant by those names.</li>
<li><strong>Faster More Agile Development -</strong> We can&#8217;t describe all the species on earth in the way we have been doing with the resource available in a reasonable time. This is <strong>not</strong> an unusual problem. All domains are faced with challenges that can&#8217;t be addressed by the resources available. In software engineering the &#8216;agile&#8217; approach to this problem is to prioritize development of important, doable things to build an initial working system and then revisit and re-prioritize what needs doing next. Publish results quickly and often. In taxonomy the opposite approach is often taken. A group is selected for monograph and worked on until resources are exhausted and the monograph is then published. By adopting a One Species Per Publication approach the &#8216;easy&#8217; species would be published as soon as the researcher was sure they were &#8216;good&#8217; taxa making their work available for others to use and give feed back on years sooner than is traditionally the case and whilst resources are still available to respond. Should the project stall or fail to complete then possibly the most valuable results will already be in circulation and not lost to science. Those enormous genera that are a life times work for someone could be chipped away at by the army of short term employees who are replacing career scientists!</li>
<li><strong>It would make the job of aggregators like EoL much, much easier!</strong> If we accept the fact that we need projects like EoL (which I think we all do) then we must also accept that we need to produce data in a form that they can use.</li>
</ul>
<h2>Summary</h2>
<p>This is a long post so to summarize my proposal</p>
<ol>
<li>Stop writing monographs or  floristic or faunistic regional accounts of taxonomic groups.</li>
<li>Produce individual, self contained fact sheets of single species that are global in scope.</li>
<li>Use &#8216;Agile&#8217; development techniques to produce and update these rapidly.</li>
<li>Treat phylogenies as separate products that handle the arrangement of the entities described in fact sheets.</li>
</ol>
<p>I am sure this will put a lot of peoples backs up. Please leave a comment if you agree and well as if you want to see my lynched.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.hyam.net/blog/archives/1522/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Square Peg Into A Round Hole?</title>
		<link>http://www.hyam.net/blog/archives/1498</link>
		<comments>http://www.hyam.net/blog/archives/1498#comments</comments>
		<pubDate>Fri, 02 Dec 2011 10:33:44 +0000</pubDate>
		<dc:creator><span property="dc:creator" resource="http://www.hyam.net/blog/archives/1498">Roger Hyam</span></dc:creator>
				<category><![CDATA[Biodiversity Informatics]]></category>
		<category><![CDATA[Rhododendron Monographs]]></category>

		<guid isPermaLink="false">http://www.hyam.net/blog/?p=1498</guid>
		<description><![CDATA[I&#8217;ve had my head down work wise for the past few weeks trying to get the Rhododendron monograph markup finished. I now have a little database with some 821 species accounts in it plus a few hundred images &#8211; mainly of herbarium specimens. The workflow has been quiet simple but very time consuming. Text is <a href='http://www.hyam.net/blog/archives/1498'>[...]</a>]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.hyam.net/blog/wp-content/uploads/2011/12/470111.jpg"><img class="size-medium wp-image-1514 alignright" title="470111" src="http://www.hyam.net/blog/wp-content/uploads/2011/12/470111-385x640.jpg" alt="" width="115" height="192" /></a>I&#8217;ve had my head down work wise for the past few weeks trying to get the <em>Rhododendron</em> monograph markup finished. I now have a little database with some 821 species accounts in it plus a few hundred images &#8211; mainly of herbarium specimens. The workflow has been quiet simple but very time consuming.</p>
<ol>
<li>Text is obtained from the source monograph either via OCR or access to the original word processor documents.</li>
<li>The text is topped-and-tailed to remove the introduction and any appendices and indexes.</li>
<li>Text is converted to UTF-8 if it isn&#8217;t already.</li>
<li>An XML header and foot are put in place and any non-XML characters are escaped  &#8211; this actually came down to just replacing &amp; with &amp;amp;</li>
<li>The text is now in a well formed XML document.</li>
<li>A series of custom regular expression based replacements are carried out to put XML tags at the start of each of the recognizable &#8216;fields&#8217; in the species accounts. These have to be find tuned to the document as the styles of the monographs are subtly different. Even the monographs published in the same journal had some differences. It is not possible to identify the start and end of each document element automatically. This is for three reasons:
<ol>
<li>OCR errors mean the punctuation, some letters and line breaks are inconsistent.</li>
<li>Original documents have typos in them. A classic is a period appearing inside or outside or inside and outside a closing parenthesis.</li>
<li>There are no consistent markers in the source documents structure for some fields. For example the final sentence of the description may  contain a description of the habitat, frequency and altitude but the order and style may vary presumably to make the text more pleasant to read. The only way to resolve this is by human intervention.</li>
</ol>
</li>
<li>The text is no longer in a well formed XML document!</li>
<li>The text is manually edited whilst consulting the published hard copy to insert missing XML tags and correct really obvious OCR errors. In some places actual editing of the text is needed to get it to fit a uniform document structure as in the habitat example above.</li>
<li>The text is now back to being a well formed XML document.</li>
<li>An XSL transformation is carried out on the XML to turn it into &#8216;clean&#8217; species accounts and alter the structure slightly.</li>
<li>An XSL transformation is carried out to convert the clean species accounts into SQL insert statements for a simple MySQL database. The structure of this database is very like an RDF triple store (actually a quad store as there is a column for source). A canonical, simplified taxon name (without authority or rank) is used as the equivalent of the URI to identify each &#8216;object&#8217; in the database. Putting the data in a database makes it much easier to clean up and to extract some additional data. An alternative would be to have a single large XML document and write XPath queries.<span id="more-1498"></span></li>
</ol>
<p>By writing queries that join the <em>Rhododendron</em> database to institutional databases I can create lists of living and dead specimens at Royal Botanic Garden Edinburgh and extract images from the herbarium digitisation project. Previously I extracted images from BHL that I can also join in. I can do things like &#8216;tag&#8217;  species with the ISO country codes, whether they are epiphytes, their altitude range &#8211; all interesting facts. I can imagine someone asking a real question such as &#8220;Give me accounts for all the rhododendrons that occur above X meters in Thailand&#8221;.</p>
<p>I could write a bespoke front end to the database that enables this functionality but this wouldn&#8217;t help someone answer the question &#8220;Give me accounts for all the XYZs that occur above X meters in Thailand&#8221;. Let&#8217;s face it only a small bunch of taxonomists and enthusiasts are interested in data that <strong>only</strong> includes rhododendrons. For most people this data will  never be the whole answer. I am being funded to do this work so that we can get the information from the Edinburgh <em>Rhododendron</em> monographs into the Encyclopedia of Life. There it can be mixed in with data from many other sources and so move towards answering the questions &#8220;most people&#8221; are likely to ask.</p>
<p><strong>&#8216;Properties&#8217; I Have Captured</strong></p>
<p>From the workflow described above you can see that the  properties I have in my database <strong>have</strong> to represent the document structure of the monographs &#8211; plus some tags extracted by very simple data mining.  The properties are:</p>
<ul>
<li><strong>altitude (around, max, min, range)</strong> &#8211; Many accounts include a range of numbers or maybe a single &#8216;circa&#8217; number.</li>
<li><strong>description</strong> &#8211; Diagnostic description. Importantly this may or may not include characters that have been mentioned higher up the taxonomy in a group, subsection, section or subgenus description.</li>
<li><strong>distribution</strong> &#8211; Usually the country and province. Sometimes individual mountains or parks.</li>
<li><strong>habitat</strong> &#8211; a sentence describing the habitat but often including whether it is epiphytic or terrestrial (shouldn&#8217;t this be habit?) and also whether it is common or not.</li>
<li><strong>icon-ref</strong> &#8211; a citation of where an image can be found in the literature.</li>
<li><strong>image</strong> &#8211; a link to a Curtis image.</li>
<li><strong>name (author, cite, formatted)</strong> &#8211; Three properties breaking down the name</li>
<li><strong>type -</strong> The type citation string for the accepted name of the taxon.</li>
<li><strong>note</strong> &#8211; All sorts of things in here! Almost all accounts have some comment varying from &#8220;Known only from type collection&#8221; to several paragraphs of text. May include derivation of name. May occur multiple times  for a species.</li>
<li><strong>occursInCountryIso</strong> &#8211; this was extracted by simply looking for country names in the distribution field</li>
<li><strong>rank -</strong> the database contains facts about the subspecies, varieties and even forma that occur in the monographs (see below)</li>
<li><strong>subgenus</strong> &#8211; a single word indicating which subgenus the species is in. Subgenera in Rhododendron could be thought of as genera &#8230;</li>
<li><strong>synonyms</strong> &#8211; this is a block of text representing all the names, types and citations that came in the synonyms paragraph.</li>
</ul>
<p>To get these properties into EoL I need to squeeze them into the <a href="http://eol.org/info/create_xml">EoL Transfer Schema</a> . (Here I need to have a declaration of interest in that I think I was in on the original design of this at a workshop at GBIF as few years ago.) The basic structure is like this:</p>
<ul>
<li>Document
<ul>
<li>Taxon
<ul>
<li>Name</li>
<li>Source</li>
<li>Other Metadata&#8230;</li>
<li>DataObject
<ul>
<li>Type</li>
<li>Source</li>
<li>Other metadata&#8230;</li>
<li>Value</li>
</ul>
</li>
<li>DataObject
<ul>
<li>..</li>
</ul>
</li>
</ul>
</li>
<li>Taxon
<ul>
<li>&#8230;</li>
</ul>
</li>
</ul>
</li>
</ul>
<p>So a document contains a number of taxa and each taxon contains some metadata plus a number of DataObjects. Each DataObject is of a &#8216;type&#8217; and has its own metadata plus a value of some kind &#8211; such as text or a link to an object. This is a very generic data structure that allows for expansion by adding new types of DataObject.</p>
<p>All I need to do is hack together a PHP script to map my properties to the DataObject types and I can go back to trying to clean up the data. This is what the types look like:</p>
<blockquote><p>Associations, Behaviour, Biology, Conservation, ConservationStatus, Cyclicity, Cytology, Description, DiagnosticDescription, Diseases, Dispersal, Distribution, Ecology, Evolution, GeneralDescription, Genetics, Growth, Habitat, Key, Legislation, LifeCycle, LifeExpectancy, LookAlikes, Management, Migration, MolecularBiology, Morphology, Physiology, PopulationBiology, Procedures, Reproduction, RiskStatement, Size, TaxonBiology, Threats, Trends, TrophicStrategy, Uses</p></blockquote>
<p>These map to subject types on taxon pages within EoL. There is a description of these <a href="http://eol.org/info/toc_subjects">on the EoL help pages</a>.</p>
<p>This is where I run into a problem. My properties don&#8217;t map to these subject types. The only matches I really have are Distribution, Description and Habitat. The advice is to put &#8220;note&#8221; type data under &#8220;Description&#8221; so probably 90% of what I have goes into &#8220;Description&#8221; DataObjects. Why have I just spent the last umpteen weeks marking all this stuff up?</p>
<p>There are interesting and important questions here:</p>
<ul>
<li>How important is semantic mark up of this kind of data? What advantages are gained over just treating each species treatment as a single block of text? I could still pull out all the ones that occur in China etc.</li>
<li>If the EoL Subject Types are a list of the kinds of information people want to see on species pages and they don&#8217;t match the data that is captured in a monograph should we continue to produce monographs in their current form? Who is driving the production of data, the users or tradition?</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.hyam.net/blog/archives/1498/feed</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>HTML5 Geolocation Data Sucks</title>
		<link>http://www.hyam.net/blog/archives/1432</link>
		<comments>http://www.hyam.net/blog/archives/1432#comments</comments>
		<pubDate>Fri, 02 Sep 2011 11:44:58 +0000</pubDate>
		<dc:creator><span property="dc:creator" resource="http://www.hyam.net/blog/archives/1432">Roger Hyam</span></dc:creator>
				<category><![CDATA[Biodiversity Informatics]]></category>
		<category><![CDATA[Misc]]></category>
		<category><![CDATA[Technolust]]></category>

		<guid isPermaLink="false">http://www.hyam.net/blog/?p=1432</guid>
		<description><![CDATA[I have long been excited about HTML5 having access to a geolocation data. It should make it possible to build a whole range of applications for phones and other devices that are cross platform but make use of the users location. Unfortunately reality bites when you try and actually build an application based on the <a href='http://www.hyam.net/blog/archives/1432'>[...]</a>]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.hyam.net/blog/wp-content/uploads/2011/09/geotest.png"><img class="alignleft size-full wp-image-1433" style="border: 1px solid black;" title="geotest" src="http://www.hyam.net/blog/wp-content/uploads/2011/09/geotest.png" alt="" width="253" height="313" /></a>I have long been excited about HTML5 having access to a geolocation data. It should make it possible to build a whole range of applications for phones and other devices that are cross platform but make use of the users location. Unfortunately reality bites when you try and actually build an application based on the technology.</p>
<p>I have been working with <a href="http://www.sencha.com/products/touch/">Sencha Touch</a> and the Ext.util.Geolocation object but am having problems with accuracy. I have noted the following behaviour.</p>
<p>When I call for a location on iPhone (3G) and iPad (v1) I get a one with around 1.3km accuracy. Basically it places me at one of two spots about 1km apart. If I switch to the native maps app then it places my position within 10m of where I am standing &#8211; that &#8220;wow it knows where I am&#8221; accuracy . Switch back to my web app and the first call to the GeoLocation returns similar accuracy. Any subsequent calls return the old inaccurate positions.<span id="more-1432"></span></p>
<p>I have tested this briefly on Android (Google&#8217;s first phone) and have similar behaviour.</p>
<p><a href="http://www.hyam.net/geotest/"><img class="alignright size-full wp-image-1434" title="geotest_qr" src="http://www.hyam.net/blog/wp-content/uploads/2011/09/geotest_qr.png" alt="" width="248" height="248" /></a>I am setting allowHighAccuracy=true and I have experimented with different age and time out durations to no effect. I have also tried calling the underlying JavaScript methods with similar results so I don&#8217;t believe it is the Touch libraries &#8211; but would welcome your thoughts.</p>
<p>My conclusion is that both iOS and Android only pass cell tower level location accuracy to the browser and do not ever use GPS &#8211; at least not in an urban environment. Basically enableHighAccuracy from the <a href="http://www.w3.org/TR/geolocation-API/">Geolocation API spec</a> is ignored. Is this a correct assumption? I hope it isn&#8217;t because it effectively cripples the HTML5+GPS application market.</p>
<p>I wrote<a href="http://www.hyam.net/geotest/"> a test application</a> that you can use to check this behaviour for yourself. Remember this is a mobile app. You can test it in Chrome or Safari on the desk top but will need to load it on your phone and be <strong>outside</strong> to benefit from GPS! Dink the QR Code with your phone and take a stroll if you are reading this indoors.</p>
<p>The application allows you to either poll the location by tapping the &#8220;Update Location&#8221; button or ask the browser to continuously update your location using autoUpdate. The maximum age parameter is hardwired to zero so any requests should attempt to fetch new location data. The time out is set to 20 seconds. You will find that if you keep punching the &#8220;Update Location&#8221; button when it isn&#8217;t returning new locations you will just get a series of time out alerts 20 seconds later.</p>
<p>Finally the &#8220;Show Map&#8221; button will launch the maps app on iOS devices and put a place marker on where HTML5 thinks you are. You can then compare it with where the maps app thinks you are. They are often different immediately or, after a few seconds, the maps app will move your current location to a far more accurate spot as the GPS kicks in. On Android the behaviour of the &#8220;Show Map&#8221; button is less obvious as it doesn&#8217;t always appear to open the maps app and sometimes just opens the Google Maps application. You will have to open the maps app manually.</p>
<p>I find it amazing that there are so many posts out there raving about location info in the browser yet when I try and use it I find it actually sucks. Maybe OK for finding your nearest Starbucks but not a lot else. If you have read <a href="http://www.amazon.com/Inmates-Are-Running-Asylum-Products/dp/0672326140"><em>The Inmates Are Running the Asylum</em> </a>by Alan Cooper then you will be familiar with the notion of &#8220;Dancing Bear Software&#8221; &#8211; it is amazing the bear can dance, what a shame it doesn&#8217;t dance very well!</p>
<p>I do hope I am doing something really really dumb here and am totally wrong. Please look at the code in my app page and tell me so.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.hyam.net/blog/archives/1432/feed</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Extracting Data From the Rhododendron Monographs</title>
		<link>http://www.hyam.net/blog/archives/1352</link>
		<comments>http://www.hyam.net/blog/archives/1352#comments</comments>
		<pubDate>Tue, 09 Aug 2011 12:41:05 +0000</pubDate>
		<dc:creator><span property="dc:creator" resource="http://www.hyam.net/blog/archives/1352">Roger Hyam</span></dc:creator>
				<category><![CDATA[Biodiversity Informatics]]></category>
		<category><![CDATA[Rhododendron Monographs]]></category>

		<guid isPermaLink="false">http://www.hyam.net/blog/?p=1352</guid>
		<description><![CDATA[This post deals with the semantics of extraction of data from the Rhododendron monographs. Another post will deal with the technicalities of the actual extraction. The image above shows a species description entry. It was chosen as being a small and simple example for illustrative purposes. I have marked up the bits I am interested <a href='http://www.hyam.net/blog/archives/1352'>[...]</a>]]></description>
			<content:encoded><![CDATA[<p style="text-align: center;"><a href="http://www.hyam.net/blog/wp-content/uploads/2011/08/rhodo_page.png"><img class="aligncenter size-large wp-image-1353" style="border: 1px solid black;" title="rhodo_page" src="http://www.hyam.net/blog/wp-content/uploads/2011/08/rhodo_page-1024x580.png" alt="" width="695" height="393" /></a></p>
<p>This post deals with the semantics of extraction of data from the <em>Rhododendron</em> monographs. Another post will deal with the technicalities of the actual extraction.</p>
<p>The image above shows a species description entry. It was chosen as being a small and simple example for illustrative purposes. I have marked up the bits I am interested in extracting. Red indicates important fields, blue unimportant and yellow something in between &#8211; but why those bits and those priorities? Monographs contain a great deal of other stuff such as keys and descriptions of higher taxa and discussions. We could argue for hours about what should be extracted and never come to a conclusion unless we have some guiding principles on what we are trying to do. I have therefore developed five guiding principles for the project that are probably pretty general and may be applicable to other such projects:<span id="more-1352"></span></p>
<ol>
<li><strong>We are extracting data NOT trying to create a digital facsimile of the document.</strong> We already have the document in its physical form, as page images, as OCR&#8217;d text and as a PDF combining the two. We can read the document whenever we need to and it is highly unlikely we will want to edit it like a word processor file. We don&#8217;t therefore need to capture anything that is to do with document layout such as fonts, line breaks, paragraphs and section orders.</li>
<li><strong>We are NOT trying to extract ALL the data in one pass. </strong>The process is not destructive and we can always return at a later date. We will probably be better at interpreting the document in the future and we may have a different perspective on how to interpret it. We should only pull out what we need today not what we think we might need at some point in the future &#8211; but see point 5.</li>
<li><strong>Use-case driven.</strong> We should have some use-case (or story) about how the data will be used. This is the basis on which we can make all the smaller decisions.</li>
<li><strong>Opportunistic.</strong> If it really is free then we will have it! Points 2 &amp; 3 should not prevent us from capturing data that comes for free as a by-product of the process. We should be careful we are not kidding ourselves though and drop anything that is taking time but contribute to the use-case.</li>
<li><strong>Provenance should only be to a useful point in the document. </strong>It is easy to get carried away with capturing provenance metadata about where in the document a piece of information comes from and what has happened to it since then. In the example above we could capture the fact that <em>R. johnstoneanum</em> occurs in <strong>India</strong> is defined by the first word on the fifth paragraph of the species account rather than having been extracted from a map or the type locality but before we know it we are effectively recreating the document (point 1). Instead we merely tag data with the species account (document section) the information comes from and the page on which the account occurs. If someone wants to track down where a fact came from they can be taken to the relevant section of the document and read it for themselves.</li>
</ol>
<p>If these are general principles what&#8217;s the use-case for this project, what do we actually want to do today?</p>
<ol>
<li>Create species page data for Encyclopedia of Life.</li>
<li>Data should be of biological interest &#8211; about a species of organism not about the processes of taxonomy and nomenclature. It is taken as read that the nomenclature has been sorted out by the experts and we are just dealing with the products of that process.</li>
<li>We are only interested in &#8216;species&#8217; &#8211; or basic units of diversity &#8211; and so not in capturing higher levels of taxonomy such as groups, subsections, sections and subgenera.</li>
<li>We are interested in tagging the species with facts that can be extracted e.g. occurs in India; Habit Shrub; Habit Tree; etc &#8211; these are largely opportunistic.</li>
</ol>
<p>The result is to extract the following fields of data for each species account</p>
<ul>
<li><strong>altitude-around</strong> = When the description contains a single altitude value in meters, often with a circa</li>
<li><strong>altitude-max</strong> = The max altitude in meters</li>
<li><strong>altitude-min</strong> = The minimum altitude in meters</li>
<li><strong>description</strong> = The bulk of the descriptive text &#8211; possibly for further data extraction later</li>
<li><strong>distribution</strong> = A fairly uniform string describing the country and province distribution</li>
<li><strong>habitat</strong> = The habitat string &#8211; this is very variable</li>
<li><strong>name</strong> = The actual name expanded to include <em>Rhododendron</em> rather than R. but without the author string</li>
<li><strong>name-author</strong> = The author string</li>
<li><strong>name-cite</strong> = The protolog (original place of publication of the name)</li>
<li><strong>name-type</strong> = A description of where the type is located</li>
<li><strong>note</strong> = Most species have one or two paragraphs of notes</li>
<li><strong>volume-part-page</strong> = A pointer to the page the species account starts on</li>
<li><strong>icon-ref </strong>= A reference to a image of the specimen</li>
<li><strong>synonyms</strong> = All the synonymy as a single block of text</li>
</ul>
<p>I&#8217;ll talk about the mechanism of doing the extraction in another post.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.hyam.net/blog/archives/1352/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Links To All Curtis Botanical Magazine Illustrations in BHL</title>
		<link>http://www.hyam.net/blog/archives/1341</link>
		<comments>http://www.hyam.net/blog/archives/1341#comments</comments>
		<pubDate>Mon, 08 Aug 2011 09:57:46 +0000</pubDate>
		<dc:creator><span property="dc:creator" resource="http://www.hyam.net/blog/archives/1341">Roger Hyam</span></dc:creator>
				<category><![CDATA[Biodiversity Informatics]]></category>
		<category><![CDATA[Rhododendron Monographs]]></category>

		<guid isPermaLink="false">http://www.hyam.net/blog/?p=1341</guid>
		<description><![CDATA[This is a sideline to my working on the Edinburgh Rhododendron monographs. The monographs often quote references to illustrations (icons) of species. This is useful as we know that these are illustrations that have been determined by the author of the account and are therefore &#8220;correctly&#8221; determined. What a shame we only have an abbreviated <a href='http://www.hyam.net/blog/archives/1341'>[...]</a>]]></description>
			<content:encoded><![CDATA[<div id="attachment_1343" class="wp-caption alignleft" style="width: 308px"><a href="http://www.hyam.net/blog/wp-content/uploads/2011/08/curtis.jpg"><img class="size-medium wp-image-1343" title="curtis" src="http://www.hyam.net/blog/wp-content/uploads/2011/08/curtis-518x640.jpg" alt="" width="298" height="369" /></a><p class="wp-caption-text">William Curtis (1746-1799)</p></div>
<p>This is a sideline to my working on the Edinburgh Rhododendron monographs.</p>
<p>The monographs often quote references to illustrations (icons) of species. This is useful as we know that these are illustrations that have been determined by the author of the account and are therefore &#8220;correctly&#8221; determined. What a shame we only have an abbreviated text string that can really only be understood by a human. An example might be &#8220;Rhododendron &amp; Camellia Yearbook 25: f.58 (1970)&#8221;. Because these are in the botanical monographic style it is near impossible even to turn them into an <a href="http://en.wikipedia.org/wiki/OpenURL">OpenURL </a>that a resolver could make sense of &#8211; so we have a bit of a challenge.</p>
<p>For the just-under-four-hundred species accounts I have extracted from the first two monographs I have 445 icon strings. Of these 144 contain &#8216;Bot. Mag.&#8217; &#8211; for <a href="http://en.wikipedia.org/wiki/Curtis's_Botanical_Magazine">Curtis&#8217; Botanical Magazine</a> and so they look like a good set to try and parse and link up. The <a href="http://www.biodiversitylibrary.org/">Biodiversity Heritage Libary</a> have digitized that proportion of<a href="http://www.biodiversitylibrary.org/item/91677#page/1/mode/2up"> Bot. Mag. prior to 1920</a> that is out of copyright thanks to <a href="http://www.mobot.org/">Missouri Botanic Gardens</a>. I just need to join it all up. In fact I could download the relevant images and embed them in my data because they are out of copyright.</p>
<p>So a happy afternoon was spent learning about the BHL API and writing XSLT and regular expressions to parse the strings I had. The result was a match up of just 59 illustrations. About the same number I could have done manually in an afternoon! The rest of my Bot. Mag. references are post 1920 and so locked up in copyright.</p>
<p>But a happy by-product of the process was the fact that I downloaded and parsed all the metadata for Bot. Mag. in BHL and extracted the item IDs (books) and page IDs for what I believe are all the illustrations &#8211; a total of 8,215. So if you are faced with the same issue as me you don&#8217;t have to go to the bother of doing it. Here is a CSV file of the full list.</p>
<p><a href="http://www.hyam.net/blog/wp-content/uploads/2011/08/all_curtis.csv">All Curtis Illustrations In BHL (CSV)</a></p>
<p>I have included the URLs to the resources in BHL although these are just trivial concatenations of the page IDs or item IDs and an http prefix.<span id="more-1341"></span></p>
<div id="attachment_1346" class="wp-caption alignright" style="width: 258px"><a href="http://www.hyam.net/blog/wp-content/uploads/2011/08/470082.jpg"><img class="size-medium wp-image-1346" title="470082" src="http://www.hyam.net/blog/wp-content/uploads/2011/08/470082-381x640.jpg" alt="" width="248" height="417" /></a><p class="wp-caption-text">Rhododendron neriiflorum</p></div>
<p>Unfortunately the names of the taxa are not included as I was solving the problem of getting from a citation to an image &#8211; I already had the name. It would be tempting to try and calculate the names for each of the illustrations but I can&#8217;t justify doing this right now so it is an exercise left to the reader.  The problem is the illustration in Curtis may come before or after the text on the species although on the plus side there is only one species per page. What I would try doing is:</p>
<ul>
<li>Calling for the OCR of the BHL pages <strong>ids</strong> immediately before and after the page id of the illustration</li>
<li>See if the first few lines contain the page <strong>number</strong> (plate or tab. number) of the illustration.</li>
<li>If they do then we have found the page for the plate so use Taxon Finder to extract the names of that page (there may be a BHL API call for this)</li>
</ul>
<p>I am still left with 85 Bot. Mag. references I can&#8217;t link to anything because Bot. Mag. is safely locked away <a href="http://www.kew.org/about-kew/kew-publishing/journals/curtis-botanical/index.htm">behind copyright at Kew and Blackwell Publishing</a>. Wouldn&#8217;t it be nice if they created a web page for every illustration that contained at least a low resolution version.</p>
<p>I hope the attached file is of some use to someone and also that no one points out I could have done this more quickly and easily by some other route. If you use it please post a comment &#8211; thanks.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.hyam.net/blog/archives/1341/feed</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Rhodo Monographs &#8211; Clean Up 1</title>
		<link>http://www.hyam.net/blog/archives/1327</link>
		<comments>http://www.hyam.net/blog/archives/1327#comments</comments>
		<pubDate>Tue, 26 Jul 2011 11:14:38 +0000</pubDate>
		<dc:creator><span property="dc:creator" resource="http://www.hyam.net/blog/archives/1327">Roger Hyam</span></dc:creator>
				<category><![CDATA[Biodiversity Informatics]]></category>
		<category><![CDATA[Rhododendron Monographs]]></category>

		<guid isPermaLink="false">http://www.hyam.net/blog/?p=1327</guid>
		<description><![CDATA[The first two parts of the monograph to be looked at were published in Notes from the Royal Botanic Garden Edinburgh &#8211; the house journal of the gardens until 1990. Cullen, J. (1980) Revision of Rhododendron. I. subgenus Rhododendron sections Rhododendron and Pogonanthum. Notes from the Royal Botanic Garden Edinburgh. 39:1-207. Chamberlain, D.F. (1982) A <a href='http://www.hyam.net/blog/archives/1327'>[...]</a>]]></description>
			<content:encoded><![CDATA[<p>The first two parts of the monograph to be looked at were published in <em>Notes from the Royal Botanic Garden Edinburgh</em> &#8211; the house journal of the <a href="http://www.rbge.org.uk/">gardens</a> until 1990.</p>
<ul>
<li><strong>Cullen, J. (1980)</strong> Revision of Rhododendron. I. subgenus  Rhododendron  sections Rhododendron and Pogonanthum. Notes from the  Royal Botanic  Garden Edinburgh. 39:1-207.<br />
<strong> </strong></li>
<li><strong>Chamberlain, D.F. (1982)</strong> A revision of Rhododendron. II. Subgenus  Hymenanthes. Notes from the Royal Botanic Garden Edinburgh. 39:209-486.</li>
</ul>
<p>Between them these publications cover 544 species &#8211; more or less half the genus.</p>
<p>The entire run of the <em>Notes</em> has now be digitized to page images for<a href="http://www.bhl-europe.eu/"> BHL-Europe</a> and so I have access to good quality pictures of the text. We have an in-house OCR service that I can drop these images into to create text or other outputs. I started by dropping all 200+ images from the first publication into the OCR and creating 200+ text files but this didn&#8217;t make sense because many of the species accounts ran across multiple pages. What I needed was the contiguous text for the whole publication. I could have concatenated the text files but I figured the OCR software would do a better job if it was working through one big document as it would learn from previous pages &#8211; OK maybe this is fantasy but it is worth a try. By using <strong>Preview</strong> (the Mac&#8217;s default PDF and image viewer) I created a single PDF containing all the images and put that through the OCR processor. The result was not only a single text file but also a PDF of the whole publication including OCR&#8217;d text. Job done! Can I stop now?</p>
<p>This process showed how easy it is to create digital versions of publications. The  PDFs produced are not very friendly being almost 100mb in size for each of the two publications but they can be read on line and indexed so do fulfill the basic requirements of making &#8216;legacy&#8217; publications available. Because of their size I do <strong>not</strong> attach the PDFs here.</p>
<p>Two points jump to mind:</p>
<ol>
<li>The accuracy of the OCR is masked because the text is hidden behind the page images. Although the document is searchable we can&#8217;t be sure that, if a search term is not found, it is because it isn&#8217;t there or because the OCR failed for that word in that location. This digitization process is likely to engender a false sense of security.</li>
<li>The PDF&#8217;s of the publications do not enable re-mixing or querying of the data beyond simple text searching. Question like &#8220;What species occur in Yunnan, China?&#8221; can only be answered by working through the text manually &#8211; something that might be quicker with the printed version.</li>
</ol>
<p>Making text available to read on line is useful in that it facilitates distribution and discovery of that text but that is all it does.</p>
<p>The next step is to try and turn what is basically a descriptive narrative into more useful information that can be used to answer the simple questions people are likely to ask of about biodiversity. At the least it has to be massaged into a set of web pages, one for each species, for use in EOL. There are two aspects to this process:</p>
<ul>
<li><strong>Syntax</strong> &#8211; this is really the easy bit although time consuming. The text of the monograph has a particular syntax &#8211; ordering of characters into words and sentences. We need to mark up the document with another syntax that will allow a machine to extract chunks of information. This isn&#8217;t too difficult to do at a course level because the monographs are highly structured but it becomes harder the more finely granular the syntax becomes. It inevitably involves a lot of manual work and I&#8217;ll cover it in another post.</li>
<li><strong>Semantics</strong> &#8211; this involves tougher decisions but isn&#8217;t that time consuming. We need to decide what chunks of information in the document we <strong>want to extract</strong> and what chunks we <strong>can practically extract</strong> and reach some kind of compromise. Different chunks of text can be seen in the document. Some of these chunks have no biological meaning at all e.g. a page or a paragraph. Others have useful biological meaning e.g. a distribution string like &#8220;NE Burma, China (Yunnan, Sichuan, W Guizhou)&#8221; in the context of a species description. The decisions made about what to extract will effect the syntax used and how long it will take to impose that syntax on the raw text of the document. Making these decisions will be the subject of <a href="http://www.hyam.net/blog/archives/1352">another blog post</a>.</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://www.hyam.net/blog/archives/1327/feed</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>EOL Rhododendron Monographs &#8211; Getting Started</title>
		<link>http://www.hyam.net/blog/archives/1317</link>
		<comments>http://www.hyam.net/blog/archives/1317#comments</comments>
		<pubDate>Mon, 25 Jul 2011 13:58:08 +0000</pubDate>
		<dc:creator><span property="dc:creator" resource="http://www.hyam.net/blog/archives/1317">Roger Hyam</span></dc:creator>
				<category><![CDATA[Biodiversity Informatics]]></category>
		<category><![CDATA[Rhododendron Monographs]]></category>

		<guid isPermaLink="false">http://www.hyam.net/blog/?p=1317</guid>
		<description><![CDATA[The Royal Botanic Garden Edinburgh has a history of research into the genus Rhododendron stretching back over 100 years. The legacy of this work is a herbarium that contains many type specimens, an amazing living collection and a set of monographs that cover the whole genus. My contribution back in the 1990&#8242;s was via my <a href='http://www.hyam.net/blog/archives/1317'>[...]</a>]]></description>
			<content:encoded><![CDATA[<p>The <a href="http://www.rbge.org.uk/">Royal Botanic Garden Edinburgh</a> has a history of research into the genus <em>Rhododendron</em> stretching back over 100 years. The legacy of this work is a herbarium that contains many type specimens, an <a href="http://www.rbge.org.uk/the-gardens/rhododendrons">amazing living collection</a> and a set of monographs that cover the whole genus. My contribution back in the 1990&#8242;s was via<a href="http://www.hyam.net/blog/archives/914"> my PhD thesis</a> which looked at the use of emerging molecular techniques.</p>
<p>The bulk of the work done on <em>Rhododendron</em> occurred just before the digital age kicked in and so the material is not integrated in a way that can be re-used and re-purposed. An example of this is what could be called <strong>The Edinburgh <em>Rhododendron</em> Monograph</strong> which covers 1,027 recognized species. This is actually spread over seven publications that came out over the course of 26 years in two journals and a book and is not available in a single form anywhere. The publications are:</p>
<ul>
<li><strong>Cullen (1980)</strong> Subgenus: <em>Rhododendron</em> Sections: <em>Rhododendron</em> &amp; <em>Pogonanthum</em> 231 species</li>
<li><strong>Argent (2006)</strong> Subgenus: <em>Rhododendron</em> Section: <em>Vireya</em> 313 species</li>
<li><strong>Chamberlain (1982)</strong> Subgenus: <em>Hymenanthes</em> Section: <em>Ponticum</em> 302 species</li>
<li><strong>Chamberlain &amp; Rae (1990)</strong> Subgenus: <em>Tsutsusi</em> 117 species</li>
<li><strong>Kron (1993)</strong> Subgenus: <em>Pentanthera</em> Section: <em>Pentanthera</em> 23 species</li>
<li><strong>Judd &amp; Kron (1995)</strong> Subgenus: <em>Pentanthera</em> Sections: <em>Rhodora</em>, <em>Viscidula</em> &amp; <em>Sciadorhodion</em> 7 species</li>
<li><strong>Philipson &amp; Philipson (1986)</strong> Subgenera: <em>Azaleastrum</em>, <em>Therorhodion</em>, <em>Mumeazalea</em> &amp; <em>Candidastrum</em> 34 species</li>
</ul>
<p>Last year I was fortunate to be awarded a <a href="http://www.eol.org/">Encylopedia of Life</a> &#8211; <a href="http://www.eol.org/content/page/172">Rubenstein Fellowship</a> to create a species page in the encyclopedia for each of the species covered by the Edinburgh monograph &#8211; the text of all seven publications now being available electronically in various forms. The award funds me for a total of 100 days to process the OCR&#8217;d or PDF text into the EOL transfer format and to link it in to as much additional data as possible. I hope to blog my experiences good and bad.</p>
<h2>References</h2>
<ul>
<li>Argent, G. (2006). Rhododendrons of subgenus Vireya. Royal Horticultural Society, London.</li>
<li>Chamberlain, D.F. (1982). A revision of Rhododendron. II. Subgenus Hymenanthes. Notes from the Royal Botanic Garden Edinburgh. 39:209-486.</li>
<li>Chamberlain, D.F. &amp; Rae, S.J. (1990). A revision of Rhododendron. IV. Subgenus Tustsusi. Edinburgh Journal of Botany. 47(2) 89-200.</li>
<li>Cullen, J. (1980) Revision of Rhododendron. I. subgenus Rhododendron sections Rhododendron and Pogonanthum. Notes from the Royal Botanic Garden Edinburgh. 39:1-207.</li>
<li>Judd, W.S. &amp; Kron, W.S. (1995). A revision of Rhododendron sections Sciadorhodion, Rhodora and Viscidula. Edinburgh Journal of Botany. 52:1-54</li>
<li>Kron, K.A. (1993). A revision of Rhododendron section Pentanthera. Edinburgh Journal of Botany. 50:249-364.</li>
<li>Philipson, W.R. &amp; Philipson, M.N. (1986). A revision of Rhododendron. III subgenera Azaleastrum, Mumeazalea, Candidastrum and Therorhodion. Notes from the Royal Botanic Garden Edinburgh. 44:1-23.</li>
</ul>
<p>&nbsp;</p>
]]></content:encoded>
			<wfw:commentRss>http://www.hyam.net/blog/archives/1317/feed</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Digitisation of European Collections By Country</title>
		<link>http://www.hyam.net/blog/archives/1313</link>
		<comments>http://www.hyam.net/blog/archives/1313#comments</comments>
		<pubDate>Tue, 28 Jun 2011 15:13:53 +0000</pubDate>
		<dc:creator><span property="dc:creator" resource="http://www.hyam.net/blog/archives/1313">Roger Hyam</span></dc:creator>
				<category><![CDATA[Biodiversity Informatics]]></category>
		<category><![CDATA[Synthesys]]></category>

		<guid isPermaLink="false">http://www.hyam.net/blog/?p=1313</guid>
		<description><![CDATA[Following on from my previous post, European Natural History Collections &#8211; What&#8217;s Missing, it is simple to create a ranked list of countries and an estimate of the number of specimen records they have in GBIF. Countries with a score of zero don&#8217;t appear in the list. Country Specimens Spain 2,277,428 France 2,081,208 United Kingdom <a href='http://www.hyam.net/blog/archives/1313'>[...]</a>]]></description>
			<content:encoded><![CDATA[<p>Following on from my previous post, <a href="http://www.hyam.net/blog/archives/1277">European Natural History Collections &#8211; What&#8217;s Missing</a>, it is simple to create a ranked list of countries and an estimate of the number of specimen records they have in GBIF. Countries with a score of zero don&#8217;t appear in the list.<br />
<!--   		BODY,DIV,TABLE,THEAD,TBODY,TFOOT,TR,TH,TD,P { font-family:"Arial"; font-size:x-small } --></p>
<table border="0" cellspacing="0" frame="VOID" rules="NONE">
<colgroup>
<col width="110"></col>
<col width="88"></col>
</colgroup>
<tbody>
<tr>
<td width="110" height="17" align="LEFT"><strong>Country</strong></td>
<td style="text-align: right;" width="88" align="LEFT"><strong>Specimens</strong></td>
</tr>
<tr>
<td height="17" align="LEFT">Spain</td>
<td align="RIGHT">2,277,428</td>
</tr>
<tr>
<td height="17" align="LEFT">France</td>
<td align="RIGHT">2,081,208</td>
</tr>
<tr>
<td height="17" align="LEFT">United Kingdom</td>
<td align="RIGHT">1,038,133</td>
</tr>
<tr>
<td height="17" align="LEFT">Germany</td>
<td align="RIGHT">747,159</td>
</tr>
<tr>
<td height="17" align="LEFT">Poland</td>
<td align="RIGHT">304,798</td>
</tr>
<tr>
<td height="17" align="LEFT">Netherlands</td>
<td align="RIGHT">263,655</td>
</tr>
<tr>
<td height="17" align="LEFT">Belgium</td>
<td align="RIGHT">217,417</td>
</tr>
<tr>
<td height="17" align="LEFT">Denmark</td>
<td align="RIGHT">161,163</td>
</tr>
<tr>
<td height="17" align="LEFT">Slovenia</td>
<td align="RIGHT">160,757</td>
</tr>
<tr>
<td height="17" align="LEFT">Finland</td>
<td align="RIGHT">146,845</td>
</tr>
<tr>
<td height="17" align="LEFT">Switzerland</td>
<td align="RIGHT">86,675</td>
</tr>
<tr>
<td height="17" align="LEFT">Austria</td>
<td align="RIGHT">82,861</td>
</tr>
<tr>
<td height="17" align="LEFT">Portugal</td>
<td align="RIGHT">63,218</td>
</tr>
<tr>
<td height="17" align="LEFT">Norway</td>
<td align="RIGHT">34,498</td>
</tr>
<tr>
<td height="17" align="LEFT"><strong>Total</strong></td>
<td align="RIGHT"><strong>7,665,815</strong></td>
</tr>
</tbody>
</table>
<p>I am still looking for a flaw in how I have calculated these numbers and would welcome suggestions.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.hyam.net/blog/archives/1313/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>European Natural History Collections &#8211; What&#8217;s Missing?</title>
		<link>http://www.hyam.net/blog/archives/1277</link>
		<comments>http://www.hyam.net/blog/archives/1277#comments</comments>
		<pubDate>Tue, 21 Jun 2011 16:20:28 +0000</pubDate>
		<dc:creator><span property="dc:creator" resource="http://www.hyam.net/blog/archives/1277">Roger Hyam</span></dc:creator>
				<category><![CDATA[Biodiversity Informatics]]></category>
		<category><![CDATA[Synthesys]]></category>

		<guid isPermaLink="false">http://www.hyam.net/blog/?p=1277</guid>
		<description><![CDATA[I am working on improving the metadata on European natural history collections as part of the Synthesys project. In an earlier post  (Big Collections First) I did an analysis of the data in the Biodiversity Collections Index. I am now building a more detailed list of those large collections (the ones believed to contain more <a href='http://www.hyam.net/blog/archives/1277'>[...]</a>]]></description>
			<content:encoded><![CDATA[<p>I am working on improving the metadata on European natural history collections as part of the <a href="http://www.synthesys.info/">Synthesys</a> project. In an earlier post  (<a href="http://www.hyam.net/blog/archives/1235">Big Collections First</a>) I did an analysis of the data in the <a href="http://www.biodiversitycollectionsindex.org/">Biodiversity Collections Index</a>. I am now building a more detailed list of those large collections (the ones believed to contain more than a million &#8216;specimens&#8217;) of which there appear to be around sixty. These  account for most of the biodiversity material in museums in Europe.</p>
<p>As I worked through the list I began to match them up against data sources in the <a href="http://data.gbif.org/">GBIF Data Portal</a> but this task became tricky as there were data sources in GBIF that had the names of museums but were clearly the results of observational studies and not catalogues of specimens. I decided to break off and do an analysis of what was in the GBIF Data Portal by way of specimens residing in Europe. This post is the results of that analysis.<br />
<span id="more-1277"></span></p>
<h2>GBIF Metadata Web Services</h2>
<p>GBIF provides web services to access metadata on the data it harvests.  I extracted details of data sets hosted in Europe from these services. GBIF use three terms:</p>
<ul>
<li><strong>Data Resource</strong> (a.k.a. Data Set) &#8211; the actual data set containing occurrence/specimen records. <a href="http://data.gbif.org/ws/rest/resource/help/">Metadata available through this web service</a>.</li>
<li><strong>Data Provider</strong> &#8211; the organisation supplying the data. Each data resource belongs to a single data provider.<a href="http://data.gbif.org/ws/rest/provider/help/"> Metadata available through this web service</a>.</li>
<li><strong>Data Network </strong>- a collection of data resources often arranged on a geographic or political basis. A data resource can belong to multiple networks.</li>
</ul>
<p>I was interested in the actual data resources but they don&#8217;t have a notion of where they are located so I started by querying the data provider service to get a list of data providers (and their resources) for each of the countries in Europe. I then called the resource service to get the metadata for each resource thus building a list of all the resources in Europe. Here is a list of the all countries and the number of resources they have.</p>
<table border="1">
<tbody>
<tr>
<th width="202" height="17" align="LEFT"><strong>Country</strong></th>
<th width="88" align="LEFT"><strong>Code</strong></th>
<th width="88" align="RIGHT"><strong>Resources</strong></th>
</tr>
<tr>
<td height="17" align="LEFT">Germany</td>
<td align="LEFT">DE</td>
<td align="RIGHT">8660</td>
</tr>
<tr>
<td height="17" align="LEFT">United Kingdom</td>
<td align="LEFT">GB</td>
<td align="RIGHT">368</td>
</tr>
<tr>
<td height="17" align="LEFT">Spain</td>
<td align="LEFT">ES</td>
<td align="RIGHT">155</td>
</tr>
<tr>
<td height="17" align="LEFT">Poland</td>
<td align="LEFT">PL</td>
<td align="RIGHT">98</td>
</tr>
<tr>
<td height="17" align="LEFT">Norway</td>
<td align="LEFT">NO</td>
<td align="RIGHT">68</td>
</tr>
<tr>
<td height="17" align="LEFT">Finland</td>
<td align="LEFT">FI</td>
<td align="RIGHT">45</td>
</tr>
<tr>
<td height="17" align="LEFT">Denmark</td>
<td align="LEFT">DK</td>
<td align="RIGHT">44</td>
</tr>
<tr>
<td height="17" align="LEFT">Netherlands</td>
<td align="LEFT">NL</td>
<td align="RIGHT">41</td>
</tr>
<tr>
<td height="17" align="LEFT">France</td>
<td align="LEFT">FR</td>
<td align="RIGHT">38</td>
</tr>
<tr>
<td height="17" align="LEFT">Ireland</td>
<td align="LEFT">IE</td>
<td align="RIGHT">38</td>
</tr>
<tr>
<td height="17" align="LEFT">Belgium</td>
<td align="LEFT">BE</td>
<td align="RIGHT">36</td>
</tr>
<tr>
<td height="17" align="LEFT">Austria</td>
<td align="LEFT">AT</td>
<td align="RIGHT">13</td>
</tr>
<tr>
<td height="17" align="LEFT">Switzerland</td>
<td align="LEFT">CH</td>
<td align="RIGHT">11</td>
</tr>
<tr>
<td height="17" align="LEFT">Andorra</td>
<td align="LEFT">AD</td>
<td align="RIGHT">7</td>
</tr>
<tr>
<td height="17" align="LEFT">Portugal</td>
<td align="LEFT">PT</td>
<td align="RIGHT">7</td>
</tr>
<tr>
<td height="17" align="LEFT">Slovenia</td>
<td align="LEFT">SI</td>
<td align="RIGHT">5</td>
</tr>
<tr>
<td height="17" align="LEFT">Iceland</td>
<td align="LEFT">IS</td>
<td align="RIGHT">4</td>
</tr>
<tr>
<td height="17" align="LEFT">Estonia</td>
<td align="LEFT">EE</td>
<td align="RIGHT">2</td>
</tr>
<tr>
<td height="17" align="LEFT">Luxembourg</td>
<td align="LEFT">LU</td>
<td align="RIGHT">1</td>
</tr>
<tr>
<td height="17" align="LEFT">Slovakia</td>
<td align="LEFT">SK</td>
<td align="RIGHT">1</td>
</tr>
<tr>
<td height="17" align="LEFT">Sweden</td>
<td align="LEFT">SE</td>
<td align="RIGHT">1</td>
</tr>
<tr>
<td height="17" align="LEFT">Åland Islands</td>
<td align="LEFT">AX</td>
<td align="RIGHT">0</td>
</tr>
<tr>
<td height="17" align="LEFT">Albania</td>
<td align="LEFT">AL</td>
<td align="RIGHT">0</td>
</tr>
<tr>
<td height="17" align="LEFT">Belarus</td>
<td align="LEFT">BY</td>
<td align="RIGHT">0</td>
</tr>
<tr>
<td height="17" align="LEFT">Bosnia and Herzegovina</td>
<td align="LEFT">BA</td>
<td align="RIGHT">0</td>
</tr>
<tr>
<td height="17" align="LEFT">Bulgaria</td>
<td align="LEFT">BG</td>
<td align="RIGHT">0</td>
</tr>
<tr>
<td height="17" align="LEFT">Croatia</td>
<td align="LEFT">HR</td>
<td align="RIGHT">0</td>
</tr>
<tr>
<td height="17" align="LEFT">Czech Republic</td>
<td align="LEFT">CZ</td>
<td align="RIGHT">0</td>
</tr>
<tr>
<td height="17" align="LEFT">Faroe Islands</td>
<td align="LEFT">FO</td>
<td align="RIGHT">0</td>
</tr>
<tr>
<td height="17" align="LEFT">Gibraltar</td>
<td align="LEFT">GI</td>
<td align="RIGHT">0</td>
</tr>
<tr>
<td height="17" align="LEFT">Greece</td>
<td align="LEFT">GR</td>
<td align="RIGHT">0</td>
</tr>
<tr>
<td height="17" align="LEFT">Guernsey</td>
<td align="LEFT">GG</td>
<td align="RIGHT">0</td>
</tr>
<tr>
<td height="17" align="LEFT">Holy See (Vatican City State)</td>
<td align="LEFT">VA</td>
<td align="RIGHT">0</td>
</tr>
<tr>
<td height="17" align="LEFT">Hungary</td>
<td align="LEFT">HU</td>
<td align="RIGHT">0</td>
</tr>
<tr>
<td height="17" align="LEFT">Isle of Man</td>
<td align="LEFT">IM</td>
<td align="RIGHT">0</td>
</tr>
<tr>
<td height="17" align="LEFT">Italy</td>
<td align="LEFT">IT</td>
<td align="RIGHT">0</td>
</tr>
<tr>
<td height="17" align="LEFT">Jersey</td>
<td align="LEFT">JE</td>
<td align="RIGHT">0</td>
</tr>
<tr>
<td height="17" align="LEFT">Latvia</td>
<td align="LEFT">LV</td>
<td align="RIGHT">0</td>
</tr>
<tr>
<td height="17" align="LEFT">Liechtenstein</td>
<td align="LEFT">LI</td>
<td align="RIGHT">0</td>
</tr>
<tr>
<td height="17" align="LEFT">Lithuania</td>
<td align="LEFT">LT</td>
<td align="RIGHT">0</td>
</tr>
<tr>
<td height="17" align="LEFT">Macedonia, the Former Yugoslav Republic of</td>
<td align="LEFT">MK</td>
<td align="RIGHT">0</td>
</tr>
<tr>
<td height="17" align="LEFT">Malta</td>
<td align="LEFT">MT</td>
<td align="RIGHT">0</td>
</tr>
<tr>
<td height="17" align="LEFT">Moldova, Republic of</td>
<td align="LEFT">MD</td>
<td align="RIGHT">0</td>
</tr>
<tr>
<td height="17" align="LEFT">Monaco</td>
<td align="LEFT">MC</td>
<td align="RIGHT">0</td>
</tr>
<tr>
<td height="17" align="LEFT">Montenegro</td>
<td align="LEFT">ME</td>
<td align="RIGHT">0</td>
</tr>
<tr>
<td height="17" align="LEFT">Romania</td>
<td align="LEFT">RO</td>
<td align="RIGHT">0</td>
</tr>
<tr>
<td height="17" align="LEFT">Russian Federation</td>
<td align="LEFT">RU</td>
<td align="RIGHT">0</td>
</tr>
<tr>
<td height="17" align="LEFT">San Marino</td>
<td align="LEFT">SM</td>
<td align="RIGHT">0</td>
</tr>
<tr>
<td height="17" align="LEFT">Serbia</td>
<td align="LEFT">RS</td>
<td align="RIGHT">0</td>
</tr>
<tr>
<td height="17" align="LEFT">Svalbard and Jan Mayen</td>
<td align="LEFT">SJ</td>
<td align="RIGHT">0</td>
</tr>
<tr>
<td height="17" align="LEFT">Ukraine</td>
<td align="LEFT">UA</td>
<td align="RIGHT">0</td>
</tr>
</tbody>
</table>
<p>&nbsp;</p>
<p>First big surprise is that Germany has 10x the number of resources of all the other countries put together! It turns out that this is largely due to a single data provider (<a href="http://www.pangaea.de/">Pangaea.de</a>) who publish a very large number of small data sets, many form individuals or small projects. These should shake out later in the analysis. The distribution of sources by countries isn&#8217;t too surprising. Larger, richer countries have more. But why none in Italy and only one in Sweden?</p>
<h2>Teasing Out the Specimens</h2>
<p>The data harvested was for all data sets but many of these are observational data and not museum and herbarium catalogues. Fortunately their is a field in the metadata that gives the default basis of record. It may be that some are mixed but in practice this is unlikely. Unfortunately this field isn&#8217;t always filled in.</p>
<table>
<tbody>
<tr>
<td width="119" height="17" align="LEFT"><strong>Basis Of Record</strong></td>
<td width="88" align="RIGHT"><strong>Count</strong></td>
</tr>
<tr>
<td height="17" align="LEFT">observation</td>
<td align="RIGHT">6718</td>
</tr>
<tr>
<td height="17" align="LEFT">unknown</td>
<td align="RIGHT">2821</td>
</tr>
<tr>
<td height="17" align="LEFT">specimen</td>
<td align="RIGHT">101</td>
</tr>
<tr>
<td height="17" align="LEFT">living</td>
<td align="RIGHT">3</td>
</tr>
</tbody>
</table>
<p>It looks as though we only have 101 data sets for collections but some of those 2,821 must be collections. Can we guess which ones on some other characteristics of the data set.</p>
<ul>
<li>Firstly we are only interested in larger data sets &#8211; the bigger collections first strategy &#8211; so we could forget everything with less than 1,000 occurrence records. This should also clear out many of the personal data sets in Pangaea.de.</li>
<li>Then we might presume that observation data sets will have more observations per taxon than specimen data sets. Instead of collecting representatives of species they are collecting distribution or ecological data.</li>
<li>Then we might presume that specimen data is old and probably won&#8217;t yet be geocoded where as for observation data location is everything and they will frequently be geocoded.</li>
</ul>
<p>Here is a query ignoring collections with &lt; 1,000 occurrence records and showing the occurrences/taxon and percentage geocodings.</p>
<table>
<tbody>
<tr>
<td width="120" height="17" align="LEFT"><strong>Basis of Record</strong></td>
<td width="116" align="RIGHT"><strong># Data Sets</strong></td>
<td width="146" align="RIGHT"><strong>Occurrences/Taxon</strong></td>
<td width="88" align="RIGHT"><strong>% Geocoded</strong></td>
</tr>
<tr>
<td height="17" align="LEFT">living</td>
<td align="RIGHT">3</td>
<td align="RIGHT">2</td>
<td align="RIGHT">0</td>
</tr>
<tr>
<td height="17" align="LEFT">observation</td>
<td align="RIGHT">624</td>
<td align="RIGHT">240</td>
<td align="RIGHT">99</td>
</tr>
<tr>
<td height="17" align="LEFT">specimen</td>
<td align="RIGHT">90</td>
<td align="RIGHT">13</td>
<td align="RIGHT">45</td>
</tr>
<tr>
<td height="17" align="LEFT">unknown</td>
<td align="RIGHT">774</td>
<td align="RIGHT">584</td>
<td align="RIGHT">75</td>
</tr>
</tbody>
</table>
<p>It looks to me that we could make the presumption that if an unknown data set has an average of fewer than 50 occurrences per taxon and is less than 50% geocoded then it is probably a specimen dataset. In SQL we consider a data set to be a specimen collection  WHERE basisOfRecord = &#8216;specimen&#8217; OR (basisOfRecord = &#8216;unknown&#8217; AND occurrenceCount &gt; 1000 and occurrencesPerTaxon &lt; 50 and percentGeocoded &lt; 50).</p>
<p>Doing this gives us a list of  <strong>276 data sets</strong> and a total number of specimens of <strong>7,665,815</strong>.</p>
<p>In my <a href="http://www.hyam.net/blog/archives/1235">previous blog post</a> I did not attempt to estimated the number of specimens in Europe but I did total the number mentioned in the Biodiversity Collections Index as 334 million. Working on the basis that many collections are missing from that list but that the missing collections are unlikely to be the BIG collections it seems reasonable to assume the total number of specimens in Europe will not top 500 million. If you doubt this then think where we might find another 160 collections containing a million specimens each. Arturo Ari<span style="font-family: Times New Roman,serif;">ñ</span>o wrote a paper in 2010 (<a href="https://journals.ku.edu/index.php/jbi/article/view/3991">Biodiversity Informatics, 7 2010, pp. 81-92</a>) where he estimated total units (specimens/lots/countable things) to be 1.2 to 2.1 billion and that GBIF had mobilized only 3% of these. My personal feeling is that these are over estimates but not by orders of magnitude.</p>
<p>If there are 334 million specimens in Europe then 7.7 million records in GBIF is 2.3% on the other hand if we underestimate collections and there are 0.5 billion specimens in Europe then GBIF only has 1.5% of them. <strong>Which ever you choose it is not a lot</strong>.</p>
<p>If we were to take just six big collections &#8211; for example:</p>
<ul>
<li>Muséum National d&#8217;Histoire Naturelle &#8211; Entomology</li>
<li>Royal Belgian Institute of Natural Sciences</li>
<li>Natural History Museum London, Department of Entomology</li>
<li>Royal Museum for Central Africa</li>
<li>Naturhistorisches Museum Wien</li>
<li>Natural History Museum of Denmark</li>
</ul>
<p>And then we were to digitize just 5% of their holdings. We would more than double the number of European specimens in GBIF.</p>
<p>What am I missing?</p>
]]></content:encoded>
			<wfw:commentRss>http://www.hyam.net/blog/archives/1277/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Big Collections First</title>
		<link>http://www.hyam.net/blog/archives/1235</link>
		<comments>http://www.hyam.net/blog/archives/1235#comments</comments>
		<pubDate>Fri, 17 Jun 2011 16:27:15 +0000</pubDate>
		<dc:creator><span property="dc:creator" resource="http://www.hyam.net/blog/archives/1235">Roger Hyam</span></dc:creator>
				<category><![CDATA[Biodiversity Informatics]]></category>
		<category><![CDATA[Synthesys]]></category>

		<guid isPermaLink="false">http://www.hyam.net/blog/?p=1235</guid>
		<description><![CDATA[I&#8217;ve been slow to blog on my day job recently. Sometimes the dead ends are so embarrassing they are better not shared. One thing worth sharing is a report I did for Synthesys on improving the quality of metadata on Eurorpean biodiversity collections. It includes analysis of the data in the Biodiversity Collections Index and <a href='http://www.hyam.net/blog/archives/1235'>[...]</a>]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.hyam.net/blog/wp-content/uploads/2011/06/graph.png"><img class="alignleft size-full wp-image-1236" style="padding: 5px; border: 1px solid black;" title="graph" src="http://www.hyam.net/blog/wp-content/uploads/2011/06/graph.png" alt="" width="350" height="227" /></a>I&#8217;ve been slow to blog on my day job recently. Sometimes the dead ends are so embarrassing they are better not shared.</p>
<p>One thing worth sharing is a report I did for <a href="http://www.synthesys.info/">Synthesys</a> on improving the quality of metadata on Eurorpean biodiversity collections. It includes analysis of the data in the <a href="http://www.biodiversitycollectionsindex.org">Biodiversity Collections Index</a> and other sources and comes to some conclusions about how we could increase our knowledge of specimens held within Europe.</p>
<p>In summary &#8211; the majority of specimens are in a few large collections. If we improved the coverage of a few dozen major collections then we could cover the majority of specimens held. This is important because the ecomomies of scale kick in with larger collections. One techie guy can support the digitisation of a collection containing many millions of specimens almost as easily as a collection with only a couple of hundred thousand. This is not saying smaller collections aren&#8217;t important it is merely a numbers game.</p>
<p>You can read a PDF of the full report but please remember this isn&#8217;t a scientific paper it is a quick look at the data to think about what to do next. <a href="http://www.hyam.net/blog/wp-content/uploads/2011/06/report_02.pdf">NA3 Task 2.3 &#8211; Metadata on European Collections – Report and Forward Plan</a> &#8211; PDF</p>
]]></content:encoded>
			<wfw:commentRss>http://www.hyam.net/blog/archives/1235/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

