Background and preamble

The 'balkanisation' of the literature is in part due to the amount of it (some 25,000 journals with presently 2.5 million peer-reviewed papers per year, i.e. ~5 per minute [1]), with a number http://www.nlm.nih.gov/bsd/medline_cit_counts_yr_pub.html increasing by something approaching 2 per minute at PubMed/Medline alone. In addition, the disconnect between the papers in the literature (usually as pdf files) and the metadata describing them (author, journal, year, pages, etc) is acute and badly needs filling [2]. Without solving this problem, and without automation of the processes of reading, interpreting and exploiting this literature and its metadata in a digital format, we cannot make use of the existing tools for text mining and natural language processing (e.g. [35]), for joining disparate concepts [6], for literature-based discovery (e.g. [