One of the most untapped and certainly unsearched sources of chemical literature on the web is journal articles in image PDF format. I am using ImageMagick and Tesseract to get round this but (having no experience of ‘image indexing’) I am discovering how memory intensive this process is and it is painfully slow.

A source which would obviously benefit from this is the Acta Chemica Scandinavica archive put together by the Danish, Swedish, Norwegian and Finnish chemical societies which has extremely high quality image PDFs that lend themselves readily to this process. We will then have full text searchable functionality for this archive - will be interesting to test the quality of the free tools I am using for this as well. Could take weeks though!

One of the surprises when indexing the huge array of literature available on the web is that many major names, that is the ones who are associated with the traditional closed model, pop up as by far and away the biggest contributors to open access works (defined here as those that are downloadable in their entirety free of charge or other barrier such as login giving away substantial personal info).

American Society for Biochemistry and Molecular Biology (100,000+ free articles)

Royal Society of Chemistry (70,000+ free articles) - trawled, but not yet added to lit search.

National Academy of Sciences of the USA (50,000+ free articles)

The observation is that around 99% of the open access works in chemistry indexed by ChemSpider are supported financially by the subscription model, and we can suppose that open access works support subscriptions by attracting unsubscribed readers too.

As we see above, this is not theory, it has been happening for years, it is a real world material contribution to openness in chemistry that has crazily not attracted any attention on the blogosphere as far as I can tell.

There is a continued focus on relabelling data produced by others as “open data” - but this data has already been labelled and licensed by the orginal producer so this could be misleading. I’ve always thought that building searchable indices that link back, as do the major search engines, is the best way to build a resource through which users can discover works and where data producers are not undercut.

In a recent article in Chemistry World, the declining support for PhD and Postdocs was highlighted. The explanation given is that this is a result of “changes to the way research costs are calculated”.

Can’t argue with that, but it is also true that if PhD/Postdocs were all that valuable, industry would fund them regardless. The truth about chemistry degrees, at least in England, is that they are all about sticking to the recipe given out by the lecturer. “Experiments” consist of following a procedure to the letter. Whilst this is a valuable skill to have, these are not “experiments” at all. In fact they are anti-experiments because should you deviate from the plan given to you, your grades will fall. If you do not *think* or question a procedure, you will get straight As.

Given that it is the straight A students who are likely to go on and do Phds/postdocs, these qualities will persist at this level, and industry is right to be unimpressed by this, since they are all about the ability to think and innovate.

Vested interest: I only got a 2.2 so maybe I would say this ;)

…. and it did. But not quite in the way that Cambridge had imagined. Over the last few weeks around 200,000 articles from contributing publishers have been added to ChemSpider’s literature search (as ChemRefer is now styled), though even this is not in the final form which we imagine.

Another 40,000 articles or so are following next week as this resource grows. The indexer is running hot 24 hours, seven days a week. Tens of thousands more articles will follow after that and on top of that we now have the capability to index text from image PDFs (many journal articles are still in this form) which that also opens up the possibility of users sending in scanned images of their data rich documents as a form of submission of chemical information to ChemSpider as well.

The main issue now is not having the time/resources to index everything we have permission for, we have still barely scratched the surface of Highwire for instance and adding updates from the resources we already index is not yet implemented properly. But, these are nice problems to have.

When we do have the critical mass of text journal articles indexed, the “cited in” feature can be implemented and we can open up the chemical names from the indexed content for downloading and curation by the ChemSpider community… and that’s when things get really interesting.

We are still on track, with just scant resources, to create a community curated cheminformatics-text search that we hope will eventually gain unstoppable momentum thanks to our community backing. Mozilla Firefox competes with Microsoft’s Internet Explorer because it has user and developer community backing and that is worth consideration as a role model for ChemSpider and the chemistry world as a whole.

The turn around that has occurred in terms of the interest in having published materials text indexed is highly significant in the long run since thousands of references will pour into ChemSpider structure records to enhance the usefulness of the database.

These, of course, will be free for anyone to download, so will make a material contribution to the openness of chemical data (which is what I want Open Chemistry Web to be all about) as opposed to talking about definitions/licenses/copyrights and other such distractions (as I see them) surrounding open access and open data.

Some of the richest sources of chemical information are research group websites. Some time ago, I indexed primary literature PDFs from many such websites into the legacy (now non-existent) ChemRefer index.

I then received this correspondence from a major publisher. I submitted it to Chilling Effects to see what the various legal ins and outs of all of this meant.

Please read the letter and then the rest of this post.

It is worth pointing out here that the publisher may well have been right, but there is no way to confirm this since I am not (and should not be) able to access author-publisher contracts.

In any case, the result was that I stopped linking to research group website PDFs (the “just in case” approach). Was that the best course of action?  Comments welcome.

PNAS - National Academy of Sciences of the USA on Highwire:

78,379 articles indexed; 2 indices created to be linked into the literature search; linking strategy: pnas.org URL; indexing type: link back to full text (chemical name to structure conversions to follow)

ChemBlink:

18,453 web pages scanned; ChemSpider structure records to be created; linking strategy: chemblink.com URL; indexing type: Chemical name, structure, synonyms and property data and link back to original web page.

Having blogged on this before I think it important to emphasise that you CAN spider PubMed Central. They even have their own utilities designed specifically for the mass downloading of articles in the form of an OAI feed. What you cannot do is spider the article URLs directly (you must use the XML) because this is forbidden in robots.TXT and you will be blocked on this basis.

PubMed Central is one of the most innovative and open chemistry resources on the web with fantastic metadata and article retrieval tool sets designed to facilitate (not prevent) the spread of chemical information at no cost.

HighWire hosted journal texts are to be indexed and linked back to by ChemSpider and structure records linking to their content be deposited here as well. HighWire will be indexed in accordance with their robots.TXT protocol (the conventional web publishing standard for stating indexing permissions).

From the website:

“HighWire-hosted publishers have collectively made 1,873,044 articles free” [and with their partner publishers] “produce 71 of the 200 most-frequently-cited journals.”

We would like to thank them for one of the most phenomenal academic publishing indexing/structure deposition permissions we have received and we expect it will greatly enhance the discoverability of their partner publishers’ works through our free cheminformatics and text search.

The first part of the first build of the Open Chemistry Web project is now available for viewing and testing here.

It lacks the advanced and substructure capabilities at the moment but these are well on the way. Currently, it more closely resembles the old text search over at ChemRefer.com and we have actually been asked to preserve that metadata format although there are soem changes already implemented.

These include an effort to clarify metadata by standardising citation data to (or as closely as possible to) Journal name, Year, Volume, Issue, and page (all explicitly stated).

The idea behind this is that people will take citations from the primary source (not ChemRefer) so citations in search results should serve only to be as clear and easily readable to the user as possible.

Soon to implemented (for some publishers) is Digital Object Identifier linking - at the request of the publisher so far. Search engines periodically refresh all their links anyway so the link permanency issues that apply to databases (which DOI solves) do not apply here and so their is no policy on this at the moment.

There will be a SIMPLE user interface. One text box, one applet on one page (preferably with very little else). We want to be addictively usable and deliver useful search results quickly. We do not want to build some all-singing-all-dancing and yet overly complex system that no-one without a Masters in cheminformatics will ever be able to decipher.

There are around 150,000 articles on the new index in comparison to ~50,000 in ChemRefer’s index of 12 months ago. Around half are open access (meaning you can download the full work in its entirety for free), and the full text of articles have been indexed to maximise the depth of the search (so even if you cannot access the full text for free, you are still searching the full text).

There is an enormous analytical and life sciences bias at the moment but these are often the most searched for chemical topics on the web due to their scope and importance.

For general interest, ChemRefer differs in structure from ChemSpider in that it is a search engine not a database. That means:

- ChemSpider exists as a website: you can link to it, bookmark it etc. Its purpose is to refer you to useful and curated resources but also to provide information on the ChemSpider.com web resource

- ChemRefer is just a searchable index. You cannot link to ChemRefer (unless you want to link to constantly changing search result pages). Its purpose is to get you off the website and to the useful primary source. Articles and metadata are spidered but this is dynamic so can hardly be described as curation. Systems have been set up to allow the curation of chemical structures from this raw full text index into ChemSpider in an accurate way but also quickly (luckily Tony Williams is a human Xerox). ChemRefer also now serves not just as a full text indexer, but also to mass harvest chemical data from selected web resources and deliver it to ChemSpider.

So, the robot is often used to deliver the data for curation such that it can be processed not (as I initally assumed) just to be fed into the Name-to-structure conversion software necessarily.

Any and all feedback welcome.

PHP is great. This quote which appears, ironically, on ASP.net explains why in a nutshell:

“I think PHP is great if you don’t wanna spent alot of time and ENERGY to become a web developer and still have some power”

Now, the literature indexing is built partly with PHP for this very reason. I am not a programmer but I want to program and I want to do it quickly because it’s just a means to an end i.e. building an index. Whether that’s an index of data from articles, CIFs, catalogs etc. is immaterial with PHP. So, it seems to me that for librarians or information professionals wherever, this is a great tool and you dont have to have any extra money (PHP is free as is the Apache webserver). Just determination and constant access to this resource: php.net .

Of course, ChemSpider is .NET and so this can create some difficulties whenever something indexed by me has to be implemented at ChemSpider.com and I rarely have any idea what they are talking about when I hear words like SQL Server and so on. On the whole though, I am increasingly using a combination of free and came-with-my-computer-Microsoft tools. e.g. the indexing runs on a WAMP server.

Snippets of the indexing code include just basic commands e.g. this for matching all URLs on a page including “/catalog/”.

<?PHP

preg_match_all(”@/catalog/[^\"]+@”, $get, $outurls);

?>

So, what does this mean for chemistry libraries. Well, having someone at your  library with this knowhow is a must. Indexing and organising literature and more effectively complementing it with data indexed from the WWW can help a library to make up for the fact that it cannot afford all the subscriptions it would like. And, in an era where these once complex, labour intensive and expensive activities are now free and dynamic, even the smallest library can use this to its advantage. You’re not getting any more money in your budget and your subscriptions aren’t getting any cheaper? … well the solution is still free.