In a recent article in Chemistry World, the declining support for PhD and Postdocs was highlighted. The explanation given is that this is a result of “changes to the way research costs are calculated”.

Can’t argue with that, but it is also true that if PhD/Postdocs were all that valuable, industry would fund them regardless. The truth about chemistry degrees, at least in England, is that they are all about sticking to the recipe given out by the lecturer. “Experiments” consist of following a procedure to the letter. Whilst this is a valuable skill to have, these are not “experiments” at all. In fact they are anti-experiments because should you deviate from the plan given to you, your grades will fall. If you do not *think* or question a procedure, you will get straight As.

Given that it is the straight A students who are likely to go on and do Phds/postdocs, these qualities will persist at this level, and industry is right to be unimpressed by this, since they are all about the ability to think and innovate.

Vested interest: I only got a 2.2 so maybe I would say this ;)

…. and it did. But not quite in the way that Cambridge had imagined. Over the last few weeks around 200,000 articles from contributing publishers have been added to ChemSpider’s literature search (as ChemRefer is now styled), though even this is not in the final form which we imagine.

Another 40,000 articles or so are following next week as this resource grows. The indexer is running hot 24 hours, seven days a week. Tens of thousands more articles will follow after that and on top of that we now have the capability to index text from image PDFs (many journal articles are still in this form) which that also opens up the possibility of users sending in scanned images of their data rich documents as a form of submission of chemical information to ChemSpider as well.

The main issue now is not having the time/resources to index everything we have permission for, we have still barely scratched the surface of Highwire for instance and adding updates from the resources we already index is not yet implemented properly. But, these are nice problems to have.

When we do have the critical mass of text journal articles indexed, the “cited in” feature can be implemented and we can open up the chemical names from the indexed content for downloading and curation by the ChemSpider community… and that’s when things get really interesting.

We are still on track, with just scant resources, to create a community curated cheminformatics-text search that we hope will eventually gain unstoppable momentum thanks to our community backing. Mozilla Firefox competes with Microsoft’s Internet Explorer because it has user and developer community backing and that is worth consideration as a role model for ChemSpider and the chemistry world as a whole.

The turn around that has occurred in terms of the interest in having published materials text indexed is highly significant in the long run since thousands of references will pour into ChemSpider structure records to enhance the usefulness of the database.

These, of course, will be free for anyone to download, so will make a material contribution to the openness of chemical data (which is what I want Open Chemistry Web to be all about) as opposed to talking about definitions/licenses/copyrights and other such distractions (as I see them) surrounding open access and open data.

Some of the richest sources of chemical information are research group websites. Some time ago, I indexed primary literature PDFs from many such websites into the legacy (now non-existent) ChemRefer index.

I then received this correspondence from a major publisher. I submitted it to Chilling Effects to see what the various legal ins and outs of all of this meant.

Please read the letter and then the rest of this post.

It is worth pointing out here that the publisher may well have been right, but there is no way to confirm this since I am not (and should not be) able to access author-publisher contracts.

In any case, the result was that I stopped linking to research group website PDFs (the “just in case” approach). Was that the best course of action?  Comments welcome.

PNAS - National Academy of Sciences of the USA on Highwire:

78,379 articles indexed; 2 indices created to be linked into the literature search; linking strategy: pnas.org URL; indexing type: link back to full text (chemical name to structure conversions to follow)

ChemBlink:

18,453 web pages scanned; ChemSpider structure records to be created; linking strategy: chemblink.com URL; indexing type: Chemical name, structure, synonyms and property data and link back to original web page.

Having blogged on this before I think it important to emphasise that you CAN spider PubMed Central. They even have their own utilities designed specifically for the mass downloading of articles in the form of an OAI feed. What you cannot do is spider the article URLs directly (you must use the XML) because this is forbidden in robots.TXT and you will be blocked on this basis.

PubMed Central is one of the most innovative and open chemistry resources on the web with fantastic metadata and article retrieval tool sets designed to facilitate (not prevent) the spread of chemical information at no cost.

HighWire hosted journal texts are to be indexed and linked back to by ChemSpider and structure records linking to their content be deposited here as well. HighWire will be indexed in accordance with their robots.TXT protocol (the conventional web publishing standard for stating indexing permissions).

From the website:

“HighWire-hosted publishers have collectively made 1,873,044 articles free” [and with their partner publishers] “produce 71 of the 200 most-frequently-cited journals.”

We would like to thank them for one of the most phenomenal academic publishing indexing/structure deposition permissions we have received and we expect it will greatly enhance the discoverability of their partner publishers’ works through our free cheminformatics and text search.

The first part of the first build of the Open Chemistry Web project is now available for viewing and testing here.

It lacks the advanced and substructure capabilities at the moment but these are well on the way. Currently, it more closely resembles the old text search over at ChemRefer.com and we have actually been asked to preserve that metadata format although there are soem changes already implemented.

These include an effort to clarify metadata by standardising citation data to (or as closely as possible to) Journal name, Year, Volume, Issue, and page (all explicitly stated).

The idea behind this is that people will take citations from the primary source (not ChemRefer) so citations in search results should serve only to be as clear and easily readable to the user as possible.

Soon to implemented (for some publishers) is Digital Object Identifier linking - at the request of the publisher so far. Search engines periodically refresh all their links anyway so the link permanency issues that apply to databases (which DOI solves) do not apply here and so their is no policy on this at the moment.

There will be a SIMPLE user interface. One text box, one applet on one page (preferably with very little else). We want to be addictively usable and deliver useful search results quickly. We do not want to build some all-singing-all-dancing and yet overly complex system that no-one without a Masters in cheminformatics will ever be able to decipher.

There are around 150,000 articles on the new index in comparison to ~50,000 in ChemRefer’s index of 12 months ago. Around half are open access (meaning you can download the full work in its entirety for free), and the full text of articles have been indexed to maximise the depth of the search (so even if you cannot access the full text for free, you are still searching the full text).

There is an enormous analytical and life sciences bias at the moment but these are often the most searched for chemical topics on the web due to their scope and importance.

For general interest, ChemRefer differs in structure from ChemSpider in that it is a search engine not a database. That means:

- ChemSpider exists as a website: you can link to it, bookmark it etc. Its purpose is to refer you to useful and curated resources but also to provide information on the ChemSpider.com web resource

- ChemRefer is just a searchable index. You cannot link to ChemRefer (unless you want to link to constantly changing search result pages). Its purpose is to get you off the website and to the useful primary source. Articles and metadata are spidered but this is dynamic so can hardly be described as curation. Systems have been set up to allow the curation of chemical structures from this raw full text index into ChemSpider in an accurate way but also quickly (luckily Tony Williams is a human Xerox). ChemRefer also now serves not just as a full text indexer, but also to mass harvest chemical data from selected web resources and deliver it to ChemSpider.

So, the robot is often used to deliver the data for curation such that it can be processed not (as I initally assumed) just to be fed into the Name-to-structure conversion software necessarily.

Any and all feedback welcome.

PHP is great. This quote which appears, ironically, on ASP.net explains why in a nutshell:

“I think PHP is great if you don’t wanna spent alot of time and ENERGY to become a web developer and still have some power”

Now, the literature indexing is built partly with PHP for this very reason. I am not a programmer but I want to program and I want to do it quickly because it’s just a means to an end i.e. building an index. Whether that’s an index of data from articles, CIFs, catalogs etc. is immaterial with PHP. So, it seems to me that for librarians or information professionals wherever, this is a great tool and you dont have to have any extra money (PHP is free as is the Apache webserver). Just determination and constant access to this resource: php.net .

Of course, ChemSpider is .NET and so this can create some difficulties whenever something indexed by me has to be implemented at ChemSpider.com and I rarely have any idea what they are talking about when I hear words like SQL Server and so on. On the whole though, I am increasingly using a combination of free and came-with-my-computer-Microsoft tools. e.g. the indexing runs on a WAMP server.

Snippets of the indexing code include just basic commands e.g. this for matching all URLs on a page including “/catalog/”.

<?PHP

preg_match_all(”@/catalog/[^\"]+@”, $get, $outurls);

?>

So, what does this mean for chemistry libraries. Well, having someone at your  library with this knowhow is a must. Indexing and organising literature and more effectively complementing it with data indexed from the WWW can help a library to make up for the fact that it cannot afford all the subscriptions it would like. And, in an era where these once complex, labour intensive and expensive activities are now free and dynamic, even the smallest library can use this to its advantage. You’re not getting any more money in your budget and your subscriptions aren’t getting any cheaper? … well the solution is still free.

I read this post on whether DOI is a good identifier or not. My feeling is that it has the following weaknesses:

It cannot (normally) be generated from citation information (a big disadvantage for an identifier) - you have to resolve them at e.g. CrossRef. This kills it as a way to communicate articles effectively.

If you want to resolve lots of them, you have to pay (there is no real value in this.. except that they have the identifiers and you do not).

It does not replace the URL, it is simply a redirect. This makes it hard to bookmark and those unfamiliar with the system who think they have bookmarked it have in fact bookmarked the URL.

Also, publishers have to pay for it too (though its possible they may receive money from CrossRef too). Essentially, all they are paying for is an unintuitive link that does not break provided they keep the redirect up to date.

Hence OpenURL.

It creates a persistent link as DOI does except it actually exists as a webpage (it is not a redirect) and can therefore be bookmarked easily and it CAN be generated from citation information without permissions. Here is a useful implementation.

A note on the CrossRef website caught my eye. It states that OpenURL is not competitive with DOI. This, of course, is nonsense (since it addresses link permanency). Apparently:

An OpenURL link that contains a DOI is similarly persistent.” [as a link]

Why would an OpenURL pointing to a publisher website not be persistent without a DOI? OpenURL can be created with citation data so it is TOTALLY persistent. With DOI, you need to fill in a form at CrossRef or Doi.org which you do not need to do with OpenURL.

It is DOIs that need third party ‘resolving’, not URLs and especially not OpenURLs which require no link up to a database (a restricted one in the case of CrossRef) for generation.

So, it is a shame that only a few publishers have taken it up. Surely, it is a competitive advantage to use a totally freely available URL structure that anyone can generate? After all, the worst that could happen is that someone might find your articles more easily.

PDFs are fantastic as a format in many ways. They store the position of their elements (unlike HTML) so allowing easy extraction of metadata (like titles and authors etc) for display in search results. There are a variety of free tools available to convert PDF files to text format and so the perception that Adobe rule the world of PDFs is false.

Most of these tools have simple ways to undo the potential damage caused by the double columned PDF, especially with long chemical names. Another common problem with chemical name extraction from PDF is that you often read this: “5-diphenyl” ….. but end up extracting this: “5diphenyl” …. not fantastic (although whether an Adobe tool would produce a better result I dont know), but easily solvable with things like regex.

 So for PDFs: I find these free/three things surprisingly useful:

1) PDFtoText (for extraction) 

2) PHP (to generate output)

3) Regex (for name matching/repairing)