The Advisory Group for ChemSpider expands yet again today and we are glad to welcome Soaring Bear to our members. I have exchanged many emails with him over the past few months and he clearly has an eye for quality having worked on the MeSH team.  Bear has joined us with a Natural Products focus…

From the Advisory group page:

“Soaring Bear overhauled the chemistry and pharmacology parts of MeSH to improve search results for all who use the PubMed.gov version of Medline (and UMLS derivatives). He developed the HerbMed.org web site which contains over ten thousand quick summaries and links into Medline about botanical medicines. His QSAR and molecular modeling experience was with topoisomerase and its inhibitors. His degrees in Biochemistry and Pharmaceutical Sciences were earned at University of Arizona. He sees the need for improvements in structure search and presentation (in both 2D and 3D) and relating/linking to structure and literature of molecular targets and physiological effects and would like to contribute his background in structure, activity and clearing up ambiguities to help Chemspider continue to improve.”

Buy me a Coffee

In an ongoing commentary about the DailyMed dataset (1,2) I have been showing some of the struggles regarding creating curated datasets from publicly available data. This post shows an example of when trade names collide. The DailyMed record for sclerosol shows no chemical structure in the label….but describes the compound as follows:

“Sclerosol® Intrapleural Aerosol (sterile talc powder 4 g) is a sclerosing agent for intrapleural administration supplied as a single-use, pressurized spray canister with two delivery tubes of 15 cm and 25 cm in length. Each canister contains 4.0 g of talc, either white or off-white to light grey, asbestos-free, and brucite-free grade of talc of controlled granulometry. The composition of the talc is ≥ 95% talc as hydrated magnesium silicate. The empirical formula is Mg3 Si4 O10 (OH)2 with molecular weight of 379.3.”

Sclerosol is Talc. A search on Sclerosol online however brings us numerous hits for dimethyl sulfoxide on ChemIndustry and the Comparitive Toxicogenomics database and on MeSH. So, is Sclerasol also DMSO?

The PubChem record merges the relationship between Talc and DMSO rather well. Visit the record here. The substance summary is as follows:

“A highly polar organic liquid, that is used widely as a chemical solvent. Because of its ability to penetrate biological membranes, it is used as a vehicle for topical application of pharmaceuticals. It is also used to protect tissue during CRYOPRESERVATION. Dimethyl sulfoxide shows a range of pharmacological activity including analgesia and anti-inflammation.”

Further information is the MeSH details shown below.

The image of the associated structure is shown below…notice it’s representative of talc.

It appears that DMSO and Talc were meshed somehow.

Sclerasol on ChemSpider is Talc. I am not stating that the structure representation of talc is appropriate but it IS the same as the one displayed on PubChem. DMSO on ChemSpider is here and never had the name Sclerasol associated with it. Since we derived some of our data from PubChem I am not sure how we managed to separate the DMSO and Sclerasol association in our processes…but we did.

So, MAYBE Sclerasol is a name for DMSO…but I don’t think so.

Why is this important? As we are working on text mining and will use a lookup dictionary of chemical names and structures as part of the process we are putting in the work to create a high quality dictionary. it’s important for us moving forward.

Buy me a Coffee

I’ve started a review of the DailyMed dataset as it is representative of some of the struggles with preparing a curated dataset of chemical structures, chemical names and trade names. In the first comment I pointed to issues with structure representations. I believe one of the worst is shown for qvar to the left. An examination of the qvar record gives the name as beclamethasone propionate. This particular compound has the chemical structure shown below. Not only is the stereochemistry missing from the structure on DailyMed but also half the ring has been lost, maybe during a scanning process? I wonder whether the label circulating out there to the public has this issue? Would the public care? Probably not. But when trying to build a curated dataset it’s rather important.


Buy me a Coffee

The Advisory Group for ChemSpider is starting to expand again as we expand our efforts in the domain of text-mining and Wiki’ing chemistry. We’re happy to welcome Chris Singleton to our advisory group and to bring his passion to Wiki’ing Chemistry to bear on our projects. I’ve had the pleasure of working with Chris on an NMR structure elucidation problem, that of hexacyclinol. Chris has been one of the biggest contributors of analytical data to ChemSpider and has submitted hundreds of spectra and URL links to chromatography separation conditions.

From the Advisory group page:

“Chris Singleton has joined the Chemspider Advisory Group on Wiki-Based Chemistry to further the growth of open science and open knowledge. The free availability of open knowledge and analytical data is vital to this effort, and Chris is focusing on Wiki-Based Chemistry in this regard. He has worked with NMR, LC-MS, chiral HPLC, GC and chiral GC, and a variety of mass spectrometric techniques in the past. He is currently a bioanalytical chemist and specializes in LC/MS-MS for quantitative measurement of small molecules from biological matrices.”

Welcome Chris!

Buy me a Coffee

I’ve been looking at various forms of communication to assist with people understanding a little more about ChemSpider. I am presently investigating the production of online movies to assist users in understanding how to use the system to full effect and hope to rollout a few examples shortly. In parallel I’ve been looking at podcasting technology.

Serendipitously I was approached by Nature to be involved in one of their podcasts and went through the experience with them. Though  you’d never know it from the podcast it was done during vacation while trying to balance the energy of our boisterous twin boys in the room with a background noise of the ocean crashing on the shore. There are worse ways to be involved in a podcast for sure…balancing the nice overview of the sea with two little boys desperately trying to stay quiet and the professionalism and speed of Geoff Brumfiel at Nature made this a very pleasant experience.

If you’re interested you can check out the podcast here. Based on the feedback we might add podcasting as one more way to communicate with our users. Thoughts?

Buy me a Coffee

The DailyMed website is a valuable website when it comes to chemical names, trade names, drug names and chemical structures.What is interesting is the quality of the information on the website. We were originally interested in using the website to expand our dictionary of drug names and associated chemical structures and exercising our text-mining tools to recognize chemical structures. In order to test our text-mining capabilities we had to examine every record for accuracy and appropriateness and to tune the algorithms. This amounted to over 3000 records. During this process we were able to review every chemical structure diagram and the appropriateness of these diagrams. As part of the process we were able to build a highly validated dataset of chemical structures and their chemical/trade/drug names. These will be exposed on ChemSpider in the near future.

For now lets examine the quality of information on Daily Med.

The website is advertised as:

“DailyMed provides high quality information about marketed drugs. This information includes FDA approved labels (package inserts). This Web site provides health information providers and the public with a standard, comprehensive, up-to-date, look-up and download resource of medication content and labeling as found in medication package inserts. ”

So, what type of materials can we find on Daily Med?

Look at Soltamox here. What do you think about the chemical structure image below? Do you think that was drawn with a structure drawing software package?

What about the one for Clindamycin phosphate? Do you think there might be a lack of stereochemistry on this structure of norethindrone below? Same question for trobicin

My favorite “not drawn by a chemist” chemical structure is the one for cefobid shown below.

Many chemical structures on DailyMed are imperfect. What is quite shocking is that many of these are not even drawn with structure drawing packages. There are other issues…more to come.

COMMENT: DailyMed is a delivery vehicle for content provided by vendors (I believe). The site is a valuable public service and is applauded. The hope is the work that we are doing on Daily Med will be of similar value and might encourage that some of the labels will be “cleaned up”

Buy me a Coffee

Here at ChemSpider we’ve been working for almost a year and a half to build a structure centric community for chemists. During this time we have been dabbling, in the background, with ChemSpider being not so structure-centric, but this has not been exposed yet. Of late we have been attracted to the possibilities around text-mining and mark-up of articles.

We are well underway in terms of providing tools for markup and they will be released incrementally. We have a lot more ideas and are interested to participate in the Article 2.0 contest to see what we can do. What is Article 2.0? Article 2.0 was announced by Elsevier here with the following statement:

“We’ve worked hard to build the Article 2.0 dataset, and now we’re opening it up to developers via a simple, straightforward REST API. We will provide contestants with access to approximately 7,500 full-text XML scientific articles (including images) and challenge each contestant to be the publisher. In other words, each contestant will have complete freedom for how they would like to present the scientific research articles contained in the Article 2.0 dataset. We will encourage the use of XQuery, but this will not be a mandate. By leveraging these APIs, the contestant becomes the publisher and can render scientific articles to meet their needs including integrating the article into existing applications or combining it with other web service APIs.”

7500 articles and complete freedom to present the articles as we see fit. Enticing! What do we already have on ChemSpider that we could reuse?

1) Structure deposition

2) Analytical data and image deposition

3) Integration to other data via URLs

4) Add comments/description

5) Text markup with “Chemical enhancements”

6) A dataset of >21 million structures and integration to over 120 data sources

7) Good ideas …

Article 2.0 looks interesting…we hope to be involved

Buy me a Coffee

I have been in discussion with Christoph Steinbeck and colleagues from the European Bioinformatics Institute. Specifically, we are interested in linking up to AND embedding the text from their ChEBI Entities of the Month. So, as is my preferred manner of not assuming everything is Open Data but rather asking for permission, I approached Christoph. I asked for permission to copy the text for the Entities of the Month onto the appropriate record view in ChemSpider. When I asked the question we were not yet ready to accept rich text format with embedded hyperlinks, a strength of many of the articles on ChEBI’s Entity of the Month.

I am happy to announce that as part of our ongoing effort to Wikify ChemSpider and allow people to add descriptions to the individual record views we have added a rich text editor and are presently testing it. At present we have rolled out the FULL implementation of the editor. This means it has lots of capabilities/buttons and the entire editor is being tested by curators. But, when rolled out to users, there will be a Simple mode and an Advanced mode for the editor.

Click on the thumbnail below to see the Text Editor in action. Don’t forget, It is the “Full-powered” implementation for now. In this case all I did was copy and paste the text from the ChEBI website and insert the ChEBI article link back to the original article on the ChEBI site.

In the Text editor we are in the process of inserting new capabilities that will facilitate mark up of articles. Since we will be hosting a number of Open Access articles shortly we will be experimenting on those articles with our new markup capabilities.

When this is all rolled out we will have the majority of capabilities necessary for people to track their research online if they wish. Online submission of structures, text deposition with full editing capabilities, submission and tracking of analytical data and images and linking to external sites and data. It’s probably an 80% solution for right now since we are missing some capabilities and workflow issues. For example, poor support for polymers and organometallics and specitfically the structure-centric nature of the solution and the insistence to submit a structure to associate data and text with. We will allow in the future “sample-submission” where the structure is not known but the data, images and experimental details of synthesis and analysis are available. Clearly the standard workflow for synthetic chemists is to synthesize first and then confirm by analysis what the products are. This is a typical workflow and will need to be supported. It’s coming…

Some of you might be asking:

1) will we support versioning of the articles as people modify/edit the article (as is done with Wikipedia)? Yes, we will. Soon.

2) will curators have the ability to lock articles? Yes, in the future we will introduce this if it’s deemed appropriate.

3) will it be possible to allow only one individual (or group) to edit an article? Yes, one of the future directions is to allow an individual or group to perform Open Notebook Science in front of the public but not allow the public to edit the results. They would of course be allowed to comment on the research. Future development…

Zemanta Pixie

Buy me a Coffee

We are adding our finishing touches to some markup tools for Open Access articles at present and they will unveil shortly. In parallel we’ve been manually curating a series of articles about drugs, about 3000 of them, and will rollout these articles with similar markup using the tools we have developed. When rolled out we will of extended our ChemSpider toolkit to facilitate integration between “documents” and ChemSpider - watch this space…

Buy me a Coffee

Most blog readers will likely be aware of the recent article written in Nature about ChemSpider. PMR has recently commented on what he said to the Nature reporter who interviewed him but did not make it into press.

I’ll clarify some of Peter’s statements and differentiate judgments versus truths, some of this is a repeat, again.

1)  “Firstly to say that I commented to Geoff before Chemspider’s announcement that it was adopting CC-SA licences. This is a major advance and has enhanced the importance of Chemspider.

We have REMOVED these licenses now after the rather interesting situation resulting from that and Peter had already commented on his own blog “I agree with John. Licences are not appropriate for data (and when I applauded Chemspider it was for the motivation rather than the actual mechanism - CC-SA is conformant to the OK definition, but difficult to operate for re-use). That’s why we use the OKF’s OpenData sticker on CrystalEye.

2)  “It’s (now) based on Web 2.0 principles in that it uses social computing for some of its content and can and has reacted to external changes.” ChemSpider has been based on Web 2.0 principles since the first rollout and I have commented on this previously.

3)  “It’s not, however, based on semantic web technology such as RDF and XML and this may be a future limitation in managing some of the more complex content.” We use XML in many places on our site and some of this will be exposed in the future. We have discussed RDF’ing our system with Egon Willighagen but it’s not a priority for us at present. It’s on the list though.

4) “Although I’m not party to the internal design I’d guess it has a relational database, most of whose primary keys are the identifiers for chemical compounds. These identifiers map onto canonicalised chemical structures (one serialization of which is the InChI) and this is the primary mechanism for indexing compounds.” Yes, it’s a relational database, on Microsoft SQL server. Primary keys are structures and we do use InChis, a lot.

5) “CS has ca 20 million compounds and the only way to manage these is robotically.” We have a hybrid model of robotic handling and human intervention and interaction with the data. To see human interaction in action visit the feedback page.

6)”there is no guarantee that the computation of properties is free from error - indeed it cannot be. Many physical properties depend on the physical form of the compound and this is often not recorded. I suspect most of the properties are computed by heuristic means (”QSPR”) rather than QM calculations. And many of them fail to take things like chemical stability and reactivity into account.  (Examples are boiling points for compounds that decompose, flashpoints for things that could never burn). But how do you tell this robotically - I don’t have a good suggestion But one can guarantee that in 20 million calculations some will be meaningless.”

I agree with the scientific declarations that properties depend on the physical form of the compound. None of the predictions are QM-based, definitely not feasible with 20 million compounds not only because of lack of access to software but more about time issues as discussed previously in regards to QM NMR predictions. I have 15 years experience around QSPR type predictions and they are fast and generally applied by the majority of chemists at the desktop in Life Science environments (and others) for the prediction of logP, solubility, logD, pKa, NMR etc. I GUARANTEE that in 20 million compounds some will be meaningless. This definiely doesn’t mean the predicted values across the DB are of no value.

Despite some of the previous comments about the properties in the vast majority of cases property prediction is valid. See such discussions here :Calcium Carbonate is not soluble and can’t have a logP PLUS Lipinski says Calcium Carbonate CAN have a logP

We are presently adding MORE predicted properties. Check out at this record the “EPI Summary” at the bottom of the page and you will see this (Scroll inside the box)

Log Octanol-Water Partition Coef (SRC):
    Log Kow (KOWWIN v1.67 estimate) =  0.85

 Boiling Pt, Melting Pt, Vapor Pressure Estimations (MPBPWIN v1.42):
    Boiling Pt (deg C):  290.82  (Adapted Stein & Brown method)
    Melting Pt (deg C):  80.58  (Mean or Weighted MP)
    VP(mm Hg,25 deg C):  3.99E-005  (Modified Grain method)
    Subcooled liquid VP: 0.000135 mm Hg (25 deg C, Mod-Grain method)

 Water Solubility Estimate from Log Kow (WSKOW v1.41):
    Water Solubility at 25 deg C (mg/L):  3.574e+004
       log Kow used: 0.85 (estimated)
       no-melting pt equation used

 Water Sol Estimate from Fragments:
    Wat Sol (v1.01 est) =  1e+006 mg/L

 ECOSAR Class Program (ECOSAR v0.99h):
    Class(es) found:
       Neutral Organics-acid

 Henrys Law Constant (25 deg C) [HENRYWIN v3.10]:
   Bond Method :   9.70E-009  atm-m3/mole
   Group Method:   Incomplete
 Henrys LC [VP/WSol estimate using EPI values]:  2.176E-010 atm-m3/mole

 Log Octanol-Air Partition Coefficient (25 deg C) [KOAWIN v1.10]:
  Log Kow used:  0.85  (KowWin est)
  Log Kaw used:  -6.402  (HenryWin est)
      Log Koa (KOAWIN v1.10 estimate):  7.252
      Log Koa (experimental database):  None

 Probability of Rapid Biodegradation (BIOWIN v4.10):
   Biowin1 (Linear Model)         :   0.7245
   Biowin2 (Non-Linear Model)     :   0.7196
 Expert Survey Biodegradation Results:
   Biowin3 (Ultimate Survey Model):   3.1842  (weeks       )
   Biowin4 (Primary Survey Model) :   3.9956  (days        )
 MITI Biodegradation Probability:
   Biowin5 (MITI Linear Model)    :   0.6808
   Biowin6 (MITI Non-Linear Model):   0.7604
 Anaerobic Biodegradation Probability:
   Biowin7 (Anaerobic Linear Model):  0.5224
 Ready Biodegradability Prediction:   YES

Hydrocarbon Biodegradation (BioHCwin v1.01):
    Structure incompatible with current estimation method!

 Sorption to aerosols (25 Dec C)[AEROWIN v1.00]:
  Vapor pressure (liquid/subcooled):  0.018 Pa (0.000135 mm Hg)
  Log Koa (Koawin est  ): 7.252
   Kp (particle/gas partition coef. (m3/ug)):
       Mackay model           :  0.000167
       Octanol/air (Koa) model:  4.39E-006
   Fraction sorbed to airborne particulates (phi):
       Junge-Pankow model     :  0.00598
       Mackay model           :  0.0132
       Octanol/air (Koa) model:  0.000351 

 Atmospheric Oxidation (25 deg C) [AopWin v1.92]:
   Hydroxyl Radicals Reaction:
      OVERALL OH Rate Constant =  24.3848 E-12 cm3/molecule-sec
      Half-Life =     0.439 Days (12-hr day; 1.5E6 OH/cm3)
      Half-Life =     5.264 Hrs
   Ozone Reaction:
      No Ozone Reaction Estimation
   Fraction sorbed to airborne particulates (phi): 0.00957 (Junge,Mackay)
    Note: the sorbed fraction may be resistant to atmospheric oxidation

 Soil Adsorption Coefficient (PCKOCWIN v1.66):
      Koc    :  1
      Log Koc:  0.000 

 Aqueous Base/Acid-Catalyzed Hydrolysis (25 deg C) [HYDROWIN v1.67]:
    Rate constants can NOT be estimated for this structure!

 Bioaccumulation Estimates from Log Kow (BCFWIN v2.17):
   Log BCF from regression-based method = 0.500 (BCF = 3.162)
       log Kow used: 0.85 (estimated)

 Volatilization from Water:
    Henry LC:  9.7E-009 atm-m3/mole  (estimated by Bond SAR Method)
    Half-Life from Model River: 7.347E+004  hours   (3061 days)
    Half-Life from Model Lake : 8.016E+005  hours   (3.34E+004 days)

 Removal In Wastewater Treatment:
    Total removal:               1.88  percent
    Total biodegradation:        0.09  percent
    Total sludge adsorption:     1.78  percent
    Total to Air:                0.00  percent
      (using 10000 hr Bio P,A,S)

 Level III Fugacity Model:
           Mass Amount    Half-Life    Emissions
            (percent)        (hr)       (kg/hr)
   Air       0.189           10.5         1000
   Water     37              360          1000
   Soil      62.7            720          1000
   Sediment  0.0722          3.24e+003    0
     Persistence Time: 547 hr

7) “Chemspider is using social computing (crowdsourcing) to clean up (curate) the information in the database. This works in Wikipedia, although the number of chemicals in in the thousands, not the millins, and there are still many data and chemical problems. Moreover WP shows that there are compounds - e.g. aluminium chloride - where there is no single structure.” Social computing curation is working well. It’s working on Wikipedia too..I am in the middle of that effort.  There is no reason that ChemSpider cannot support multiple species for one compound either. For example, see the structure of Thymol Blue on Wikipedia and then look at this search: http://www.chemspider.com/q/thymol%20blue on ChemSpider. 2 of the 3 structures in the scheme are noted on ChemSPider. The third can be added. For aluminium chloride we link to Wikipedia to explain this…at present only the lede of the article, we could host the entire article. Why not?

8 ) “What is Chemspider now is and where it may be going? It’s difficult to predict anything on the web but it’s also clear that chemists are one of the most conservative disciplines. Why use a free service when you can get your library to pay (a lot of money) for ACS or Beilstein services? So I wouldn’t predict explosive growth like Flickr or Google” Yup, I’d agree. But it’s not only conservatism. it’s marketing (we don’t do any paid marketing” and ChemSpider is for chemists. Flickr’s for everybody, so is Google. How can it be as explosive? But can it and is it growing? Yup.

9) “Nick found 26 sites displaying staurosporine and there were 19 different structures given. Some were incomplete and several were just crazily wrong. Clearly many chemical suppliers, journal editors, etc. do not care about chemical structures. So there is a huge amount of rubbish out there.” I’ve said the same many times (1,2, and others). But does it mean we should stop? I don’t think so…

and to conclude

10) “PMR: At some stage, therefore, the community will react against this centralisation of information, but it could be a long time. I don’t think anyone should set up to duplicate what ACS does - I think we should use modern thinking to do things quicker, smarter, cheaper and in tune with the modern Web. Chemspider may have to make some choices soon - is it a company or a voluntary activity? does it concentrate on high volume and variable quality, or low volume and high quality - it cannot do both? What is the particular USP of its repository service ?- there may well be a role for a specialist chemical repository service but when? Is it different from Pubchem, and how…?

ChemSpider is not a company. ChemZoo is. We ARE using modern thinking in tune with the modern web. Probably one of the fastest moving efforts in this area..are there others moving as fast at depositing? curating? integrating? So, we are a company and at no cost to the users. Volunteers are helping. We are working on BOTH high volume and high quality. It is work. We are being successful on both. The Wikipedia collection, when finished, will only be a subset of ChemSpider. But structures and associated information (other than predictions!) are validated daily at present, And crowdsourcing can speed it up. And there WILL be disagreements between chemists..just like on Wikipedia! I am in those conversations too. I think there is a role for a free access chemical repository now. We may be surpassed at any time but for now our efforts are valid and valiant, in my opinion…what say you?

Buy me a Coffee

Since ChemSpider went live in Spring 2007 we have received a lot of support, feedback and guidance from our Advisory Group. The advisory group was set up as a rotating group of advisors and it is now time for a “changing of the guard”. If you have an interest in becoming a member of the advisory group please send me a note to antonyDOTwilliamsATchemspiderDOTcom.

For the next year we will be focusing our efforts on: supporting Open Access publishers (see later post), text-mining, document mark-up, working with chemical vendors, enhancement of web services, and extending our penetration into the world of Wiki-based chemistry. If any of these areas are of interest let me know!

Buy me a Coffee

Will Griffiths has posted at Open Chemistry Web a post entitled “Chemrefer could disappear tomorrow“.

He’s not talking about the fact that ChemRefer is disappearing, quite the contrary. he is talking about how the combination of ChemRefer/ChemSpider is powering ahead with our indexing of Open Access articles and the new 10s of 1000s of articles added to ChemSpider text-searching capabilities so far, and the many more coming soon.

I guess now we have to consider that “ChemSpider could disappear tomorrow” too. I hope we disappear in the SAME way! By that I mean I hope that some organization sees the value of what we are doing and will want to collaborate with us in order to make an even bigger impact. One thing about what we are doing, as I commented during my presentation at the Whitney Symposium at GE is “We are upsetting a lot of people – evangelists, cheminformatics system vendors, publishers, data content providers”. This is NOT intentional but what we are doing is disruptive, we understand. We haven’t focused on talking about what’s possible but getting on with doing it, sometimes with warts and all. Not all players in these areas see us as a threat but based on direct feedback some do. Its a shame.  We have a lot of “birthmarks” on us at present …We are upsetting a lot of people

Buy me a Coffee

I spent two days in Albany, New York this week at the GE Research Center. The Whitney Symposium was focused on “Networks” and invited speakers from Harvard University, Caltech, MIT, Yahoo research and the like to talk about their views of networks. These included power networks, biological networks, socio-economic networks and so on. I spoke in the Social Networking section and a link to the presentation is below: Crowd Sourcing to Build a Structure Crowd-Centric Community for Chemists

I have not added text to each of the slides but hope it will be rather self-explanatory.

Buy me a Coffee

A biweekly update of new blog postings on the ChemConnector Blog that might be of interest to ChemSpider readers.

Books I am reading - The Autoimmune Epidemic

Invited Symposium Speaker at a Fortune 500 Company

New Shower Curtains and Our Health

Petaflops and Cell Processors

Buy me a Coffee

As we continue to add data sources to ChemSpider…and it’s going on almost weekly at present, it is clear that we have to make it easier for the users of ChemSpider to know what each of the Data Sources is. We’ve been doing some developments in the background for a couple of collaborations that have required the development of certain components and we’re layering one of them on here. We are using a callout balloon to display description details of the data source. Just hover over the name of the Data Source and you will see the description as shown below.

Clicking on the More Details link at the bottom right hand side of the callout balloon takes you to the details page. If any of you readers are DEPOSITORS on the ChemSpider system please note that we would love you to maintain your own page. Contact me and I will guide you through the process. This is aht efirst of many enhancements to help navigate Data Sources.

Buy me a Coffee

Yay..I am on my way to SciFoo in August to hang out with lots of people I know and lots more I don’t. It’s a Science camp with an intention of “encouraging collaboration between scientists who would not typically work together.”

As mentioned in the invite to me…”The Economist said that it “capture[s] the essence of innovation”; in a photo essay for Edge, George Dyson wrote of  “the impossible choice” when deciding which sessions to attend; another attendee described it simply as “The best gathering ever. Period.”"

I am really excited to participate and gthere are already conversations afoot regarding getting a group of us together to discuss extending ChemSpider to become an ever better platform for “Open Notebook Science”.

This is going to be great!

Buy me a Coffee

The past couple of days has seen an interesting exchange going on over on the SimBioSys blog.

Zsolt Zsoldos is someone I respect, not only for his passion for his science but also for his want to educate others in the challenges of what he does in developing software. I believe his blog post entitled “Crystal Structure Errors in CSD too” was an honest attempt to tell people to be “careful” when using data from databases. I don’t care whether the database is ChemSpider, PubChem, the CAS Registry or any of the other databases available via free access of commercial transaction, they ALL have errors. It is inevitable. Zsolt’s attempt to highlight that such errors exist was done, I believe, with pedagogical intent.

“J” then came back and gave some appropriate comments in response to Zsolt’s post and they should be consumed in series. It appears there was some type of backroom conversation, likely with the CCDC,  about how these comments were not prominent enough. Zsolt then posted this:

Update: Since the posting of this blog entry, we have received 2 public comments — displayed in a standard way as all comments by the WordPress blog software, and some private emails originating from CCDC. One of the complaints from CCDC was that the second comment — which explains the problems and directs the blame on my naivity for my wrong expectations about the data — was not displayed as prominently as the original article.”

He then posted the comment into the original article. Huh? Not sure why Zsolt should have felt obliged to do this for anyone. It’s a Wordpress issue re how comments are displayed. He should not have felt obliged to insert the text into the article. Zsolt then went on to comment about the licence agreement and permission to use the CSD. What is more interesting to me is his view here:

“On a personal opinion: such restrictions on the use of scientific facts do not seem to make much sense to me. As the IUCr position paper explains: There is a long-standing acceptance within crystallography of the principle that such primary data sets should be freely available for sharing and re-use (with appropriate credit) within the structural science community. Also the FAQ on the CystalEye site explains: “As this supplementary data is a set of facts and is not part of the article full-text it does not fall under the copyright, and it should therefore be free to both view and download“. Nevertheless, CCDC has the legal right to stop us from using the data, since we signed a licensing agreement containing such conditions. That was a mistake on our part, one that we have to live with now. Let this case be a warning for others who have not yet made such mistake to sign the draconian agreement. ”

Those of you who have been watching the discussion between myself and ACS over the past few months will know I have been trying to get confirmation that “supplementary data” are Open Data and that we could scrape the CIFs if we chose to…it’s a MANY month conversation at this point. The Unilever School at Cambridge, via Nick Day’s work, has generated CrystalEye and, after many conversations, we were provided the data source and have it on ChemSpider now. We are awaiting constructive feedback from Nick and Peter Murray-Rust regarding our implementation of their data on our site. THis is especially important when there are licensing issues as appear to have been enforced on SimBioSys, evidenced by this Public Apology to CCDC. Read the post for details. It is Zsolt’s concluding statement that feeds directly into the value of Open Data in science and the value of CrystalEye to the community.

He comments: “One lesson I learned from this exchange is the importance of Open Data for scientific advancement (some scientists believe that research data must be free), e.g. such that is available from CrystalEye. When even non-profit organizations (registered as a charity) use draconian license agreements protecting data created and published by others, then fully commercial entities (like pharmaceutical companies) must be guarding their own data even stronger. It makes it difficult to make scientific progress if a single blog mention of an error in a data entry invites the wrath of the company who sells services on the data.”

As efforts like CrystalEye prevail, as the copyrightability and position of publishers regarding supplementary data is resolved, and the efforts of groups such as ChemSpider are applied to gathering Open Data and developing algorithms from these data, there is likely to be increasing tension showing up such as we see here.

Buy me a Coffee

A few days ago I blogged about the removal of the NMR predictor link from ChemSpider and committed to follow up with the developers of the algorithm. They are clearly my type of people…they have moved quickly and have already fixed a couple of bugs. If you check my original post above you will see my comments about the NMR spectrum of benzene. Check below for the NMR spectrum now after their bug-fix. Looks fine to me. They commented they have some additional work to do but it looks like we might be reconnecting to the service shortly.

The benefits of having a community test a software product/service like this is that the developers get the feedback and can go to work. Everybody wins. I look forward to their further comments on this blog post but I can say I am impressed with how fast they mobilized to fix this!

Buy me a Coffee

here has been a response to my post about Chemical Names and Structures here.

PMR>”For certain purposes, it is valuable to collect as many names as possible, for example for location of lookup. But these should be accompanied with metadata. A similar example is from ChemSpiderMan (ed.):

On a record view we list “Names and Synonyms”. The question marks Peter sees are for a French name shown here: Looks fine in my broswer and pasted in here too: N-{2-[({5?-[(dim�th?ylamino)m?�thyl]fur?an-2-yl}m?�thyl)sul?fanyl]�th?yl}-N’-m�?thyl-2-ni?tro�th�ne?-1,1-diam?ine. So, not junk (saying that the French name is junk would offend the Parisians). Notice that the Z- has been removed (for now) and that the name is labeled French on the record. If any of you are seeing issues in your browser let us know and we will investigate at our end.

PMR: Without the metadata giving the langauage information is losr. For example what does “pain” mean? If the language is not given there is a tendency to interpret this as english.  We have to acknowledge that the language of science is currently english (it wasn’t when I started and we had to read French and German  papers). So RDF, for example, provides a language qualifier (e.g. @en or @fr). The addition of that qualifier transforms the information from junk to meaningful. “

First of all, it’s interesting to note that the French name has been rendered as “junk” in Peter’s blog as shown here.

This probably relates to his original comment that the name is junk in his browser too…but acceptable in mine. On the other hand his blog post may look fine to him and looks bad in mine! Oh those dependencies…I see similar things show up in Wordpress regularly.

Peter suggests that there should be metadata giving the language information. Good idea. See my previous blog post about that particular issue and the fact that we allow curators to layer on metadata AND we capture and retain it WHEN it is available.

If you look at this record you will see that there are names labeled as Polish, German and Dutch.

Chloropre​ne [Wiki]

1,3-Butad​iene, 2-c​hloro-

126-99-8 [RN]

204-818-0 [EINECS]

2-Chloor-​1,3-butad​ieen [Dutch]

2-Chlor-1​,3-butadi​en [German]

2-Chlorbu​ta-1,3-di​en [German]

2-Chloro-​1,3-butad​iene

2-Chlorob​utadiene

Chloropren [Polish]

Most labels were captured during the deposition process. One was added manually.Notice also the direct links to Wikipedia, the Registry number link to perform a search of PubChem and the link to EINECS.

As I commented in my post on ranitidine, and extracting from Peter’s post “Notice …….. that the name is labeled French on the record.” So, what Peter suggests is already in place on ChemSpider. I display below what is presently available to curators to label the names with. Notice this includes language,
EINECS numbers, CAS Registry Numbers, INNs, JANs etc.


The list of languages is easy to expand. Anybody have any requests?

A further comment “PMR: I very much like the idea of regarding chemical names as social identifiers. But, of course, that only works for humans. The machines can aggregate the tags but they cannot make inferences from them. The problem is that when they are put into databases they lose their social context and are managed by hard boolean logic. That fails immediately and often dramatically. A major cause is the loss of metadata and authorities. In this world you cannot use voting (which is why Chempedia cannot be seen as an authority for CAS numbers, only a useful guide). We have to use authorities (provenance) in our information. Thus the statements: Ranitidine is the Z-isomer and Ranitidine is the E-isomer may be seen as contradictory. That’s why people have suggested that RDF should have quads, not triples, such as Antony_Williams asserts ranitidine hasIsomer Z Wikipedia asserts ranitidine hasIsomer E Both these are true. That is the language we should use in the semantic web PeterMR still deliberately fails to make an assertion about this isomerism and is waiting to see what others think.”

This leads us into a deeper discussion about retention of metadata and authorities. We retain metadata when it is deposited or we can harvest it. Let’s consider the information below extracted from the same compound on ChemSpider:

Notice all of the

and note that they all link through to the original source of information, in this case NIOSH.

  • Appearance: Colorless liquid with a pungent, ether-like odor.

  • First Aid: Eye: Irrigate immediately Skin: Soap wash immediately Breathing: Respiratory support Swallow: Medical attention immediately