Making Openness Work: An interview with Barry Bunin of Collaborative Drug Discovery

I recently had the chance to sit down with Barry Bunin to talk about his new drug discovery platform, Collaborative Drug Discovery. As you may guess from the title, he’s taking a novel approach to drug discovery. Modern drug discovery faces huge challenges due to the economic inefficiency of the process where hundreds of millions of dollars must be spent to discover one new drug. The current model also makes it difficult to capitalize on all the interesting but not immediately drug-relevant data that’s generated in the process. CDD’s approach promotes collaboration as opposed to the traditional approach where different teams at different companies repeat much of the same work and suggests that companies will actually share information that leads to a mutual benefit, provided there’s a easy and secure way to do so. I’m delighted to share this interview with you of yet another company showing how openness and collaboration works for business.

WG: What was the inspiration for CDD?

BB: I wanted to do something that mattered for individual scientists and the overall efficiency of drug discovery. It is a field with lots of redundancy and waste – some is necessary, but I saw a technical and logical way to avoid most of that waste. Also, I wanted to do work that encouraged collaboration which is a synonym for leverage which is a synonym for efficiency. We saw Collaboration as the key to turning around the macroeconomic challenges of the drug discovery business. We want to create a platform whose value accrued with more software, data, and collaborations. Finally, I wanted the platform to be equally applicable for Neglected Disease, as well as Commercial, collaborative drug discovery. With CDD, we have the opportunity to broadly transform how research is done. That is inspiring and intellectually interesting.

How does CDD fit within the ecosystem of collaborative tools for research? Mendeley helps people share collections of research, there’s Figshare for figures, Labguru for lab management, etc.

BB: CDD is focused on biological and chemical data. Our place in this ecosystem is that CDD provides a much easier system to archive and mine data typically found in Excel (or .csv or .sdf) files and CDD also has unique collaborative data partitioning/sharing features. On the collaborative side, it is secure, yet usable within regular scientists workflows for sharing the range of data arising from biological experiments, from an individual experimental measurement on one object (i.e. a molecule or antibody) to batch sharing results from high-throughput experiments.

Being web-based, for heterogeneously biological experiments one can upload most any data into a highly configurable display, including active hyperlink readouts, in private or public Vaults. CDD has a reference module that integrates with Pubmed, we have a mirror of ChemSpider which links to over 30M molecules on the internet including Pubchem (both within CDD’s secure environment), and we have links from other databases to CDD, like John Irwin’s Zinc database. You can think of CDD as an inside-out version of Sharepoint,, Dropbox, except instead of being simply a file sharing system, it is a database. CDD can attach files to objects, protocols, runs of protocols, or a secure message board, but it is much more valuable when fully adopted for R&D. For the data in a collaborative database, an analogy we often use to explain this concept is that it is a bit how folks would easily tag photos in Flickr, except instead of saying “cute kittens”, one designates for each set of data which project(s) (private, collaborative, or public) one wants to selectively and securely share data within. CDD has a REST-based architecture, has made it easy to upload or export data from the application, and is platform independent, so we can securely link to or from any of the examples you mention with a unique url. CDD is interesting in that we have multiple levels of security and have passed big pharma and government audits for private data, yet it does have a piece that touches on all the exciting collaborative innovation on the internet.

We often describe the sharing on Mendeley in terms of Flickr as well, except for us it’s research documents, so it’s interesting that you bring that up. How are researchers using CDD Vaults?

BB: Half a dozen representative case studies are publicly disclosed online on our case studies page.

  1. Acetylon Pharmaceuticals: Harvard spinout company uses CDD to manage academic-industry and China CRO collaborations. In 18 months since their last financing, Acetylon files an IND for their selective HDAC inhibitor, and raises a $27M Series-B Financing.
  2. The NIH Neuroscience Blueprint with 7 leading academic laboratories, 4 CROs (including CDD), and ex-pharma drug development consultants working as a “virtual pharma” to develop compounds from chemical optimization through Phase I clinical testing.
  3. The Bill & Melinda Gates Foundation $3M CDD TB Database Projects: 250 users, 58 labs, 20 collaborations produced the following results: the two year project was extended to five years. Three projects supported by CDD were partnered with three big pharmas. This project was nominated by TB Alliance and NIAID and won the prestigious 2011 BioIT Best Practice’s Award.

More generally, researchers login daily to archive, mine, and collaborate around their drug discovery data. There are both collaborative and non-collaborative use cases for scientists’ natural workflows, things like visualizations, calculations, QC, etc. The main differentiators between us and traditional complex registration and data mining platforms are the ease of adoption and the collaborative possibilities. We focused on collaborative drug discovery, but the same approach of allowing data within natural workflows to be private, collaborative or public can be extended to many types of research data in a .csv/excel type format that could be more useful in a CDD type of collaborative database format. To give you a sense of the activity levels we’re seeing, CDD has had over 32,000 logins over the last 12 months and it has been growing geometrically.

So that’s daily use by a committed core of people, which is where everything has to start. How much data in terms of GBs, number of datasets, etc does CDD host?

BB: CDD handles the range of low-throughput, medium-throughput, and high-throughput data. Unlike bioinformatics gene data, to generate experimental drug discovery data people need to buy expensive reagents and molecules for assays. So the numbers are not so huge. On the other hand, CDD has been broadly adopted, so even from these relatively smaller data sets testing, say, ten 96-well plates in a screen, CDD has already amassed over 160,000,000 datapoints, but this is the critical data folks end up hanging their hat on for IP (composition of matter and utility patents for drugs, not just broadly available genomics data to help guide target ID). So it is a bit like comparing oil and diamonds, when considered relative to the vastly larger bioinformatics databases. To play with the analogy a bit, they are both made of the same stuff (data, carbon), and they are both valuable, but the value per datapoint or mg is very different. We come across heterogeneous biological data from our collaborators who work on target validation and assay development. So we do have researchers using CDD for more diverse types of biological data now. There is a joke I heard from a hardware vendor – “we made this powerful server, but if you want to use it to hold your door open, go ahead”. On a more serious note, there are some elegant ways to use CDD for the summary data, since it’s web-based, and to link to a server with larger data sets than one would typically use in CDD, such as large numbers of images, genomics data, etc. We are generally agnostic about build, buy, or partner – and if something is better for our customer to do in another technology, we let them know. We have not put limitations on the number of data points folks can upload, if files are particularly big for the internet upload process, one can zip or gzip. Sometimes researchers break a big sdfile up, too. Our philosophy is to architect for the long-term, but to right size performance. We don’t get complaints about performance, and performance is one of those things that are best when folks don’t notice it.

Isn’t that the truth! So how many datasets have researchers using CDD Vault made publicly available?

BB: Hundreds. Years ago we passed the 1 million molecule mark. Any securely accessed private compound that happens to be in CDD public gets a CDD number associated with it (and thus if it doesn’t have a CDD public number, it is more likely novel). We’ve put lots of work into the GUI to make it easy to zoom in for the microscopic view or zoom out for the telescopic view of data sets without getting overwhelmed. All related data is always just a click away, but not cluttering your current view. Making an expanding universe of private and public complex data sets simply accessible is a challenging but solvable problem with the right design.

How many datasets do you expect to have? What’s the growth of the public data like?

BB: There is no limit and it is going up geometrically. The ratio tends to be about 20:1 (private to public), so that gives a sense of how much of the iceberg is below the surface.

So what motivates researchers to make this data available?

BB: It is similar to what motivates researchers to publish papers. It is part of what makes us human, the wish to engage with others. It is worth mentioning, even with IP and patents, researchers do release sensitive compound and biodata. This is very analogous, except that since folks use it for private data, it includes the negative as well as the few positive results that typically see the light of day.

We often describe this, as Newton’s did, as “standing on the shoulders of giants”. Have their been any success stories of researchers using this data?

BB: Many. Some of the public ones we or others have published on are in the malaria and TB space. We’ve had collaborators identify both new compounds and known drugs to reverse the resistance to chloroquine when tested as combinations in human red blood cells with the resistant strains of malaria in 2009. Most commercial applications go unmentioned, one spinout of Harvard working with CDD from day one raised $27M to develop selective HDAC inhibitors for cancer. We have half a dozen case studies listed online including campus wide at UCLA, a MM4TB consortium with two big pharmas, and a number of others.

As I’m sure you know, there’s recently been quite a lot of interest in data citation. Are you planning to present the metadata describing each dataset in a structured format so that it can be cited properly by researchers. and so the publishers of the data can get their deserved credit for making it available, and what sort of provenance information do you have on the datasets?

BB: All data in CDD Public (even data in private Vaults for that matter) has an audit trail for attribution regarding who uploaded or shared what and when with whom. Every public data set has not only a point of contact, but a description of the data. With CDD researchers can statistically explore and mine the data with our tools, and we also give any researcher or organization that shares data publicly an opportunity to do a guest post on our blog. For added color, we do CDD spotlight interviews of researchers and their research, like you’re doing here. CDD urls are unique and permanent for references. We have many examples of researchers referencing CDD in their papers and upon request we can make custom landing pages at our own expense, if folks wish to give people a place to just their group’s data in CDD. We’ve done this for William Scott’s unique work at IUPUI with undergrads synthesizing small peptide mimetics that others can use and we’ve done this for the useful PDSP database from Bryan Roth’s group at UNC with >47,000 Ki against 699 GPCRs (including reference ligand, tissue source, species, reference, and pubmed link). The GPCR data set is particularly interesting since historically about 40% of all drugs hit GPCRs.

All protocols and runs have places for procedures and hyperlinks to documents with meta-information. This means that the value of the data goes up as a function of all the data sets shared by the community, but as with the Internet, one must use their own filters to judge the info. Data is attributed; some are more than just information – such as molecules that can be purchased from vendors for testing hypotheses.

Most of your services are designed around these private Vaults, which contain SAR data and such, and the main thing you want researchers to know is that your data is as safe in these Vaults as if they were storing them locally, right?

BB: Yes, I couldn’t have said it better myself. Securely archiving data in the CDD Vault is the foundation for all the other cool, useful things folks can do with data in CDD. Paradoxically, the more you give researchers data partitioning, security, and selectively sharing capabilities, the more often they share or collaborate (privately or publicly). The reason is because half the barrier is simply archiving data, not just the question of sharing that everyone focused on. At the same time, since the data is on the web, you’re also making it easier for researchers to share data as well.

So that is a interesting finding that suggests that privacy or competitive reasons aren’t the only reason people don’t share. Part of it is just because the tools aren’t yet good enough. Could secure sharing could start to encourage more open sharing as well?

BB: An entrepreneur’s job is to bring the real world closer to the ideal (or ideally efficient) world. New technologies, services, and processes are the efficiency catalysts. For drug discovery, in an ideal world (from a rate of learning and research progress perspective), all information would be shared instantaneously and progress would accelerate. However, in the real world, researchers in many places look to develop proprietary IP to demand a premium. It’d be nice to accelerate the learning from the losses, in aggregate across our industry to make everyone more efficient, while maintaining the investment (and focus that the profit motive enables) for the winners. So those are some provocative thoughts to get folks thinking about “what is and what should be”. CDD is unusual, in that we’ve found a way to have a sustainable business model that facilitates new modes of collaboration, of openness, of even thinking about progress in the internet era that is in harmony with the traditional, proprietary model. We are still tied to the almighty dollar, but we’ve been able to push the envelope in our space around collaborative workflows thanks to grants from organizations like the Gates Foundation, the NIH, and the EU. Also, we have some unusual (for a for-profit company) values and patience from our stakeholders to think long term which is really rare and fortunate and a blessing. In terms of our values, we would rather do something that in the foreseeable future benefits everyone else more than our company. Of course, in the long-run it also benefits the company, so we’re not just being nice. The airlines are a good example of this type of business that just breaks even (other than Southwest), yet provide a huge multiplier effect for society. One of our board members, Alpheus Bingham, once wisely remarked to me that he doesn’t want money in the company mission statement, of course every company needs to make money, the key is to create enough value for society to allow you to keep some. So beyond all the philosophy and values, it leads to a pragmatic solution for private data with optional mechanisms to partition or share data that encourages greater openness (while providing freedom, privacy, security, etc) through advances in technology and services. Today, CDD Public is free for any data shared with the public, essentially we “only” charge for privacy in the CDD Vaults.

How do you see CDD supporting open science in the future?

BB: We’ll never force folks to be open or to do any experiments, but the platform gently guides researchers to do better and more collaborative experiments (to be efficient) – while always giving the user the freedom to be as open or closed, as they wish. Scientists can have some data closed, say for a cancer project, and other data open for a malaria project. The breakthrough is that anyone can look at private, collaborative and/or public data together, with fine-grained control of each molecule, IC50, file, etc. CDD balances the need for privacy with the efficiency of working more openly, or collaboratively as we call it. When I started the company, I was going to call it Open Source Drug Discovery (OSDD), but that was too radical for the mainstream market to adopt, so instead it is Collaborative Drug Discovery (CDD) to support the broader range of private, collaborative and open use cases.

Thanks so much, Barry! It’s inspiring to see these examples of how enlightened self-interest can align with openness through the creation of better tools for scientists.