Digging For Big Data Gold: Data Mining As A Route To Drug Development Success
By Suzanne Elvidge, contributing editor
The term Big Data is on everyone's lips, from retailers to healthcare providers. But what actually is it, and how can it help biopharma R&D? There are many definitions of Big Data, but perhaps the simplest is a collection of data that's approximately bigger than one terabyte and/or is too big to handle using standard software and analytical processes.
Big Data is becoming a major part of all facets of healthcare as physicians' notes on patients, test results, prescription records, and even imaging results (e.g. X-rays, MRIs) are being included in electronic medical records (EMRs). There are many electronic public databases that are part of biobanks and national healthcare studies. In addition, more and more clinical trial results and other drug development and approval documents are being stored electronically.
Picking Out The Nuggets Of Data
Drug development costs are skyrocketing, and despite this, attrition rates continue to climb with drugs still failing in latestage clinical trials. Accessing this treasure trove of Big Data could help by improving compound selection and refining clinical trials. But how to find the gold among the dross?
"The challenge of Big Data is the number of combinations of factors involved. For example, if you have 1,000 patients, you could have 1,000 genomes, 1,000 sets of comorbidities, 1,000 phenotypes — the list could go on," says Steve Gardner, partner at Biolauncher, a United Kingdom-based biopharma consultancy. It's situations like this where data mining, which uses software algorithms to analyze and summarize the data, comes to the rescue. GenoKey, which provides analytics solutions for healthcare Big Data, has developed an array-based technology to solve very large combinatorial problems in case-control data, finding patterns in the data using massively parallel GPU processing.
NuMedii, a start-up based in Menlo Park, CA, is using data mining to correlate disease information and drug data to predict drug efficacy. The company's database includes billions of points of disease, pharmacological, and clinical data, and it mines this using network-based algorithms. This should de-risk the development process, increasing the chance of drugs making it through to the market.
The U.K. start-up MedChemica is at the core of a collaboration designed to speed drug development using data mining of precompetitive-shared data while maintaining the security of each individual partner's intellectual property. As Hans-Joachim Boehm, head of small molecule research at Roche (one of MedChemica's collaborators), explains, the driver behind the collaboration was that many companies have a lot of preclinical data, but the challenge is how to analyze it and make practical use of it.
"Drug development is an iterative process, and you learn at each stage. You start with a target and a molecule that hits the target. You then characterize the interaction and the molecule, find out what the activity and the issues are, and then make modifications, creating a new molecule. Then you start the process again," Boehm says.
This is a time-consuming process and generates a lot of data. The collaboration, based on MedChemica's matched molecular pair analysis technology, aims to make it more efficient, using existing information to reduce the number of steps between hit and candidate. MedChemica's algorithms mine the partners' databases of molecules generated during the iterative process to find pairs that are very closely matched. The software then analyzes the differences between the in vitro data from the pairs of molecules and maps this to the structural changes in the molecules. The output from the analysis is then used to create rules that can be applied to virtual molecules to predict the impacts of similar structural changes. When drugs fail at a late stage of development, it's generally because of safety issues, and so toxicity data is particularly valuable to be able to "design out" issues at a much earlier stage.
"We originally created the matched molecular pairs technology at AstraZeneca. However, this is a very data-hungry process, and we realized that there just wouldn't be enough data in any one individual company. MedChemica was formed as a neutral intermediary with the idea of bringing multiple companies together and acting as the hub of the consortium, and AstraZeneca strongly bought into this opportunity," says Al Dossetter, founder and managing director of MedChemica. "AstraZeneca has been joined by Roche/Genentech, and the database contains around 1.2 million data points so far. However, the more data there is to mine, the better the results will be."
The consortium is open to other large biopharma companies, and discussions are ongoing. As a consortium, all partners have a say and can suggest where additional data could improve the dataset overall, even agreeing to share costs where further testing would be advantageous or match the addition of equivalent amounts of data. There will be no "reach-through" claims or tiebacks for any molecules generated as a result of the collaboration. "More companies will create bigger databases and, therefore, better rules. This should be synergistic rather than additive," says Boehm.
There will also be opportunities for collaborations with academia. The benefits of these will be two-way — for both the academic researchers and the science behind the database. "We plan to have an online tool available by the end of 2013. This could give academia and small companies access to the technology on a pay-as-you-go basis. This would support research and provide us with another revenue stream," says Dossetter.
As with all precompetitive collaborations, security is an important issue. However, Boehm is reassuring, saying, "The beauty of the collaboration is that the data is extracted and analyzed in such a way that we share the rules but not the structures of the molecules. Many companies are recognizing the advantages of precompetitive collaborations, and I expect to see more in the future. I look forward to seeing what comes out of this collaboration. It could be a big step in drug development."
NextBio has created a database with billions of data points from a range of different types of information, such as genomic, proteomic and metabolomic data, molecular profiles, and clinical trial results from public and private databases, as well as clinical data from individual patients. The company analyzes the data using its proprietary algorithms.
"One of the drivers for the advances in Big Data in healthcare research is the improved efficiency in producing molecular profiles, as sequencing costs are falling," says Saeid Akhtari, cofounder, president, and CEO of NextBio. "Each patient whose data is added to the system makes it smarter."
Cutting Through The Rock Face: The Big Data Challenges
Data mining and Big Data bring with them many challenges. One of the biggest challenges in data mining is the consistency of the data, which can come from many sources. However, as Akhtari explains, this is important since it reduces the risk of false positives and is the point where a human touch can be essential to provide quality control. "There are many data repositories worldwide containing a lot of heterogeneous data. This data has to be standardized and indexed to be searchable, and results from queries need to be returned in real time via an intuitive interface to enable scientists to continue their research," Akhtari adds.
As Boehm explains, this isn't always as easy as it looks: "There have been some interesting papers on how to analyze Big Data, but when you look closely, you realize it takes a huge amount of curation and isn't necessarily scalable. What's possible on a thousand records won't necessarily work on millions. What's needed is a way to build compatible and well-annotated databases and analyze the databases using processes that can be scaled up."
Textual information makes up the bulk of the information generated by the biopharma industry, and one of the exciting possibilities for data mining would be to be able to link this with the other available information and analyze it. However, as Gardner explains, this has its own issues. "Analyzing text is challenging because so many meanings of words are changed by their context. You can't assume that two people using the same word will necessarily mean the same thing. It will be necessary to resolve issues at a very detailed level."
It is also important to know the data well, as this will influence how it is searched and analyzed and the quality of the outputs. Understanding the data also has an impact on the questions asked of the data. "For example, do you know the context in which the data was discovered? Have patients been diagnosed using a specific methodology or were they self-diagnosed? Were they given the same treatment protocol or even the same dose? Were the endpoints the same?" asks Gardner.
Another key challenge is data security. This is important both for patients and drug developers. "Data security and patient privacy is critical. We remove identifiers to protect privacy and store data in a private cloud to ensure it is secure and to provide confidence for our clients," says Akhtari.
The Future Of Big Data And Data Mining: The Route To The Mother Lode
If these challenges can be resolved and large sets of data (e.g. drug information, FDA-approval documentation, patents) can be combined successfully, then the future of Big Data and data mining could be very exciting. "The future of data mining, we believe, is in making data available to the community and connecting stakeholders," says Akhtari.