Bioinformatics: Finding life-saving needles in enormous data haystacks
Kyle Bone Oct 13, 2022
All scientific discoveries depend on one thing: data. These facts and statistics, produced through countless hours of lab and clinical work, are the backbone of our understanding of any scientific field. This certainly holds true for the musculoskeletal research performed at the Indiana Center for Musculoskeletal Health. Yet, despite its vital importance, data in and of itself is of little value. The analysis and interpretation of the data are where the true heart of scientific discovery lies.
The types and amount of data researchers can produce continues to grow exponentially, as advancements in technology open doors previously inaccessible. And while these ever-expanding data sets are undoubtedly advantageous, the sheer amount of information does produce new challenges. How to best store, organize, and analyze the mountains of information is a top-of-mind concern for most biological scientists. A primary solution to these challenges comes from a key tenant of computer science: informatics.
Informatics is fundamentally the integration and analysis of large amounts of different data. In a medical setting, this is referred to as health informatics or bioinformatics. The goal of bioinformatics is to develop tools and techniques that enable researchers and healthcare providers to analyze and utilize the data available to them to improve human health and the delivery of healthcare services. Understanding the immense impact this field can have on musculoskeletal research, the ICMH recently welcomed our first Bioinformatics Director, Dr. Gang Peng, Ph.D.
Gang Peng is a Doctor of Bioinformatics and Biostatistics and joins the ICMH after spending the past seven years at the Yale School of Medicine in New Haven, Connecticut. Dr. Peng will be tasked with developing a robust bioinformatics unit to collaborate with the world-class musculoskeletal researchers that make up the ICMH membership.
When asked what exactly medical bioinformatics is, Dr. Peng likened the field to finding needles in data haystacks. However, these needles are often lifesaving or changing and the data haystacks are often unimaginably large. The type and source of these datasets are highly variable and are dependent on the project that produced them. Dr. Peng’s goal is to integrate biological data from a given project and find meaningful insights from it. This process can prove immensely difficult but undoubtedly worth the effort.
Dr. Peng’s early work focused on a rare genetic disorder called Li-Fraumeni Syndrome. This disease is caused by a single mutation in gene TP53 and results in an increased risk of the carrier developing cancer. In fact, females born with this gene mutation have a nearly 100% chance of developing breast cancer. Understanding that Li-Fraumeni Syndrome creates such a high cancer risk, diagnosing it early in a person’s life allows healthcare providers to monitor that patient closely and detect any cancerous formations early. This in turn leads to greater success rates in treatment.
There are several ways to detect this mutation in the TP53 gene but there are limitations and inaccuracies with all of them. With the goal of improving upon these limitations, Dr. Peng began looking at large amounts of family history data. The idea was, given the hereditary nature of the disease, studying family history data would allow physicians to infer which individuals were likely to be carriers of the mutation. This approach proved very accurate and can be used as a highly successful early detection method, leading to life-saving interventions for those afflicted. This method has been implemented in many commercial software programs like Progeny, PenRad, CancerGene Connect, etc. These software programs have been wildly used at over 1600 sites in more than 60 countries.
Following his work on Li-Fraumeni Syndrome, Dr. Peng’s projects really highlight not only the potential of health informatics but also the flexibility of the approach. While at Yale, he worked extensively on newborn screening. This very successful public health service aims to detect hereditary disorders in infants within 24-48 hours of birth. This is accomplished by taking a blood sample and analyzing metabolites associated with certain diseases. Put simply, certain metabolites are indicators of gene mutations. In turn, these gene mutations are known factors in various diseases.
Typically, the newborn screening process looks at 1-2 specific metabolites associated with each of 40-50 diseases. Unfortunately, the approach is siloed, meaning the metabolites associated with each disease are evaluated separately and independently to determine if an indicator of the specific gene mutation is present. This method can be inaccurate and result in many false positives. However, with such a vast amount of data available from the millions of newborn screenings performed each year, Dr. Peng saw a way in which to greatly improve the accuracy of these analyses.
By integrating all the data sets for each disease together, he was able to create a program in which all metabolites, for all 40-50 diseases, could be integrated and examined simultaneously and comprehensively. This gave a much more complete picture of the newborn’s biology and the diagnostic accuracy of the screenings improved tremendously. This also allowed for other data sets (such as height, weight, and ethnicity) to be integrated into the analysis.
About 1 in 200 newborns are flagged for a genetic disease through this screening process. Early detection can lead to treatment and in many cases having this treatment outlook from birth is the key to the child leading a normal life. Between 100,000-200,000 newborns are saved every year in the U.S. thanks to this process. One made even more powerful with the application of informatics.
As Dr. Peng establishes the bioinformatics unit within the ICMH, he will begin by collaborating with many of the investigators within the Center. Given that his previous work focused on clinical applications, working with basic and translational researchers will provide a new and exciting outlet for his work. One area that is of great interest to the scientific community is the spatial distribution of particular cell types within a tissue or a tumor in three-dimensions.
Mapping the spatial distribution of cells is proving to be important in following the progression and treatment of any disease. For example, in non-small cell lung cancer, immunotherapy is often a good treatment option. However, the success of this treatment is highly dependent on the distribution of immune cells within the cancer tissue. Researchers can use samples to take microscopic pictures of cancer cells and label them to map out their distribution in three dimensions. This data can then be integrated and scored. When these scores are compared across hundreds or thousands of samples, certain insights reveal themselves. These scores can inform researchers and clinicians which tumors are more responsive to the treatment and which ones are not. Ultimately the hope is this knowledge of the three-dimensional localization of specific cell types within a tumor or a diseases tissue can guide the treatment toward the most effective options.
Cell structure is not only important in cancers, however. This type of data analysis can be key to understanding many musculoskeletal disorders as well. For his part, Dr. Peng is excited to apply his expertise in uncovering these insights. Due to the broad, multidisciplinary nature of musculoskeletal research, an almost limitless array of biological data can be produced. Dr. Peng is confident his work can act as an invaluable tool for the understanding of this data.
In addition to collaborating with biological researchers, Dr. Peng also leads many of his own projects. Often, interesting results or challenges found in the various projects he is supporting will raise new informatics questions Dr. Peng would like to see answered. Following these lines of inquiry in his own work leads to greater success in his collaboration with biology scientists, creating a synergistic process.
That doesn’t mean applying bioinformatics to the musculoskeletal research being performed at the ICMH will be without its challenges. With such a rich source of data to work with, Dr. Peng stresses it’s important that he is always very careful in his analysis and validates all findings. He says, “There are many different factors that can affect the findings we get. Different processes used in the labs to generate the data can produce different results. Sometimes multiple conclusions can be drawn from the data and it’s difficult to determine which, if any, are accurate. Sometimes we find something new that wasn’t known in biology before, and we must validate and discern if it’s real or an issue with the data analysis.”
This brings us back to our original analogy: With immense quantities of biological data produced in the lab, finding valid, useful, and novel insights can be as daunting as finding a single needle in a mountain of hay. However, as Dr. Peng’s previous work proves, it is not only possible but also extremely valuable. We are confident this value will become apparent in the musculoskeletal research being performed at the ICMH in the near future.
To learn more about Dr. Peng’s research, please find his publications below.
Peng, G., Bojadzieva, J., Ballinger, M.L., Li, J., Blackford, A.L., Mai, P.L., Savage, S.A., Thomas, D.M., Strong, L.C. and Wang, W., 2017. Estimating TP53 Mutation Carrier Probability in Families with Li–Fraumeni Syndrome Using LFSPROTP53 Mutation Probability Estimation. Cancer Epidemiology, Biomarkers & Prevention, 26(6), pp.837-844.
Peng, G., Fan, Y., Palculict, T.B., Shen, P., Ruteshouser, E.C., Chi, A.K., Davis, R.W., Huff, V., Scharfe, C. and Wang, W., 2013. Rare variant detection using family-based sequencing analysis. Proceedings of the National Academy of Sciences, 110(10), pp.3985-3990.
Peng, G., Shen, P., Gandotra, N., Le, A., Fung, E., Jelliffe-Pawlowski, L., Davis, R.W., Enns, G.M., Zhao, H., Cowan, T.M. and Scharfe, C., 2019. Combining newborn metabolic and DNA analysis for second-tier testing of methylmalonic acidemia. Genetics in Medicine, 21(4), pp.896-903.
Peng, G., Tang, Y., Cowan, T.M., Enns, G.M., Zhao, H. and Scharfe, C., 2020. Reducing false-positive results in newborn screening using machine learning. International journal of neonatal screening, 6(1), p.16.
Peng, G., Tang, Y., Gandotra, N., Enns, G.M., Cowan, T.M., Zhao, H. and Scharfe, C., 2020. Ethnic variability in newborn metabolic screening markers associated with false‐positive outcomes. Journal of inherited metabolic disease, 43(5), pp.934-943.
Peng, G., Wilson, R., Tang, Y., Lam, T.T., Nairn, A.C., Williams, K. and Zhao, H., 2019. ProteomicsBrowser: MS/proteomics data visualization and investigation. Bioinformatics, 35(13), pp.2313-2314.
Peng, G., Chai, H., Ji, W., Lu, Y., Wu, S., Zhao, H., Li, P. and Hu, Q., 2021. Correlating genomic copy number alterations with clinicopathologic findings in 75 cases of hepatocellular carcinoma. BMC medical genomics, 14(1), pp.1-9.