Research
As the lead bioinformatician on several NSF-funded Plant Genome Projects at Penn State, I developed various bioinformatics pipelines and visualization tools for genome annotation and analysis in both model organisms and evolutionarily important non-model species. I was involved in the planning, writing, and generating preliminary analyses for NIH and NSF grant proposals, I trained and supervised dozens of undergraduate, graduate, and post-doctoral students in bioinformatics, database management, data science, and web development, and I was involved in the annotation, data coordination, data analysis, and publication of 3 publicly funded and sequenced plant genomes (Poplar, papaya, Selaginella). This work helped identify some of the early genome duplications and early diversification events in flowering plants. I have skills in many different areas of bioinformatics and have co-authored papers in database and web tools, plant genome papers, miRNA discovery, gene family studies, and tools to analyze NGS, including statistical simulations to understand sequencing coverage.
Since 2009, I have worked in various roles (bioinformatics, data scientists, digitalization) in industry for BASF in both research and manufacturing. I have had the chance to help lead the digital transformation in several key areas in industry. I have been responsible for data management, analysis, and integrating output from systems biology, comparative genomics, and bioinformatics to study genes involved in plant development, regulation, and metabolism. I have effectively led the development of pipelines to analyze NGS technologies to annotate and characterize plant genomes. I effectively led teams to develop methodology, software, and databases to identify gene regulatory elements and I coordinated gene family analysis, integrated omics data from various projects, sources and technologies to gene nomination into field trials. Before Penn State, I had additional industry experience where I worked at an internet start-up company focused on music technology business applications, where we designed, developed, and deployed significant components in a Business-to-business digital media distribution system.
Although I have continued my academic research during my industry tenure, I am now in a position to dedicate more time to academic research. As a Visiting Assistant Professor at my alma mater LSU, I have the bioinformatics expertise, project management skills, motivation, and time to successfully support the proposed research project. I am also an Adjunct Professor in the Chemical Engineering Department at LSU, where we are using machine learning algorithms to predict quality and production soft sensors, predict anomalies in equipment sensor datasets, build dynamic dashboards for quality and production data management, and use unsupervised clustering to approximate manufacturing operating conditions.
In my research at Penn State University, I developed many bioinformatics pipelines, databases, and visualization tools for genome annotation and analysis. I built tools to help with research in transcriptomics using both expressed sequence tags (ESTs) and was involved in the early use of next generation sequencing (NGS) technologies for transcriptomics. There were many early questions around transcriptome sequencing that needed to be addressed including sequence depth and breadth of studies (ESTstat). I developed an approach to build automated gene family phylogenies that allowed our projects to automatically sort transcript sequences into putative gene families (PlantTribes). We used these families to publish many gene family papers on some of the most important plant transcription factor families (MADS-box, APETALA2-like) as well as other important gene families. I am currently collaborating with Professors Alyssa Johnson and Adam Bohnert in the Biological Sciences Department at LSU to use RNA-seq in D. melanogaster and C. elegans to study differentially expressed genes in various biological conditions. We have developed pipelines to look for enriched gene sets (KEGG, GO, etc.) and developed pipelines to identify co-expressed gene networks and then look for conversed non-coding regulatory elements in such networks.
- Cui L, Veeraraghavan N, Richter A, Wall K, Jansen RK, Leebens-Mack J, Makalowska I, dePamphilis CW. ChloroplastDB: the chloroplast genome database. Nucleic Acids Res. 2006;34(Database issue):D692-6.
- Wall K, Leebens-Mack J, Müller K, Field D, Altman N, Depamphilis CW. PlantTribes: A gene family database for comparative genomics in plants. Nucleic Acids Res. 2008; 36(Database issue):D970-6.
- Wall K, Leebens-Mack J, Chanderbali A, Barakat A, Wolcott E, Liang H, Landherr L, Tomsho L, Hu Y, Carlson J, Ma H, Schuster S, Soltis D, Soltis P, Altman N, dePamphilis C. 2009. Comparison of next generation sequencing technologies for de novo transcriptome characterization. BMC Genomics, BMC Genomics 2009, 10:347 (1 August 2009).
- Shan H, Zahn L, Guindon S, Wall PK, Kong H, Ma H, DePamphilis CW, Leebens-Mack J. Evolution of plant MADS box transcription factors: evidence for shifts in selection associated with early angiosperm diversification and concerted gene duplications. Mol Biol Evol. 2009 Oct;26(10):2229-44. Epub 2009 Jul 3.
Genome Annotation
I was involved in the annotation, data coordination, data analysis, and publication of 2 publicly funded and sequenced plant genomes (Poplar, papaya). I was also involved in the early analysis of both Amborella and Selaginella in order to lay the groundwork for them to be chosen by DOE for genome sequencing. We used an early version of the PlantTribes database with the first two sequenced plant genomes (Arabidopsis and Rice) to automate gene family phylogenetic analysis for Poplar, papaya, and Selaginella.
- Ming et al. Genome of the transgenic tropical fruit tree papaya (Carica papaya L.). Nature. 2008;452(7190):991-6.
- Tuskan et al. The genome of black cottonwood, Populus trichocarpa (Torr. & Gray ex Brayshaw). Science. 2006;313(5793):1596-604.
- Soltis D, Albert V, Leebens-Mack J, Wing R, dePamphilis C, Ma H, Carlson J, Altman N, Wall K, Zuccolo A, Sotis P. The Amborella genome: an evolutionary reference for plant biology. Genome Biol. 2008;9(3):402
Genome Duplication
Another important part of my research was building tools to look at important duplication events in plants and help reconstruct ancestral genome sequences. This work helped identify some of the early genome duplications and early diversification events in flowering plants. I also built many bioinformatics pipelines to analyze transcriptome data from various organisms and compare them to model organisms.
- Chanderbali et al. Conservation and canalization of gene expression during angiosperm diversification accompany the origin and evolution of the flower. Proc Natl Acad Sci U S A. 2010 Dec 28;107(52):22570-5. Epub 2010 Dec 13.
- Duarte JM, Wall PK, Edger PP, Landherr LL, Ma H, Pires JC, Leebens-Mack J, dePamphilis CW. Identification of shared single copy nuclear genes in Arabidopsis, Populus, Vitis and Oryza and their phylogenetic utility across various taxonomic levels. BMC Evol Biol. 2010 Feb 24;10:61.
- Zahn et al. Comparative transcriptomics among floral organs of the basal eudicot Eschscholzia californica as reference for floral evolutionary developmental studies. Genome Biol. 2010;11(10):R101. Epub 2010 Oct 15.
- Cui L, Wall K, Leebens-Mack JH, Lindsay BG, …, Ma H, dePamphilis CW. Widespread genome duplications throughout the history of flowering plants. Genome Res. 2006;16(6):738-49.
Small RNAs
I have developed bioinformatics pipelines to identify and analyze miRNAs in non-model organisms.
- Barakat A, Wall K, Diloreto S, dePamphilis CW, Carlson JE. Conservation and divergence of microRNAs in Populus. BMC Genomics. 2007;8:481.
- Barakat A, Wall K, Leebens-Mack J, Carlson J, dePamphilis C. Conservation and divergence of microRNAs in California poppy. Plant Journal. 2007; 51(6):991-1003.
Chemical Production Clustering and Machine Learning
In my current line of work and additional academic collaborations, I am working in chemical production data analytics as the Digitalization Manager at BASF. I am collaborating with the Chemical Engineering Department (Professors Jose Romagnoli & Xun Tang) at LSU, where I am also an Adjunct Professor. I have been involved in developing pipelines to use machine learning algorithms to predict quality and production soft sensors, to predict anomalies in equipment sensor datasets, to build dynamic dashboards for quality and production data management, and to use unsupervised clustering to approximate manufacturing operating conditions.