- Title:MAIB-class-014:A Fundamental Model for Genetic Studies of Complex Diseases
- Date:10:00pm US East time, 05/06/2023
- Date:10:00am Beijing time, 05/07/2023
- Zoom ID:933 1613 9423
- Zoom PWD:416262
- Zoom: https://uwmadison.zoom.us/meeting/register/tJcudu-prTIuGNda1MsF8PKyRQlnGn06TP2E
-
Momiao Xiong, Ph. D, Professor in Department of Biostatistics snd Data Science , University of Texas, School of Public Health. Dr. Xiong graduated from the Department of Statistics at the University of Georgia in 1993. From 1993 to 1995, Dr. Xiong was postdoctoral fellow at the University of Southern California working with Michael Waterman.
-
Research Interest: Causal Inference, Artificial Intelligence , Manifold Learning, Statistic Genetics and Bioinformatics .
Background
Develop genotype language model as a fundemental model for genetic studies of complex diseases. Generative AI raises a great challenge in both philosophy and practice “on a scale not experienced since the beginning of the Enlightenment” Now AI-powered sequencers were capable of sequencing whole-genome at $100 per individual12, which allows generating a large amount of sequence data. An exponential growth of DNA and protein sequence data is paving the way to develop DNA and protein language models for genomics and biomedicine DNA and protein sequences contain rich information about their evolution, fitness, protein structure and stability, mutation semantics and mechanism of disease.
Information about biological properties of the sequences are encoded in the representations. The representations can be used for association and causal analysis of genetic variants, including QTL, and eQTL. One limitation of fundemental models is lack of hyhpothesis testing which lead to untranspanic and unexplainable results. To overcome these limitations, I will first develop a general framework for hypothesis test theory in aritificial intelligence in general and in fundemental models in special. I will view the transformer as a universe approximation to function from sequence to sequence and use nonlinear testing theory instatistics to define null hypothesis, test statistics and derive their distribtuion. The developed testing theory is applied to genome-wide association studies.
开发基因型语言模型作为复杂疾病遗传研究的基本模型。生成式人工智能在哲学和实践方面都提出了巨大的挑战,这是“自启蒙时代以来没有经历过的规模”。现在,AI动力测序仪能够以每个个体100美元的价格进行全基因组测序,这使得产生大量的序列数据成为可能。DNA和蛋白质序列数据的指数级增长正在为发展基因组学和生物医学的DNA和蛋白质语言模型铺平道路。DNA和蛋白质序列包含有关它们的进化、适应性、蛋白质结构和稳定性、突变语义和疾病机制的丰富信息。
序列的生物学特性信息被编码在表示中。这些表示可以用于遗传变异(包括QTL和eQTL)的关联和因果分析。基本模型的一个限制是缺乏假设检验,这导致结果难以理解和解释。为了克服这些限制,我将首先开发一个关于人工智能的假设检验理论的通用框架,并特别针对基本模型进行开发。我将把变换器视为从序列到序列的函数的宇宙近似,并使用非线性测试理论在统计学中定义零假设、测试统计量并推导它们的分布。所开发的测试理论将应用于全基因组关联研究。
工智能在最近的十年取得了巨大的进步,以至于有些科学家主要从人工智能的負面方面来评价人工智能对于现代科学研究的影响。不透明、不可靠和欠解释性是他们诟病人工智能的主要论据之一。人工智能研究的主要工具之一是预测。正是预测导致了上述人工智能所具有的常为人们批评的缺点。预测实际上是计算事件发生的概率。事件包含了很多因素。有些因素起作用,有些因素不起作用。因为在许多情况下,神经网络是一个黑箱。它一般没有,在许多情况下也不能识别出那些因素对预测起了重要的作用。在统计学中另一与预测同样重要的是假设检验。Lehmann 为统计学的研究生写了两本书,第一本是估计,第二本就是假设检验。假设检验也是费歇为统计学所奠定的基石之一。假设检验就是识别导致事件发生的因素。在经典统计学里,假设检验的理论都是在欧氏空间中进行的。
我们要在人工智能的主要模型中凡是出现予测的地方都要探索建立假设检验的理论,其中包括零假设,检验假设的统计量,统计量在零假设下的概率分布。设计计算一类错误的数字模拟。基础模型是人工智能的主要理论,我们就从基础模型开始来研究基础模型下的假设检验。