Toward Personal Language Models

Speaker: Dr. Yi Wang, Fudan University
Title： MAIB-Talk-006: Toward Personal Language Models
Date：9:00pm US East time, 01/28/2023
Date：10:00am Beijing time, 01/29/2023
Zoom ID： 933 1613 9423
Zoom PWD： 416262
Key words: Toward； Personal，Language，Models

Title: Toward Personal Language Models

自然语言模型是生成式人工智能的又一重要方法。扩散模型源于连续变量，而自然语言模型源于离散变量。自然语言模型的重要技术支柱之一是自注意力。现在连续，离散，Transformer和自然语言的自回归正在相互融合，成为人工智能的主战場之一，在我们的推理，智能活动，生物学，健康卫生和医学中都有广泛的应用。OpenAI 继推出chat GPT 之后又在模拟人的智力活动和思维过程。上述领域是我们末来几个月活动的主要内容。由于知识涉及面广，成百上千的文章湧现，我们都来不及读。但我们会尽力尽量选取有代表性的论文介绍。在此之后，我们讨论因果分析和人工智能。然后，我们介绍微分流形的基本知识和在人工智能中的应用。王一老师这周的个性化自然语言的报告，是我们这个另一生成式人工智能活动的开始。我们认为，这是走向一般人工智能的途径之一。

Abstract:

Language models provide the joint probability distribution of a symbolic sequence. A language model can generate novel sequences which enables article writing and dialogues. It can also predict the likelihood of given sequences which enables blank filling, choices on multiple answers or judgement of propositions. Thus, language models are keys to future artificial general intelligence (AGI). Currently huge language models dominate the field. They cost huge computation, emit tons of CO2, require expensive GPU server to deploy and block small labs and individual researchers. In this study, I explored various technologies to a personal language model which is small, elegant, cheap, fast and affordable to everyone. These technologies include: (1) A simple bare CUDA/C++ implementation of every operator from the scratch. (2) Several novel candidate architectures. (3) A novel entropy-based sampling method for text generation, aka Top-E sampling. (4) Elegant designs, such as byte level modeling, extreme deep and narrow design, single head batch computation etc. (5) Quantization with VNNI instructions. I open sourced the June version with two pretrained models: PubMed English model and WuDao Chinese models. A more recent Traditional Chinese Medicine model is also available on WeChat based on a state-of-the-art model with only 3 million parameters.

语言模型提供符号序列的联合概率分布。语言模型可以生成新颖的序列，从而实现文章写作和对话。它还可以预测给定序列的可能性，从而能够填空、选择多个答案或判断命题。因此，语言模型是未来通用人工智能 (AGI) 的关键。目前，庞大的语言模型在该领域占据主导地位。它们需要巨大的计算成本，排放大量的二氧化碳，需要昂贵的 GPU 服务器来部署和阻止小型实验室和个人研究人员。在这项研究中，我探索了各种技术，以形成一种小巧、优雅、廉价、快速且人人都能负担得起的个人语言模型。这些技术包括：(1) 从头开始对每个运算符进行简单的裸 CUDA/C++ 实现。 (2) 几种新颖的候选架构。 (3) 一种新的基于熵的文本生成抽样方法，又名 Top-E 抽样。 (4) 优雅的设计，如字节级建模、极深极窄设计、单头批计算等。 (5) VNNI 指令量化。我开源了 6 月版本的两个预训练模型：PubMed 英文模型和五道中文模型。基于只有 300 万个参数的最先进模型，微信上也提供了更新的中医模型。

Bio:

Dr. Yi Wang is a Youth Research Associate in School of Life Science in Fudan University. He obtained Bachelor and PhD in the same institution. In his PostDoc journey in Human Genome Sequencing Center in Baylor College of Medicine, he joined the 1000 genome project and his SNPTools package obtained consensus from the community and produces PHASE I imputation result of the project. His XiaoBu AI doctor was well accepted and was deployed in the Children’s Hospital of Fudan University.

王一博士，复旦大学生命科学学院青年研究员。他在同一所大学获得学士和博士学位。在贝勒医学院人类基因组测序中心的博士后之旅中，他加入了 1000 基因组计划，他的 SNPTools 包获得了社区的共识，并产生了该项目的 PHASE I imputation 结果。他的小布AI医生很受欢迎，被部署在复旦大学附属儿童医院。

HealthScienceHub, https://space.bilibili.com/2056525058