CSDS600: Large Language Models

Xiaotian (Max) Han

Fall 2024, M/W 3:20–4:35 PM

Course Description

This course offers an in-depth exploration of large language models (LLMs). LLM has revolutionized natural language processing and even the field of artificial intelligence. This course will introduce LLMs from their foundations to practical applications, covering topics such as model architecture, system design, and various training methodologies including pre-training, fine-tuning, and instruction tuning. This course is highly research-oriented and designed for advanced undergraduate and graduate students in computer science for research purposes.

Required and Recommended Materials

No textbooks are required.
Research papers will be provided.
Dive into Deep Learning, https://d2l.ai/d2l-en.pdf

Course Components & Grading Policy

The grading policy is subject to minor change.

Student’s grades will be calculated according to the following components:

Paper Presentation (30%):
- 20-min presentation on a research paper (30%)
Class participation (10%)
- Forms for the presentation (1% each, 10 in total)
Final Project (60%):
- A 2-page proposal (10%)
- A 6-page final report (40%)
- Project presentation (10%).

Paper Presentation Format:

Select a paper from the provided list
20-minute presentation on the selected paper
10-minute Q&A session
Each student must present at least once
Audience members will submit questions and rate the presenter using a provided form.

Final Project Format:

3 students in each group
Select a topic related to LLMs
Use ICLR2024 latex format for proposal and final report
In-class presentation on the final project

Grading Criteria:

Paper Presentation (30%):
- Completeness and quality of the presentation (15%)
- Average peer rating (15%)
Class Participation (10%):
- Forms submitted for each paper presentation (1% each, we will have over 10 presentations, you need to submit at least 10 forms)
Final Project (60%):
- Assessed the quality of the proposal, final report, and presentation.

Course Schedule

The course schedule is subject to minor change.

Date	Topic	Papers
08/26	Course overview, Introduction to LLMs	-
08/28	Basic deep learning, mlp, word2vec, layernorm	-
09/02	(Labor Day Holiday)	-
09/04	Langual modeling & transformers	-
09/09	LLM architecture & training	-
09/11	Training Data	The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research Finetuned Language Models Are Zero-Shot Learners RedPajama-Data-v2: An open dataset with 30 trillion tokens for training large language models
09/16	Evalutaion	Measuring Massive Multitask Language Understanding MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark
09/18	In-Context learnings	Language Models are Few-Shot Learners In-context Learning and Induction Heads Rethinking the Role of Demonstrations: What Makes In-Context Learning Work? Transformers Learn In-Context by Gradient Descent Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning Many-Shot In-Context Learning
09/23	CoT & Reasoning	Chain-of-thought prompting elicits reasoning in large language models Self-Consistency Improves Chain of Thought Reasoning in Language Models Towards Revealing the Mystery behind Chain of Thought: A Theoretical Perspective
09/25	Scaling law	Scaling Laws for Neural Language Models Data Mixing Laws: Optimizing Data Mixtures by Predicting Language Modeling Performance Training Compute-Optimal Large Language Models
09/30	Emergent ability	Emergent Abilities of Large Language Models Are Emergent Abilities of Large Language Models a Mirage?
10/02	Positional enbedding	Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation RoFormer: Enhanced Transformer with Rotary Position Embedding The Impact of Positional Encoding on Length Generalization in Transformers Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
10/07	Long context ability	YaRN: Efficient Context Window Extension of Large Language Models Efficient Streaming Language Models with Attention Sinks LLM Maybe LongLM: Self-Extend LLM Context Window Without Tuning LM-Infinite: Zero-Shot Extreme Length Generalization for Large Language Models
10/09	KV cache	Efficient Memory Management for Large Language Model Serving with PagedAttention Llama 2: Open Foundation and Fine-Tuned Chat Models (Grouped-query Attention) DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model (Multi-head Latent Attention)
10/14	Efficient attention	Flashattention: Fast and memory-efficient exact attention with io-awareness Flashattention-2: Faster attention with better parallelism and work partitioning Flashattention-3: Fast and accurate attention with asynchrony and low-precision
10/16	(Fall Break)	-
10/21	Efficient architecture	Jamba-1.5: Hybrid Transformer-Mamba Models at Scale Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality RWKV: Reinventing RNNs for the Transformer Era Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models
10/23	RAG	Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks Retrieval meets Long Context Large Language Models From Local to Global: A Graph RAG Approach to Query-Focused Summarization
10/28	Specilized LLM	Code Llama: Open Foundation Models for Code DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models BioMistral: A Collection of Open-Source Pretrained Large Language Models for Medical Domains HyenaDNA: Long-Range Genomic Sequence Modeling at Single Nucleotide Resolution
10/30	Guest Lecture	TBD
11/04	Guest Lecture	TBD
11/06	Guest Lecture	TBD
11/11	Guest Lecture	TBD
11/13	Guest Lecture	TBD
11/18-12/04	Project presentation	TBD