CSDS600: Large Language Models

Xiaotian (Max) Han

Fall 2024, M/W 3:20–4:35 PM

Course Description

This course offers an in-depth exploration of large language models (LLMs). LLM has revolutionized natural language processing and even the field of artificial intelligence. This course will introduce LLMs from their foundations to practical applications, covering topics such as model architecture, system design, and various training methodologies including pre-training, fine-tuning, and instruction tuning. This course is highly research-oriented and designed for advanced undergraduate and graduate students in computer science for research purposes.

Course Components & Grading Policy

The grading policy is subject to minor change.

Student’s grades will be calculated according to the following components:

  • Paper Presentation (30%):
    • 20-min presentation on a research paper (30%)
  • Class participation (10%)
    • Forms for the presentation (1% each, 10 in total)
  • Final Project (60%):
    • A 2-page proposal (10%)
    • A 6-page final report (40%)
    • Project presentation (10%).

Paper Presentation Format:

  • Select a paper from the provided list
  • 20-minute presentation on the selected paper
  • 10-minute Q&A session
  • Each student must present at least once
  • Audience members will submit questions and rate the presenter using a provided form.

Final Project Format:

  • 3 students in each group
  • Select a topic related to LLMs
  • Use ICLR2024 latex format for proposal and final report
  • In-class presentation on the final project

Grading Criteria:

  • Paper Presentation (30%):
    • Completeness and quality of the presentation (15%)
    • Average peer rating (15%)
  • Class Participation (10%):
    • Forms submitted for each paper presentation (1% each, we will have over 10 presentations, you need to submit at least 10 forms)
  • Final Project (60%):
    • Assessed the quality of the proposal, final report, and presentation.

Course Schedule

The course schedule is subject to minor change.

Date Topic Papers  
08/26 Course overview, Introduction to LLMs -  
08/28 Basic deep learning, mlp, word2vec, layernorm -  
09/02 (Labor Day Holiday) -  
09/04 Langual modeling & transformers -  
09/09 LLM architecture & training -  
09/11 Training Data The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only
Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research
Finetuned Language Models Are Zero-Shot Learners
RedPajama-Data-v2: An open dataset with 30 trillion tokens for training large language models
 
09/16 Evalutaion Measuring Massive Multitask Language Understanding
MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark
 
09/18 In-Context learnings Language Models are Few-Shot Learners
In-context Learning and Induction Heads
Rethinking the Role of Demonstrations: What Makes In-Context Learning Work?
Transformers Learn In-Context by Gradient Descent
Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning
Many-Shot In-Context Learning
 
09/23 CoT & Reasoning Chain-of-thought prompting elicits reasoning in large language models
Self-Consistency Improves Chain of Thought Reasoning in Language Models
Towards Revealing the Mystery behind Chain of Thought: A Theoretical Perspective
 
09/25 Scaling law Scaling Laws for Neural Language Models
Data Mixing Laws: Optimizing Data Mixtures by Predicting Language Modeling Performance
Training Compute-Optimal Large Language Models
 
09/30 Emergent ability Emergent Abilities of Large Language Models
Are Emergent Abilities of Large Language Models a Mirage?
 
10/02 Positional enbedding Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation
RoFormer: Enhanced Transformer with Rotary Position Embedding
The Impact of Positional Encoding on Length Generalization in Transformers
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
 
10/07 Long context ability YaRN: Efficient Context Window Extension of Large Language Models
Efficient Streaming Language Models with Attention Sinks
LLM Maybe LongLM: Self-Extend LLM Context Window Without Tuning
LM-Infinite: Zero-Shot Extreme Length Generalization for Large Language Models
 
10/09 KV cache Efficient Memory Management for Large Language Model Serving with PagedAttention
Llama 2: Open Foundation and Fine-Tuned Chat Models (Grouped-query Attention)
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model (Multi-head Latent Attention)
 
10/14 Efficient attention Flashattention: Fast and memory-efficient exact attention with io-awareness
Flashattention-2: Faster attention with better parallelism and work partitioning
Flashattention-3: Fast and accurate attention with asynchrony and low-precision
 
10/16 (Fall Break) -  
10/21 Efficient architecture Jamba-1.5: Hybrid Transformer-Mamba Models at Scale
Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality
RWKV: Reinventing RNNs for the Transformer Era
Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models
 
10/23 RAG Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
Retrieval meets Long Context Large Language Models
From Local to Global: A Graph RAG Approach to Query-Focused Summarization
 
10/28 Specilized LLM Code Llama: Open Foundation Models for Code
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
BioMistral: A Collection of Open-Source Pretrained Large Language Models for Medical Domains
HyenaDNA: Long-Range Genomic Sequence Modeling at Single Nucleotide Resolution
 
10/30 Guest Lecture TBD  
11/04 Guest Lecture TBD  
11/06 Guest Lecture TBD  
11/11 Guest Lecture TBD  
11/13 Guest Lecture TBD  
11/18-12/04 Project presentation TBD