[Tutorial] Scaling LLM Training: Efficient Pre-training & Fine-tuning on AI Accelerators

Conference: IJCAI 2025

Date: August 16th 9AM – 12:30PM ET

Venue: Palais des congrès Montréal, QC, Canada

Abstract

The rise of powerful foundation models, particularly large language models (LLMs) built on Transformer architectures, has ushered in a new era of Generative AI, transforming various industries. These models have enabled a wide range of applications, including question answering, customer support, image and video generation, and code completion. However, modern LLMs consist of billions of parameters trained on trillions of tokens, making their development challenging in resource-constrained environments.

This tutorial provides a comprehensive exploration of deep learning training techniques optimized for AI accelerators. These enable faster, memory-efficient, yet robust training in billion-scales of model parameters. We begin with an overview of Transformer architectures, deep learning optimization strategies, and system and hardware-level of techniques. We then discuss system optimization techniques, such as fast attention computation and fault-tolerant training at scale. Leveraging these modern deep learning frameworks, we illustrate the principles of scaling laws that enable the training of LLMs with hundreds of billions of parameters. Next, we delve into low-precision training methods (e.g.,FP8 and FP4), highlighting techniques such as numerical error handling through scaling and stochastic rounding. Finally, we examine fine-tuning approaches, such as low-rank adaptation together with sparsity and quantization, which enables efficient model updates by modifying only a small subset of parameters.

Tutorial Slides (updated 8/22)

Logistics

Session 1: Modern FM Models and DL Framework (1h 45 min)

Introduction to Foundation Models (5 min)

Importance of Efficient LLM Training
LLM Pipeline

Overview of Transformer-Based Architectures (25 min)

Transformer basics
- Feed-forward network
  - Mixture-of-Experts (MoE)
- Attention
  - Self attention, Cross attention
  - Group Query Attention
- Positional Encoding
  - RoPE
- BPE Tokenizer and Token Embedding
LLM and Multi-modal (MM) basics
- Decoder-only models with self-attention (e.g., LLaMA, Qwen)
- Encoder-decoder models with vision encoder and cross-attention
  
  (e.g., LLAVA, LLaMA 4)