Notes for “Chameleon: Mixed-Modal Early-Fusion Foundation Models”

personal notes
llm
Author

Gabriel Chua

Published

June 5, 2024

Here is the link to the original paper. These notes were prepared for my LLM Asia Paper Club sharing. Any feedback or areas for improvement would be most appreciated at cyzgab[at]gmail.com.

Key Points

  1. End-to-end multimodal tokens (i.e. no modality-specific encoder or decoder)

  2. Novelties in architectural innovations and training techniques to address computational challenges:

    1. query-key normalization
    2. revised placement of layer norms
  3. Pre-training these models require large datasets and computation.

    1. Dataset of 4.4T (2.9T is text only, 1.5T is text-to-image, and 400B is text-and-image-interleaved)
    2. 7B and 34B trained on 856K and 4.2M GPU hours respectively
  4. Performance:

    1. On Visual Q&A, outperforms Flamingo, Llava-1.5
    2. On text-only benchmarks, still competitive with Mixtral 8x7B and Gemini-Pro
    3. On human pairwise comparisons, beats Gemini-Pro and GPT-4V

Image 1: Conceptual summary of multimodal training and generation from the paper

Image 2: Example generation from the paper

Late vs Early Fusion

A useful reference for me was this literature review by Wadekar et.al

Late fusion Done at the internal layers of the model (e.g. OpenFlamingo, LLaMA-Adapter-V2)


Image 3: Simplified architectural summary for late fusion - taken from Wadekar et.al

Early fusion Done at the input stage (e.g. LLaVA, Unified-IO-2, Chameleon, Gemini)


Image 4: Simplified architectural summary for non-tokenised early fusion - taken from Wadekar et.al

Image 5: Simplified architectural summary for tokenised early fusion - taken from Wadekar et.al

Tokeniser

  • For images: trained a new image tokenizer based on Gafni et.al (2022) which encodes a 512 × 512 image into 1024 discrete tokens from a codebook of size 8192

Based on OAI’s pricing page (as of 5 June 2024), one image is ~170 tokens in GPT-4o.

  • For text: BPE tokenizer over a subset of training data, with a vocabulary size of 65K

Ensuring Stability of Pre-Training

“We found that the standard LLaMa architecture showed complex divergences due to slow norm growth in the mid-to-late stages of training. We narrowed down the cause of the divergence to the softmax operation being problematic when training with multiple modalities of significantly varying entropy due to the translation invariant property of softmax (i.e., softmax(z) = softmax(z+c)). Because we share all weights of the model across modalities, each modality will try to “compete” with the other by increasing its norms slightly; while not problematic at the beginning of training, it manifests in divergences once we get outside the effective representation range of bf16… In a unimodal setting, this problem has also been named the logit drift problem.” (Page 6)

  1. Query-key normalisation: applying layer norm to the query and key vectors within the attention.

  2. Revised placement of layer norms for 34B model

  3. Introducing z-loss regularisation


Image 6: Training plots

Dropout was initially introduced after the attention and feed forward layer for the 7B model, though subsequently found to be not necessary. For the 34B model, dropout was not sufficient (nor necessary).

Summary of Pre-Training

Model Params Context Length GQA Tokens LR Epochs Dropout Zloss Qknorm
LLMa-1 7B 2k 1.0T 3.0 × 10^-4 1.0 0.0 0.0
33B 2k 1.4T 1.5 × 10^-4 1.0 0.0 0.0
LLMa-2 7B 4k 2.0T 3.0 × 10^-4 1.0 0.0 0.0
34B 4k 2.0T 1.5 × 10^-4 1.0 0.0 0.0
Chameleon 7B 4k 4.4T 1.0 × 10^-4 2.1 0.1 10^-5
34B 4k 4.4T 1.0 × 10^-4 2.1 0.0 10^-5

Taken from Table 1 of the paper

Challenges Associated with Inference

  1. When decoding, we need to check whether it is a text or image token
  2. Masking tokens from other modalities when exclusively generating for a particular modality (e.g. no text tokens when doing image-only generation)
  3. Token-based image generation is a fixed-sized block

SFT/Alignment

Supervised fine-tuning dataset covered the following categories:

  • text
  • code
  • visual chat
  • image generation
  • interleaved text/image generation
  • safety (e.g. “I can’t help with that”)