Mechanistic Interpretability in Action: Understanding Induction Heads and QK Circuits in Transformers

Ayyüce Kızrak, Ph.D.
20 min readSep 28, 2024

This project, created for the AI Alignment Course — AI Safety Fundamentals powered by BlueDot Impact, leverages a range of advanced resources to explore key concepts in mechanistic interpretability in transformers.

Acknowledgment —I would like to express my gratitude to the AI Safety Fundamental team, the facilitators, and all participants in the cohorts that I had the opportunity to contribute to developing new ideas in our discussions. I am pleased to be a part of this team.

Cover Image Source: Google DeepMindUnsplash

To explore the practical implementation of the topic discussed in this blog post, check out my GitHub repository.👇

Introduction — Mechanistic Interpretability

In artificial intelligence (AI), mechanistic interpretability focuses on studying and understanding how artificial neural network (ANN) models, deep learning models, work at the level of individual components such as neurons, circuits, and weights. AI models are often described…

--

--

Ayyüce Kızrak, Ph.D.

AI Specialist @Digital Transformation Office, Presidency of the Republic of Türkiye | Academics @Bahçeşehir University | http://www.ayyucekizrak.com/