Using Mechanistic Interpretability to understand LLM Models

I am newbie to the field of Mechanistic Interpretability. Below are few ideas i would like to try out using mechanistic interpretability to understand LLM models. Major theme here is to have two models which are identical in every aspect but trained separately and use mechanistic interpretability to understand how features are encoded in the model.

1. Undertand the difference between pre-trained models and fine-tuned models 2. Undertand the difference between two llm models (pre-trained or fine-tuned) trained on same dataset but with different hyper-parameters

Ultimate goal for me with these methods is to have a mechnaism to turn off certain features in the model. It is very lofty goal, but progressing in that direction would be paramount for realiabile systems.

Enjoy Reading This Article?