Using Mechanistic Interpretability to understand LLM Models

I am newbie to the field of Mechanistic Interpretability. Below are few ideas i would like to try out using mechanistic interpretability to understand LLM models. Major theme here is to have two models which are identical in every aspect but trained separately and use mechanistic interpretability to understand how features are encoded in the model.

1. Undertand the difference between pre-trained models and fine-tuned models 2. Undertand the difference between two llm models (pre-trained or fine-tuned) trained on same dataset but with different hyper-parameters

Ultimate goal for me with these methods is to have a mechnaism to turn off certain features in the model. It is very lofty goal, but progressing in that direction would be paramount for realiabile systems.




Enjoy Reading This Article?

Here are some more articles you might like to read next:

  • Google Gemini updates: Flash 1.5, Gemma 2 and Project Astra
  • Displaying External Posts on Your al-folio Blog
  • How to Build a ChatBot Using Google Cloud
  • How to Identify Good Denim Jeans
  • Importance of Drinking Water and how fixing it improves the health of the people.