A Systematic Approach to Evaluating Large Language Models