Does Field Ordering Affect Model Performance?
I mostly wrote this post as an excuse to try the freshly-minted and excellent pydantic-evals
framework for LLM evaluations but one interesting question that arises when working with Pydantic Models to implement structured output in your AI applications is: What happens if you shuffle the order of fields in your schema?
Does it matter if your output type looks like this:
class AgentOutput(BaseModel):
answer: str
reasoning: str
or like this:
class AgentOutput(BaseModel):
reasoning: str
answer: str
One line of argument goes something like this: For reasoning models, field ordering shouldn’t matter much. But for non-reasoning models, the second ordering might actually help: by placing the reasoning field first, we could nudge the model into a chain-of-thought-like regime early on, encouraging it to construct a more well-grounded answer. Contrast that with the first ordering, which asks the mode to output the answer first and then the explanation. This could backfire if the model locks in a wrong answer and the reasoning becomes a post-hoc rationalisation.
Well let's see!
Setup
We use the painting style classification task from HuggingFace (because it doesn't seem saturated from 0-shot models). We create two kinds of tasks — a simple classification one (just the standard task itself) and a hard task which does the following:
The model is given an image of a painting and a list of strings. The model first has to assign the painting a numerical label according to the style (as per the classical task) and then it has to select the string from the list that has exactly i+2 letters where i is the index of the label the model chose in the previous step.
This harder task seemingly requires more reasoning so one might expect techniques that elicit reasoning in non-reasoning models to perform better at it.
The pydantic-evals
code for the creation of this task looks roughly like this:
import random
import string
from datasets import load_dataset
from pydantic_evals import Case, Dataset
from pydantic_evals.evaluators import EqualsExpected
PAINTING_DATASET = load_dataset("keremberke/painting-style-classification", "full")
LABEL_NAMES = PAINTING_DATASET["train"].features["labels"].names
AUXILIARY_LS = ["".join(random.choices(string.ascii_lowercase, k=i)) for i in range(30)]
HARD_PYDANTIC_PAINTING_DATASET = Dataset[PIL.Image.Image, int](
cases=[
Case(
inputs=PAINTING_DATASET["test"][i]["image"],
expected_output=AUXILIARY_LS[PAINTING_DATASET["test"][i]["labels"] + 2],
)
for i in range(len(PAINTING_DATASET["test"]))
],
evaluators=[
EqualsExpected(),
],
)
We also need a framework / logic for structured output. By far my favourite framework for that and more is pydantic ai
so we can define a few agents like so:
class HardTaskAgentOutputAnswerFirst(BaseModel):
answer: str
reasoning: str
ht_af_agent = Agent(
model=MODEL_NAME,
output_type=HardTaskAgentOutputAnswerFirst,
)
@ht_af_agent.system_prompt
def ht_af_system_prompt() -> str:
return f"""
You are a helpful assistant.
You first assign each image the number of the label that best describes the style of the painting by choosing from the following list:
{_label_str}
Then you provide as an answer the string that has exactly i+2 letters where i is the index of the label you chose in the previous step.
The answer should be selected from the following list:
{AUXILIARY_LS}
Provide your answer as a structured response in the following format:
{HardTaskAgentOutputAnswerFirst.model_json_schema()}
"""
The full code including scripts to run is available on github here.
Results
Easy Task
Model | Answer First | Answer Second |
---|---|---|
gpt 4.1 | 0.5227 | 0.5040 |
gpt 4.1-mini | 0.4556 | 0.4515 |
gpt 4o | 0.4960 | 0.5103 |
gpt 4o-mini | 0.4205 | 0.4213 |
Hard Task
Model | Answer First | Answer Second |
---|---|---|
gpt 4.1 | 0.0777 | 0.0647 |
gpt 4.1-mini | 0.0385 | 0.1017 |
gpt 4o | 0.0696 | 0.0684 |
gpt 4o-mini | 0.0607 | 0.0787 |
Conclusion?
It's hard to say exactly why something works the way it does with LLMs but we've entered a new development paradigm and it's worth paying attention to the emerging patterns and ideas, especially the subtle ones :)