Does Field Ordering Affect Model Performance?

I mostly wrote this post as an excuse to try the freshly-minted and excellent pydantic-evals framework for LLM evaluations but one interesting question that arises when working with Pydantic Models to implement structured output in your AI applications is: What happens if you shuffle the order of fields in your schema?

Does it matter if your output type looks like this:

class AgentOutput(BaseModel):
    answer: str
    reasoning: str

or like this:

class AgentOutput(BaseModel):
    reasoning: str
    answer: str

One line of argument goes something like this: For reasoning models, field ordering shouldn’t matter much. But for non-reasoning models, the second ordering might actually help: by placing the reasoning field first, we could nudge the model into a chain-of-thought-like regime early on, encouraging it to construct a more well-grounded answer. Contrast that with the first ordering, which asks the mode to output the answer first and then the explanation. This could backfire if the model locks in a wrong answer and the reasoning becomes a post-hoc rationalisation.

Well let's see!

Setup

We use the painting style classification task from HuggingFace (because it doesn't seem saturated from 0-shot models). We create two kinds of tasks — a simple classification one (just the standard task itself) and a hard task which does the following:

The model is given an image of a painting and a list of strings. The model first has to assign the painting a numerical label according to the style (as per the classical task) and then it has to select the string from the list that has exactly i+2 letters where i is the index of the label the model chose in the previous step.

This harder task seemingly requires more reasoning so one might expect techniques that elicit reasoning in non-reasoning models to perform better at it.

The pydantic-evals code for the creation of this task looks roughly like this:

import random
import string

from datasets import load_dataset
from pydantic_evals import Case, Dataset
from pydantic_evals.evaluators import EqualsExpected

PAINTING_DATASET = load_dataset("keremberke/painting-style-classification", "full")
LABEL_NAMES = PAINTING_DATASET["train"].features["labels"].names
AUXILIARY_LS = ["".join(random.choices(string.ascii_lowercase, k=i)) for i in range(30)]

HARD_PYDANTIC_PAINTING_DATASET = Dataset[PIL.Image.Image, int](
    cases=[
        Case(
            inputs=PAINTING_DATASET["test"][i]["image"],
            expected_output=AUXILIARY_LS[PAINTING_DATASET["test"][i]["labels"] + 2],
        )
        for i in range(len(PAINTING_DATASET["test"]))
    ],
    evaluators=[
        EqualsExpected(),
    ],
)

We also need a framework / logic for structured output. By far my favourite framework for that and more is pydantic ai

so we can define a few agents like so:

class HardTaskAgentOutputAnswerFirst(BaseModel):
    answer: str
    reasoning: str

ht_af_agent = Agent(
    model=MODEL_NAME,
    output_type=HardTaskAgentOutputAnswerFirst,
)

@ht_af_agent.system_prompt
def ht_af_system_prompt() -> str:
    return f"""
    You are a helpful assistant.

    You first assign each image the number of the label that best describes the style of the painting by choosing from the following list:

    {_label_str}

    Then you provide as an answer the string that has exactly i+2 letters where i is the index of the label you chose in the previous step.

    The answer should be selected from the following list:

    {AUXILIARY_LS}

    Provide your answer as a structured response in the following format:
    {HardTaskAgentOutputAnswerFirst.model_json_schema()}
    """

The full code including scripts to run is available on github here.

Results

Easy Task
Model Answer First Answer Second
gpt 4.1 0.5227 0.5040
gpt 4.1-mini 0.4556 0.4515
gpt 4o 0.4960 0.5103
gpt 4o-mini 0.4205 0.4213
Hard Task
Model Answer First Answer Second
gpt 4.1 0.0777 0.0647
gpt 4.1-mini 0.0385 0.1017
gpt 4o 0.0696 0.0684
gpt 4o-mini 0.0607 0.0787

Conclusion?

It's hard to say exactly why something works the way it does with LLMs but we've entered a new development paradigm and it's worth paying attention to the emerging patterns and ideas, especially the subtle ones :)