Codex vs Factory vs Cursor: Does the Harness Really Matter for MVP Prototyping?

When I first thought about this article, I had planned to go with a headline ‘Harness Matters’ or something cool like ‘A good harness is all you need’, but to my own shock, it seems to not actually make that much of an impact – atleast as far as MVP’ing a front end wireframe goes. Maybe much further down the line it might matter more than this and compounds on varying abilities present in each of them, but so far if you are starting off in this field and testing out the thousand types of tools to access the same models – it does not really matter which one you opt for.

:::info
I must reiterate that this holds true only if you are not using them to their full power as each hold special powers that can make them superior to the other.

For example: Claude Code has agents, plugins, memory even for some lucky users, but Codex has neither (unless you manually build a memory.md and tell it how to fill it) and no real support for agents or plugins like Claude Code. Factory has its own concept called Droids, and Cursor can make use of all of that but does things slightly differently when it comes to combing through your directory (Semantic Search vs Claude’s grepping everything to maximize token usage)

:::

Experimentation strategy:

Ignoring the non-deterministic nature and controlling for what we can, we will keep the skills constant, the prompt constant, and then measure what the end product looks like. We create a spec.md, a backend.md, and a wireframe html that can help us visualize its understanding of the task.

PROMPT:
I have an idea, and it is out of a pain point that I am suffering. I want to create a plan for it, and a wireframe or a MVP for the UI how it would look that is navigatable or clickable, its essentially a website and this is how it works:
This is generally for people who have friends, who are not big creators, and are trying to get a head start by asking their friends to engage with their content by liking or sharing it.
The primary issue by such small creators is that they often feel shy or embarassed by asking again and again from their friends to like or share their article, and especially if they crosspost across different platforms that is even more awkward. So how do you manage this?
The idea works around the fact that these friends trust each other enough to basically connect their platforms in this website such that: creator_friend posts at substack, posts at hackernoon, posts at X, posts at LinkedIn, posts at Facebook and posts at Threads. creator_friend then goes on to inform this website with urls, and it appears like a notification against their name.
The group_friends of this creator_friend can then automatically from their connected platforms run an automated like and share action on all these platforms. This is sort of a pledge network, where they have pledged their support to them by this action, and also to avoid a weekly flood of messages to like something or share something.
Remember: Ease for Supporter to pledge, connect platforms to set to auto-like and auto-share or just auto-like.
Create a spec sheet, detailed back end and before that, a wireframe sorta thing to showcase what the web would look like.

The skill we used is found here.

==GPT 5.3 Codex (High): Codex v Factory v Cursor==

This is a one shot attempt, and so it means there will be no back and forth.

The output from both cases with GPT 5.3 Codex was less than impressive, and left me shocked. Initially I believed it was just Factory not being a great harness, but post attempt on Codex itself, I was left to believe that this is not its strength.

Codex:

GPT seems excited to have done a wireframe at all with its prototype note

This is how it thinks a UI should look like? What was it even trained on to have spat this out?

Factory:

Tabs move from vertical to horizontal

Besides opting for light mode, nothing really stands out.

Cursor:

The plight of ugly UI choices continue

And we are back to dark mode with absolutely no red flags having been raised for this atrocious front-end.

All in all, nothing impressive, and it looks to seem that in all of the responses, GPT 5.3 Codex was dying to jump to actually implementing the idea with different proposed tech stacks rather than consider for a moment what it has put together might just be the ugliest looking wireframe in human history.

==Claude Opus 4.6 (High): Codex v Factory v Cursor==

Before I even show you what it looks like, let me just say that Opus 4.6 was just so visually impressive. The only time I was this happy with a model’s output with regards to front end was Gemini Flash 3.0 Preview.

Claude Code:

It breaks it down into multiple steps, a landing page, a hero section, a login page and so much more.

Landing Page

Dashboard; it already looks levels ahead with whatever GPT 5.3 Codex spat out.

Next up, we take a look at Factory’s attempt, and see how different would it be from Claude Code’s MVP.

Factory:

Once again, we see the same trend, as the model is obviously the same, the harness does not really seem to come into affect so much so that it would change the entire bias of the training data and preference optimization so easily.

We get the same breakdown; the landing page, nice looking hero section, login page etc..

Landing Page

It does one thing differently; feed. It brings the dashboard a subtle thing, and Feed something to be of higher importance. But given the non-deterministic nature of these models, had I asked in different conversations in the same model this level of variation would have existed still.

Factory also seems to have taken more liberty around icons in comparison to what we got from Claude Code.

Finally, we take a look at how different it is inside Cursor.

Cursor

Cursor’s Opus had the least amount of content page, but this almost means nothing 😀 due to the non-deterministic nature of these models.

Landing Page

Feed, Dashboard and Icons!

Conclusion

Well, it only seems to have become obvious that harnesses only start to matter when you actually start to utilize them with whatever they offer. For example in the case of Claude Code, you get Hooks, you get sub-agents, you get plugins. With Factory, you get Droids, you get MCP, agents, and similarly in the case of Cursor you get a suite of elements that it is designed around.

This is where the ‘difference’ starts to become obvious, when your prompt is being handled ‘differently’ by all of them, but to do that, you must be exploiting all that they have to offer.

If you are just starting out and dont really understand any of these choices, and get decision paralysis – fret no more, because the point I am trying to make is, no matter what you choose, it doesn’t really matter. Not for prototyping, oh and if what you want to do is going to be UI heavy, go for Claude.

:::tip
For anyone running local LLMs and fighting for performance from non-reasoning models, there’s an interesting trick floating around: putting your prompt in twice can improve the response.

You can read more in this research paper called ‘Prompt Repetition Improves Non-Reasoning LLMs’.

:::

Signing Out,

Abdullah

==GPT 5.3 Codex (High): Codex v Factory v Cursor==

==Claude Opus 4.6 (High): Codex v Factory v Cursor==

Conclusion

Leave a Comment Cancel reply