and the worse thing for me is that everything shows up as aggregate usage. Total tokens, total cost, maybe per model.
So I ended up hacking together a thin layer in front of OpenAI where every request is forced to carry some context (agent, task, user, team), and then just logging and calculating cost per call and putting some basic limits on top so you can actually block something if it starts going off the rails. It’s very barebones, but even just seeing “this agent + this task = this cost” was a big relief.
It uses your own OpenAI key, so it’s not doing anything magical on the execution side, just observing and enforcing.
I want to know you guys are dealing with this right now. Are you just watching aggregate usage and trusting it, or have you built something to break it down per agent / task?
If useful, here is the rough version I’m using : https://authority.bhaviavelayudhan.com/
I've had a Claude subscription for the past year, although I only really started properly using LLMs in the past couple of months. With Opus, I get about 5 messages every 5 hours (fairly small codebase); more with Sonnet. I then cancelled that, since its practically unusable and got ChatGPT sub about a week ago. Currently using it with 5.4 High and I haven't had to worry about limits. But the code it produces is definitely "different" and I need to plan more in advance. Its plan mode is also not as precise as with Claude (it doesn't lay out method stubs it plans to implement etc) so I suppose I may need to change how I work with it? Lastly, for normal chats it produces significantly more verbose output (with personality set to Efficient) and fast (with Thinking) but often it feels as though its not as thorough as I'd like it to be.
My question; is this a "you're holding it wrong" type of situation, where I just need to get used to a different mode of interaction? Or are others noticing material difference in quality? Ideally I'd like to stick with ChatGPT due to borderline impractical limits with Anthropic.