Failures are fed back to the LLM so it can regenerate taking that feedback into account. People are much happier with it than I could have imagined, though it's definitely not cheap (but the cost difference is very OK for the tradeoff).
I know that consuming something and not thumbing it up/down sort-of does that, but it's a vague enough signal (that could also mean "not close enough to keyboard / remote to thumbs up/down) that recommendation systems can't count it as an explicit choice.
In practice, people generally didn't even vote with two options, they voted with one!
IIRC youtube did even get rid of downvotes for a while, as they were mostly used for brigading.
No, they got rid of them most likely because advertisers complained that when they dropped some flop they got negative press from media going "lmao 90% dislike rate on new trailer of <X>".
Stuff disliked to oblivion was either just straight out bad, wrong (in case of just bad tutorials/info) and brigading was very tiny percentage of it.
Visit a youtube video today, you can still upvote and downvote with the exact same thumbs up or down, the site however only displays to you the count of upvotes. The channel owners/admins can still see the downvote count and the downvotes presumably still inform YouTube's algorithms.
If you want downvote data be more precise, do your part and install the extension! :-)
“You’re absolutely right! Nice catch how I absolutely fooled you”
By having independent tests and then seeing if it passes them (yes or no) and then evaluating and having some (more complicated tasks) be valued more than not or how exactly.
IME deep thinking hgas moved from upfront architecture to post-prototype analysis.
Pre-LLM: Think hard → design carefully → write deterministic code → minor debugging
With LLMs: Prototype fast → evaluate failures → think hard about prompts/task decomposition → iterate
When your system logic is probabilistic, you can't fully architect in advance—you need empirical feedback. So I spend most time analyzing failure cases: "this prompt generated X which failed because Y, how do I clarify requirements?" Often I use an LLM to help debug the LLM.
The shift: from "design away problems" to "evaluate into solutions."
Another big problem is it’s hard to set objectives is many cases, and for example maybe your customer service chat still passes but comes across worse for a smaller model.
Id be careful is all.
I'd push everyone to self-host models (even if it's on a shared compute arrangement), as no enterprise I've worked with is prepared for the churn of keeping up with the hosted model release/deprecation cadence.
(Potentially interesting aside: I’d say I trust new GLM models similarly to the big 3, but they’re too big for most people to self host)
For a medical use case, we tested multiple Anthropic and OpenAI models as well as MedGemma. Pleasantly surprised when the LLM as Judge scored gpt5-mini as the clear winner. I don't think I would have considered using it for the specific use cases - assuming higher reasoning was necessary.
Still waiting on human evaluation to confirm the LLM Judge was correct.
We have a hard OCR problem.
It's very easy to make high-confidence benchmarks for OCR problems (just type out the ground truth by hand), so it's easy to trust the benchmark. Think accuracy and token F1. I'm talking about highly complex OCR that requires a heavyweight model.
Scout (Meta), a very small/weak model, is outperforming Gemini Flash. This is highly unexpected and a huge cost savings.
Some problems aren't so easily benchmarked.
It's the hard part of using LLMs and a mistake I think many people make. The only way to really understand or know is to have repeatable and consistent frameworks to validate your hypothesis (or in my case, have my hypothesis be proved wrong).
You can't get to 100% confidence with LLMs.
Since building a custom agent setup to replace copilot, adopting/adjusting Claude Code prompts, and giving it basic tools, gemini-3-flash is my go-to model unless I know it's a big and involved task. The model is really good at 1/10 the cost of pro, super fast by comparison, and some basic a/b testing shows little to no difference in output on the majority of tasks I used
Cut all my subs, spend less money, don't get rate limited
I've been using the smaller models ever since. Nano/mini, flash, etc.
I have found out recently that Grok-4.1-fast has similar pricing (in cents) but 10x larger context window (2M tokens instead of ~128-200k of gpt-4-1-nano). And ~4% hallucination, lowest in blind tests in LLM arena.
I'm unwilling to look past Musk's politics, immorality, and manipulation on a global scale
I have yet to go back to small models, waiting for the upstream feature / GPU provider has been seeing capacity issues, so I am sticking with the gemini family for now
gemini-3-flash stands well above gemini-2.5-pro
1. There is still night and day difference
2. Local is slow af
3. The vast majority of people will not run their own models
4. I would have to spend more than $200+ a month on frontier AI to come close the same price it would cost for any decent AI at home rig. Why would I not use frontier models at this point?
Presumably that'll be some sort of funnel for a paid upload of prompts.
What seems missing: I can not see the answer from the different models. One have to rely on the "correctness" score.
Another minor thing: the scoring seems hardcoded to: 50% correctness, 30% cost, 20% latency - which is OK, but in my case i care more about correctness and latency I don't care.
Wow! This was my testprompt:
You are an expert linguist and translator engine.
Task: Translate the input text from English into the languages listed below.
Output Format: Return ONLY a valid, raw JSON object.
Do not use Markdown formatting (no ```json code blocks).
Do not add any conversational text.
Keys: Use the specified ISO 639-1 codes as keys.
Target Languages and Codes:
- English: "en" (Keep original or refine slightly)
- Mandarin Chinese (Simplified): "zh"
- Hindi: "hi"
- Spanish: "es"
- French: "fr"
- Arabic: "ar"
- Bengali: "bn"
- Portuguese: "pt"
- Russian: "ru"
- German: "de"
- Urdu: "ur"
Input text to translate:
"A smiling boy holds a cup as three colorful lorikeets perch on his arms and shoulder in an outdoor aviary."Here's a bug report, by switching the model group the api hangs in private mode.
Where is TDD for prompt engineering? Does it exist already?
In most cases, e.g. with regular ML, evals are easy and not doing them results in inferior performance. With LLMs, especially frontier LLMs, this has flipped. Not doing them will likely give you alight performance and at the same time proper benchmarks are tricky to implement.
What I did splurge on was brief openai access for some subtitle translator program and when I used the deepseek api. Actually I think that $13 includes some as yet unused credits. :D
I'd be happy to provide details if CLIs are an option and you don't m ind some sweatshop agent. :)
(I am just now noticing I meant to type 2 years not 3 above. Sorry about that.)
I get a good amount of non-agentic use out of them, and pay literally less than $1/month for GLM-4.7 on deepinfra.
I can imagine my costs might rise to $20-ish/month if I used that model for agentic tasks... still a very far cry from the $1000-$1500 some spend.
Sorry, this just makes no sense to start off with. What do you mean?
Stop prompt engineering, put down the crayons. Statistical model outputs need to be evaluated.
It’s shocking to me how often it happens. Aside from just the necessity to be able to prove something works, there are so many other benefits.
Cost and model commoditization are part of it like you point out. There’s also the potential for degraded performance because of the shelf benchmarks aren’t generalizing how you expect. Add to that an inability to migrate to newer models as they come out, potentially leaving performance on the table. There’s like 95 serverless models in bedrock now, and as soon as you can evaluate them on your task they immediately become a commodity.
But fundamentally you can’t even justify any time spent on prompt engineering if you don’t have a framework to evaluate changes.
Evaluation has been a critical practice in machine learning for years. IMO is no less imperative when building with llms.
It sounds like he's building some kind of ai support chat bot.
I despise these things.
On the other hand, this would be interesting for measuring agents in coding tasks, but there's quite a lot of context to provide here, both input and output would be massive.
Any resources you can recommend to properly tackle this going forward?
- Did it cite the 30-day return policy? Y/N - Tone professional and empathetic? Y/N - Offered clear next steps? Y/N
Then: 0.5 * accuracy + 0.3 * tone + 0.2 * next_steps
Why: Reduces volatility of responses while still maintaining creativeness (temperature) needed for good intuition