It doesn't matter really, what matters is our ability to stare into the void of what we don't know and start making progress.
Our ability to process and master new topics is part of the job.
I'm sure you've done that countless times.
I have to disagree and question what you mean by "optimization". It's very easy to write web code that technically accomplishes a task, but does so poorly. This is the natural consequence of having so many options available.
The vast majority of web devs with less than 5 years of experience simply don't understand plain javascript well enough. It's a longstanding problem that devs will reach for the most ergonomic tools, not the best tools.
Lacking sufficient experience, they can't help it. This happens in all programming languages and in all layers of software. AI slop is even worse because it tends towards the mean.
And the tools themselves are built by other engineers and they need new features, debugging, optimization etc. It is turtles all the way down.
But each layer has its own jargons, conventions and unwritten hacks. That is where experience comes in. Once you get out off a rabbit hole or pothole, you are one step closer to becoming the “domain expert”. There is no short cut.
they are never tested on it, and many won't dig that deep in the day-to-day. Whose fault is it that they don't know plain javascript well enough? That's the result of shipping "content" over any other metric of proper software engineering.
Funnily enough I did take a mini-course (not a week, but we're talking maybe 100 hours of work as a recreational online summer class) in plain javascript at my university. Quite the quirky language. But this was in ES3 or so, so maybe there's many more guard rails these days against the core jank that makes up JS
Isn't that mostly because as you go up the abstraction layer, tools and docs to teach yourself the tricks of trade fast are in abundance (let alone a popular layer like React)? Which inturn is likely a function of incentives and opportunities.
This was one of my gripes in college, why am I implementing something if I just need to understand what it does? I'm going to use the built-in version anyway.
And so you can write your own because you're probably going to want to sort data in a specific way. Sort doesn't mean in numerical increasing or decreasing order, it means whatever order you want. You're sorting far more often than you're calling the sort function.
Its almost wild to me that you never have.
Sometimes you need a better sort for just one task. Sometimes you need a parser because the data was never 100% standards compliant. Sometimes you need to reread Knuth for his line-breaking algorithm.
He was brought in by the state to do some coaching for existing software devs back in the 90s. When he was going over the various different basic algorithms (insertion sort, selection sort, etc.) one of the devs in the back of the class piped up with, "why are you wasting our time? C++ has qsort built in."
When you're processing millions of records, many of which are probably already sorted, using an insertion sort to put a few new records into a sorted list, or using selection sort to grab the few records you need to the front of the queue, is going to be an order of magnitude faster than just calling qsort every time.
Turned out he worked for department of revenue. So my teacher roasted him with "oh, so you're the reason it takes us so long to get our tax returns back."
Thinking that you can just scoot by using the built-in version is how we get to the horrible state of optimization that we're in. Software has gotten slow because devs have gotten lazy and don't bother to understand the basics of programming anymore. We should be running a machine shop, not trying to build a jet engine out of Lego.
funnily enough, this wasn't limited to contributing to some popular OS initiative. You can call YAGNI, but many companies do in fact have their own libraries to maintain internally. So it comes up more than you expect.
On a higher level, the time I took to implement a bunch of sorts helped me be able to read the docs for sort(), realize it's a quicksort implentation, and make judgements like
1. yeah, that works
2. this is overkill for my small dataset, I'll just whip up basic bubblesort
3. oh, there's multiple sort API's and some sorts are in-place. I'll use this one
4. This is an important operation and I need a more robust sorting library. I'll explain it to the team with XYZ
The reasoning was the important lesson, not the ability to know what sorting is.
So you can pass job interviews, of course!
I'll take any interviews at this point in time.
But yes, every domain has its jargon. I work tangentially to this and quickly understood this as a GPGPU problem. A relatively elementary one if you studied this space, though a time limit of 2 hours seems overly restrictive if you aren't actively studying this stuff.
The task is to parallelize tree traversal, which is embarrassingly unparallel so it's tricky.
Is that really the case? My experience is fairly limited, but I've found that the LLM's willingness to fill in plausible sounding (but not necessarily at all accurate) numbers where it needs them to be a significant hindrance when asking it to think about performance.
However, when I hit "scratch_write" and it wasn't in the Machine class and it wasn't coming from some Decorator and it was getting defined and deleted by a member function ... I stopped. That's paying lip service to the variable typing that is scattered around and actively hampers even basic IDE usage. Probably the typing was added by AI/LLM after the fact, and it missed that unusual usage. The Python convention used to be that those kinds of variables got declared as "_scratch_write" with a leading underscore to flag that they were "private/internal".
That was the gigantic red "We write shitty code" signal or worse "We don't care about wasting your time" signal. Human review should have flagged that.
Shame. I was kinda looking forward to the technical problem, but I'm not going to spend a bunch of time using grep to untangle garbage code to get at it.
I suspect everything would actually be much clearer if you wrote it in SystemVerilog and tested with Cocotb. Let's see if their LLMs can handle that porting job. HAH!
A lot of people write Python code and then run "AI" on it to fill in the variable types. This, of course, is error prone and shitty. And the AI will miss strange usages like the one I flagged.
Although I am sorry for phrasing it as "variable typing". I can see how you might read that as "typing that varies" instead.
If you look at the top of perf_takehome.py then there is a brief comment saying the challenge is to optimize a kernel. Kernel in GPU land means a program that computes on data in parallel, it's not an OS kernel:
Optimize the kernel (in KernelBuilder.build_kernel) as much as possible in the
available time, as measured by test_kernel_cycles on a frozen separate copy
of the simulator.
However, this kernel doesn't run on an actual GPU. It runs on a little interpreter for a custom assembly language written in Python. Thus you will be optimizing the program built in-memory by the function on this line:https://github.com/anthropics/original_performance_takehome/...
This function is described only as:
Like reference_kernel2 but building actual instructions.
Scalar implementation using only scalar ALU and load/store.
The KernelBuilder class has some fields like "instrs" but we can't immediately see what they're meant to be because this is Python and types are optional. Nonetheless we can see that instructions are being added to a list, and below we can see the test_kernel_cycles function that runs the interpreter on the program. So our mission is to change the build_kernel function to make a better program. And it says this is an assembly version of the python function reference_kernel2 which is found in problem.py.What exactly is this kernel doing? The reference_kernel2 function doesn't explain itself either - it's some sort of parallel tree walk. Let's put that to one side for a second and explore the machine, which is defined in problem.py. The machine itself is also largely undocumented, but there's a brief description in a docstring on line 66.
At this point it helps to understand the design of exotic processors. The emulator is for a fictional CPU that uses a VLIW SIMD ISA. Normal programmers will never encounter such a chip. Intel tried to make such a machine decades ago and it never took off, since then the concept has been largely dead. I believe it's still used in some mobile DSPs like Qualcomm's Hexagon. Notably, NVIDIA PTX is not such an ISA so this seems to have been chosen just to make things harder. As the comment explains, in a VLIW machine multiple instructions are packed together into a "slot" and executed in parallel. In a normal CPU the hardware reads a serial stream of instructions and works out just in time which can be executed in parallel, using fancy out-of-order circuitry. In a VLIW machine that's done ahead of time by the compiler or (in this case) the humble programmer, you. But this isn't just a VLIW machine, it's also multi-core, and multi-"engine", so there are multiple levels of execution going on. And it's SIMD, meaning each instruction can itself operate on multiple bits of data simultaneously.
This machine doesn't have registers or cache but it does have "scratch space", and so you can use the vector instructions to load data into a series of 32 bit scratch words and then do things on them in parallel. And multiple vector instructions can also run in parallel. "Broadcasting a scalar" in SIMD-speak means taking a single value and repeating it over multiple scratch space slots (or register subwords in a real machine), so you take e.g. 0xFF and get 0xFFFFFFFFFFFFFFFF.
And that's it, that's all we get. As the code says: "This comment is not meant to be full ISA documentation though, for the rest you should look through the simulator code". Possible point of confusion: real ISAs are serialized to bytes but this one is just Python tuples. The code is only partially typed; sometimes you're just left guessing.
So to recap, the problem is to optimize an undocumented program expressed in undocumented data structures returned by a Python function whose result is interpreted by a partly documented Python class that simulates a fictional exotic CPU architecture using an abandoned design that gives a lot of parallel computational capacity, but which requires all parallelism to be statically declared ahead of time, whilst simultaneously reverse engineering the Python that does all this.
Does that help? Sounds like a fun exercise :)
Edit: I just checked and Google TPUs are much more VLIW like so perhaps this simulator is designed to match a TPU. I know Anthropic rely on TPUs for serving and have done some optimization for them.
Since the focus of the challenge appears(?) intended to be optimization, not reverse engineering, it's a bit odd that they don't give a clear statement of what the kernel is meant to be computing. Perhaps the challenge is intended to be a combination of the two, but then the correct reverse engineering part of it becomes a gate for the optimization part, else you'll be solving the wrong problem.
Given the focus on results achieved by Opus 4.5, maybe that's the main point - to show how well Opus can reverse engineer something like this. If they gave the actual clear problem statement, then maybe you could brute force an optimal solution using tree search.
"Can you "reverse engineer" what the kernel in this optimization exercise is actually doing - write a specification for it?
https://github.com/anthropics/original_performance_takehome"
Gemini says it's doing inference on a random forest - taking a batch of inputs, running each one through each decision tree, and for each input outputting the sum of these decision tree outputs - the accumulated evidence.
It's doing some sort of binary tree traversal, but the hashing and wrap around looks weird - maybe just a made up task rather than any useful algorithm?
If you can't make sense of such a small codebase or don't immediately recognize the algorithm that's being used (I'm guilty of the latter) then you presumably aren't someone that they want to hire.
They then provide you with a very naive implementation that runs on their (very simple) VLIW architecture that you are to optimize.
If at the end of that someone is still lost I think it is safe to say it was their goal that person should fail.
The problem is about pipelining memory loads and ALU operations, so why not just give clear documentatation and state the task rather than "here's a kernel - optimize it"? \_(ツ)_/
And perhaps a third purpose is to use the simulator to test your ability to reason about hardware that you are only just getting familiar with.
Maybe they specified the challenge in this half-assed way to deliberately test those sorts of skills (even if irrelevant to the job), or maybe it was just lazily put together.
The other thing to note is that if you look at what the reference_kernel() is actually doing, it really looks like a somewhat arbitrary synthetic task (hashes, wraparound), so any accurate task specification would really need to be a "line by line" description of the steps, at which point you may as well just say "here's some code - do this".
I think they do and his name is Claude ;)
this is what all specialized chips like TPU/Cerebras require today, and it allows for better optimization than a generic CPU since you can "waste" 30 min figuring out the perfect routing/sequencing of operations, instead of doing it in the CPU in nanoseconds/cycles
another benefit is you can throw away all the CPU out-of-order/branch prediction logic and put useful matrix multipliers in it's place
I think I'd be able to make some progress optimizing this program in two hours but probably not much. I'm not a performance engineer but have designed exotic emulated CPU architectures before, so that helps a lot.
I gleaned about half of this comment in a few minutes of just skimming the code and reading the comments on the functions and classes. There's only 500 lines of code really (the rest is the benchmark framework).
On the whole I don't think I'd perform all that well on this task given a short time limit but it seems to me to be an extremely well designed task given the stated context. The reference kernel easily fits on a single screen and even the intrinsic version almost does. I think this task would do a good job filtering the people they don't want working for them (and it seems quite likely that I'm borderline or maybe worse by their metric).
From JAX to VLIW: Tracing a Computation Through the TPU Compiler Stack, https://patricktoulme.substack.com/p/from-jax-to-vliw-tracin...
Google’s Training Chips Revealed: TPUv2 and TPUv3, HotChips 2020, https://hc32.hotchips.org/assets/program/conference/day2/Hot...
Ten Lessons From Three Generations Shaped Google’s TPUv4i, ISCA 2021, https://gwern.net/doc/ai/scaling/hardware/2021-jouppi.pdf
The ISA in this Anthropic machine is actually both, VLIW and SIMD, and both are relevant to the problem.
Sounds like a fun exercise :)
I'll be honest, that sounds like the opposite of fun since the worst parts of my job are touching the parts of a Python codebase that are untyped. The sad part is this work codebase isn't even that old, maybe a few years, and the developers definitely should have known better if they had anyone capable leading them. Alas, they're all gone now.Harder than figuring out the instruction set for some exotic CPU are definitely the giant untyped dicts/lists common in data science code.
I think that's one of the intentional points. Being able to quickly understand what the provided source code is doing.
¹https://github.com/anthropics/original_performance_takehome/...
²https://github.com/anthropics/original_performance_takehome/...
Do you make a habit of not presuming even basic competence? You believe that Anthropic left the task running for hours, got a score back, and never bothered to examine the solution? Not even out of curiosity?
Also if it was cheating you'd expect the final score to be unbelievably low. Unless you also suppose that the LLM actively attempted to deceive the human reviewers by adding extra code to burn (approximately the correct number of) cycles.
How do you explain the specific score that was achieved if as you suggest the LLM simply copied the answer directly?
- Optimize the kernel (in KernelBuilder.build_kernel) as much as possible in the available time, as measured by test_kernel_cycles on a frozen separate copy of the simulator
It's not about you being average, just a different knowledge set.
But this is good. Staying humble makes you hungrier for learning.
For me, I've had that mentality for the longest time and I didn't get anything done because, well, "I'm just average".
For me, a little bit of arrogance (there's no way I couldn't do X, let's go do it), even if I end up "looking stupid" (see, I told you it was that hard!), was far more valuable to my development
Always room to learn in software :)
the hot take is, there are other games.
Yes, this applies to some simulated imaginary CPU with an artificial problem. Except that the job asked here is exactly the core of what a performance engineer will do at anthropic: optimize kernels for their fleet of GPUs. Is it simplified? Yes! (e.g. the simulator does not restrict memory access patterns)
This is a real-world problem adapted to a lab setting that can fit in one's head in a matter of hours. Leetcode would have you reimplement the hashmap used in there.
In every other field it's helpful to understand the basics. I don't think software is the exception here.
I see this directly in Gemini CLI as the harness detects loops and bails the reasoning. But I've also just occasionally seen it take 15m+ to do trivial stuff and I suspect that's a symptom of a similar issue.
Seems like capacity because it works a lot better late at night.
I don't see the same with the claude models in antigravity.
Sometimes Gemini tools will just randomly stop and pass the buck back to you. The last thing will be like "I will read the <blah> code to understand <blah>" and then it waits for another prompt. So I just type "continue" and it starts work again.
And, sometimes it will spit out the internal CoT directly instead of the text that's actually supposed to be user-visible. So sometimes I'll see a bunch of paragraphs starting with "Wait, " as it works stuff out and then at the end it says "I understand the issue" or whatever, then it waits for a prompt. I type "summarise" and it gives me the bit I actually wanted.
It feels like all these things are related and probably have to do with the higher-level orchestration of the product. Like I assume there are a whole bunch of models feeding data back and forth to each other to form the user-visible behaviour, and something is wrong at that level.
I suspect this is also something like the "inverse" of a prompt hijacking situation. Basically it's losing track of where its output is flowing to (whereas prompt injection is when it loses track of where its input is flowing from).
After ~40 minutes, it got to:
The final result is 2799 cycles, a 52x speedup over the baseline. I successfully implemented Register Residency, Loop Unrolling, and optimized Index Updates to achieve this, passing all correctness and baseline speedup tests. While I didn't beat the Opus benchmarks due to the complexity of Broadcast Optimization hazards, the performance gain is substantial.
It's impressive as I definitely won't be able to do what it did. I don't know most of the optimization techniques it listed there.
I think it's over. I can't compete with coding agents now. Fortunately I've saved enough to buy some 10 acre farm in Oregon and start learning to grow some veggies and raise chickens.
Maybe Claude will be able to do that soon, too.
you can't compete with an AI on doing an AI performance benchmark?
That would be impressive.
Each ran the same spec headlessly in their native harness (one shot).
Results:
Agent Cycles Time
─────────────────────────────────────────────
gpt-5-2 2,124 16m
claude-opus-4-5-20251101 4,973 1h 2m
gpt-5-1-codex-max-xhigh 5,402 34m
gpt-5-codex 5,486 7m
gpt-5-1-codex 12,453 8m
gpt-5-2-codex 12,905 6m
gpt-5-1-codex-mini 17,480 7m
claude-sonnet-4-5-20250929 21,054 10m
claude-haiku-4-5-20251001 147,734 9m
gemini-3-pro-preview 147,734 3m
gpt-5-2-codex-xhigh 147,734 25m
gpt-5-2-xhigh 147,734 34m
Clearly none beat Anthropic's target, but gpt-5-2 did slightly better in much less time than "Claude Opus 4 after many hours in the test-time compute harness".The performance killer is the "random" access reads of the tree node data which the scalar implementation hides, together with the lack of load bandwidth, and to tackle that you'd have to rewrite the kernel to optimize the tree data loading and processing.
This is an interesting way to recruit. Much better than standard 2 leetcode medium/hard questions in 45 mins.
Then again, this may just be a way to get free ideas at optimising their product from outside the box.
Why are we still interviewing like its 1999?
I try give positive feedback for candidates who didn't know the problem but could make good use of hints, or had the right approach. But unfortunately, it's difficult to pass a Leetcode interview if you haven't seen a similar problem to what is asked before. Most candidates I interview nowadays seem to know all questions.
That's what the company has decided so we have to go along. The positive side is that if you do your part, you have good chances of being hired, even if you disagree with the process.
This type of individual is more likely to follow orders and work hard - and most importantly - be like the other employees you hired.
Now it means company is stupid.
It's true that being ready for leetcode takes practice, but at least it's standard so you can re-use the skills to other interviews. Optimizing some generated code is certainly fun, but it's as useless as leetcode for your average programmer.
> I find it unreasonable to ask a candidate to spend that much time
And same for some reason does not apply to leetcode style interviews?
> It would take something like one week full time to work on this
I am not sure if this is satire or what? You need months of continuous preparation to be ready for the leetcode style interview.
> Optimizing some generated code is certainly fun, but it's as useless as leetcode for your average programmer.
No, it is not. This is specifically the type of job you would be doing tomorrow at Anthropic team if hired. And they are specifically hiring people who are already good enough at that very task. The same cannot be said for the leetcode, not even remotely comparable.
[1] https://en.wikipedia.org/wiki/Demoscene [2] https://en.wikipedia.org/wiki/Code_golf
It even uses Chrome tracing tools for profiling, which is pretty cool: https://github.com/anthropics/original_performance_takehome/...
But to be honest, I wonder what algorithm they implement. I have read the code for 2 minutes, and it sound like random forest prediction. Anyone knows what the code does ?
As a take home assignment though I would have failed as I would have probably taken 2 hours to just sketch out ideas and more on my tablet while reading the code before even changing it.
"before Claude Opus 4.5 started doing better than humans given only 2 hours"
"Claude Opus 4.5 in a casual Claude Code session, approximately matching the best human performance in 2 hours"
"Claude Opus 4.5 after 2 hours in our test-time compute harness"
"Claude Sonnet 4.5 after many more than 2 hours of test-time compute"
So that does make one wonder where this comes from. Could just be LLM generated with a talking point of "2 hours", models can fall in love with that kind of stuff. "after many more than 2 hours" is a bit of a tell.
Would be quite curious to know though. How I usually design take home assignments is:
1. Candidate has several _days_ to complete (usually around a week).
2. I design the task to only _take_ 2-4 hours, informing the candidate about that, but that doesn't mean they can't take longer. The subsequent interview usually reveals if they went overboard or struggled more than expected.
But I can easily picture some places sending a candidate the assignment and asking them to hand in their work within two hours. Similar to good old coding competitions.
I think I'm going to get sub 900 since i just realized i can in-parallel compute whether stage 5 of the hash is odd just by looking at bits 16 and 0 of stage 4 with less delay.....
Let me put down my thought process: You have to start to think of designing a 6-slot x8-len vector pipeline doing 48 hashes in parallel first which needs at least 10 steps —- if you convert three stages to multiply adds and do parallel XORs for the other three) —- the problem with 10 cycle hashing is you need to cram 96 scalar xors along side your vector pipeline, so that will use all 12 ALUs for 8 of those cycles. Leaving you only 24 more scalar ops per hash cycle which isn’t enough for the 48 tree value xors..
so you must use at least 11 steps per hash, with 96 xors (including the tree value xor) done in the scalar alus using 8 steps, and giving 3*12 Alu ops per hash cycle. You need 12 more ops per hash to do odd/even, so you must be 12 stages, and just do all of the hash ops in valu, 4 cycles of 12 alus doing modulo, 8 cycles x 12 alus free
With 12 steps and 48 parallel you’re absolute minimum could be 4096/48 x 12 = 1,024 cycles, since stage 10 can be optimized (you don’t need the odd/even modulo cycle, and can use some of those extra scalar cycles to pre-xor the constant can save you ~10 cycles. 1024 gonna be real hard, but I can imagine shenanigans to get it down to 1014, sub-1000 possible by throwing more xor to the scalar alus.
I performed a similar analysis to you and found it very difficult to imagine sub-1000. Your comment I think convinced me that it may be possible, though. Interesting.
I'm below the threshold for recruiting but not below Claude at the moment. Not sure where I am going wrong.
For the first several rounds (when every tree value is in use) Combine the stage 5 XOR with the subsequent round’s tree XORs. You can determine even/odd in hash stage 5 starting with a ^ (a>>16) without Xoring the constant, then you can only need one XOR, this saves you a ton of XORs
Create separate instruction bundles for the first round, rounds 1-5 (combining hash stages 5 XOR with next round tree XORs) and 6-9 (not every tree node is used anymore), round 10 round 11-14 and round 15 and combine them.
you can use add_imm in parallel to load consts. stage 0 you have to do load the tree first and the vals, by later stages when everything is in scratch, you could use 12 scalar XORs and 6 vector XORs on scratch. once you vload vals, you can start to do XORs but can only advance so much at a time, so I’m starting to work on getting hash stages moving to different rounds faster to hide the initial vloads and get to the heavy load section sooner and spread the load pain.
BROADCAST LOAD SCHEDULE
======================================================================
Round | Unique | Load Strategy
------|--------|------------------------------------------
0 | 1 | 1 broadcast → all 256 items
1 | 2 | 2 broadcasts → groups
2 | 4 | 4 broadcasts → groups
3 | 8 | 8 broadcasts → groups
4 | 16 | 16 broadcasts → groups
5 | 32 | 32 broadcasts → groups
6 | 63 | 63 loads (sparse, use indirection)
7 | 108 | 108 loads (sparse, use indirection)
8 | 159 | 159 loads (sparse, use indirection)
9 | 191 | 191 loads (sparse, use indirection)
10 | 224 | 224 loads (sparse, use indirection)
11 | 1 | 1 broadcast → all 256 items
12 | 2 | 2 broadcasts → groups
13 | 4 | 4 broadcasts → groups
14 | 8 | 8 broadcasts → groups
15 | 16 | 16 broadcasts → groups
Total loads with grouping: 839Total loads naive: 4096
Load reduction: 4.9x
I know real people do sometimes use it, but it's a smell.
Oops I mean, you're absolutely right, those ARE hallmark signs of an LLM. Let me breakdown why this isn't just your imagination but actually...
The README only gives numbers without any information on what you’re supposed to do or how you are rated.
being cryptic and poorly specified is part of the assignment
just like real code
in fact, it's _still_ better documented an self contained than most of the problems you'd usually encounter in the wild. pulling on a thread to end up with a clear picture of what needs to be accomplished is like 90% of the job very often.
Basically it's a long enough problem that I'd be annoyed at being asked to do it at home for free, if what I wanted from that was a shot at an interview. If I had time on my hands though, it's something I could see trying for fun.
I suspect it would take me another hour to get it implemented. Leaving 30 minutes to figure out something clever?
Idk maybe I'm slow or really not qualified.
So yeah. They _could_ have written it much more clearly in the readme.
With a live interview, you get past a phone screening, and now the company is investing significant resources in the day or so of engineering time it takes to have people interview you. They won't do that unless they have a serious level of interest in you. The take-home means no investment for the company so there's a huge imbalance.
There's another thread about this article, which explains an analogous situation about being asked to read AI slop: https://zanlib.dev/blog/reliable-signals-of-honest-intent/
IMO the assignment('s purpose) could be improved by making the code significantly worse. Then you're testing the important stuff (dealing with ambiguity) that the AI can't do so well. Probably the reason they didn't do that is because it would make evaluation harder + more costly.
When I pointed out this contradiction via email, they ignored me completely and instead silently patched the README to retroactively enforce the rule.
It’s not just a bad test; it’s a massive red flag for their engineering culture. They wasted candidates' time on a "guess the hidden artificial constraint" game rather than evaluating real optimization skills.
They want to see how you handle low level optimizations, not get tripped over some question semantics.
I didn't simply "skip" the problem. I implemented a compiler that solves the problem entirely at build time, resulting in O(0) runtime execution.
Here is the actual "Theorem" I implemented in my solution. If a test penalizes this approach because it "goes against the spirit," then the test is fundamentally testing for inefficiency.
""" Theorem 1 (Null Execution): Let P: M → M be a program with postcondition φ(M). If ∃M' s.t. φ(M') ∧ M ≅ M', then T(P) = 0.
Complexity: O(n) compile-time, O(0) runtime """
If they wanted to test runtime loop optimizations, they should have made the inputs dynamic.
I understand that this test is intended to somehow test the raw brianpower, the ability to tackle an unfamiliar and complicated domain, and to work under stress. But I hope it's not representative of the actual working conditions at Anthropic. It's like asking a candidate to play a Quake deathmatch when hiring to a special forces assault squad.
This is a valid way to solve the problem.
> If you optimize below 1487 cycles, beating Claude Opus 4.5's best performance at launch, email us at performance-recruiting@anthropic.com with your code (and ideally a resume) so we can be appropriately impressed and perhaps discuss interviewing.
That doesn’t seem snarky to me. They said if you beat Opus, not their best solution. Removing “perhaps” (i.e. MAYBE) would be worse since that assumes everyone wants to interview at Anthropic. I guess they could have been friendlier: “if you beat X, we’d love to chat!”
"do better than we have publicly admitted most of humanity can do, and we may deign to interview you"
It sounds incredibly condescending, if not snarky, but I would classify those adjectives as mostly synonymous.
There's more to employees than their raw ability to go below some performance threshold. If somebody passes the test, but lives in an US sanctioned country with no plans to move, is well known for using the n-word on social media or has previously broken an NDA, Anthropic probably doesn't want to interview them.
Hiring and interviewing is in a weird place right now. We’re coming off of a period where tech jobs were easy to get and companies were competing for candidates. A lot of candidates quickly got used to the idea of companies working hard to charm and almost beg them to join. When those candidates encounter what it’s like to apply for highly competitive companies who have 1000x more applicants than they’d ever consider, the resulting straightforwardness can be shocking.
>If you optimize below 1487 cycles, beating Claude Opus 4.5's best performance at launch, email us at performance-recruiting@anthropic.com with your code (and ideally a resume) so we can be appropriately impressed and perhaps discuss interviewing.
Not condescending
> If you optimize below 1487 cycles, beating Claude Opus 4.5's best performance at launch, email us at performance-recruiting@anthropic.com with your code so we can schedule an interview.
> do better than we have publicly admitted most of humanity can do, and we may deign to interview you
If you think telling someone that after passing a test that 99.999% of humanity cannot pass, that they _may_ get an interview, you are being snarky/condescending.
You may want to consider the distribution and quantity of replies before stating that you WILL do something that might just waste more people’s time or not be practical.
The classy thing to do would be responding to every qualifying submission, even if it’s just to thank everyone and let some people know the field was very competitive if an interview won’t be happening.
Does that change the fact that they are condescending?
(yes, yes, not every human will try this test)
Do you think if the applicants are really in that level of demand that they would be getting a take home test instead of being actively recruited?
Legitimately lay out your understanding of a world where an employer is chasing after employees who are high in demand, give them a test that is expected to take hours, and have a hedged bet in their wording, instead of saying we will absolutely hire you if you pass X bar?
If three people send them improvements, they'll probably get interviews. If three thousand do, the problem is easier than they thought or amenable to an LLM or one bright person figured out a trick and shared it with all his classmates or colleagues or all of GitHub.
If the models get a good feedback loop + easy (cheap) verification, they get to bang their tokens against the wall until they find a better solution.
Like optimizing for people who assume the start indices always will be zero. I am close to 100% sure that's required to get below 2096 total loads but it's just not fun
If it however had some kind of dynamic vector lane rotate that could have been way more interesting
Asked to generate drawio for the winner so I can grok it more easily, then I gave feedback.
Edit: 1121 cycles
Is this saying that Claude matched the best human performance, where the human had two hours? I think that is the correct reading, but I'm not certain they don't mean that Claude had two hours, and matched the best human performance where the human had an arbitrary amount of time. The former is impressive but the later would be even more so.
Was the screening format here that this problem was sent out, and candidates had to reply with a solution within 2 hours?
Or, are they just saying that the latest frontier coding models do better in 2 hours than human candidates have done in the past in multiple days?
Does this confirm they actually do knee cap models after the launch period to save money, without telling users?
Something comes across really badly here for me. Some weird mix of bragging, mocking, with a hint of aloof.
I feel these top end companies like the smell of their own farts and would be an insufferable place to work. This does nothing but reinforce it for some reason.
Rant: On a similar note, I recently saw a post on Linkedin from Mistral, where they were bragging to recruit candidates from very specific schools. That sounded very pretentious (and also an HR mistake on several levels IMHO).
The current e-mail invitation in the README is just another avenue for exceptional people to apply. If someone is already highly qualified from their background and resume they can go through the front door (direct application). For those who have incredible talent but not necessarily the background or resume to unlock the front door yet, this is a fun way to demonstrate it.
The machine is fake and simulated: https://github.com/anthropics/original_performance_takehome/...
But presumably similar principles apply.
This is the general framework for reasoning about correct memory addressing in the presence of arbitrary constraints like those of hardware.
Anyone worth working with respected that and I landed several clients who forwent the assignment altogether. It's chump change in the grand scheme of things, and often a formality.
Does help that I have a very public web presence and portfolio, though.
I couldn't care less about getting paid for a few hours, what's truly annoying when you're job hunting is the company having an extremely high rejection rate even at the take-home stage. That's an inordinate waste of time multiplied by a lot of companies.
If you have a >50% chance of rejecting, don't even give the candidate a take-home. Be at least 90% sure you want them before you get to that stage.
Being told "here do this arbitrary thing that will take 4 hours of your time and maybe we'll look at it, and then if we even bother to do that, maybe we'll respond" is different than an interview where both parties invest their time face-to-face.
Worth mentioning that demanding to be paid to apply for a company is usually equivalent to rejecting the job. Most companies are going to end the interview there. Few HR departments would allow one applicant to be paid for the same interview loop as other candidates.
I was helping out in a mentoring program during the ZIRP period when the idea of charging companies for take-home interviews started to become popular. I can’t think of anyone it actually worked for in that group. I’ve heard anecdotes online of some people doing it with success, but any company like Anthropic is just going to close your application and move on if you request to be paid for applying. They have a zillion other qualified candidates in line.
If someone is giving a take-home problem that looks like you’re actually doing work for the company, that’s a different story. This problem is not actually work, obviously.
Sending a company a surprise bill that they didn't agree upon is bad practice. Interviews are customarily not compensated, so it's unreasonable to surprise bill someone for it.
If you send a company a surprise bill for the interview, it's going to give the HR people a good laugh as they cross you off the candidates list. Everyone involved is going to forever remember you as the person who tried surprise billing for the interview and make a mental note to never interview you again at future companies.
It's not a good thing to try.
i guess that ensures you either hire the childless
or those with children who are fine with be not present for that long willingly (so they are probably gonna be job-obsessed enough)
or they are currently unemployed so they won't have an existing job as anchoring leverage
well played, anthropic
nobody i know ever spends 4hrs uninterrupted working remotely lolol
This assumes that the candidate has a lot of time for playing other games.
Did you apply for a position? Did they send you the assignment without prior discussion?
Would you prefer C or C++?
"2) AI companies are content with slop and do not even bother with clear problem statements."
It's a filter. If you don't get the problem, you'll waste their time.
"3) LOC and appearance matter, not goals or correctness."
The task was goal+correctness.
"4) Anthropic must be a horrible place to work at."
Depends on what you do. For this position it's probably one of the best companies to work at.
I think they also have open positions for stealing other people's code and DDoS-ing other people's websites.
> Unironically, yes. Unless I never plan to look at that code again
And before some smart aleck says you can be creative on these types of optimization problems: not in two hours, it’s far too risky vs regurgitating some standard set of tried and true algos.
You're both right and wrong. You're right in the sense that the sort of creativity the task is looking for isn't really possible in two hours. That's something that takes a lot of time and effort over years to be able to do. You're wrong because that's exactly the point. Being able to solve the problem takes experience. Literally. It's having tackled these sorts of problems over and over in the past until you can draw on that understanding and knowledge reasonably quickly. The test is meant to filter out people who can't do it.
I also think it's possible to interpret the README as saying humans can't do better than the optimizations that Claude does when Claude spends two hours of compute time, regardless of how long the human takes. It's not clear though. Maybe Claude didn't write the README.
> I was reminding them that there is always someone smarter.
And even with this comment you literally do not understand that you have some skewed view of the world. Do you have some high school trauma?
I am not sure ad personam is appropriate here
> And even with this comment you literally do not understand that you have some skewed view of the world.
I’m well aware I don’t have a perfect view of reality and the map isn’t the territory. Do you?
It's a take-home test, which means some people will spend more than a couple of hours on it to get the answer really good. They would have gone after those people in particular.
Good. That should be the minimum requirement.
Not another Next.js web app take home project.