Hacker News Clone

lbreakjaiJan 21, 2026, 7:10 AM

I consider myself rather smart and good at what I do. It's nice to have a look at problems like these once in a while, to remind myself of how little I know, and how much closer I am to the average than to the top.

epolanskiJan 21, 2026, 3:53 PM

Computing is a very broad topic. Even Linus or Carmack have no skills or knowledge about countless topics that would be mundane to you.

It doesn't matter really, what matters is our ability to stare into the void of what we don't know and start making progress.

Our ability to process and master new topics is part of the job.

I'm sure you've done that countless times.

fergieJan 21, 2026, 7:49 AM

I'm 30 years in, and literally don't understand the question.

WithinReasonJan 21, 2026, 11:23 AM

After a quick look this is can be seen as a low level GPU/TPU optimization problem where you have to consider the throughput and depth of different arithmetic pipelines. If you want to hire people who understand how to do that you unfortunately have to give them such a convoluted task and emulate the relevant parts of HW. (In reality this is probably more like TPU since it has scalar pipelines, but the optimization methods are not that different)

The task is to parallelize tree traversal, which is embarrassingly unparallel so it's tricky.

WithinReasonJan 21, 2026, 2:17 PM

This also shows that a performance engineer's job, even at Anthropic, is to be a glorified human compiler, who is often easily beaten by LLMs.

0xffff2Jan 22, 2026, 6:20 PM

> who is often easily beaten by LLMs

Is that really the case? My experience is fairly limited, but I've found that the LLM's willingness to fill in plausible sounding (but not necessarily at all accurate) numbers where it needs them to be a significant hindrance when asking it to think about performance.

scottyahJan 21, 2026, 6:12 PM

I think the job is to be one of the few that's better than LLMs.

johnnyanmacJan 22, 2026, 2:01 AM

And how would one do that these days if they didn't spend their career doing this pre-LLM? Just expect to study and perform such projects as a hobby for a few years on the side? These are specialized problems that you only really do for a few select companies.

cheikhcheikhJan 22, 2026, 5:33 PM

I mean yeah... You kind of have to learn this stuff (performance engineering) by yourself (a strong education background helps a lot of course). There are transferable parts of it and there are platform-specific parts where you need to be somewhat familiar with GPUs.

johnnyanmacJan 22, 2026, 9:34 PM

Seeks like another catch 22 when companies still care about 3-5 years of experience in industry, even if you work on some hobby projects. I'm not in this sector but I had similar struggles getting noticed in another specific domain despite studying it for a while.

bsderJan 21, 2026, 11:00 AM

Since it's a CPU, you start with the idea that there is an ALU and spiral outward from that. That gives you something concrete to wrap your head around while you climb up the abstraction levels.

However, when I hit "scratch_write" and it wasn't in the Machine class and it wasn't coming from some Decorator and it was getting defined and deleted by a member function ... I stopped. That's paying lip service to the variable typing that is scattered around and actively hampers even basic IDE usage. Probably the typing was added by AI/LLM after the fact, and it missed that unusual usage. The Python convention used to be that those kinds of variables got declared as "_scratch_write" with a leading underscore to flag that they were "private/internal".

That was the gigantic red "We write shitty code" signal or worse "We don't care about wasting your time" signal. Human review should have flagged that.

Shame. I was kinda looking forward to the technical problem, but I'm not going to spend a bunch of time using grep to untangle garbage code to get at it.

I suspect everything would actually be much clearer if you wrote it in SystemVerilog and tested with Cocotb. Let's see if their LLMs can handle that porting job. HAH!

arsl16Jan 24, 2026, 6:21 AM

What is variable typing?

bsderJan 25, 2026, 12:06 AM

The types on the variables. Python recently adopted "gradual typing", but it isn't enforced by default. Consequently, you may have to actually execute a Python program to determine what an unlabeled variable type is.

A lot of people write Python code and then run "AI" on it to fill in the variable types. This, of course, is error prone and shitty. And the AI will miss strange usages like the one I flagged.

Although I am sorry for phrasing it as "variable typing". I can see how you might read that as "typing that varies" instead.

mike_hearnJan 21, 2026, 9:19 AM

The question isn't clearly written down anywhere, that's why. Presumably actual candidates would have been given more info over the phone or email. Part of the "challenge" is reverse engineering their Python; unclear if that's intentional.

If you look at the top of perf_takehome.py then there is a brief comment saying the challenge is to optimize a kernel. Kernel in GPU land means a program that computes on data in parallel, it's not an OS kernel:

    Optimize the kernel (in KernelBuilder.build_kernel) as much as possible in the
    available time, as measured by test_kernel_cycles on a frozen separate copy
    of the simulator.

However, this kernel doesn't run on an actual GPU. It runs on a little interpreter for a custom assembly language written in Python. Thus you will be optimizing the program built in-memory by the function on this line:

https://github.com/anthropics/original_performance_takehome/...

This function is described only as:

    Like reference_kernel2 but building actual instructions.
    Scalar implementation using only scalar ALU and load/store.

The KernelBuilder class has some fields like "instrs" but we can't immediately see what they're meant to be because this is Python and types are optional. Nonetheless we can see that instructions are being added to a list, and below we can see the test_kernel_cycles function that runs the interpreter on the program. So our mission is to change the build_kernel function to make a better program. And it says this is an assembly version of the python function reference_kernel2 which is found in problem.py.

What exactly is this kernel doing? The reference_kernel2 function doesn't explain itself either - it's some sort of parallel tree walk. Let's put that to one side for a second and explore the machine, which is defined in problem.py. The machine itself is also largely undocumented, but there's a brief description in a docstring on line 66.

At this point it helps to understand the design of exotic processors. The emulator is for a fictional CPU that uses a VLIW SIMD ISA. Normal programmers will never encounter such a chip. Intel tried to make such a machine decades ago and it never took off, since then the concept has been largely dead. I believe it's still used in some mobile DSPs like Qualcomm's Hexagon. Notably, NVIDIA PTX is not such an ISA so this seems to have been chosen just to make things harder. As the comment explains, in a VLIW machine multiple instructions are packed together into a "slot" and executed in parallel. In a normal CPU the hardware reads a serial stream of instructions and works out just in time which can be executed in parallel, using fancy out-of-order circuitry. In a VLIW machine that's done ahead of time by the compiler or (in this case) the humble programmer, you. But this isn't just a VLIW machine, it's also multi-core, and multi-"engine", so there are multiple levels of execution going on. And it's SIMD, meaning each instruction can itself operate on multiple bits of data simultaneously.

This machine doesn't have registers or cache but it does have "scratch space", and so you can use the vector instructions to load data into a series of 32 bit scratch words and then do things on them in parallel. And multiple vector instructions can also run in parallel. "Broadcasting a scalar" in SIMD-speak means taking a single value and repeating it over multiple scratch space slots (or register subwords in a real machine), so you take e.g. 0xFF and get 0xFFFFFFFFFFFFFFFF.

And that's it, that's all we get. As the code says: "This comment is not meant to be full ISA documentation though, for the rest you should look through the simulator code". Possible point of confusion: real ISAs are serialized to bytes but this one is just Python tuples. The code is only partially typed; sometimes you're just left guessing.

So to recap, the problem is to optimize an undocumented program expressed in undocumented data structures returned by a Python function whose result is interpreted by a partly documented Python class that simulates a fictional exotic CPU architecture using an abandoned design that gives a lot of parallel computational capacity, but which requires all parallelism to be statically declared ahead of time, whilst simultaneously reverse engineering the Python that does all this.

Does that help? Sounds like a fun exercise :)

Edit: I just checked and Google TPUs are much more VLIW like so perhaps this simulator is designed to match a TPU. I know Anthropic rely on TPUs for serving and have done some optimization for them.

dist-epochJan 21, 2026, 11:34 AM

> but which requires all parallelism to be statically declared ahead of time

this is what all specialized chips like TPU/Cerebras require today, and it allows for better optimization than a generic CPU since you can "waste" 30 min figuring out the perfect routing/sequencing of operations, instead of doing it in the CPU in nanoseconds/cycles

another benefit is you can throw away all the CPU out-of-order/branch prediction logic and put useful matrix multipliers in it's place

forgotpwd16Jan 21, 2026, 9:31 AM

This is nice writeup. Thanks. Another commenter said will've taken them 2h just to sketch out ideas; sans LLMs will've taken me more than 2h just to collect all this info let alone start optimizing it.

mike_hearnJan 21, 2026, 9:49 AM

It took me about 10 minutes to generate that writeup the old fashioned 100% organic way, because one of the things that's unspecified is whether you're allowed to use AI to help solve it! So I assumed as it's a job interview question you're not allowed, but now I see other comments saying it was allowed. That would let you get much further.

I think I'd be able to make some progress optimizing this program in two hours but probably not much. I'm not a performance engineer but have designed exotic emulated CPU architectures before, so that helps a lot.

maccardJan 21, 2026, 12:05 PM

I've not written a VM before, but the comments in perf_takehome.py and problem.py explain the basics of this.

I gleaned about half of this comment in a few minutes of just skimming the code and reading the comments on the functions and classes. There's only 500 lines of code really (the rest is the benchmark framework).

fc417fc802Jan 21, 2026, 3:44 PM

Same thought. I doubt they provided additional explanation to candidates - it seems that basic code literacy within the relevant domain is one of the first things being tested.

On the whole I don't think I'd perform all that well on this task given a short time limit but it seems to me to be an extremely well designed task given the stated context. The reference kernel easily fits on a single screen and even the intrinsic version almost does. I think this task would do a good job filtering the people they don't want working for them (and it seems quite likely that I'm borderline or maybe worse by their metric).

owlbiteJan 21, 2026, 3:21 PM

I think calling VLIW "an adandoned design" is somewhat of an exaggeration, such architectures are pretty common for embedded audio processing.

matt_dJan 21, 2026, 5:40 PM

Worth adding on that note:

From JAX to VLIW: Tracing a Computation Through the TPU Compiler Stack, https://patricktoulme.substack.com/p/from-jax-to-vliw-tracin...

Google’s Training Chips Revealed: TPUv2 and TPUv3, HotChips 2020, https://hc32.hotchips.org/assets/program/conference/day2/Hot...

Ten Lessons From Three Generations Shaped Google’s TPUv4i, ISCA 2021, https://gwern.net/doc/ai/scaling/hardware/2021-jouppi.pdf

mike_hearnJan 22, 2026, 9:59 AM

Thanks, that JAX writeup was interesting.

mike_hearnJan 21, 2026, 3:30 PM

Sure. I did mention DSPs. But how many people write code for DSPs?

HarHarVeryFunnyJan 21, 2026, 6:23 PM

x86-64 SSE and AVX are also SIMD

vel0cityJan 21, 2026, 7:38 PM

SIMD and VLIW are somewhat similar but very different in the end.

HarHarVeryFunnyJan 21, 2026, 7:51 PM

True.

The ISA in this Anthropic machine is actually both, VLIW and SIMD, and both are relevant to the problem.

b40d-48b2-979eJan 21, 2026, 12:41 PM

    Sounds like a fun exercise :)

I'll be honest, that sounds like the opposite of fun since the worst parts of my job are touching the parts of a Python codebase that are untyped. The sad part is this work codebase isn't even that old, maybe a few years, and the developers definitely should have known better if they had anyone capable leading them. Alas, they're all gone now.

Harder than figuring out the instruction set for some exotic CPU are definitely the giant untyped dicts/lists common in data science code.

carschnoJan 21, 2026, 9:49 AM

On the one hand, this exercise probably reflects a realistic task. Daily engineering work comprises a lot of reverse engineering and debugging of messy code. On the other hand, this does not seem very suitable as an isolated assignment. The lack of code base-specific context has a lot of potential for frustration. I wonder what they really tested on the candidates, and whether this was what they wanted to filter for.

fc417fc802Jan 21, 2026, 3:48 PM

> The lack of code base-specific context has a lot of potential for frustration.

I think that's one of the intentional points. Being able to quickly understand what the provided source code is doing.

fergieJan 21, 2026, 10:43 AM

Wow! Thanks for the explanation :)

mannyvJan 21, 2026, 6:34 PM

"Performance can be optimized by not using python."

PeterStuerJan 21, 2026, 8:58 AM

Which part exactly are ypu having trouble with?

- Optimize the kernel (in KernelBuilder.build_kernel) as much as possible in the available time, as measured by test_kernel_cycles on a frozen separate copy of the simulator

karmajunkieJan 21, 2026, 5:03 PM

Thank goodness, I thought it was just me...

mangatmodiJan 21, 2026, 11:04 AM

Smart is different than the knowledge. If you learn about these concepts andwork on these problems, then you will be able to solve them.

It's not about you being average, just a different knowledge set.

xenihnJan 21, 2026, 8:06 AM

It comes with test suites, so that gives you a base to start from. You can at the very least do trial-and-error and come up with some heuristics on the fly. You're at a huge disadvantage to someone who has some familiarity but can convincingly play it off as being a newcomer, though.

chistevJan 21, 2026, 11:03 AM

What we know is a drop, what we don't know is an ocean.

elzbardicoJan 21, 2026, 4:51 PM

There's a big chance you're falling in a subtle form of imposter syndrome that manifests itself by largely over-estimating the average skill level.

But this is good. Staying humble makes you hungrier for learning.

ActorNightlyJan 21, 2026, 7:03 PM

Yours is a good mentality to have because it creates the emotional drive to learn more, so don't lose that. That being said, this isn't really that complicated. Its just a matter of taking enough time to look at the code and understand how its structured. I feel like the thing that differentiates developer skill is pretty much being able to do that, specifically in the process of having the model of the program in your head.

sigbottleJan 21, 2026, 7:40 PM

Does it?

For me, I've had that mentality for the longest time and I didn't get anything done because, well, "I'm just average".

For me, a little bit of arrogance (there's no way I couldn't do X, let's go do it), even if I end up "looking stupid" (see, I told you it was that hard!), was far more valuable to my development

gervwykJan 21, 2026, 9:06 PM

Don’t stress, its very likely that this problem was vibe coded :) It’s insane how much better Claude Code is compared to alternatives lately.

LouisSayersJan 21, 2026, 1:27 PM

It's the type of thing you'd be exposed to in a computer science degree - operating systems / compilers.

Always room to learn in software :)

deadbabeJan 21, 2026, 2:13 PM

If you think you’re average, you’re not average.

apsurdJan 21, 2026, 7:18 AM

disagree. nobody has a monopoly on what metric makes someone good. I don't understand all this leet code optimization. actually i do understand it, but it's a game that will attract game optimizers.

the hot take is, there are other games.

tuetuopayJan 21, 2026, 9:13 AM

This is the opposite of leet code.

Yes, this applies to some simulated imaginary CPU with an artificial problem. Except that the job asked here is exactly the core of what a performance engineer will do at anthropic: optimize kernels for their fleet of GPUs. Is it simplified? Yes! (e.g. the simulator does not restrict memory access patterns)

This is a real-world problem adapted to a lab setting that can fit in one's head in a matter of hours. Leetcode would have you reimplement the hashmap used in there.

saagarjhaJan 21, 2026, 9:07 AM

This is explicitly not Leetcode, in fact its goal is to attract optimizers

sevenzeroJan 21, 2026, 7:47 AM

Also leetcode does not really provide insight into ones ability to design business solutions. Whether it be system design, just some small feature implementation or communication skills within a team. Its just optimizers jerking each other off on some cryptic problems 99.999999999% of developers will never see in real life. Maybe it would've been useful like 30 years ago, but all commonly used languages have all these fancy algorithms baked into their stdlib, why would I ever have to implement them myself?

lbreakjaiJan 21, 2026, 8:00 AM

But this is an interview problem at Anthropic, not at your local CRUD factory. They _are_ looking for the optimizers, because they _are_ working on cryptic problems the 99.9999% of us will never encounter.

thorncoronaJan 21, 2026, 8:00 AM

Or more likely, the commonality is how you're applying your software skills?

In every other field it's helpful to understand the basics. I don't think software is the exception here.

sevenzeroJan 21, 2026, 8:23 AM

Understanding basics is very different to being able to memorize algorithms. I really dont see why I'd ever have to implement stuff like quicksort myself somewhere. Yes I know what recursion is, yes I know what quick sort is, so if I ever need it I know what to look for. Which was good enough throughout my career.

pvalue005Jan 21, 2026, 5:36 AM

I suspect this was released by Anthropic as a DDOS attack on other AI companies. I prompted 'how do we solve this challenge?' into gemini cli in a cloned repo and it's been running non-stop for 20 minutes :)

languid-photicJan 21, 2026, 6:50 AM

Naively tested a set of agents on this task.

Each ran the same spec headlessly in their native harness (one shot).

Results:

    Agent                        Cycles     Time
    ─────────────────────────────────────────────
    gpt-5-2                      2,124      16m
    claude-opus-4-5-20251101     4,973      1h 2m
    gpt-5-1-codex-max-xhigh      5,402      34m
    gpt-5-codex                  5,486      7m
    gpt-5-1-codex                12,453     8m
    gpt-5-2-codex                12,905     6m
    gpt-5-1-codex-mini           17,480     7m
    claude-sonnet-4-5-20250929   21,054     10m
    claude-haiku-4-5-20251001    147,734    9m
    gemini-3-pro-preview         147,734    3m
    gpt-5-2-codex-xhigh          147,734    25m
    gpt-5-2-xhigh                147,734    34m

Clearly none beat Anthropic's target, but gpt-5-2 did slightly better in much less time than "Claude Opus 4 after many hours in the test-time compute harness".

game_the0ryJan 21, 2026, 3:08 PM

> If you optimize below 1487 cycles, beating Claude Opus 4.5's best performance at launch, email us at performance-recruiting@anthropic.com with your code (and ideally a resume) so we can be appropriately impressed and perhaps discuss interviewing.

This is an interesting way to recruit. Much better than standard 2 leetcode medium/hard questions in 45 mins.

abra0Jan 21, 2026, 11:14 AM

This is a really fun problem! I suggest anyone who likes optimization in a very broad sense to try their hand at it. Might be the most fun I've had while interviewing. I had to spend a week-worth of evenings on it to fully scratch the itch, and I managed to get 1112 cycles. But that was mostly manual, before the current crop of agentic models (clopus 4.5, gpt5.2). I wonder how far you can RalphWiggum it!

lukahJan 21, 2026, 12:54 PM

I've never heard AI-assisted coding referred to as "RalphWiggum"ing a problem, and now I will have to use that always. Thank you.

usgroupJan 21, 2026, 1:01 PM

https://awesomeclaude.ai/ralph-wiggum

avaerJan 21, 2026, 5:17 AM

It's pretty interesting how close this assignment looks to demoscene [1] golf [2].

[1] https://en.wikipedia.org/wiki/Demoscene [2] https://en.wikipedia.org/wiki/Code_golf

It even uses Chrome tracing tools for profiling, which is pretty cool: https://github.com/anthropics/original_performance_takehome/...

wiz21cJan 21, 2026, 8:46 AM

I was in the demoscene long ago and that kind of optimisation is definitely in the ballpark of what we did: optimize algorithm down to machine code level (and additionally, cheat like hell to make you believe we ran the algorithm for real :-)).

But to be honest, I wonder what algorithm they implement. I have read the code for 2 minutes, and it sound like random forest prediction. Anyone knows what the code does ?

saagarjhaJan 21, 2026, 9:10 AM

It’s some useless problem like a random tree walk or something like that, the actual algorithm is not particularly important to the problem

psb217Jan 21, 2026, 10:56 AM

Yeah, I assume it was partly chosen since the problem structure provides some convenient hooks for selectively introducing subtle and less subtle inefficiencies in the baseline algorithm that match common optimization patterns.

KeplerBoyJan 21, 2026, 8:28 AM

perfetto is pretty widely used for such traces, because building a viewer for your traces is a completely avoidable pain.

nice_byteJan 21, 2026, 5:19 AM

it's designed to select for people who can be trusted to manually write ptx :-)

sureglymopJan 21, 2026, 5:29 AM

Having recently learned more about SIMD, PTX and optimization techniques, this is a nice little challenge to learn even more.

As a take home assignment though I would have failed as I would have probably taken 2 hours to just sketch out ideas and more on my tablet while reading the code before even changing it.

forgotpwd16Jan 21, 2026, 7:38 AM

Unless misread, 2 hours isn't the time limit for the candidate to do this but the time Claude eventually needed to outperform best returned solution. Best candidate could've taken 6h~2d to achieve this result.

fhd2Jan 21, 2026, 8:07 AM

Their Readme.md is weirdly obsessed with "2 hours":

"before Claude Opus 4.5 started doing better than humans given only 2 hours"

"Claude Opus 4.5 in a casual Claude Code session, approximately matching the best human performance in 2 hours"

"Claude Opus 4.5 after 2 hours in our test-time compute harness"

"Claude Sonnet 4.5 after many more than 2 hours of test-time compute"

So that does make one wonder where this comes from. Could just be LLM generated with a talking point of "2 hours", models can fall in love with that kind of stuff. "after many more than 2 hours" is a bit of a tell.

Would be quite curious to know though. How I usually design take home assignments is:

1. Candidate has several _days_ to complete (usually around a week).

2. I design the task to only _take_ 2-4 hours, informing the candidate about that, but that doesn't mean they can't take longer. The subsequent interview usually reveals if they went overboard or struggled more than expected.

But I can easily picture some places sending a candidate the assignment and asking them to hand in their work within two hours. Similar to good old coding competitions.

alcasaJan 21, 2026, 8:33 AM

No the 2 hours is their time limit for candidates. The thing is that you are allowed to use any non-human help for their take homes (open book), so if AI can solve it in below 2 hours, it's not very good at assessing the human.

saagarjhaJan 21, 2026, 9:14 AM

4 hours but AI help is (was?) allowed. I assume it was retired because of Opus basically oneshotting it

alcasaJan 21, 2026, 11:54 AM

Fair enough. I feel like designing AI-proof take-homes is getting ever more futile. Given the questions need to be sufficiently low context to be human-doable in a short time and timespans for AI tasks increasing, I'm not sure take homes can actually serve any filtering function whatsoever, besides checking if applicants are willing to put in a minimal amount of effort.

amirhirschJan 21, 2026, 6:05 PM

I'm at 1137 with one hour with opus now... Pipelined vectorized hash, speculation, static code for each stage, epilogues and prologues for each stage-to-stage...

I think I'm going to get sub 900 since i just realized i can in-parallel compute whether stage 5 of the hash is odd just by looking at bits 16 and 0 of stage 4 with less delay.....

WithinReasonJan 22, 2026, 1:21 PM

Submit it to the leaderboard: https://www.kerneloptimization.fun/

amirhirschJan 22, 2026, 2:48 PM

I think I can hit #1 (current #1 is 1000). sub 900 not possible though.

Let me put down my thought process: You have to start to think of designing a 6-slot x8-len vector pipeline doing 48 hashes in parallel first which needs at least 10 steps —- if you convert three stages to multiply adds and do parallel XORs for the other three) —- the problem with 10 cycle hashing is you need to cram 96 scalar xors along side your vector pipeline, so that will use all 12 ALUs for 8 of those cycles. Leaving you only 24 more scalar ops per hash cycle which isn’t enough for the 48 tree value xors..

so you must use at least 11 steps per hash, with 96 xors (including the tree value xor) done in the scalar alus using 8 steps, and giving 3*12 Alu ops per hash cycle. You need 12 more ops per hash to do odd/even, so you must be 12 stages, and just do all of the hash ops in valu, 4 cycles of 12 alus doing modulo, 8 cycles x 12 alus free

With 12 steps and 48 parallel you’re absolute minimum could be 4096/48 x 12 = 1,024 cycles, since stage 10 can be optimized (you don’t need the odd/even modulo cycle, and can use some of those extra scalar cycles to pre-xor the constant can save you ~10 cycles. 1024 gonna be real hard, but I can imagine shenanigans to get it down to 1014, sub-1000 possible by throwing more xor to the scalar alus.

icelancerJan 23, 2026, 10:10 AM

> sub 900 not possible though.

I performed a similar analysis to you and found it very difficult to imagine sub-1000. Your comment I think convinced me that it may be possible, though. Interesting.

I'm below the threshold for recruiting but not below Claude at the moment. Not sure where I am going wrong.

amirhirschJan 23, 2026, 5:24 PM

Here’s some other hints: combine hash stages 2 and 3, it can be two muladds and a XOR

For the first several rounds (when every tree value is in use) Combine the stage 5 XOR with the subsequent round’s tree XORs. You can determine even/odd in hash stage 5 starting with a ^ (a>>16) without Xoring the constant, then you can only need one XOR, this saves you a ton of XORs

Create separate instruction bundles for the first round, rounds 1-5 (combining hash stages 5 XOR with next round tree XORs) and 6-9 (not every tree node is used anymore), round 10 round 11-14 and round 15 and combine them.

you can use add_imm in parallel to load consts. stage 0 you have to do load the tree first and the vals, by later stages when everything is in scratch, you could use 12 scalar XORs and 6 vector XORs on scratch. once you vload vals, you can start to do XORs but can only advance so much at a time, so I’m starting to work on getting hash stages moving to different rounds faster to hide the initial vloads and get to the heavy load section sooner and spread the load pain.

menaerusJan 23, 2026, 9:28 AM

Why do you need an X account for it? Seems like a ridiculous requirement

lalaland1125Jan 21, 2026, 6:23 PM

How do you avoid the load bottleneck?

amirhirschJan 21, 2026, 9:25 PM

======================================================================

BROADCAST LOAD SCHEDULE

======================================================================

Round | Unique | Load Strategy

------|--------|------------------------------------------

   0  |    1   | 1 broadcast → all 256 items

   1  |    2   | 2 broadcasts → groups

   2  |    4   | 4 broadcasts → groups

   3  |    8   | 8 broadcasts → groups

   4  |   16   | 16 broadcasts → groups

   5  |   32   | 32 broadcasts → groups

   6  |   63   | 63 loads (sparse, use indirection)

   7  |  108   | 108 loads (sparse, use indirection)

   8  |  159   | 159 loads (sparse, use indirection)

   9  |  191   | 191 loads (sparse, use indirection)

  10  |  224   | 224 loads (sparse, use indirection)

  11  |    1   | 1 broadcast → all 256 items

  12  |    2   | 2 broadcasts → groups

  13  |    4   | 4 broadcasts → groups

  14  |    8   | 8 broadcasts → groups

  15  |   16   | 16 broadcasts → groups

Total loads with grouping: 839

Total loads naive: 4096

Load reduction: 4.9x

amirhirschJan 21, 2026, 6:31 PM

take advantage of index collisions, optimizing round 0 and 11, speculative pre-loading, and the early branch predictor (which now I am doing looking at bits output at stage 3)

lzhouJan 21, 2026, 9:35 PM

it's actually pretty funny since opus will suggest both of these with enough prying (though with a single-prompt it might not try it).

fabian4Jan 21, 2026, 2:45 PM

[flagged]

tap12783487Jan 21, 2026, 2:51 PM

[flagged]

epiccolemanJan 21, 2026, 3:23 PM

It definitely bears all the LLM hallmarks we've come to know. emdash, the "this isn't X. it's Y" structure - and then, to cap it off, a single pithy sentence to end it.

nostrademonsJan 21, 2026, 3:24 PM

Also bears all the hallmarks of an ordinary post (by someone fairly educated) on the Internet. This would make sense, because LLMs were trained on lots of ordinary posts on the Internet, plus a fair number of textbooks and scientific papers.

epiccolemanJan 22, 2026, 3:48 AM

The — character is the biggest cause of suspicion. It's difficult to type manually so most people - myself included - substitute the easily typed hyphen.

I know real people do sometimes use it, but it's a smell.

nostrademonsJan 22, 2026, 6:32 PM

I think some software will automatically substitute "smart quotes" for regular quotes and an em-dash for a double hyphen -- I know MS Word used to do this. Curious if any browsers do. This comment was typed in Brave, which doesn't appear to, but I didn't check if Chrome or IE or Opera does.

menaerusJan 23, 2026, 8:22 AM

The comment was not wrong though so I am not sure I understand if flagging it for the sole "it was most likely written by the use of AI" reason is completely valid.

haliskerbasJan 21, 2026, 3:25 PM

I've noticed people who are using LLMs more, myself included, are starting to talk like that.

Oops I mean, you're absolutely right, those ARE hallmark signs of an LLM. Let me breakdown why this isn't just your imagination but actually...

bytesandbitsJan 21, 2026, 6:41 AM

Having done a bunch of take home for big (and small) AI labs during interviews, this is the 2nd most interesting one I have seen so far.

pettersJan 21, 2026, 7:12 AM

And the answer to the obvious follow-up question is...?

mrklolJan 21, 2026, 8:40 AM

Milk before cereals

matthews3Jan 21, 2026, 11:07 AM

Milk, then cereal, then bowl!

Xmd5aJan 21, 2026, 11:35 AM

How about a bowl, and then, 30 minutes ~ 1 hour later, milk with cereals?

darkwaterJan 21, 2026, 9:11 AM

Maybe it's under NDA :)

kevthecoderJan 21, 2026, 11:21 AM

42

reader9274Jan 21, 2026, 7:47 AM

fries

koolbaJan 21, 2026, 4:26 AM

What is the actual assignment here?

The README only gives numbers without any information on what you’re supposed to do or how you are rated.

glalondeJan 21, 2026, 4:34 AM

"Optimize the kernel (in KernelBuilder.build_kernel) as much as possible in the available time, as measured by test_kernel_cycles on a frozen separate copy of the simulator." from perf_takehome.py

vermilinguaJan 21, 2026, 5:08 AM

Think that means you failed :(

nice_byteJan 21, 2026, 5:16 AM

+1

being cryptic and poorly specified is part of the assignment

just like real code

in fact, it's _still_ better documented an self contained than most of the problems you'd usually encounter in the wild. pulling on a thread to end up with a clear picture of what needs to be accomplished is like 90% of the job very often.

throwaway81523Jan 21, 2026, 6:03 AM

I didn't see much cryptic except having to click on "perf_takehome.py" without being told to. But, 2 hours didn't seem like much to bring the sample code into some kind of test environment, debug it enough to works out details of its behaviour, read through the reference kernel and get some idea of what the algorithm is doing, read through the simulator to understand the VM instruction set, understand the test harness enough to see how the parallelism works, re-code the algorithm in the VM's machine language while iterating performance tweaks and running simulations, etc.

Basically it's a long enough problem that I'd be annoyed at being asked to do it at home for free, if what I wanted from that was a shot at an interview. If I had time on my hands though, it's something I could see trying for fun.

tayo42Jan 21, 2026, 6:32 AM

2 hours does seem short. It took me a half hour to get through all you listed and figure out how to get the valu instruction working.

I suspect it would take me another hour to get it implemented. Leaving 30 minutes to figure out something clever?

Idk maybe I'm slow or really not qualified.

ithkuilJan 21, 2026, 7:37 AM

My instinct to read about the problem was to open the "problem.py" file, which states "Read the top of perf_takehome.py for more introduction"

So yeah. They _could_ have written it much more clearly in the readme.

nice_byteJan 21, 2026, 6:21 AM

it's "cryptic" for an interview problem. e.g. the fact that you have to actually look at the vm implementation instead of having the full documentation of the instruction set from the get go.

throwaway81523Jan 21, 2026, 6:47 AM

That seems normal for an interview problem. They put you in front of some already-written code and you have to fix a bug or implement a feature. I've done tons of those in live interviews. So that part didn't bother me. It's mostly the rather large effort cost in the case where the person is a job applicant, vs an unknown and maybe quite low chance of getting hired.

With a live interview, you get past a phone screening, and now the company is investing significant resources in the day or so of engineering time it takes to have people interview you. They won't do that unless they have a serious level of interest in you. The take-home means no investment for the company so there's a huge imbalance.

There's another thread about this article, which explains an analogous situation about being asked to read AI slop: https://zanlib.dev/blog/reliable-signals-of-honest-intent/

avaerJan 21, 2026, 5:39 AM

It's definitely cleaner than what you will see in the real world. Research-quality repositories written in partial Chinese with key dependencies missing are common.

IMO the assignment('s purpose) could be improved by making the code significantly worse. Then you're testing the important stuff (dealing with ambiguity) that the AI can't do so well. Probably the reason they didn't do that is because it would make evaluation harder + more costly.

fumi2026Jan 22, 2026, 5:15 AM

I just withdrew my application over this test. It forces an engineering anti-pattern: requiring runtime calculation for static data (effectively banning O(1) pre-computation).

When I pointed out this contradiction via email, they ignored me completely and instead silently patched the README to retroactively enforce the rule.

It’s not just a bad test; it’s a massive red flag for their engineering culture. They wasted candidates' time on a "guess the hidden artificial constraint" game rather than evaluating real optimization skills.

hackern3972Jan 22, 2026, 5:35 AM

This isn't the gotcha moment you think it is. Storing the result on disk is some stupid "erm achkually" type solution that goes against the spirit of the optimization problem.

They want to see how you handle low level optimizations, not get tripped over some question semantics.

fumi2026Jan 22, 2026, 5:53 AM

You are missing the point. This isn't "storing result on disk." In high-performance engineering, if the input is static and known at build time, the only correct optimization is pre-computation.

I didn't simply "skip" the problem. I implemented a compiler that solves the problem entirely at build time, resulting in O(0) runtime execution.

Here is the actual "Theorem" I implemented in my solution. If a test penalizes this approach because it "goes against the spirit," then the test is fundamentally testing for inefficiency.

""" Theorem 1 (Null Execution): Let P: M → M be a program with postcondition φ(M). If ∃M' s.t. φ(M') ∧ M ≅ M', then T(P) = 0.

Complexity: O(n) compile-time, O(0) runtime """

If they wanted to test runtime loop optimizations, they should have made the inputs dynamic.

nine_kJan 21, 2026, 5:14 PM

This is a kind of task that's best solved by possibly spending more than the allocated 2 hours on it, once any obvious low-hanging fruit is picked. An optimization task is what a machine does best. So the real problem would be to construct a machine that would be able to run the optimization. A right optimization framework that results from the effort could also efficiently solve many more similar problems in the future.

I understand that this test is intended to somehow test the raw brianpower, the ability to tackle an unfamiliar and complicated domain, and to work under stress. But I hope it's not representative of the actual working conditions at Anthropic. It's like asking a candidate to play a Quake deathmatch when hiring to a special forces assault squad.

saagarjhaJan 21, 2026, 8:05 PM

> So the real problem would be to construct a machine that would be able to run the optimization.

This is a valid way to solve the problem.

tucnakJan 21, 2026, 5:31 AM

The snarky writing of "if you beat our best solution, send us an email and MAYBE we think about interviewing you" is really something, innit?

FriendlyMikeJan 21, 2026, 11:51 AM

They should just have you create a problem that can't be solved by an llm in two hours. That's the real problem here

ec109685Jan 21, 2026, 3:07 PM

Solvable in more than 2 but not less than 2 would be the real trick.

OisinMoranJan 21, 2026, 2:13 PM

"You have 1 minute to design a maze that takes 2 minutes to solve"

NitpickLawyerJan 21, 2026, 5:57 AM

The writing was on the wall for about half a year (publicly) now. The oAI 2nd place at the atcoder world championship competition was the first one, and I remember it being dismissed at the time. Sakana also got 1st place in another atcoder competition a few weeks ago. Google also released a blog a few months back on gemini 2.5 netting them 1% reduction in training time on real-world tasks by optimising kernels.

If the models get a good feedback loop + easy (cheap) verification, they get to bang their tokens against the wall until they find a better solution.

cgearhartJan 22, 2026, 1:41 AM

I think this is the actual “bitter lesson”—the scalable solution (letting LLMs bang against the problem nonstop) will eventually far outperform human effort. There will come a point—whether sooner or later—where this’ll be the expected norm for handling such problems. I think the only question is whether there is any distinction between problems like this (clearly defined with a verifiable outcome) vs the space of all interesting computer programs. (At the moment I think there’s space between them. TBD.)

lostmsuJan 21, 2026, 12:10 PM

1% doesn't sound like a lot at all.

_aavaa_Jan 21, 2026, 5:09 PM

That depends on how close to the theoretical max you think they are.

myahioJan 21, 2026, 2:45 PM

Sakana is a grift from what I understand

NitpickLawyerJan 21, 2026, 6:41 PM

Eh. I'd call them overly enthusiastic :) I know they publish hype-y stuff, they jumped the gun on a few things, I get that. But their recent result was on a "live" contest, and they did share agent traces, so that's likely a legit result.

tayo42Jan 21, 2026, 6:34 AM

I wonder if the Ai is doing anything novel? Or if it's like a brute force search of applying all types of existing optimizations that already exist and have been written about.

piokochJan 21, 2026, 12:49 PM

How something that generates next token, given a list of previous tokens, can do something novel?

rellfyJan 21, 2026, 1:20 PM

By that same logic, humans would not be able to do anything novel either.

LarsKrimiJan 22, 2026, 12:27 AM

I liked the core challenge. Finding the balance of ALU and VALU, but I think that the problem with the load bandwidth could lead to problems

Like optimizing for people who assume the start indices always will be zero. I am close to 100% sure that's required to get below 2096 total loads but it's just not fun

If it however had some kind of dynamic vector lane rotate that could have been way more interesting

eisbawJan 21, 2026, 1:43 PM

I got to 1364 cycles for now, semi-manually: Using design space exploration organized via backlog.md project, and then recombination from that. 20 agents in parallel.

Asked to generate drawio for the winner so I can grok it more easily, then I gave feedback.

Edit: 1121 cycles

karmasimidaJan 22, 2026, 2:40 AM

Same just make it a survival game

eisbawJan 21, 2026, 6:53 PM

1023 cycles

seamossfetJan 21, 2026, 6:43 PM

I'm getting flashbacks from my computer engineering curriculum. Probably the first place I'd start is replacing comparison operators on the ALU with binary arithmetic since it's much faster than branch logic. Next would probably be changing the `step` function from brute iterators on the instructions to something closer to a Btree? Then maybe a sparse set for the memory management if we're going to do a lot of iterations over the flat memory like this.

throwaway0123_5Jan 21, 2026, 5:13 PM

> Claude Opus 4.5 in a casual Claude Code session, approximately matching the best human performance in 2 hours

Is this saying that Claude matched the best human performance, where the human had two hours? I think that is the correct reading, but I'm not certain they don't mean that Claude had two hours, and matched the best human performance where the human had an arbitrary amount of time. The former is impressive but the later would be even more so.

pickpocketJan 21, 2026, 5:50 PM

I cleared this assignment but did not clear the follow up interview that was way easier than this. So I gave up on tech interviews in general, stayed where I was.

arsl16Jan 24, 2026, 7:01 AM

I got this but I am an embedded SWE, might not be my cup of tea

MaroJan 21, 2026, 6:03 AM

> This repo contains a version of Anthropic's original performance take-home, before Claude Opus 4.5 started doing better than humans given only 2 hours.

Was the screening format here that this problem was sent out, and candidates had to reply with a solution within 2 hours?

Or, are they just saying that the latest frontier coding models do better in 2 hours than human candidates have done in the past in multiple days?

saagarjhaJan 21, 2026, 9:15 AM

4 hours

mrklolJan 21, 2026, 8:43 AM

Oh, I thought candidates got 2 hours but now I am confused too

arsl16Jan 24, 2026, 6:17 AM

Fellas should I even attempt it? I got it recently and lets say it brings back memories of computer architecture class.

kristianpaulJan 21, 2026, 5:44 AM

“If you optimize below 1487 cycles, beating Claude Opus 4.5's best performance at launch, email us at performance-recruiting@anthropic.com with your code (and ideally a resume) so we can be appropriately impressed and perhaps discuss interviewing.”

afro88Jan 21, 2026, 6:35 AM

> at launch

Does this confirm they actually do knee cap models after the launch period to save money, without telling users?

mediamanJan 21, 2026, 6:44 AM

No, they later updated the harness for this and it subsequently got better scores.

sevenzeroJan 21, 2026, 7:58 AM

The company that wanted to simply get away with the thievery of terabytes of intellectual property, what a great place to work at! Not. Anthropic has no shame.

nottorpJan 21, 2026, 9:24 AM

Is it "write 20 astroturfing but somewhat believable posts about the merits of "AI" and how it is going to replace humans"?

atomlibJan 21, 2026, 10:03 AM

I'm afraid that position is already filled by the CEO.

falloutxJan 21, 2026, 9:40 AM

It should be "can you gaslight a CEO into firing 90% of their software engineers?"

demirbey05Jan 21, 2026, 8:17 AM

It's showcase more than being take home assignment. I couldnt understand what the task is ,only performance comparisons between their LLM

torginusJan 21, 2026, 10:10 AM

Are you allowed to change the instruction sequence? I see some optimization opportunities - it'd be obviously the correct thing to do an optimizing compiler, but considering the time allotted, Id guess you could hand-optimize it, but that feels like cheating.

saagarjhaJan 21, 2026, 11:45 AM

Yes, in fact this will be one of the first things you will want to do.

IncipientJan 21, 2026, 6:43 AM

>so we can be appropriately impressed and perhaps discuss interviewing.

Something comes across really badly here for me. Some weird mix of bragging, mocking, with a hint of aloof.

I feel these top end companies like the smell of their own farts and would be an insufferable place to work. This does nothing but reinforce it for some reason.

sponnathJan 21, 2026, 6:57 AM

I have to agree. It's off-putting to me too. I'm impressed by the performance of their models on this take-home but I'm not impressed at their (perhaps unintentional) derision of human programmers.

qbaneJan 21, 2026, 9:46 AM

Remember: It is a company that keep saying how much production code can be written by AI in xx years, but at the same time recruiting new engineers.

yodsanklaiJan 21, 2026, 4:55 PM

Thanks for noticing this. I got the same feeling when reading this. It may not sound like much, and it doesn't mean it's an insufferable place to work, but it's a hint it might be.

Rant: On a similar note, I recently saw a post on Linkedin from Mistral, where they were bragging to recruit candidates from very specific schools. That sounded very pretentious (and also an HR mistake on several levels IMHO).

htrpJan 21, 2026, 9:24 PM

Idle side note: surprised that https://github.com/anthropic is just some random dude in Australia

mips_avatarJan 21, 2026, 5:13 AM

Going through the assignment now. Man it’s really hard to pack the vectors right

svilen_dobrevJan 21, 2026, 9:29 AM

if anyone is interested to try their agent-fu, here's some more-real-world rabbit-hole i went optimizing in 2024. Note this is now dead project, noone's using it, and probably same for the original. i managed to get it 2x-4x faster than original, took me several days then. btw There are some 10x optimizations possible but they break few edge cases, so not entirely correct.

https://github.com/svilendobrev/transit-python3

karmasimidaJan 21, 2026, 11:29 AM

I am able to beat this 1487 benchmark by switching between LLMs, doesn't seem that hard lol. Albeit, I do not fully understand what the solution is, loll

lostmsuJan 21, 2026, 12:08 PM

Yeah, GPT 5.2 on high got down to 1293 on the 5th try (about 32mins).

spencerflemJan 21, 2026, 7:49 AM

Oh wow it’s by Tristan Hume, still remember you from EyeLike!

Graziano_MJan 21, 2026, 1:40 PM

I recognized the name and dug around too. I played DEFCON CTF with him back in the day!

piokochJan 21, 2026, 9:06 AM

Interesting... Who would spend hours working for free for some company that promised only that they would invite you for a job interview. Maybe.

AurornisJan 21, 2026, 2:29 PM

When this was being used it was probably given to candidates who had already started the interview loop and been screened.

The current e-mail invitation in the README is just another avenue for exceptional people to apply. If someone is already highly qualified from their background and resume they can go through the front door (direct application). For those who have incredible talent but not necessarily the background or resume to unlock the front door yet, this is a fun way to demonstrate it.

cjrpJan 21, 2026, 9:23 AM

I guess someone who enjoys solving these kinds of problems anyway, and thinks the potential upside if they do get hired is worth it.

saagarjhaJan 21, 2026, 9:05 AM

Oh, this was fun! If you like performance puzzles you should really do it. Actually I might go back and see if I can improve on it this weekend…

greesilJan 21, 2026, 5:17 AM

This is a knowledge test of GPU architecture?

avaerJan 21, 2026, 5:20 AM

Kind of, but not any particular GPU.

The machine is fake and simulated: https://github.com/anthropics/original_performance_takehome/...

But presumably similar principles apply.

benreesmanJan 21, 2026, 6:34 AM

It's a test of polyhedral layout algebra, what NVIDIA calls CuTe and the forthcoming C++ standard calls std::mdspan.

This is the general framework for reasoning about correct memory addressing in the presence of arbitrary constraints like those of hardware.

saagarjhaJan 21, 2026, 9:17 AM

You can get pretty far without needing to care about this fwiw

greesilJan 21, 2026, 11:43 AM

Not far enough if you're turning cash into waste heat with GPUs :)

sublimefireJan 21, 2026, 3:41 PM

Did a bit of soul searching and manually optimised to 1087 but I give up. What is the number we are chasing here? IMO I would not join a company giving such a vague problem because you can feel really bad afterwards, especially if this does not open a door to the next stage of the interview. As an alternative we could all instead focus on a real kernel and improve it :)

trishumeJan 21, 2026, 5:06 PM

Author of the take-home here: That's quite a good cycle count, substantially better than Claude's, you should email it to performance-recruiting@anthropic.com.

pshirshovJan 21, 2026, 10:31 AM

Yet Claude is the only agent which deadlocks (blocks in GC forever) after an hour of activity.

potato-peelerJan 21, 2026, 12:43 PM

What does clock cycles mean? Don’t think they are referring to the cpu clock?

mayankdJan 23, 2026, 3:52 PM

Problem solving is eternal!

fumi2026Jan 21, 2026, 5:13 PM

I could only cut it down to 41 cycles.

pickpocketJan 21, 2026, 5:49 PM

i cleared this one but didn't clear the follow up interview that was way easier than this

zeroCaloriesJan 21, 2026, 5:19 AM

It shocks me that anyone supposedly good enough for anthropic would subject themselves to such a one sided waste of time.

dhruv3006Jan 21, 2026, 6:05 AM

I wonder if OpenAI follows suit.

rvzJan 21, 2026, 6:24 AM

They should.

SinghCoderJan 22, 2026, 9:19 AM

why is their github handle anthropics and not anthropic :D

alexpadulaJan 21, 2026, 11:51 AM

Looks rather fun!

mrdootdootJan 21, 2026, 12:58 PM

“In English, Data”

OhNoNotAgain_99Jan 21, 2026, 7:27 AM

[dead]

mannykannotJan 21, 2026, 1:49 PM

I beat the target by deleting the parts that were causing the cycle count to be too high. /s

eisbawJan 21, 2026, 7:39 PM

submit and see if Anthropic accepts it

kartibbbJan 21, 2026, 11:25 AM

[flagged]

kartibbbJan 21, 2026, 11:27 AM

[flagged]

tmp-127853716Jan 21, 2026, 8:15 AM

[flagged]

falloutxJan 21, 2026, 9:43 AM

Well working under someone who keeps insisting Software engineering is dead sounds like a toxic work environment.

woofJan 21, 2026, 10:02 AM

"1) Python is unreadable."

Would you prefer C or C++?

"2) AI companies are content with slop and do not even bother with clear problem statements."

It's a filter. If you don't get the problem, you'll waste their time.

"3) LOC and appearance matter, not goals or correctness."

The task was goal+correctness.

"4) Anthropic must be a horrible place to work at."

Depends on what you do. For this position it's probably one of the best companies to work at.

tap12783487Jan 21, 2026, 2:54 PM

It is a filter for academics who write horrible Python code and feel smart, yes.

I think they also have open positions for stealing other people's code and DDoS-ing other people's websites.

am17anJan 21, 2026, 1:18 PM

1) Python is unreadable." Would you prefer C or C++?

> Unironically, yes. Unless I never plan to look at that code again

myahioJan 21, 2026, 2:54 AM

[flagged]

jackblemmingJan 21, 2026, 4:44 AM

Seems like they’re trying to hire nerds who know a lot about hardware or compiler optimizations. That will only get you so far. I guess hiring for creativity is a lot harder.

And before some smart aleck says you can be creative on these types of optimization problems: not in two hours, it’s far too risky vs regurgitating some standard set of tried and true algos.

Anthropic's original take home assignment open sourced

Comments