An alarming number of people don't understand that LLMs work via purely stochastic processes, so I'm happy to see in-depth pieces like this. I'm looking for a job and maybe this is why it's so hard to get a callback these days: resumes are just dumped in some LLM black hole and no one really knows how it works. The author says:
> temperature 0.1 — low, supposedly nudging the model toward deterministic outputs
This is not correct (and is briefly touched on later in the piece when he sets temperature to 0), temperature is not some kind of "deterministic" switch, but rather it affects the sampling distribution (which becomes more "spiky"—but is still very much a distribution).
MImiki1232111 天前
In theory, temperature 0 does make the LLM deterministic.
Well, in theory theory, temperature 0 doesn't really exist. Mathematically, as lim temperature->0, the distribution gets spikier and spikier, the most likely sample goes to almost-but-not-quite infinity and the rest go to almost-but-not-quite 0. In practice, temperature=0 is literally a separate branch of an if statement that just picks the most common sample (using the actual formula that works for non-zero values would cause a zero division).
However, due to things such as batching and even different kinds of floating point imprecisions for different algorithm implementations, the probability distribution itself often differs run-by-run, so what you sample from it also differs.
SIsigmoid101 天前
>in theory theory, temperature 0 doesn't really exist.
It does exist very much, even if you go to pure math. Look at the softmax function and take the limit as T->0. It becomes a dirac-delta function. I.e. in a discrete setting (like for LLMs with a finite set of output tokens), probability P becomes one for argmax and 0 for everything else. Only in coding practice it is easer to implement T=0 as a simple if check that directly chooses argmax instead of calculating the limit of some function that includes 1/T quotients. But setting T to zero is in both, theory and practice, turning the usual probability function into greedy sampling.
3131707023 小时前
> Look at the softmax function and take the limit as T->0. It becomes a dirac-delta function.
In pure math, it does not always do that. It becomes a dirac-delta comb with equal weight on every maximum. There can be more than 1 maximum. Setting the temperature to zero turns into greedy sampling, but greedy sampling is not necessarily deterministic as you can have multiple equally optimal options.
SIsigmoid1022 小时前
That is not a problem for LLMs, because in practice floating point inaccuracies (in particular after exponentiation) prevent values from being exactly equal. That's why greedy sampling generally produces deterministic output for LLMs. The real gotchas are elsewhere (like with batch inference as we've seen with earlier GPTs). But unlike what the earlier comment says, this is a non-issue mathematically.
MImiki1232113 小时前
There's a difference between f(0) and lim t->0 f(t).
We just chose to treat this function as a "staircase function" where f(0) =lim t->0 f(t), general formula for f(t!=0).
THthaumasiotes23 小时前
> It becomes a dirac-delta function. I.e. in a discrete setting (like for LLMs with a finite set of output tokens), probability P becomes one for argmax and 0 for everything else. Only in coding practice it is easer to implement T=0 as a simple if check that directly chooses argmax instead of calculating the limit of some function that includes 1/T quotients.
I don't understand the distinction you're drawing. A Dirac delta function is a "simple if check".
SIsigmoid1022 小时前
The point is that the case T=0 doesn't just "exist" as a special code branch - it is still well defined mathematically without any change to the output function. What the above comment refers to with the extra "if" check is just a limitation of computers not liking to divide anything by zero, even if the actual function exists and is well behaved at zero. It is not some weird or special theoretical construction.
TEteiferer21 小时前
> Mathematically, as lim temperature->0, the distribution gets spikier and spikier, the most likely sample goes to almost-but-not-quite infinity and the rest go to almost-but-not-quite 0.
That's not how limits work. As the temperature goes to 0, the rest goes to 0. That's it. The "almost-but-not-quite" is part of the "goes to".
Let's say f(x) = 3x+1. It's a continuous function. If we let x go to 10, f(x) goes to 31. Not "almost-but-not-quite 31". No, to 31. (If you don't have a continuous function then it's the same argument, but less intuitive to illustrate.)
SOsobellian15 小时前
Even if it's deterministic that doesn't mean it isn't arbitrary. I can achieve determinism at any temperature by saving the seed. But that wouldn't make rejects feel much better knowing that if a bit was flipped in an arbitrary seed they would be scored differently.
MSmsdz21 小时前
> However, due to things such as batching and even different kinds of floating point imprecisions for different algorithm implementations, the probability distribution itself often differs run-by-run, so what you sample from it also differs.
Exactly. While I’m assuming this won’t be news for most here, for those that are still new and/or curious about some more explanation on e.g. the floating-point imprecisions, see this nice article: https://thinkingmachines.ai/blog/defeating-nondeterminism-in...
JOjohnsmith184013 小时前
I did large scale tests temp 0 and there was still randomness with the same prompt inputs coming in.
I did this with several model apis.
GPU processing is not going to be the same from what I read but also the AI backend is doing a lot of fancy batching resulting in another layer of randomness.
PMpmarreck16 小时前
It is not deterministic because the order of computations in a typical multithreaded system is not deterministic and also because when combined with the devil that is IEEE754, it gets even less deterministic.
LElelandbatey1 天前
As I understood it, the "randomness" affecting what is selected at any temperature still comes from a PRNG or CSPRNG (or whatever RNG you want, maybe a hardware one), and if you where to swap out that with something deterministic you'd get the same results every time (barring non-determinism in other parts of the OS/drivers/maybe even hardware).
But theoretically, the output of every LLM is seed-driven (or could be if you wrote the software to isolate it) just like any computer software. It's just none of the software written (even llama.cpp AFAIK) chooses to support stable-seeding due to the changes in stuff like CPU/Vulkan/CUDA/Metal differences making it difficult to make consistent.
They could though! Hopefully one day someone implements it into the mainstream LLM-engine software and it gets exposed in the APIs serving the models. It'd do a lot to show folks the "internals" of these models.
TOtoolslive1 天前
It's probably due to the fact that it's a cloud service. You have no guarantee that your next request will go to the same machine. So even with an identical seed, and temp 0 you might get different hardware and hence different accuracy/noise in the floating point operations.
RIrightbyte23 小时前
How can there be noise in floating point operations? I could buy like completion order for parallized batches i.e. adding a+b+c is different from a+c+b etc.
MImicrotonal1 天前
Stable seeding is not enough. A lot of modern, fast compute kernels are nondeterministic. Floating point multiplication/addition is not strictly associative and e.g. reductions can combine results from different threads in different orders (e.g. through atomic ops). You can write kernels to be deterministic, but it is generally less efficient.
VLvlovich12320 小时前
They are only non-deterministic when you’re doing batching and a kernel ends up running across a “random” set of token streams. If you’re only processing one user’s request, they’re very much deterministic.
NOnok22kon1 天前
that's incorrect in the presence of batching. it's tough work making it truly deterministic:
https://x.com/FireworksAI_HQ/status/2069873437217276015
VIvidarh1 天前
It's not that hard. What is hard is making it truly deterministic and retain high throughput.
GAgaflo20 小时前
PRNG is deterministic.
NUnullc23 小时前
If you make an exact integer implementation and run with temp=0 it's deterministic.
You don't even need temperature 0, just make a random seed for the sampler part of the input and then its deterministic as a function of the input.
But running autoregressive models at temp=0 tends to expose pathological behavior, because the training process produces a function with a lot of gain so its prone to feedback on its own noise.
CHchrisjj1 天前
> However, due to things such as batching and even different kinds of floating point imprecisions for different algorithm implementations, the probability distribution itself often differs run-by-run
The implementation does not often differ run by run.
SKskissane22 小时前
> The implementation does not often differ run by run.
If you use a cluster, or even multiple clusters, and they have non-identical hardware, then two consecutive runs could end up being routed to nodes having different GPU models with slightly different floating point behaviour, or even software differences (e.g. newer GPU offers some feature usable to speed up calculations which older model lacked; same code can use the feature when it is available, fall back to slower alternative if it isn’t). The larger your scale, the greater the odds it will happen
PAPaulHoule16 小时前
The whole problem of text understanding is a problem of reasoning under uncertainty, that is, you can't really be sure which witch people are talking about all the time. A person you might hire might be successful or unsuccessful at the role, no matter what hiring process you use. Two people might look at the same resume and come to the same conclusions. Two patients with the same symptoms and clinical presentation might have different diseases, etc.
I don't buy the story that the old AI died primarily due to the cost of knowledge base maintenance [1], but rather the lack of a universal system of reasoning over uncertainty.
For me it's a running gag that Spock was always saying things like "Captain, we have a 21% probability of surviving this mission" when Bayes teaches us your probability distribution has a probability distribution, "we have a β(5,1) chance of surviving this mission" is more like it.
To that end it wouldn't be too crazy to run a resume through that machine 100 times and look at the probability distribution of the score.
[1] then again I am the kind of maniac who will sort images on a tablet lying in bed until my visual system malfunctions
VEvessenes22 小时前
To be clear, temperature 0 is deterministic and will produce the same output for exact duplicate inputs, across all seed choices.
Provided:
* If it’s MoE we are talking about, that the duplicate inputs are for the whole batch (yes, your batch neighbours can impact your choice of experts. Blergh.)
* Your kernels are deterministic
* There’s no system wide effort switch that responds to, e.g. work load across the cluster (for a thinking model)
Upshot:
Temperature 0 is not deterministic in probably any existing cloud infra, but it could be for edge inference pretty reliably.
To your quibble on 0.1 being more deterministic - I think it’s a pretty fair summary - we’re going to sample much more from the ‘temp 0’ answer at 0.1 than we would at temp 0.9, no?
DYDylan1680722 小时前
Even then it's deterministic in the way a hash function is deterministic. Change one letter and you can get a completely different output. What people actually want is something continuous.
VEvessenes19 小时前
Agreed on the desire for continuous behavior. That said, in a modern LLM, is this hash analogy accurate? I would be surprised if a single letter changed most zero temp force ranked outputs.
E.g:
“Where is the Eiffel Tower Located? One word only.”
“Where is the Effel Tower located? One word only.”
“Where is the Eiffel Tower located? One wor only.”
I’d be very surprised if those got different answers from even a small local model at temp 0.
GUguhcampos22 小时前
This is it. People mistake deterministic for precise/exact/correct. It's not.
AEaesthesia1 天前
A distribution with all probability mass on one outcome is deterministic, so in principle, setting temperature to 0 _should_ result in deterministic outputs. There are a few reasons it might not, but I don't think any of these apply when running a local model like the author did.
313170701 天前
> so in principle, setting temperature to 0 _should_ result in deterministic outputs
It is a common misconception, but it is not true even in principle. If I have 2 or more logits which are equal to the maximum of my logits, I will sample uniformly random from them with any temperature, even zero. Sampling from softmax([1, 0, 1]) is still stochastic at temperature 0, because the limit is to sample uniformly from the first or the last element.
Anyway: "GPUs don't do deterministic matrix multiplications" is the biggest source of randomness in LLMs. GPUs put the associativity of the sums in matrix multiplications in arbitrary order, and this has a huge impact on the logits coming out of the neural network.
JSjstanley1 天前
> "GPUs don't do deterministic matrix multiplications" is the biggest source of randomness in LLMs.
But this isn't a fundamental property of LLMs, it's just an implementation detail. It's pretty obvious that if you evaluate the matrix multiplications correctly and deterministically sample from the highest-probability outputs, you will have a deterministic LLM.
EVEvgeniyZh1 天前
You don't have to sample uniformly. You could take the lowest index of all maxima.
But yeah, the main source of randomness is non-deterministic matmul, and temperature does nothing with it
DODougBTX1 天前
> GPUs put the associativity of the sums in matrix multiplications in arbitrary order
That’s user-controlled too, not an inherent property of GPUs:
https://docs.pytorch.org/docs/2.12/generated/torch.use_deter...
EAeasygenes1 天前
There are. If the kernels are nondeterministic (e.g. timing issues) there are minor changes between runs, on a single system, even with eager decode enabled (typically what temperature=0 achieves).
ISIshKebab1 天前
Setting the temperature to 0 should give deterministic results but that's not any better - it's just hiding the huge variance by only taking one sample.
CRcroes1 天前
So you would get always the same result, but it could be the wrong one
SRsrdjanr1 天前
Of course, nothing can guarantee the right answer from LLMs
VAvalzam1 天前
I mean the easiest explanation would be that the model harness doesn't always take the most likely token but does top-k sampling or similar. temperatur just means that probabilities get more and more equalized, boosting the chance that an unlikely token gets picked. but even with temp 0 you could have 0.8 T1, 0.19 T2, ... and sometimes sample T2
AEaesthesia1 天前
No, this can't happen at temperature 0. The formula defining temperature-adjusted softmax isn't strictly defined at 0, but taking the limit (in the case where all logits are distinct) results in probability 1 being placed on the largest logit. Samplers will typically special case temperature 0 and pick the most likely token at each step.
MYmywittyname16 小时前
> This is not correct
Several of my claimed AI-expert colleagues repeat this as though it's gospel. I've heard "set the temperature to 0 so we get consistent results" more times that I can count.
TETerr_15 小时前
I imagine it's much like game-developers saying: "Set a fixed seed so the player gets consistent results."
Yeah, it can work, but it is subject to so many potential pitfalls that you can't casually assume it will. It's a property you have to actively design-for and rigorously test to be sure the system can deliver it for some particular scenario.
THthesuitonym14 小时前
> resumes are just dumped in some LLM black hole and no one really knows how it works.
Not that I'm defending AI, but HR departments rarely knew how their ATS ranked and sorted applicants before they were AI powered.
MAmargalabargala16 小时前
> I'm happy to see in-depth pieces like this
It's somewhat ironic that this "in depth" piece was written by an LLM as well.
LElelanthran20 小时前
> temperature is not some kind of "deterministic" switch, but rather it affects the sampling distribution (which becomes more "spiky"—but is still very much a distribution).
You're correct. The confusion arises because we use the word "non-deterministic" when we mean "probabilistic".
I tried to explain it better: https://www.lelanthran.com/chap15/content.html
UNunknown1 天前
[deleted]
MAmake31 天前
A more spikey distribution exactly makes the distribution closer to deterministic. That's not the point though. Even in greedy (deterministic) decoding, it is still a black box though that reacts in ways ways that are unpredictable to the inputs. Switching one word around might lead to different scores for example.
FLfluoridation22 小时前
Yeah, this is the forest that the people arguing about math trees are missing. It doesn't matter that the algorithm is deterministic if the algorithm passes the input through a cryptographic hash function to make a yes/no decision. The result may be perfectly reproducible and still non-sensical in its distribution with respect to its input domain.
NINimitz1414 小时前
He said it nudges it to be more deterministic. Your comment is not correct.
BHbhanu7861 天前
Agree
MTmtharrison20 小时前
Small refinement: the underlying model isn’t stochastic at all. The forward pass is a deterministic function of the weights and input, it just produces a probability distribution over the next token. The stochasticity is an optional sampling step layered on top, not something inherent to LLMs. Greedy/argmax decoding (or temperature 0) makes the whole thing deterministic.
So “purely stochastic” overstates it a bit: the distribution is computed deterministically, and you choose whether to sample from it or not.
SIsimiones19 小时前
There are more layers to this problem, if we want to get into the details. The LLM is defined in terms of floating point operations, and those are not actually fully deterministic, on most hardware and in most performant implementations.
IEEE 754 only specifies precision requirements for certain operations, not precise bit patterns (e.g. for exponentials). So, at least in principle, the same hardware performing the same operation could produce different results at different times, as long as they are close enough to the theoretical answer. I'm not sure if any hardware actually works like this.
IEEE 754 also specifies that many of the basic arithmetic operations are not associative - so any reordering (which is common when batching multiple queries at the same time) will introduce indeterminacy from the perspective of your own query (that is the result for your query will change depending on what other query happens to be processed at the same time, which is not under your control).
Finally, even if we take the case when a query is processed alone, and even if one particular hardware is completely deterministic, the result will be different on different hardware - which can again look like non-determinism if you're sending your query to a load balancer.
So, the math for LLMs is deterministic in theory, but implemented with non-deterministic approximations & optimizations in practice, and their results are then normally used only as a probability distribution to be sampled from.
SPspwa41 天前
[deleted]
MAmahogany22 小时前
Every time people point out a limitation or constraint of LLMs, I see a comment that is to the effect of “but humans…”. I don’t understand why this comparison is relevant to this particular thread. Is it just an amusing similarity?
EFefromvt19 小时前
I think it often useful to push the conversation down "we built a system for humans that dealt with this, what from that is or is not applicable for agents in the same context"? Humans randomizing resume review for screening is pretty known; I've seen companies try to fight it with things like hiding information, panel reviews, etc - it's unclear to me how effective those would be for agents (honestly, it was unclear how effective those were for humans). I was depressed about the hiring process before we had AI screening and I remain depressed about it.
CAcastlecrasher217 小时前
It may seem trite but the point is that if separate humans were assigned the same task the LLM was here the results would be similarly non-deterministic.
SMsmusamashah1 天前
We expect computers to be consistent on the other hand. A calculator will always give you the same answer unless some chip gets struck by a particle. LLMs are on computers and should be fairly consistent too.
VIvidarh1 天前
And this lies at the heart of the problem.
We expect computers to be consistent despite running programs that are not designed to be consistent.
This despite the fact that we have lots of experience of programs running on computers that produces wildly inconsistent outputs.
But for some reason some people choose to assume LLMs should act like a calculator instead of any of those programs.
MImiki1232111 天前
What's even worse, different humans have different weights.
If you train two different LLMs and replace what data they "see" in batch n, that doesn't affect the data they see in batch n+1, or any further batches. In LLMs, you can introduce "noise" into the training process, but that noise doesn't really compound.
Humans learn from experience, not from data, and their experiences at age n shape what experiences they seek (and hence train on) at age n+1. A small amount of "noise" injected into their "training", let's say hearing a group of friends discuss a movie while their identical tween goes to the bathroom, can compound into them watching that movie, which can compound into them forming an identity around that genre, and so on, until they're two completely different people, trained on completely different "data mixtures".
CHchrisjj23 小时前
> What's even worse, different humans have different weights.
Far worse would be different humans having the same weights.
THthisisit22 小时前
The same person is not going to give you three different answers within span of minutes. Especially when nothing fundamentally has changed. People might or might not update their views depending on their biases.
RKrkuodys21 小时前
I'm pretty sure the personality tests are created specifically for the reason that a single person can have fundamentally (or conflicting) beliefs about himself in a matter of minutes. You can say "I am honest person" and the next minute you can say "I never lie" - and both cannot be true for an average person.
MNmnky9800n1 天前
Test retest reliability is a thing in psychometrics.
SPspwa41 天前
[deleted]
CYcyanydeez23 小时前
a studied example is sampling judicial decisions before lunch and after lunch. judges are more lenient on a full stomach.
UNunknown18 小时前
[deleted]
THThrowawayR218 小时前
That was a single study and it's finding is at the very least disputed, if not debunked, e.g. https://news.ycombinator.com/item?id=41091803
WHWhrRTheBaboons22 小时前
how did they account for sampling bias? a judge might leave easier cases for after lunch. people with control over their schedules usually ease themselves back into it after breaks.
NOnok22kon1 天前
its a bad idea in general to use non-1.0 temperature. there is a reason labs are strongly recommending using 1.0.
using low temperature is more deterministic, but the cost is the model becomes "dumber"
TItipsytoad1 天前
1.0 is actually pretty arbitrary and way too high as a general rule. Something like 0.3 is a more sensible default
PRprogramjames17 小时前
1.0 is "natural units". If your energy corresponds to nats, you should be using temperature 1.0. If your energy corresponds to bits, you should be using temperature ln(2) ~= 0.7. The optimization pressure is
max nats = max entropy + energy / temperature
Why might energy correspond to bits or nats? Imagine your goal is to play as many interesting games of chess as possible in a tournament. This implies you have to keep winning. If you look at the RL environment from the right perspective, you can turn it into optimizing bits or nats.
3131707023 小时前
If RL was used to train the model, the model will have been trained on its own sequences. Those will have been generated with a temperature of 1.0. They must be, otherwise you would get a premature collapse or explosion of your entropy if the temperature was respectively lower or higher.
After that RL step, you want to stick to the RL distribution, and so keep a temperature of 1.0. Other temperatures will drive the model out-of-distribution.
That is why the sampling step for agents or thinking LLMs are usually kept at a temperature of 1.0.
ZIzipy1241 天前
It really depends on the application does it not? I'm not an LLM guy, but for creative tasks like storytelling wouldn't you want a higher temperature usually? Happy to gain insight from anyone with experience here :)
EMembedding-shape1 天前
Heavily depends on the model architecture and the implementation though, I don't think you can say what values are better than others without first specifying those, otherwise it's straight up guessing, ironically.
NUnullc22 小时前
If you use a model in a configuration far from where it was RLed you get no warranty. (you also get no warranty the other way, however)
COcodeflo1 天前
It can be useful for pure translation tasks and stuff like that where you explicitly don't want creativity of any kind.
JLjldugger18 小时前
Would 1.0 have fixed the wide variance in scoring?
NOnok22kon14 小时前
temperature is the wrong tool
the variance is caused by the bad evaluation prompt
if you ask "what is the capital of Paris" you'll always get Paris, with any (non-extreme) temperature
VIvidarh1 天前
Plenty of setups defaults to lower values than 1.0.
BLbluechair1 天前
Willing to be corrected but I believe this type of automated resume filtering is illegal. Not saying it never happens but my understanding is it is not typical.
THthayne1 天前
I would expect that to depend on jurisdiction.
I don't know for sure, but I would be surprised if it was illegal in my particular US state. You might be able to argue the AI has inherent biases that introduce illegal discrimination in the hiring process, but my understanding is winning I case like that would be very difficult, especially since most employers are very cagey about their hiring process and why they mades a decision.
SMsmall_scombrus1 天前
They don't need to actually filter/blackhole to have have the same virtual effect.
Show someone a list of resumes with an "applicant score*" and they'll naturally ignore the ones with a low ranking
*scores are generated with AI, mistakes may be made, use only as a guide and verify results
IVivan_gammel1 天前
In situations when you get hundreds of applications for one open position (real market now), whatever reduces your pool to the size a human can handle, works. You can preserve some diversity metrics in the process. This particular filtering is rather primitive, but LLM as a first filter can definitely do the job. You may burn less tokens than the hourly rate of your HR and it will be fairer than just dumping 50% of unread CVs in trash.
363695486848928261 天前
Great until someone realises you’ve filtered out minority groups from the application process (most developers are men so maybe the LLM decided they’re the best fit, but you’ll never know exactly why it screwed your over) and you suddenly have an expensive lawsuit
ELelric1 天前
Under GDPR, you have the right to request manual processing whenever personal data is processed automatically to make a decision about you that has "significant impact". Not being hired seems like it would qualify.
DGdgellow1 天前
Illegal where?
DAdathinab21 小时前
And this + the tendency for AI to "prefer" AI produced code + some other AI biased is why *this is most likely highly illegal to use in the EU due to violating anti discrimination laws in multiple ways.
To be clear:
- randomly filtering "too many" resumes is pretty much allowed (I think)
- but must be actual random independent of the resume (and can be in multiple layers, i.e. random filter > pre-select > random filter > select)
- this isn't the case for AI as the random aspect isn't done as the random aspect is not independent of the actual resume evaluation
- in general you can't make sure the AI doesn't apply systematic biases, and there is high indication that it does do so
- for humans you can train them and order them to ignore their biases, this won't work reliable either _but now you delegated the responsibility of illegal biases to the hiring personal violating the order_. But for AI usage you are responsibility no matter what you tell it. Lastly you can technically "show/proof" a specific used AI is highly biased in a specific contexts, which for human employees is technical possible but practical not really practical. So this moves "specific mostly deniable" cases, into "systematic proven bias" teritory. Or in other word legal risk goes from "limited/no issue" to "people can systematically f-you over if they know you use AI for hiring".
JEjerf20 小时前
Everything is correlated to everything [1].
Which means there's a good chance this is somehow correlated in one way or another to race/gender/other protected classes in the US, just by the math of everything being correlated to everything.
Which means this is one good lawsuit away from being illegal in the US as well. It doesn't even necessarily have to "win", just do well enough in court to scare away anyone else from using this.
And boy oh boy would I hate to be on the receiving end of this lawsuit, trying to prove that my AI screener is completely in compliance with all hiring laws. That sounds like a nightmare.
[1]: https://gwern.net/everything
OCoceansweep18 小时前
Already happening with Workday in California:
https://news.bloomberglaw.com/litigation/workday-loses-bid-t...
TOtorben-friis19 小时前
Would the accused party have to prove compliance? Or would non compliance have to be proved by the accuser?
Honest question, I'm not American.
JEjerf18 小时前
"Innocent until proven guilty" is a criminal court concept. This would be a civil suit. Those use different standards, like "preponderance of the evidence". I agree that if the claimant had to prove the AI system is violating employment law that that would be a hard bar to clear, but showing on the preponderance of the evidence is something that would have me a lot more nervous if I was on the receiving end of the lawsuit.
This is a highly general answer to a complicated topic; my main point is more that this is not going to be held to the standard of "beyond reasonable doubt", which would be hard to meet.
[1]: https://www.law.cornell.edu/wex/preponderance_of_the_evidenc...
NOnonethewiser16 小时前
>Which means there's a good chance this is somehow correlated in one way or another to race/gender/other protected classes in the US, just by the math of everything being correlated to everything.
>Which means this is one good lawsuit away from being illegal in the US as well.
Uhh.. what? No that doesn't follow at all.
Screening resumes in a way that correlates to race, gender, etc. is not illegal. This is a fundamental distinction. The law is you cannot use those as filters. But the outcomes likely will be correlated. In fact to ensure they are not correlated you'd have to break the law and control for race, gender etc. Which is racism.
The models dont even get race as an input. If they did and they used it to select then yeah, that lawsuit sounds like it has merit. But a mere correlation in outcomes? In no way illegal what-so-ever.
TItikhonj13 小时前
The US has a notion of "disparate impact"[1] that means you can be liable for discriminating based on a protected characteristic on the basis of correlation. This is why HR departments are very hesitant to use things like IQ tests for screening candidates, for example.
[1]: https://www.congress.gov/crs-product/IF13057
DIDiscourseFan17 小时前
I wouldn't doubt that lawsuits for employment discrimination for any company (and I suppose it was most of them) that used LLMs in hiring processes will become a very lucrative business. They are all open to civil suits at this point.
ANAnimalMuppet17 小时前
And, if there aren't enough lawyers to do all that work, you could use AI to file the suits.
I'll let you decide whether that's a dream or a nightmare...
COCobrastanJorji15 小时前
> randomly filtering "too many" resumes is pretty much allowed (I think)
It's totally fine to filter out resumes in a completely random, content-independent way. Grabbing the fourth resume down in the pile and offering them the job is a perfectly fair albeit stupid way to make a hiring decision. However, AIs are very, very good at capturing biases, and it would not at all surprise me if an AI told to filter resumes is going to end up filtering with some biases for things that you definitely do not want to filter on, like the name of the candidate. And it might be that everybody resume that claims it fixed a typo in a major open source project gets a pass, but resumes that only list their own projects get rejected 60% of the time, so you're losing more good candidates than bad.
DIDistrict552419 小时前
I'm not sure this is very easy to show this is a breach of non-discrimination requirements, like under Council Directive 2000/78/EC for employment.
Due to acting like an irrational gambling machine, I agree it can have unwanted indirect discrimination effect in general. But it will probably not differentiate "on the grounds of religion or belief, disability, age or sexual orientation". It is possible, but that would take a lot of work for the lawyers to prove to the court.
I believe the more interesting part is that the EU AI Act (still not in force in this regard until 2 December 2027). This will be clearly a high-risk AI system: "AI systems intended to be used for the recruitment or selection of natural persons, in particular to place targeted job advertisements, to analyse and filter job applications, and to evaluate candidates".
Which does not mean prohibited, but it could later turn out that LLMs will be excluded from being used in high-risk AI use cases (falling under article 6 with no exemptions).
Considering that none of the standards are published yet, I have absolultely no idea how they will ensure compliance with the following parts of Article 10 when using LLMs for such tasks:
"(f) examination in view of possible biases that are likely to affect the health and safety of persons, have a negative impact on fundamental rights or lead to discrimination prohibited under Union law, especially where data outputs influence inputs for future operations;
(g) appropriate measures to detect, prevent and mitigate possible biases identified according to point (f)"
I don't think that's technically possible to do so with LLMs in general at the moment, even with the full cooperation of the model providers. Maybe you can do some meaningful audits for smaller models. But the EU AI Act may end up excluding all the generic "using-LLM-but-not-entirely-sure-why" vibe coded approaches from high-risk use cases (in Annex III). Which would make sense.
https://eur-lex.europa.eu/eli/reg/2024/1689/oj/eng
DAdathinab17 小时前
EU AI Act got hijacked by huge corpo with last minute changed with moved it from "could probably work" to "catastrophe".
Even at 2 December 2027 it might be intentionally not enforced at all due to that for a while, through I think the goal is currently to amend it until then.
> that LLMs will be excluded from being used in high-risk AI use cases
no, it won't I can guarantee you this. At best they will get additional restrictions over time, as things go wrong. Anyone who could make this happen has way too much interest to not make it happen. (Most/All? EU country legal systems are overloaded to a point of not working correctly anymore, and have been before AI generated law suites and other AI nonsense started. I won't go into detail but many believe AI assistance (for certain tasks, always with a human doing any final decisions) is the only way to get out of this mess).
> standards are published yet
or exist,
like seriously this isn't a case of there being non public WIP standards which will pin all the nitty bitty details down, but cases of state agencies (and in last instance judges) having to decide if a specific standard (or implementation) is sufficient or not.
but also to some degree it shouldn't be tightly coupled to tech standards as there are often many ways to implement the things the law requires and accepting only one is undesirable (and likely wouldn't legally hold up). But having tech standards which are a "guaranteed to be enough if you comply with" (but not the only valid way) would have been preferable, bringing us to the next point
> have absolutely no idea how they will ensure compliance
nor do they know, the original non big corpo hijacked version had exceptions for most companies affected now. So it would only have affected a handful of huge companies, which have many of the things required already in place, in some form or another. Most likely this would have played out as this companies presenting how their measurements are "sufficient" and the agencies then evaluating it and potentially requiring some changes, going back and force over a longer duration leading to documented cases of rough technical standards about "what is sufficient" they then can pass to other organizations in the future. But now the law affects not just a handful of companies but like thousands, if not tens of thousands. Many not stuffed in a way where such a process could work, or even do the necessary documentation to show "compliance"...
So from a practicability POV, if enforced starting 2027, it currently excludes close to _any_ (meaningful) use of AI, down to a trivial linear regression or similar. Including any "old school ML/AI" any Bank uses for risk assessment.
Banking stopping running in December and there not being any (meaningfull) AI startups or adoption at all is not something anyone (in power in any state organ) wants to see, so guess how much it will be enforced ;)
And as mentioned the chance of AI as technology being excluded "in general" is close to none. Maybe specific usages could be excluded (and/or are already excluded) but thats it.
Oh and as a bonus a malicious reading of f+g remove any proper privacy protections for any AI usage in high risk context, where it is often most relevant... (a more sane reading allow it, with ... tricks).
BUbuzer20 小时前
> this is most likely highly illegal to use in the EU due to violating anti discrimination laws in multiple ways.
It's generally illegal under GDPR Article 22.
> The data subject shall have the right not to be subject to a decision based solely on automated processing, including profiling, which produces legal effects concerning him or her or similarly significantly affects him or her.
Exceptions in 22(2) are unlikely to apply. It's hard to argue that it's truly necessary (a) and consent (c) is almost always unavailable in employment context. (b) might apply, but it requires specific law in EU or Member State to authorize it.
BLbluGill19 小时前
For C: I'm not sure how EU laws work, but ethics says that someone who needs a job cannot give consent since the possibility of a job if they give consent could be a bribe. See a lawyer for how it works in your country.
DAdathinab18 小时前
also not fully sure, but AFIK there are limits to how far you can wave this right, in context of things like TOS, simple opt-in fields on forms etc.
Like YT would have loved to make you opt out of it (and probably has it in their TOS) but there where multiple cases of courts forcing them to handle it properly in the past as far as I remember.
My _guess_ is that at least if you don't sign a proper contract you can always force a human reevaluation. But also only that (so only semi useful). Also even with a proper contract it's unclear if it would be possible in this specific case due to the contract being fundamentally one-side/unfair and semi-forced on you if it where wide spread on the market for the specific job you are trying to get.
BUbuzer18 小时前
That's why I said consent usually cannot be used in employment context. I wouldn't rule it out 100% for everything employment related, but application screening is unlikely to qualify for those rare cases.
DAdathinab18 小时前
this isn't quite how GDPR Article 22 works
The is a difference between
- having a right you can't wave
- which is very similar to something being forbidden
- but different to having a right you fully or partially can wave
Furthermore to some degree you are only "subject to a decision based on ..." if the decision has an effects affecting you.
In practice wrt. Article 22 this means companies can make a "decision solely based on automated processing[..]" iff they give you a (realistic) chance to object to it in which case they will do a human review of the decision where a human confirms/changes this decision based on reviewing the involved information.
There is a lot of gray area what a "chance to object" means and when a human review makes an decision no longer "solely based on automated processing" (a human just saying AI was right clearly doesn't count, but a human constructing a case why they would have decided the same way based on the why the AI did the decision can count, iff it's reasonable to assume a human might have come to the decision had it only been reviews by an human).
Or in other words GDRP Article 22, just "soso" meaningful in context of hiring.
Like if the AI did a mistake they have to reevaluate it, but as long as there are other similarly qualified competitor (they did hire/are in process of hiring) it quite easy to come up with a reason why they are a better choice for them. Or go through the motions of you being in round 2,3 of hiring and then find an excuse to not hire you.
BUbuzer17 小时前
Mostly yes.
Note the chance to object must be given before decision is made, i.e. not to give option for human review after the fact. Human must also be able to actually have meaningful chance to affect the decision.
If the decision is based on purely objective facts that are actually necessary (like you must have certain license) then human and computer always coming to same decision is likely correct and compliant, but as soon as you start putting in subjective criteria and human agrees with 100% of computer denials it becomes a lot harder to demonstrate that human is actually able to affect the decision as required by Article 5. Note that demonstration burden is on controller, not on data subject/DPA.
Objective criteria also isn't always enough by itself. If both human and computer calculate the same credit score and you must score X points to get a loan then human isn't actually able to affect the decision. Essentially the credit score calculation itself ends up being the automated decision rather than the formal rejection that is later given to data subject.
FAfartcoin6720 小时前
[deleted]
STstellamariesays19 小时前
[deleted]
RYryukoposting1 天前
At this point we might as well adopt that joke where you blindly throw away half the resumes because you don't want to hire unlucky people.
TAtaffronaut23 小时前
At one point in the past a major UK a medical school adopted random selection for qualified candidates (Barts and The London School of Medicine and Dentistry - part of Queen Mary University of London). The approach benefitted qualified students from less well-off backgrounds vs those who can afford to win at the ever more elaborate (manual at the time) hurdles of resume assessment criteria and effectively game the system. There was an orchestrated campaign against the lottery around "Why gamble with would-be doctors?". Random selection was quietly dropped.
HEHerring15 小时前
That's probably a good litmus test for political capture by elites. The Netherlands introduced a weighted lottery for medical schools in 1972, abolished it in 2017 for basically the same reasons, studied the (worse) outcomes for a bit, then put it back in 2024.
AGagnosticmantis1 天前
A person's total luck is constant over a lifetime. The remaining half of the candidates already spent some of their luck in this selection, so they'll be on average less lucky than the discarded half.
TETerr_13 小时前
Normally we'd reject the first 37% [0] of candidates and then pick the next one that is above the average, but if all the unluckiest candidates show up first, then we need to sample even more in order to get an accurate baseline.
This may be compounded by the the "Teela Brown" problem [1], where some candidates may be too lucky to end up with our company, causing them to appears later in the stream or not-at-all.
[0] https://en.wikipedia.org/wiki/Secretary_problem
[1] https://en.wikipedia.org/wiki/Ringworld
T-t-31 天前
No, luck would be some expression of the difference between the average and the individual outcomes - it only exists relative to a population at the point in time when it is measured.
BEbee_rider19 小时前
But, however you structure the selection process the people who get picked are the ones who’ve expended some luck (like, if you throw away half the resumes, but then pick the resumes out of the trashcan, the ones you plucked out are still the lucky ones).
I see two possible solutions.
1) Most people won’t be using up most of their luck on this one thing. I mean they’ve got their whole lifetime worth of luck, so you just need to make sure to pick people who still have plenty left. In other words, ageism and/or picking people who’ve never accomplished much are the solutions!
2) We assume working for the company is a lucky outcome. If you make the company a really unpleasant place to work, people will have to use their luck to dodge it. However, luck can only be evaluated against other possible outcomes. The plan, then, should be to set up a competitor (possibly a front) that is a really nice place to work. They’ll act as the “lucky outcome expenditure dump.”
THthrowawaythekey1 天前
> A person's total luck is constant over a lifetime
Ah yes, the much revered cosmological fairness constraint.
CYcyanydeez23 小时前
everyone knows luck is tied to the wealth-gravity and increases as the inverse distance to the density of matter. hut because its relative, everyone thinks they have the same luck when not observing others.
LAlatexr1 天前
Even assuming that was genuinely how luck works, the conclusion does not follow from the premise because it’s obvious not everyone “starts with” the same amount of luck to spend.
ADaddandsubtract22 小时前
But assuming a random draw, you're more likely to select people with higher luck.
LOlobocinza15 小时前
assuming luck is spendable
SFsfn4217 小时前
This is not at all how probability works. Luck is not a resource one spends. If you flip heads 500 times in a row with a fair coin, the next coin flip is still 50/50.
ASaspenmayer2 小时前
Presupposing that the same coin is used for every flip (which is implicit in the example), it would be fair to question whether the coin could possibly be a fair coin after 500 heads in a row, even (and especially) if the flipping process were ideally fair.
I’m not a whiz with the math involved, but I am of the opinion that 500 consecutive same-side flips is a large enough sample size to calculate that the coin in question is biased, so it would be unreasonable to assume that the next flip is 50/50.
https://en.wikipedia.org/wiki/Checking_whether_a_coin_is_fai...
CUCuriouslyC22 小时前
Donald Trump disproves the fixed luck hypothesis (and the Karma hypothesis!)
ZIzipy1241 天前
Or more to the point. There are generally far more qualified applicants than job roles. That is training and education greatly expanded over the last couple of decades to produce more and more job seekers, whilst job creation hasn't really kept pace.
PJpjio1 天前
This hurts more than it should.
CIcitrin_ru22 小时前
May be LLM resume screening is a symptom of a bigger problem - with tens of candidates per vacancy employers can screen resume badly and even throw half of the resumes away and still hire someone qualified.
ABAbsurdCensor19 小时前
That's really what it is, or at least what I've noticed.
Any position you have these days is inundated with applications. Most don't meet the qualifications (because in a lot of places say in the US you must apply to jobs to keep with benefits, regardless of what you are applying for), and for the remaining, you'll find that there will always be some that are all similarly qualified. Who do you hire for one position? It sometimes just comes down to luck.
AI doing the job of filtering I can't imagine making the process easier, and more applications are just going to get tossed because of it.
LAlatortuga18 小时前
The author made this exact joke in TFA.
JEjerrythegerbil1 天前
> I fail 65% of the time. Same exact resume, different luck.
As someone who’s run hiring pipelines for technical roles in the past few years, that’s actually a fantastic number. I objectively hate saying that, but it’s true.
35% chance of elevating a technical individual to the next stage with no effort? I’ve seen as many as 100+ applicants an hour even when including a domain specific screener question. That’s 35 “screened” applicants in an hour. Were valid candidates screened out? Yes. Does you still have a candidate pool 35x larger than you need? Unfortunately, also yes.
The volume of applicants is SO HIGH such that your chances of getting moved to the next stage are actually markedly worse if AI isn’t involved. If you didn’t apply immediately (using an AI bot) there’s 50+ people ahead of you, and an exhausted technical leader if they ever make it to your resume.
Referral bonuses exist for a reason.
PUPufPufPuf1 天前
In that case, I have a pre-screening system to sell you. Through state of the art technology, it only lets through the best* 1% of applications.
*According to our proprietary, undisclosed, non-deterministic metric, which may or may not be Math.random
GRgroundzeros201516 小时前
I worked at a startup that judged their hiring pipeline quality using rejection rate criteria.
RVrvba1 天前
Reminds me of this
https://stackoverflow.com/questions/16833100/why-does-the-mo...
KYkyralis1 天前
Is it? Or is it a 65% chance of a resume getting ignored before a single human sees it, reducing your pipeline's likelihood of catching qualified candidates by the same?
Gates that reduce resume flow-through are only useful if their reduction is correlated with quality. Otherwise they're just dragging out your hiring process or unnecessarily causing you to ultimately lower your hiring bars.
JEjerrythegerbil1 天前
> Gates that reduce resume flow-through are only useful if their reduction is correlated with quality.
The volume is infeasible to review everyone for quality, even at an hour scale. The conclusion and solution is inevitable, though I wish it were different. 35% is actually really good if you’re not coming in through a referral.
The current reality is <1% and the person reviewing you is exhausted.
FAfalsemyrmidon1 天前
You may as well just randomly pick 65 to discard, if your only goal is to reduce the number for review.
SEsevenzero1 天前
What a inhumane way of looking at this. Hiring is deeply flawed, you know it, and yet you keep job postings open for weeks/months in case "the one" magically appears on your doorstep instead of just interviewing 10-20 people and just pick one...
Corpo bullshittery at its finest.
BRBrian_K_White1 天前
This reasoning isn't.
BAbagels1 天前
The goal for the interviewer is to have a much higher ratio of good/bad candidates after the first screening. This means the more costly time you spend on the second step has a better return.
AEaesthesia1 天前
So the question is: is the score given by this system correlated with candidate quality? I don't think this post gives enough data to know.
LUludicrousdispla1 天前
So the logical solution is for candidates to submit multiple applications with slight variations to their contact info, "John Schmidt", "John J. Schmidt", "John J. J. Schmidt", "John Jacob J. Schmidt", "J. J. Jingleheimer Schmidt", etc.
YUyuliyp17 小时前
Hey, that's my name too!
TETerr_13 小时前
Whenever I send them out
The filters always route:
"Spammer: John Jacob Jingleheimer Schmidt"
[N/A] [N/A] [N/A] [N/A]
AMambicapter18 小时前
It's a good day to have 3 middle names.
UNunknown1 天前
[deleted]
RErecursivecaveat1 天前
If you have no requirements for accuracy, you can just advance 35% of applicants at random.
If the first 50 people who apply are all bots, why are you reading resumes in order of submission?
WOwodenokoto21 小时前
One of the first things you do when hiring is to set a period and randomize order of resume when reviewing because early application is not a strong signal.
MRmrhottakes16 小时前
Sounds like you're pretty bad at hiring pipelines.
SPspike0211 天前
there have got to be better ways to optimize pipelines. maybe set a limit on number of applications for a role based on the number you/your team can reliably go through them. if more are needed then open the role for another wave of applications.
LOlowbloodsugar1 天前
Except the bit about ranking a decades long S3 engineer lower than an intern with GitHub repo.
ISIshKebab1 天前
I wonder if you could solve this for programming specifically as follows:
1. Give them some easy leetcode questions. Nothing that a competent programmer would have any problem with.
2. If they pass, ask for a deposit of like $20. Shouldn't be an issue for people who are actually serious.
3. Do more simple leetcode questions but this time on zoom so you can tell if they are using AI. If they pass that they get the deposit back.
(Yeah I know there are real-time interview cheat AI programs but based on what I've seen on demos of them it's super obvious when they're being used.)
Probably not practical but just a thought!
JGjghn18 小时前
I'm not going to do any of those 3 things for a would-be employer.
ISIshKebab17 小时前
They don't seem like unreasonable things to me so I guess it also helps filter out unreasonable people!
NEnever_inline21 小时前
This selects for desperation.
UNunknown20 小时前
[deleted]
DVdvt1 天前
[deleted]
CMCM3023 小时前
I think what's more worrying to me (if other systems work like this ATS) is that it seems to judge based on a bunch of factors that will probably disqualify a ton of decent to good participants.
For example, 65 points are given for a mix of personal projects and open source contributions. Which is great if your one and only interest is in tech, and you don't have a family, dependents or a second/third job. If you have any of those other things, well the odds seem like they're incredibly stacked against you.
And it makes me wonder how many of these systems are stacked in favour of wealthy people with a near special interest level of obsession with tech and no worries outside of going to college/working a single job in their industry of choice.
THthewebguyd16 小时前
Yeah, the over valuing personal/open source projects is worrying and kind of sucks. I can use myself as an example, I don't do personal projects really, outside of work. My only actual programming work experience is during work hours for my employer. My hobbies are tech-adjacent (3D printing, some hardware/arduino stuff, photography) but they aren't "make a bunch of projects and put them on github" type hobbies. I'm certainly not going to make some BS fake CRUD or SaaS apps just to show off for potential employers, what a waste of time.
I, intentionally, have zero online presence in that regard. You won't find any public repos on my github, I don't blog, etc. Its even infected the ops/syadmin side of the field (where I work), and that's somehow even worse. Like of course I don't have a bunch of environment specific scripts on my GH, why would I? It's irrelevant to anyone that doesn't work in my department at my current employer.
FIfireant6 小时前
In my experience personal projects are the greatest indicator of IC competence, especially for young people. You may not like it, but turns out that when you do a thing in your free time because you like it, you get better at the thing than the people that only do it because they have to.
BObob00123 小时前
[deleted]
DOdoodaddy20 小时前
I know that some think this is just some cold hard straight talk but this style of individualistic thinking lacks empathy. And more practically, it’s a trap.
In context, the “doing things” and “opportunities” that we’re talking about are jobs, careers. So by promoting the idea that one must work harder or longer to get or keep a career that they’ve already built sounds like a path to opt-in servitude.
SCSchiendelman21 小时前
In hiring, we pass laws to prevent abuses. In many countries and soon a few states, being asked to work outside of work hours is considered an abuse. Expecting that someone does work related activity outside of work hours is something I would actually consider regulating out of the application process!
DAdanmaz7422 小时前
Of course life isn't fair. But here the result is that companies will ignore potentially great candidates which dedicate all their programming time to their job and instead consider candidates which may be not just worse programmers, but also are more interested in their hobbies (or padding their CV) that doing their job.
I'm saying this as somebody who most of the time has some side project going on.
BObob00122 小时前
[deleted]
GRGrombobulous20 小时前
“Fair” is one thing, “systemically impossible to even approach fair” is another.
For example, you can’t “conscious long-term effort” your way out of being stop and frisked by cops because you were walking while black.
This setup isn’t even good for employers. Having your job as your hobby doesn’t automatically make you better at your job.
AUAurornis1 天前
> The default model is gemma3:4b
That’s a tiny model. No LLM is going to be a perfect and repeatable judge, but a tiny 4B model is like plugging an RNG into this system.
This whole exercise feels like someone vibe coded an ATS and got it to the point where the tests were passing because they decided they should have an open source ATS project.
DAdanpalmer1 天前
This sort of model is fine for small problems, when used in the right way. I think there's probably a version of Resume analysis that would work well with this model, but "hey clanker, what projects has this person done" is not the way. You need extraction, cleanup, probably OCR to compare and further clean up, multiple analysis passes per signal with LLMs, judges, etc. None of that needs to be large models, you'll get marginally better performance, but there's very little context, these models will perform well when used correctly.
UNunknown20 小时前
[deleted]
ORorbital-decay22 小时前
This word (determinism) has a magical effect of warping any online posts it touches. Once you hear it you can almost guarantee it's going to be misguided. At least this time it's actual determinism (same input = same output), not arbitrary unrelated things.
Determinism matters for reproducibility, but do you really want these outputs to be reproducible in this particular case? Making LLM outputs deterministic is relatively trivial, you have to use batch-invariant kernels (if you use batching) and either set the temperature to 0 (don't do that, randomized sampling is here for a reason) or fix the seed (better). It's readily available in a few systems. But this won't make the result more useful, it will just obscure the fact that the agent is genuinely not sure about it - look at the range of the scores it gives! It still won't predict anything but the score will stay the same each time. Do you really want that?
What happens here is they're supplying too little information (just a resume, which is almost at the noise level) and expecting a reply with too broad implications. This is a basic design mistake regardless of whether it uses LLMs. All surveys, tests, laws, and voting systems are extremely sensitive to framing because they work off too little information. But they also don't exist in vacuum, unlike this thing.
RURugnirViking21 小时前
This. Human judges and examiners are famously not deterministic even though we would wish it were so - we've probably all heard the thing of harsher sentences being given in the hour before lunch.
NOnonethewiser15 小时前
>we've probably all heard the thing of harsher sentences being given in the hour before lunch
That suggests determinism though.
I mean I agree with you overall. Either humans decision making is a system so complex it appears non-deterministic, or it is deterministic. Practically speaking, we are non-deterministic.
Let's not conflate non-deterministic with inaccurate though. Non-deterministic systems can be 100% accurate. https://en.wikipedia.org/wiki/Las_Vegas_algorithm
GRgroundzeros201516 小时前
> harsher sentences being given in the hour before lunch.
Implicit bias theory sparked a massive number of studies that suggested everything influenced you from the color of the room, to what the person said to you before entering.
It’s been really hard to replicate and the conclusions that have been drawn are contradictory.
NOnonethewiser15 小时前
I made a similar comment on a different post. Non-determinism does not necessarily mean it cannot reliably reach the correct output (although sometimes it does mean that). Las Vegas algorithims are non-deterministic and 100% accurate. The tradeoff is the time it takes to reach the correct answer is highly variable.
To contextualize this insight in your post and basically just repeat what you are saying: The mistake is not using a non-deterministic system. The mistake could be, in some sense, using it too little. Re-evaluating the same resume 5 times and seeing a high variance in scores is a more useful signal than evaluating it once.
PRprogramjames17 小时前
Nondeterminism is also a feature, not a bug. If you don't want people to optimize against your filtering process, you have to make it somewhat nondeterministic. For example, better candidates are exponentially more likely to pass the filter, instead of a hard cut-off at the top-100. Then it becomes no longer worthwhile to Goodhart the filtering process, because it barely increases your chances and there are so many more places you can use your time better.
1212_throw_away15 小时前
> If you don't want people to optimize against your filtering process, you have to make it somewhat nondeterministic.
I'm sorry, I'm not following this at all. When you say "better candidates are exponentially more likely to pass the filter", we're still are talking about a metric, yes? A metric that can be optimized? Why would switching from a hard cutoff to some sort of stochastic filter weighted by this metric discourage optimization?
PRprogramjames11 小时前
Optimizing for the metric involves:
1. Optimizing for generally applicable skills that the metric is trying to measure.
2. Optimizing adversarially to hill-climb the metric.
You want candidates to do (1) and not (2). You can make them agnostic to the second by setting
d(expected gain)/d(opportunity cost) = 0
==>
expected gain \propto opportunity cost
It is the case that most metrics are logarithmic: it takes just as much effort to decrease one bit of error as the next bit. So
log(score) \propto (opportunity cost) \propto expected gain
Thus, for them to be agnostic, you should filter candidates proportional to their log-score on the metric (where 0 is a perfect score). Because generally applicable skills are generally applicable, they will still benefit from improving those, they just no longer benefit from adversarial optimization, unless your score function looks very similar to others who have not adopted this filtering process.
The issue with a hard cutoff is that people near the boundary are extremely incentivized to adversarially optimize, as it is usually cheaper than working on generally applicable skills and actually pays off for them. You see this phenomenon on AoPS where (esp. Californian) students talk about grinding for MATHCOUNTS instead of learning calculus.
SEseanieb23 小时前
It's always amazed me that a tech company will pay $300,000+ for a good engineer, because talent is so hard hard to find... meanwhile their recruiter operates unsupported, has a very different idea about what good looks like. Their ATS black-holes >50% the resumes because it's filtering heuristics are garbage because recruiting selected the ATS system because it has a google Gmail integration or something, and the ATS's filtering technology was not reviewed by anyone in the engineering or data teams.
JOjoshmn20 小时前
I ran the ATS myself and had a similarly quirky experience. I was in the 70s because it couldn't find my GitHub profile, and then it didn't like some of the popular Ruby libraries I'm the author of.
After a few runs it picked things up appropriately. I always got dinged on formal education though.
This stuff is gross.
FEfernandopj19 小时前
Similar to my experience. Put me around 65 in some runs, because it didn't like I don't have contributions to OSS.
Also, it doesn't pick up certifications or awards. I tried some PRs people are suggesting with enhancements (https://github.com/Zem-0/hiring-agent), it helps, but overall their ATS is hugely biased towards people with large GitHub contributions to OSS.
ROrobertlagrant22 小时前
I tried this with my CV, and it somehow scored me bonus points for GSoC!
BONUS POINTS: 5.0
------------------------------
Google Summer of Code (GSoC) participation: +5
Even though I've never done this, and don't claim to have done it in my CV.
FEfernandopj19 小时前
Happened to me as well. It is a known hallucination https://github.com/interviewstreet/hiring-agent/issues/240
ROrobertlagrant17 小时前
Thanks - interesting. Very odd, though.
0X0xbadcafebee20 小时前
This insanity only exists because the tech industry is standard-less. No formal education needed, no formal training requirement, no apprenticeship, no software building code, no professional organization. Resumes have never been a good predictor of success - and why would they be?? Even if they're truthful and it's "impressive looking", that doesn't give you any assurance of knowledge, of who they learned under, what they learned, that they passed some minimum criteria. We might as well be rolling dice. So why not an LLM that randomly assigns scores?
COconductr18 小时前
I have no data to lean on other than my experience and intuition but I’d say that’s not the case. My domain is corporate finance, which encompasses a lot of structured roles and certifications, yet I consistently feel the Resume is just a poor device for making any judgement calls. Having people summarize their career into 1-2 pages of bullet points just doesn’t mean much. Especially now that keyword packing is a thing. It’s just meant as an introduction/sniff test to open the door for a conversation. Then it allows for deeper more probing questions to be asked. This where you’ll assess how impactful their contribution to a project actually was. Were they really living up to your definition of a manager, or were they more so an IC that had a lot responsibility. Stuff like that.
> Resumes have never been a good predictor of success
Applies broadly to the world, it’s not unique to tech
0X0xbadcafebee16 小时前
The problem is we have too many applicants to phone screen them all. For a lot of jobs today you end up with 10,000 applications, which is why these automated resume-skimming systems exist, but unfortunately this page shows how they basically don't work
COconductr16 小时前
People seem to hit a wall when flooded by resumes. They feel like there some needle in the haystack they need to find and it’s overwhelming. But you don’t have to read all of them. Or talk to all of them. Or use a system like this to filter.
If you know what you’re looking for, you just start skimming them and maybe ranking them based on your own rubric. If it’s an obvious “no” you can usually tell within 5 seconds skim. Once you have a handful of high ranking ones, stop, and talk to them. Repeat as necessary until you have a short list of people you’d want to hire. There might be 9900/10000 resumes you never even looked at and maybe one of them would have been slightly better but you can’t let perfection be the enemy of progress. Stand by your convictions of feeling the candidate is qualified and capable and meets what you expect and hire them, get back to business.
Having been in “talent shortage” mode for a long while I’d rather have 10000 resumes than 3. Having to pick one from a suboptimal selection is an awful position to be in, but sometimes a necessity.
GRgroundzeros201516 小时前
Do you think fields that have formal criteria don’t use resumes with keywords? I bet Lawyers look for school names and big law firms all the time.
Credentialing helps maintain a quality floor. Does this person have basic employable skill? Nothing more. It actually doesn’t help you identify levels of talent and skill which is a universal hiring problem.
We do have a credential - a CS degree. And you can see it is a mixed signal. Employers can choose of their own free will to take risks on employees that do have this credential, or not.
Mandating by law that you must have a CS degree doesn’t seem to help our field as we famously have high performers across the spectrum of formal education.
A4a4isms18 小时前
Feels like "I Don't Hire Unlucky People" all over again, but with extra tokenmaxxing steps.
https://neonrocket.com/2014/05/rescued-from-the-ashes-i-dont...
ZXzx808019 小时前
This is the new AI reality everyone around is wanting: a nondeterministic computing.
There is another name for it: a waste of electricity.
But wait, not waste! Consumers paid for it fully, with nice profit margins.
You and me, paid.
Try using google flights, or booking.com: the prices shown in search results list are frequently significantly different from those in a single result. It's a nondeterministic compute when it's easy to spot it. But it's not always that easy.
It's all sad, to be honest.
REreactordev19 小时前
There should be laws against displaying wrong prices or different prices for who you are…
GSgs171 天前
I'm a little confused, is this an ATS system that anyone actually uses? If not, I'm not sure how it's better than just asking ChatGPT to score your resume out of 100. Why would you want to optimize your resume for a system no one is using to score it?
BUBukhmanizer1 天前
I would assume at least hackerrank is?
I don’t think the point of a lot of this is to optimize your resume. It’s to show how arbitrary these systems are.
MAmarticode23 小时前
From my understanding this one is used for hiring tech workers only. The (very) widely used Workday application system for ex seems to have its own built-in ATS.
PEpetesergeant1 天前
(Almost) everyone’s using some kind of ATS, every ATS is adding AI auto-ranking (and has been trying to for 15 years), and almost all HR people feel like they have too many obviously bad CVs to read. Whether or not someone is using this ATS specifically, if you submit several CVs to several places, your CV is going into at least one magical 8-ball.
4040four1 天前
“I'm a little confused, is this an ATS system that anyone actually uses?”
You read my mind. If the answer is “no”, then we can ignore this.
ANanother-dave1 天前
For one, if you go on to Hacker Rank's "Screen" page, they mention the product is used by Stripe/AirBnB/LinkedIn/Atlassian/IBM etc etc. I imagine that there's plenty more companies using it too.
But I'd also assume that their competitors are doing something similar so I don't think we as an industry can just ignore that it's happening.
4040four20 小时前
Interesting, thanks. I admittedly spent zero time looking into it :)
I’m surprised open source contributions count for so much. first I thought was “is that something people actually list in as resume?”. But it looks like it pulls your GitHub account and appends that information.
That kind of unfortunate for anyone who doesn’t use GitHub
GSgs1717 小时前
> HackerRank Screen compresses the top of the hiring funnel by replacing manual resume reviews and unstructured phone screens with structured, auto-scored assessments
That seems to be a different type of product.
BAbartread22 小时前
The takeaway from this for me is that, using an LLM to score anything takes multiple (maybe even many) runs and the result you’ll get is, at best, a sane-ish distribution.
Which sort of sounds workable until you scale it up to larger datasets, where at some point compute/time/energy costs will render it non-viable.
I am sure there’s some reasonable rule of thumb estimation on distribution that could be applied based off fewer runs per data artifact, but you’re always going to be trading off against confidence by doing this.
Beyond this, I’d bet that almost no implemented systems that use LLMs for scoring, ranking, or decision making use such a multi-run approach. Partly because people don’t understand their behaviour is stochastic, perhaps because a lot of people without a background in statistics don’t understand what stochastic actually means, and no doubt partly because of budget concerns: if you have to ask an LLM to do the same thing 10, 50, 100 times to get a sufficiently good result, then the cost saving argument is either weakened or completely destroyed.
There is at least one more aspect worth considering in the specific case of resumes/CVs: is the inconsistency of scoring by LLM worse than the inconsistency of scoring by a human following a similar process?
Because the reality is that, even for an experienced recruiter, reviewing hundreds or thousands of resumes or CVs gets pretty fatiguing. People get hungry, bored, tired, restless, irritable, etc.
That inevitably leads to inconsistencies creeping in, so there’s always an element of “luck” (or, perhaps better, uncertainty) as to whether your resume/CV passes screening.
So is that inconsistency better or worse with LLM screening? I don’t know. But, at least, if it’s not worse maybe it doesn’t matter for this specific use case. And if it’s notably better then maybe it’s raised the bar on what “good enough” screening looks like?
(And I’m sure other use cases warrant similar, “does it matter?”, questions, with the answers no doubt landing differently.)
CUCuriouslyC22 小时前
My experience with benchmarks and evals is that it can take ~20 runs of a problem for the distribution of answers to start to converge. Ideally you'd know the convergence properties of your algorithm ahead of time and make a Bayesian solution that makes the uncertainty explicit.
MAmakeavish1 天前
Hiring and job search has been so hard and AI has amplified the existing problems instead of solving any.
SEsevenzero1 天前
Wdym, cant you just litter your applications with buzzwords and other bs to automatically get a high score in these systems?
SZszszrk1 天前
HR market is basically an early google rigging era, where you can place hundreds of keywords at the footer (white text on white background) to start popping up on random searches.
MAmakeavish23 小时前
I have been at both side of the market. And it sucks so bad at both ends. Companies which deeply care about next hire are struggling to hire and actual great people looking out are outcompeted by AI slop and AI bulk applying.
It is actually a very hard to solve problem.
KAkailpa11 天前
From `resume_evaluation_system_message.jinja`
> *SCORES MUST NEVER DEPEND ON THE FOLLOWING FACTORS:*
> - College, university, or educational institution name
> - CGPA, GPA, or academic grades
I don't understand why they would omit these factors from the evaluation.
SWswiftcoder1 天前
> I don't understand why they would omit these factors from the evaluation.
Only hiring MIT graduates sounds great to a lot of tech folks! Automatically rejecting applicants from HBCUs, however, sounds like a lawsuit
As to GPA thing, I think it's just to stop the LLM glomming onto an obvious numerical grade? LLMs like to rank things by obvious dimensions, and whether someone had a 4.0 or a 3.8 in grad school makes very little difference to their performance 10 years down the line.
CEceejayoz19 小时前
https://qz.com/1427621/companies-are-on-the-hook-if-their-hi...
> But it didn’t. After the company trained the algorithm on 10 years of its own hiring data, the algorithm reportedly became biased against female applicants. The word “women,” like in women’s sports, would cause the algorithm to specifically rank applicants lower. After Amazon engineers attempted to fix that problem, the algorithm still wasn’t up to snuff and the project was ended.
And in another org:
> After an audit of the algorithm, the resume screening company found that the algorithm found two factors to be most indicative of job performance: their name was Jared, and whether they played high school lacrosse. Girouard’s client did not use the tool.
https://www.npr.org/2024/04/11/1243713272/resume-bias-study-...
> Their working paper, published this month and titled "A Discrimination Report Card," found that the typical employer called back the presumably white applicants around 9% more than Black ones. That number rose to roughly 24% for the worst offenders.
It'll discriminate by proxy, basically.
BUbulder22 小时前
I don't understand why they'd hand over those data points over to the model in the first place. If it's in the context window, it's impacting the output. To ensure that no weight is placed on those factors, they should be sanitizing them out before handing the data over to the model.
SPsph1 天前
Hopefully so that people like me, that dropped out of high school yet have had a successful career as a self-taught engineer, have a chance. [1]
Just kidding, my resumes are sent to /dev/null like everybody else’s.
——
1: In fact, I will be controversial and say that self-taught engineers tend to be the strongest in their own particular niche, because they are powered by sheer desire to learn and improve. I am routinely appalled by how many people go on forums to ask how to learn a new thing, completely unable to self-direct their learning. I blame the modern school system.
KAkailpa11 天前
I'm a self-taught programmer as well, who dropped out of university, and these factors being omitted would benefit me as well, but I feel like good grades and a good university are still indicators of someone being or is capable of becoming a good programmer.
This system would drop a Harvard top graduate for someone having a year of experience in some outsourcing firm.
MKmk8912 小时前
At my company someone has introduced an internal tool that should help understand and give a "score" to design documents from teams.
Needless to say, this tool gives scores exactly like the article mentions. Same document, same LLM, same prompt, and different results. It becomes even more ridiculous once you switch to other models, or if you ask a model to review the work of another model.
I am not sure why we insist on making LLMs do the work they are not supposed to do and/or in a way they are not supposed to do.
The worst part is that people are aware of the problem but they just ignore it and consider it as "a reference number, just to have an understanding".
If it were like that, it would be less of a problem. The issue comes from the fact that eventually someone without enough knowledge will trust the output (so X points out of Y is how it is), or someone will stop challenging the output and consider it for their process - like in this unfortunate case of hiring.
At a certain point, people who don't know what they are doing give a tool that doesn't know what its doingto people who don't know what they are doing. A pure mess. And everyone has to comply and applaud. If you go against, you are against AI.
This is what I hate the most about AI. Not the tool, but the shortcuts we're willing to take to justify its existence.
TAtasuki1 天前
> Sometimes my projects “lack architectural complexity”
Well done you! It is difficult to avoid architectural complexity, but imho well worth it.
SAsaidnooneever1 天前
Count to three, no more, no less. Four shalt thou not count, neither count thou two—excepting that thou then proceed to three. Five is right out.
评论
20 条顶层评论请先登录 h4cker 账号,然后连接 Hacker News 后发表评论。
An alarming number of people don't understand that LLMs work via purely stochastic processes, so I'm happy to see in-depth pieces like this. I'm looking for a job and maybe this is why it's so hard to get a callback these days: resumes are just dumped in some LLM black hole and no one really knows how it works. The author says: > temperature 0.1 — low, supposedly nudging the model toward deterministic outputs This is not correct (and is briefly touched on later in the piece when he sets temperature to 0), temperature is not some kind of "deterministic" switch, but rather it affects the sampling distribution (which becomes more "spiky"—but is still very much a distribution).
In theory, temperature 0 does make the LLM deterministic. Well, in theory theory, temperature 0 doesn't really exist. Mathematically, as lim temperature->0, the distribution gets spikier and spikier, the most likely sample goes to almost-but-not-quite infinity and the rest go to almost-but-not-quite 0. In practice, temperature=0 is literally a separate branch of an if statement that just picks the most common sample (using the actual formula that works for non-zero values would cause a zero division). However, due to things such as batching and even different kinds of floating point imprecisions for different algorithm implementations, the probability distribution itself often differs run-by-run, so what you sample from it also differs.
>in theory theory, temperature 0 doesn't really exist. It does exist very much, even if you go to pure math. Look at the softmax function and take the limit as T->0. It becomes a dirac-delta function. I.e. in a discrete setting (like for LLMs with a finite set of output tokens), probability P becomes one for argmax and 0 for everything else. Only in coding practice it is easer to implement T=0 as a simple if check that directly chooses argmax instead of calculating the limit of some function that includes 1/T quotients. But setting T to zero is in both, theory and practice, turning the usual probability function into greedy sampling.
> Look at the softmax function and take the limit as T->0. It becomes a dirac-delta function. In pure math, it does not always do that. It becomes a dirac-delta comb with equal weight on every maximum. There can be more than 1 maximum. Setting the temperature to zero turns into greedy sampling, but greedy sampling is not necessarily deterministic as you can have multiple equally optimal options.
That is not a problem for LLMs, because in practice floating point inaccuracies (in particular after exponentiation) prevent values from being exactly equal. That's why greedy sampling generally produces deterministic output for LLMs. The real gotchas are elsewhere (like with batch inference as we've seen with earlier GPTs). But unlike what the earlier comment says, this is a non-issue mathematically.
There's a difference between f(0) and lim t->0 f(t). We just chose to treat this function as a "staircase function" where f(0) =lim t->0 f(t), general formula for f(t!=0).
> It becomes a dirac-delta function. I.e. in a discrete setting (like for LLMs with a finite set of output tokens), probability P becomes one for argmax and 0 for everything else. Only in coding practice it is easer to implement T=0 as a simple if check that directly chooses argmax instead of calculating the limit of some function that includes 1/T quotients. I don't understand the distinction you're drawing. A Dirac delta function is a "simple if check".
The point is that the case T=0 doesn't just "exist" as a special code branch - it is still well defined mathematically without any change to the output function. What the above comment refers to with the extra "if" check is just a limitation of computers not liking to divide anything by zero, even if the actual function exists and is well behaved at zero. It is not some weird or special theoretical construction.
> Mathematically, as lim temperature->0, the distribution gets spikier and spikier, the most likely sample goes to almost-but-not-quite infinity and the rest go to almost-but-not-quite 0. That's not how limits work. As the temperature goes to 0, the rest goes to 0. That's it. The "almost-but-not-quite" is part of the "goes to". Let's say f(x) = 3x+1. It's a continuous function. If we let x go to 10, f(x) goes to 31. Not "almost-but-not-quite 31". No, to 31. (If you don't have a continuous function then it's the same argument, but less intuitive to illustrate.)
Even if it's deterministic that doesn't mean it isn't arbitrary. I can achieve determinism at any temperature by saving the seed. But that wouldn't make rejects feel much better knowing that if a bit was flipped in an arbitrary seed they would be scored differently.
> However, due to things such as batching and even different kinds of floating point imprecisions for different algorithm implementations, the probability distribution itself often differs run-by-run, so what you sample from it also differs. Exactly. While I’m assuming this won’t be news for most here, for those that are still new and/or curious about some more explanation on e.g. the floating-point imprecisions, see this nice article: https://thinkingmachines.ai/blog/defeating-nondeterminism-in...
I did large scale tests temp 0 and there was still randomness with the same prompt inputs coming in. I did this with several model apis. GPU processing is not going to be the same from what I read but also the AI backend is doing a lot of fancy batching resulting in another layer of randomness.
It is not deterministic because the order of computations in a typical multithreaded system is not deterministic and also because when combined with the devil that is IEEE754, it gets even less deterministic.
As I understood it, the "randomness" affecting what is selected at any temperature still comes from a PRNG or CSPRNG (or whatever RNG you want, maybe a hardware one), and if you where to swap out that with something deterministic you'd get the same results every time (barring non-determinism in other parts of the OS/drivers/maybe even hardware). But theoretically, the output of every LLM is seed-driven (or could be if you wrote the software to isolate it) just like any computer software. It's just none of the software written (even llama.cpp AFAIK) chooses to support stable-seeding due to the changes in stuff like CPU/Vulkan/CUDA/Metal differences making it difficult to make consistent. They could though! Hopefully one day someone implements it into the mainstream LLM-engine software and it gets exposed in the APIs serving the models. It'd do a lot to show folks the "internals" of these models.
It's probably due to the fact that it's a cloud service. You have no guarantee that your next request will go to the same machine. So even with an identical seed, and temp 0 you might get different hardware and hence different accuracy/noise in the floating point operations.
How can there be noise in floating point operations? I could buy like completion order for parallized batches i.e. adding a+b+c is different from a+c+b etc.
Stable seeding is not enough. A lot of modern, fast compute kernels are nondeterministic. Floating point multiplication/addition is not strictly associative and e.g. reductions can combine results from different threads in different orders (e.g. through atomic ops). You can write kernels to be deterministic, but it is generally less efficient.
They are only non-deterministic when you’re doing batching and a kernel ends up running across a “random” set of token streams. If you’re only processing one user’s request, they’re very much deterministic.
that's incorrect in the presence of batching. it's tough work making it truly deterministic: https://x.com/FireworksAI_HQ/status/2069873437217276015
It's not that hard. What is hard is making it truly deterministic and retain high throughput.
PRNG is deterministic.
If you make an exact integer implementation and run with temp=0 it's deterministic. You don't even need temperature 0, just make a random seed for the sampler part of the input and then its deterministic as a function of the input. But running autoregressive models at temp=0 tends to expose pathological behavior, because the training process produces a function with a lot of gain so its prone to feedback on its own noise.
> However, due to things such as batching and even different kinds of floating point imprecisions for different algorithm implementations, the probability distribution itself often differs run-by-run The implementation does not often differ run by run.
> The implementation does not often differ run by run. If you use a cluster, or even multiple clusters, and they have non-identical hardware, then two consecutive runs could end up being routed to nodes having different GPU models with slightly different floating point behaviour, or even software differences (e.g. newer GPU offers some feature usable to speed up calculations which older model lacked; same code can use the feature when it is available, fall back to slower alternative if it isn’t). The larger your scale, the greater the odds it will happen
The whole problem of text understanding is a problem of reasoning under uncertainty, that is, you can't really be sure which witch people are talking about all the time. A person you might hire might be successful or unsuccessful at the role, no matter what hiring process you use. Two people might look at the same resume and come to the same conclusions. Two patients with the same symptoms and clinical presentation might have different diseases, etc. I don't buy the story that the old AI died primarily due to the cost of knowledge base maintenance [1], but rather the lack of a universal system of reasoning over uncertainty. For me it's a running gag that Spock was always saying things like "Captain, we have a 21% probability of surviving this mission" when Bayes teaches us your probability distribution has a probability distribution, "we have a β(5,1) chance of surviving this mission" is more like it. To that end it wouldn't be too crazy to run a resume through that machine 100 times and look at the probability distribution of the score. [1] then again I am the kind of maniac who will sort images on a tablet lying in bed until my visual system malfunctions
To be clear, temperature 0 is deterministic and will produce the same output for exact duplicate inputs, across all seed choices. Provided: * If it’s MoE we are talking about, that the duplicate inputs are for the whole batch (yes, your batch neighbours can impact your choice of experts. Blergh.) * Your kernels are deterministic * There’s no system wide effort switch that responds to, e.g. work load across the cluster (for a thinking model) Upshot: Temperature 0 is not deterministic in probably any existing cloud infra, but it could be for edge inference pretty reliably. To your quibble on 0.1 being more deterministic - I think it’s a pretty fair summary - we’re going to sample much more from the ‘temp 0’ answer at 0.1 than we would at temp 0.9, no?
Even then it's deterministic in the way a hash function is deterministic. Change one letter and you can get a completely different output. What people actually want is something continuous.
Agreed on the desire for continuous behavior. That said, in a modern LLM, is this hash analogy accurate? I would be surprised if a single letter changed most zero temp force ranked outputs. E.g: “Where is the Eiffel Tower Located? One word only.” “Where is the Effel Tower located? One word only.” “Where is the Eiffel Tower located? One wor only.” I’d be very surprised if those got different answers from even a small local model at temp 0.
This is it. People mistake deterministic for precise/exact/correct. It's not.
A distribution with all probability mass on one outcome is deterministic, so in principle, setting temperature to 0 _should_ result in deterministic outputs. There are a few reasons it might not, but I don't think any of these apply when running a local model like the author did.
> so in principle, setting temperature to 0 _should_ result in deterministic outputs It is a common misconception, but it is not true even in principle. If I have 2 or more logits which are equal to the maximum of my logits, I will sample uniformly random from them with any temperature, even zero. Sampling from softmax([1, 0, 1]) is still stochastic at temperature 0, because the limit is to sample uniformly from the first or the last element. Anyway: "GPUs don't do deterministic matrix multiplications" is the biggest source of randomness in LLMs. GPUs put the associativity of the sums in matrix multiplications in arbitrary order, and this has a huge impact on the logits coming out of the neural network.
> "GPUs don't do deterministic matrix multiplications" is the biggest source of randomness in LLMs. But this isn't a fundamental property of LLMs, it's just an implementation detail. It's pretty obvious that if you evaluate the matrix multiplications correctly and deterministically sample from the highest-probability outputs, you will have a deterministic LLM.
You don't have to sample uniformly. You could take the lowest index of all maxima. But yeah, the main source of randomness is non-deterministic matmul, and temperature does nothing with it
> GPUs put the associativity of the sums in matrix multiplications in arbitrary order That’s user-controlled too, not an inherent property of GPUs: https://docs.pytorch.org/docs/2.12/generated/torch.use_deter...
There are. If the kernels are nondeterministic (e.g. timing issues) there are minor changes between runs, on a single system, even with eager decode enabled (typically what temperature=0 achieves).
Setting the temperature to 0 should give deterministic results but that's not any better - it's just hiding the huge variance by only taking one sample.
So you would get always the same result, but it could be the wrong one
Of course, nothing can guarantee the right answer from LLMs
I mean the easiest explanation would be that the model harness doesn't always take the most likely token but does top-k sampling or similar. temperatur just means that probabilities get more and more equalized, boosting the chance that an unlikely token gets picked. but even with temp 0 you could have 0.8 T1, 0.19 T2, ... and sometimes sample T2
No, this can't happen at temperature 0. The formula defining temperature-adjusted softmax isn't strictly defined at 0, but taking the limit (in the case where all logits are distinct) results in probability 1 being placed on the largest logit. Samplers will typically special case temperature 0 and pick the most likely token at each step.
> This is not correct Several of my claimed AI-expert colleagues repeat this as though it's gospel. I've heard "set the temperature to 0 so we get consistent results" more times that I can count.
I imagine it's much like game-developers saying: "Set a fixed seed so the player gets consistent results." Yeah, it can work, but it is subject to so many potential pitfalls that you can't casually assume it will. It's a property you have to actively design-for and rigorously test to be sure the system can deliver it for some particular scenario.
> resumes are just dumped in some LLM black hole and no one really knows how it works. Not that I'm defending AI, but HR departments rarely knew how their ATS ranked and sorted applicants before they were AI powered.
> I'm happy to see in-depth pieces like this It's somewhat ironic that this "in depth" piece was written by an LLM as well.
> temperature is not some kind of "deterministic" switch, but rather it affects the sampling distribution (which becomes more "spiky"—but is still very much a distribution). You're correct. The confusion arises because we use the word "non-deterministic" when we mean "probabilistic". I tried to explain it better: https://www.lelanthran.com/chap15/content.html
[deleted]
A more spikey distribution exactly makes the distribution closer to deterministic. That's not the point though. Even in greedy (deterministic) decoding, it is still a black box though that reacts in ways ways that are unpredictable to the inputs. Switching one word around might lead to different scores for example.
Yeah, this is the forest that the people arguing about math trees are missing. It doesn't matter that the algorithm is deterministic if the algorithm passes the input through a cryptographic hash function to make a yes/no decision. The result may be perfectly reproducible and still non-sensical in its distribution with respect to its input domain.
He said it nudges it to be more deterministic. Your comment is not correct.
Agree
Small refinement: the underlying model isn’t stochastic at all. The forward pass is a deterministic function of the weights and input, it just produces a probability distribution over the next token. The stochasticity is an optional sampling step layered on top, not something inherent to LLMs. Greedy/argmax decoding (or temperature 0) makes the whole thing deterministic. So “purely stochastic” overstates it a bit: the distribution is computed deterministically, and you choose whether to sample from it or not.
There are more layers to this problem, if we want to get into the details. The LLM is defined in terms of floating point operations, and those are not actually fully deterministic, on most hardware and in most performant implementations. IEEE 754 only specifies precision requirements for certain operations, not precise bit patterns (e.g. for exponentials). So, at least in principle, the same hardware performing the same operation could produce different results at different times, as long as they are close enough to the theoretical answer. I'm not sure if any hardware actually works like this. IEEE 754 also specifies that many of the basic arithmetic operations are not associative - so any reordering (which is common when batching multiple queries at the same time) will introduce indeterminacy from the perspective of your own query (that is the result for your query will change depending on what other query happens to be processed at the same time, which is not under your control). Finally, even if we take the case when a query is processed alone, and even if one particular hardware is completely deterministic, the result will be different on different hardware - which can again look like non-determinism if you're sending your query to a load balancer. So, the math for LLMs is deterministic in theory, but implemented with non-deterministic approximations & optimizations in practice, and their results are then normally used only as a probability distribution to be sampled from.
[deleted]
Every time people point out a limitation or constraint of LLMs, I see a comment that is to the effect of “but humans…”. I don’t understand why this comparison is relevant to this particular thread. Is it just an amusing similarity?
I think it often useful to push the conversation down "we built a system for humans that dealt with this, what from that is or is not applicable for agents in the same context"? Humans randomizing resume review for screening is pretty known; I've seen companies try to fight it with things like hiding information, panel reviews, etc - it's unclear to me how effective those would be for agents (honestly, it was unclear how effective those were for humans). I was depressed about the hiring process before we had AI screening and I remain depressed about it.
It may seem trite but the point is that if separate humans were assigned the same task the LLM was here the results would be similarly non-deterministic.
We expect computers to be consistent on the other hand. A calculator will always give you the same answer unless some chip gets struck by a particle. LLMs are on computers and should be fairly consistent too.
And this lies at the heart of the problem. We expect computers to be consistent despite running programs that are not designed to be consistent. This despite the fact that we have lots of experience of programs running on computers that produces wildly inconsistent outputs. But for some reason some people choose to assume LLMs should act like a calculator instead of any of those programs.
What's even worse, different humans have different weights. If you train two different LLMs and replace what data they "see" in batch n, that doesn't affect the data they see in batch n+1, or any further batches. In LLMs, you can introduce "noise" into the training process, but that noise doesn't really compound. Humans learn from experience, not from data, and their experiences at age n shape what experiences they seek (and hence train on) at age n+1. A small amount of "noise" injected into their "training", let's say hearing a group of friends discuss a movie while their identical tween goes to the bathroom, can compound into them watching that movie, which can compound into them forming an identity around that genre, and so on, until they're two completely different people, trained on completely different "data mixtures".
> What's even worse, different humans have different weights. Far worse would be different humans having the same weights.
The same person is not going to give you three different answers within span of minutes. Especially when nothing fundamentally has changed. People might or might not update their views depending on their biases.
I'm pretty sure the personality tests are created specifically for the reason that a single person can have fundamentally (or conflicting) beliefs about himself in a matter of minutes. You can say "I am honest person" and the next minute you can say "I never lie" - and both cannot be true for an average person.
Test retest reliability is a thing in psychometrics.
[deleted]
a studied example is sampling judicial decisions before lunch and after lunch. judges are more lenient on a full stomach.
[deleted]
That was a single study and it's finding is at the very least disputed, if not debunked, e.g. https://news.ycombinator.com/item?id=41091803
how did they account for sampling bias? a judge might leave easier cases for after lunch. people with control over their schedules usually ease themselves back into it after breaks.
its a bad idea in general to use non-1.0 temperature. there is a reason labs are strongly recommending using 1.0. using low temperature is more deterministic, but the cost is the model becomes "dumber"
1.0 is actually pretty arbitrary and way too high as a general rule. Something like 0.3 is a more sensible default
1.0 is "natural units". If your energy corresponds to nats, you should be using temperature 1.0. If your energy corresponds to bits, you should be using temperature ln(2) ~= 0.7. The optimization pressure is max nats = max entropy + energy / temperature Why might energy correspond to bits or nats? Imagine your goal is to play as many interesting games of chess as possible in a tournament. This implies you have to keep winning. If you look at the RL environment from the right perspective, you can turn it into optimizing bits or nats.
If RL was used to train the model, the model will have been trained on its own sequences. Those will have been generated with a temperature of 1.0. They must be, otherwise you would get a premature collapse or explosion of your entropy if the temperature was respectively lower or higher. After that RL step, you want to stick to the RL distribution, and so keep a temperature of 1.0. Other temperatures will drive the model out-of-distribution. That is why the sampling step for agents or thinking LLMs are usually kept at a temperature of 1.0.
It really depends on the application does it not? I'm not an LLM guy, but for creative tasks like storytelling wouldn't you want a higher temperature usually? Happy to gain insight from anyone with experience here :)
Heavily depends on the model architecture and the implementation though, I don't think you can say what values are better than others without first specifying those, otherwise it's straight up guessing, ironically.
If you use a model in a configuration far from where it was RLed you get no warranty. (you also get no warranty the other way, however)
It can be useful for pure translation tasks and stuff like that where you explicitly don't want creativity of any kind.
Would 1.0 have fixed the wide variance in scoring?
temperature is the wrong tool the variance is caused by the bad evaluation prompt if you ask "what is the capital of Paris" you'll always get Paris, with any (non-extreme) temperature
Plenty of setups defaults to lower values than 1.0.
Willing to be corrected but I believe this type of automated resume filtering is illegal. Not saying it never happens but my understanding is it is not typical.
I would expect that to depend on jurisdiction. I don't know for sure, but I would be surprised if it was illegal in my particular US state. You might be able to argue the AI has inherent biases that introduce illegal discrimination in the hiring process, but my understanding is winning I case like that would be very difficult, especially since most employers are very cagey about their hiring process and why they mades a decision.
They don't need to actually filter/blackhole to have have the same virtual effect. Show someone a list of resumes with an "applicant score*" and they'll naturally ignore the ones with a low ranking *scores are generated with AI, mistakes may be made, use only as a guide and verify results
In situations when you get hundreds of applications for one open position (real market now), whatever reduces your pool to the size a human can handle, works. You can preserve some diversity metrics in the process. This particular filtering is rather primitive, but LLM as a first filter can definitely do the job. You may burn less tokens than the hourly rate of your HR and it will be fairer than just dumping 50% of unread CVs in trash.
Great until someone realises you’ve filtered out minority groups from the application process (most developers are men so maybe the LLM decided they’re the best fit, but you’ll never know exactly why it screwed your over) and you suddenly have an expensive lawsuit
Under GDPR, you have the right to request manual processing whenever personal data is processed automatically to make a decision about you that has "significant impact". Not being hired seems like it would qualify.
Illegal where?
And this + the tendency for AI to "prefer" AI produced code + some other AI biased is why *this is most likely highly illegal to use in the EU due to violating anti discrimination laws in multiple ways. To be clear: - randomly filtering "too many" resumes is pretty much allowed (I think) - but must be actual random independent of the resume (and can be in multiple layers, i.e. random filter > pre-select > random filter > select) - this isn't the case for AI as the random aspect isn't done as the random aspect is not independent of the actual resume evaluation - in general you can't make sure the AI doesn't apply systematic biases, and there is high indication that it does do so - for humans you can train them and order them to ignore their biases, this won't work reliable either _but now you delegated the responsibility of illegal biases to the hiring personal violating the order_. But for AI usage you are responsibility no matter what you tell it. Lastly you can technically "show/proof" a specific used AI is highly biased in a specific contexts, which for human employees is technical possible but practical not really practical. So this moves "specific mostly deniable" cases, into "systematic proven bias" teritory. Or in other word legal risk goes from "limited/no issue" to "people can systematically f-you over if they know you use AI for hiring".
Everything is correlated to everything [1]. Which means there's a good chance this is somehow correlated in one way or another to race/gender/other protected classes in the US, just by the math of everything being correlated to everything. Which means this is one good lawsuit away from being illegal in the US as well. It doesn't even necessarily have to "win", just do well enough in court to scare away anyone else from using this. And boy oh boy would I hate to be on the receiving end of this lawsuit, trying to prove that my AI screener is completely in compliance with all hiring laws. That sounds like a nightmare. [1]: https://gwern.net/everything
Already happening with Workday in California: https://news.bloomberglaw.com/litigation/workday-loses-bid-t...
Would the accused party have to prove compliance? Or would non compliance have to be proved by the accuser? Honest question, I'm not American.
"Innocent until proven guilty" is a criminal court concept. This would be a civil suit. Those use different standards, like "preponderance of the evidence". I agree that if the claimant had to prove the AI system is violating employment law that that would be a hard bar to clear, but showing on the preponderance of the evidence is something that would have me a lot more nervous if I was on the receiving end of the lawsuit. This is a highly general answer to a complicated topic; my main point is more that this is not going to be held to the standard of "beyond reasonable doubt", which would be hard to meet. [1]: https://www.law.cornell.edu/wex/preponderance_of_the_evidenc...
>Which means there's a good chance this is somehow correlated in one way or another to race/gender/other protected classes in the US, just by the math of everything being correlated to everything. >Which means this is one good lawsuit away from being illegal in the US as well. Uhh.. what? No that doesn't follow at all. Screening resumes in a way that correlates to race, gender, etc. is not illegal. This is a fundamental distinction. The law is you cannot use those as filters. But the outcomes likely will be correlated. In fact to ensure they are not correlated you'd have to break the law and control for race, gender etc. Which is racism. The models dont even get race as an input. If they did and they used it to select then yeah, that lawsuit sounds like it has merit. But a mere correlation in outcomes? In no way illegal what-so-ever.
The US has a notion of "disparate impact"[1] that means you can be liable for discriminating based on a protected characteristic on the basis of correlation. This is why HR departments are very hesitant to use things like IQ tests for screening candidates, for example. [1]: https://www.congress.gov/crs-product/IF13057
I wouldn't doubt that lawsuits for employment discrimination for any company (and I suppose it was most of them) that used LLMs in hiring processes will become a very lucrative business. They are all open to civil suits at this point.
And, if there aren't enough lawyers to do all that work, you could use AI to file the suits. I'll let you decide whether that's a dream or a nightmare...
> randomly filtering "too many" resumes is pretty much allowed (I think) It's totally fine to filter out resumes in a completely random, content-independent way. Grabbing the fourth resume down in the pile and offering them the job is a perfectly fair albeit stupid way to make a hiring decision. However, AIs are very, very good at capturing biases, and it would not at all surprise me if an AI told to filter resumes is going to end up filtering with some biases for things that you definitely do not want to filter on, like the name of the candidate. And it might be that everybody resume that claims it fixed a typo in a major open source project gets a pass, but resumes that only list their own projects get rejected 60% of the time, so you're losing more good candidates than bad.
I'm not sure this is very easy to show this is a breach of non-discrimination requirements, like under Council Directive 2000/78/EC for employment. Due to acting like an irrational gambling machine, I agree it can have unwanted indirect discrimination effect in general. But it will probably not differentiate "on the grounds of religion or belief, disability, age or sexual orientation". It is possible, but that would take a lot of work for the lawyers to prove to the court. I believe the more interesting part is that the EU AI Act (still not in force in this regard until 2 December 2027). This will be clearly a high-risk AI system: "AI systems intended to be used for the recruitment or selection of natural persons, in particular to place targeted job advertisements, to analyse and filter job applications, and to evaluate candidates". Which does not mean prohibited, but it could later turn out that LLMs will be excluded from being used in high-risk AI use cases (falling under article 6 with no exemptions). Considering that none of the standards are published yet, I have absolultely no idea how they will ensure compliance with the following parts of Article 10 when using LLMs for such tasks: "(f) examination in view of possible biases that are likely to affect the health and safety of persons, have a negative impact on fundamental rights or lead to discrimination prohibited under Union law, especially where data outputs influence inputs for future operations; (g) appropriate measures to detect, prevent and mitigate possible biases identified according to point (f)" I don't think that's technically possible to do so with LLMs in general at the moment, even with the full cooperation of the model providers. Maybe you can do some meaningful audits for smaller models. But the EU AI Act may end up excluding all the generic "using-LLM-but-not-entirely-sure-why" vibe coded approaches from high-risk use cases (in Annex III). Which would make sense. https://eur-lex.europa.eu/eli/reg/2024/1689/oj/eng
EU AI Act got hijacked by huge corpo with last minute changed with moved it from "could probably work" to "catastrophe". Even at 2 December 2027 it might be intentionally not enforced at all due to that for a while, through I think the goal is currently to amend it until then. > that LLMs will be excluded from being used in high-risk AI use cases no, it won't I can guarantee you this. At best they will get additional restrictions over time, as things go wrong. Anyone who could make this happen has way too much interest to not make it happen. (Most/All? EU country legal systems are overloaded to a point of not working correctly anymore, and have been before AI generated law suites and other AI nonsense started. I won't go into detail but many believe AI assistance (for certain tasks, always with a human doing any final decisions) is the only way to get out of this mess). > standards are published yet or exist, like seriously this isn't a case of there being non public WIP standards which will pin all the nitty bitty details down, but cases of state agencies (and in last instance judges) having to decide if a specific standard (or implementation) is sufficient or not. but also to some degree it shouldn't be tightly coupled to tech standards as there are often many ways to implement the things the law requires and accepting only one is undesirable (and likely wouldn't legally hold up). But having tech standards which are a "guaranteed to be enough if you comply with" (but not the only valid way) would have been preferable, bringing us to the next point > have absolutely no idea how they will ensure compliance nor do they know, the original non big corpo hijacked version had exceptions for most companies affected now. So it would only have affected a handful of huge companies, which have many of the things required already in place, in some form or another. Most likely this would have played out as this companies presenting how their measurements are "sufficient" and the agencies then evaluating it and potentially requiring some changes, going back and force over a longer duration leading to documented cases of rough technical standards about "what is sufficient" they then can pass to other organizations in the future. But now the law affects not just a handful of companies but like thousands, if not tens of thousands. Many not stuffed in a way where such a process could work, or even do the necessary documentation to show "compliance"... So from a practicability POV, if enforced starting 2027, it currently excludes close to _any_ (meaningful) use of AI, down to a trivial linear regression or similar. Including any "old school ML/AI" any Bank uses for risk assessment. Banking stopping running in December and there not being any (meaningfull) AI startups or adoption at all is not something anyone (in power in any state organ) wants to see, so guess how much it will be enforced ;) And as mentioned the chance of AI as technology being excluded "in general" is close to none. Maybe specific usages could be excluded (and/or are already excluded) but thats it. Oh and as a bonus a malicious reading of f+g remove any proper privacy protections for any AI usage in high risk context, where it is often most relevant... (a more sane reading allow it, with ... tricks).
> this is most likely highly illegal to use in the EU due to violating anti discrimination laws in multiple ways. It's generally illegal under GDPR Article 22. > The data subject shall have the right not to be subject to a decision based solely on automated processing, including profiling, which produces legal effects concerning him or her or similarly significantly affects him or her. Exceptions in 22(2) are unlikely to apply. It's hard to argue that it's truly necessary (a) and consent (c) is almost always unavailable in employment context. (b) might apply, but it requires specific law in EU or Member State to authorize it.
For C: I'm not sure how EU laws work, but ethics says that someone who needs a job cannot give consent since the possibility of a job if they give consent could be a bribe. See a lawyer for how it works in your country.
also not fully sure, but AFIK there are limits to how far you can wave this right, in context of things like TOS, simple opt-in fields on forms etc. Like YT would have loved to make you opt out of it (and probably has it in their TOS) but there where multiple cases of courts forcing them to handle it properly in the past as far as I remember. My _guess_ is that at least if you don't sign a proper contract you can always force a human reevaluation. But also only that (so only semi useful). Also even with a proper contract it's unclear if it would be possible in this specific case due to the contract being fundamentally one-side/unfair and semi-forced on you if it where wide spread on the market for the specific job you are trying to get.
That's why I said consent usually cannot be used in employment context. I wouldn't rule it out 100% for everything employment related, but application screening is unlikely to qualify for those rare cases.
this isn't quite how GDPR Article 22 works The is a difference between - having a right you can't wave - which is very similar to something being forbidden - but different to having a right you fully or partially can wave Furthermore to some degree you are only "subject to a decision based on ..." if the decision has an effects affecting you. In practice wrt. Article 22 this means companies can make a "decision solely based on automated processing[..]" iff they give you a (realistic) chance to object to it in which case they will do a human review of the decision where a human confirms/changes this decision based on reviewing the involved information. There is a lot of gray area what a "chance to object" means and when a human review makes an decision no longer "solely based on automated processing" (a human just saying AI was right clearly doesn't count, but a human constructing a case why they would have decided the same way based on the why the AI did the decision can count, iff it's reasonable to assume a human might have come to the decision had it only been reviews by an human). Or in other words GDRP Article 22, just "soso" meaningful in context of hiring. Like if the AI did a mistake they have to reevaluate it, but as long as there are other similarly qualified competitor (they did hire/are in process of hiring) it quite easy to come up with a reason why they are a better choice for them. Or go through the motions of you being in round 2,3 of hiring and then find an excuse to not hire you.
Mostly yes. Note the chance to object must be given before decision is made, i.e. not to give option for human review after the fact. Human must also be able to actually have meaningful chance to affect the decision. If the decision is based on purely objective facts that are actually necessary (like you must have certain license) then human and computer always coming to same decision is likely correct and compliant, but as soon as you start putting in subjective criteria and human agrees with 100% of computer denials it becomes a lot harder to demonstrate that human is actually able to affect the decision as required by Article 5. Note that demonstration burden is on controller, not on data subject/DPA. Objective criteria also isn't always enough by itself. If both human and computer calculate the same credit score and you must score X points to get a loan then human isn't actually able to affect the decision. Essentially the credit score calculation itself ends up being the automated decision rather than the formal rejection that is later given to data subject.
[deleted]
[deleted]
At this point we might as well adopt that joke where you blindly throw away half the resumes because you don't want to hire unlucky people.
At one point in the past a major UK a medical school adopted random selection for qualified candidates (Barts and The London School of Medicine and Dentistry - part of Queen Mary University of London). The approach benefitted qualified students from less well-off backgrounds vs those who can afford to win at the ever more elaborate (manual at the time) hurdles of resume assessment criteria and effectively game the system. There was an orchestrated campaign against the lottery around "Why gamble with would-be doctors?". Random selection was quietly dropped.
That's probably a good litmus test for political capture by elites. The Netherlands introduced a weighted lottery for medical schools in 1972, abolished it in 2017 for basically the same reasons, studied the (worse) outcomes for a bit, then put it back in 2024.
A person's total luck is constant over a lifetime. The remaining half of the candidates already spent some of their luck in this selection, so they'll be on average less lucky than the discarded half.
Normally we'd reject the first 37% [0] of candidates and then pick the next one that is above the average, but if all the unluckiest candidates show up first, then we need to sample even more in order to get an accurate baseline. This may be compounded by the the "Teela Brown" problem [1], where some candidates may be too lucky to end up with our company, causing them to appears later in the stream or not-at-all. [0] https://en.wikipedia.org/wiki/Secretary_problem [1] https://en.wikipedia.org/wiki/Ringworld
No, luck would be some expression of the difference between the average and the individual outcomes - it only exists relative to a population at the point in time when it is measured.
But, however you structure the selection process the people who get picked are the ones who’ve expended some luck (like, if you throw away half the resumes, but then pick the resumes out of the trashcan, the ones you plucked out are still the lucky ones). I see two possible solutions. 1) Most people won’t be using up most of their luck on this one thing. I mean they’ve got their whole lifetime worth of luck, so you just need to make sure to pick people who still have plenty left. In other words, ageism and/or picking people who’ve never accomplished much are the solutions! 2) We assume working for the company is a lucky outcome. If you make the company a really unpleasant place to work, people will have to use their luck to dodge it. However, luck can only be evaluated against other possible outcomes. The plan, then, should be to set up a competitor (possibly a front) that is a really nice place to work. They’ll act as the “lucky outcome expenditure dump.”
> A person's total luck is constant over a lifetime Ah yes, the much revered cosmological fairness constraint.
everyone knows luck is tied to the wealth-gravity and increases as the inverse distance to the density of matter. hut because its relative, everyone thinks they have the same luck when not observing others.
Even assuming that was genuinely how luck works, the conclusion does not follow from the premise because it’s obvious not everyone “starts with” the same amount of luck to spend.
But assuming a random draw, you're more likely to select people with higher luck.
assuming luck is spendable
This is not at all how probability works. Luck is not a resource one spends. If you flip heads 500 times in a row with a fair coin, the next coin flip is still 50/50.
Presupposing that the same coin is used for every flip (which is implicit in the example), it would be fair to question whether the coin could possibly be a fair coin after 500 heads in a row, even (and especially) if the flipping process were ideally fair. I’m not a whiz with the math involved, but I am of the opinion that 500 consecutive same-side flips is a large enough sample size to calculate that the coin in question is biased, so it would be unreasonable to assume that the next flip is 50/50. https://en.wikipedia.org/wiki/Checking_whether_a_coin_is_fai...
Donald Trump disproves the fixed luck hypothesis (and the Karma hypothesis!)
Or more to the point. There are generally far more qualified applicants than job roles. That is training and education greatly expanded over the last couple of decades to produce more and more job seekers, whilst job creation hasn't really kept pace.
This hurts more than it should.
May be LLM resume screening is a symptom of a bigger problem - with tens of candidates per vacancy employers can screen resume badly and even throw half of the resumes away and still hire someone qualified.
That's really what it is, or at least what I've noticed. Any position you have these days is inundated with applications. Most don't meet the qualifications (because in a lot of places say in the US you must apply to jobs to keep with benefits, regardless of what you are applying for), and for the remaining, you'll find that there will always be some that are all similarly qualified. Who do you hire for one position? It sometimes just comes down to luck. AI doing the job of filtering I can't imagine making the process easier, and more applications are just going to get tossed because of it.
The author made this exact joke in TFA.
> I fail 65% of the time. Same exact resume, different luck. As someone who’s run hiring pipelines for technical roles in the past few years, that’s actually a fantastic number. I objectively hate saying that, but it’s true. 35% chance of elevating a technical individual to the next stage with no effort? I’ve seen as many as 100+ applicants an hour even when including a domain specific screener question. That’s 35 “screened” applicants in an hour. Were valid candidates screened out? Yes. Does you still have a candidate pool 35x larger than you need? Unfortunately, also yes. The volume of applicants is SO HIGH such that your chances of getting moved to the next stage are actually markedly worse if AI isn’t involved. If you didn’t apply immediately (using an AI bot) there’s 50+ people ahead of you, and an exhausted technical leader if they ever make it to your resume. Referral bonuses exist for a reason.
In that case, I have a pre-screening system to sell you. Through state of the art technology, it only lets through the best* 1% of applications. *According to our proprietary, undisclosed, non-deterministic metric, which may or may not be Math.random
I worked at a startup that judged their hiring pipeline quality using rejection rate criteria.
Reminds me of this https://stackoverflow.com/questions/16833100/why-does-the-mo...
Is it? Or is it a 65% chance of a resume getting ignored before a single human sees it, reducing your pipeline's likelihood of catching qualified candidates by the same? Gates that reduce resume flow-through are only useful if their reduction is correlated with quality. Otherwise they're just dragging out your hiring process or unnecessarily causing you to ultimately lower your hiring bars.
> Gates that reduce resume flow-through are only useful if their reduction is correlated with quality. The volume is infeasible to review everyone for quality, even at an hour scale. The conclusion and solution is inevitable, though I wish it were different. 35% is actually really good if you’re not coming in through a referral. The current reality is <1% and the person reviewing you is exhausted.
You may as well just randomly pick 65 to discard, if your only goal is to reduce the number for review.
What a inhumane way of looking at this. Hiring is deeply flawed, you know it, and yet you keep job postings open for weeks/months in case "the one" magically appears on your doorstep instead of just interviewing 10-20 people and just pick one... Corpo bullshittery at its finest.
This reasoning isn't.
The goal for the interviewer is to have a much higher ratio of good/bad candidates after the first screening. This means the more costly time you spend on the second step has a better return.
So the question is: is the score given by this system correlated with candidate quality? I don't think this post gives enough data to know.
So the logical solution is for candidates to submit multiple applications with slight variations to their contact info, "John Schmidt", "John J. Schmidt", "John J. J. Schmidt", "John Jacob J. Schmidt", "J. J. Jingleheimer Schmidt", etc.
Hey, that's my name too!
Whenever I send them out The filters always route: "Spammer: John Jacob Jingleheimer Schmidt" [N/A] [N/A] [N/A] [N/A]
It's a good day to have 3 middle names.
[deleted]
If you have no requirements for accuracy, you can just advance 35% of applicants at random. If the first 50 people who apply are all bots, why are you reading resumes in order of submission?
One of the first things you do when hiring is to set a period and randomize order of resume when reviewing because early application is not a strong signal.
Sounds like you're pretty bad at hiring pipelines.
there have got to be better ways to optimize pipelines. maybe set a limit on number of applications for a role based on the number you/your team can reliably go through them. if more are needed then open the role for another wave of applications.
Except the bit about ranking a decades long S3 engineer lower than an intern with GitHub repo.
I wonder if you could solve this for programming specifically as follows: 1. Give them some easy leetcode questions. Nothing that a competent programmer would have any problem with. 2. If they pass, ask for a deposit of like $20. Shouldn't be an issue for people who are actually serious. 3. Do more simple leetcode questions but this time on zoom so you can tell if they are using AI. If they pass that they get the deposit back. (Yeah I know there are real-time interview cheat AI programs but based on what I've seen on demos of them it's super obvious when they're being used.) Probably not practical but just a thought!
I'm not going to do any of those 3 things for a would-be employer.
They don't seem like unreasonable things to me so I guess it also helps filter out unreasonable people!
This selects for desperation.
[deleted]
[deleted]
I think what's more worrying to me (if other systems work like this ATS) is that it seems to judge based on a bunch of factors that will probably disqualify a ton of decent to good participants. For example, 65 points are given for a mix of personal projects and open source contributions. Which is great if your one and only interest is in tech, and you don't have a family, dependents or a second/third job. If you have any of those other things, well the odds seem like they're incredibly stacked against you. And it makes me wonder how many of these systems are stacked in favour of wealthy people with a near special interest level of obsession with tech and no worries outside of going to college/working a single job in their industry of choice.
Yeah, the over valuing personal/open source projects is worrying and kind of sucks. I can use myself as an example, I don't do personal projects really, outside of work. My only actual programming work experience is during work hours for my employer. My hobbies are tech-adjacent (3D printing, some hardware/arduino stuff, photography) but they aren't "make a bunch of projects and put them on github" type hobbies. I'm certainly not going to make some BS fake CRUD or SaaS apps just to show off for potential employers, what a waste of time. I, intentionally, have zero online presence in that regard. You won't find any public repos on my github, I don't blog, etc. Its even infected the ops/syadmin side of the field (where I work), and that's somehow even worse. Like of course I don't have a bunch of environment specific scripts on my GH, why would I? It's irrelevant to anyone that doesn't work in my department at my current employer.
In my experience personal projects are the greatest indicator of IC competence, especially for young people. You may not like it, but turns out that when you do a thing in your free time because you like it, you get better at the thing than the people that only do it because they have to.
[deleted]
I know that some think this is just some cold hard straight talk but this style of individualistic thinking lacks empathy. And more practically, it’s a trap. In context, the “doing things” and “opportunities” that we’re talking about are jobs, careers. So by promoting the idea that one must work harder or longer to get or keep a career that they’ve already built sounds like a path to opt-in servitude.
In hiring, we pass laws to prevent abuses. In many countries and soon a few states, being asked to work outside of work hours is considered an abuse. Expecting that someone does work related activity outside of work hours is something I would actually consider regulating out of the application process!
Of course life isn't fair. But here the result is that companies will ignore potentially great candidates which dedicate all their programming time to their job and instead consider candidates which may be not just worse programmers, but also are more interested in their hobbies (or padding their CV) that doing their job. I'm saying this as somebody who most of the time has some side project going on.
[deleted]
“Fair” is one thing, “systemically impossible to even approach fair” is another. For example, you can’t “conscious long-term effort” your way out of being stop and frisked by cops because you were walking while black. This setup isn’t even good for employers. Having your job as your hobby doesn’t automatically make you better at your job.
> The default model is gemma3:4b That’s a tiny model. No LLM is going to be a perfect and repeatable judge, but a tiny 4B model is like plugging an RNG into this system. This whole exercise feels like someone vibe coded an ATS and got it to the point where the tests were passing because they decided they should have an open source ATS project.
This sort of model is fine for small problems, when used in the right way. I think there's probably a version of Resume analysis that would work well with this model, but "hey clanker, what projects has this person done" is not the way. You need extraction, cleanup, probably OCR to compare and further clean up, multiple analysis passes per signal with LLMs, judges, etc. None of that needs to be large models, you'll get marginally better performance, but there's very little context, these models will perform well when used correctly.
[deleted]
This word (determinism) has a magical effect of warping any online posts it touches. Once you hear it you can almost guarantee it's going to be misguided. At least this time it's actual determinism (same input = same output), not arbitrary unrelated things. Determinism matters for reproducibility, but do you really want these outputs to be reproducible in this particular case? Making LLM outputs deterministic is relatively trivial, you have to use batch-invariant kernels (if you use batching) and either set the temperature to 0 (don't do that, randomized sampling is here for a reason) or fix the seed (better). It's readily available in a few systems. But this won't make the result more useful, it will just obscure the fact that the agent is genuinely not sure about it - look at the range of the scores it gives! It still won't predict anything but the score will stay the same each time. Do you really want that? What happens here is they're supplying too little information (just a resume, which is almost at the noise level) and expecting a reply with too broad implications. This is a basic design mistake regardless of whether it uses LLMs. All surveys, tests, laws, and voting systems are extremely sensitive to framing because they work off too little information. But they also don't exist in vacuum, unlike this thing.
This. Human judges and examiners are famously not deterministic even though we would wish it were so - we've probably all heard the thing of harsher sentences being given in the hour before lunch.
>we've probably all heard the thing of harsher sentences being given in the hour before lunch That suggests determinism though. I mean I agree with you overall. Either humans decision making is a system so complex it appears non-deterministic, or it is deterministic. Practically speaking, we are non-deterministic. Let's not conflate non-deterministic with inaccurate though. Non-deterministic systems can be 100% accurate. https://en.wikipedia.org/wiki/Las_Vegas_algorithm
> harsher sentences being given in the hour before lunch. Implicit bias theory sparked a massive number of studies that suggested everything influenced you from the color of the room, to what the person said to you before entering. It’s been really hard to replicate and the conclusions that have been drawn are contradictory.
I made a similar comment on a different post. Non-determinism does not necessarily mean it cannot reliably reach the correct output (although sometimes it does mean that). Las Vegas algorithims are non-deterministic and 100% accurate. The tradeoff is the time it takes to reach the correct answer is highly variable. To contextualize this insight in your post and basically just repeat what you are saying: The mistake is not using a non-deterministic system. The mistake could be, in some sense, using it too little. Re-evaluating the same resume 5 times and seeing a high variance in scores is a more useful signal than evaluating it once.
Nondeterminism is also a feature, not a bug. If you don't want people to optimize against your filtering process, you have to make it somewhat nondeterministic. For example, better candidates are exponentially more likely to pass the filter, instead of a hard cut-off at the top-100. Then it becomes no longer worthwhile to Goodhart the filtering process, because it barely increases your chances and there are so many more places you can use your time better.
> If you don't want people to optimize against your filtering process, you have to make it somewhat nondeterministic. I'm sorry, I'm not following this at all. When you say "better candidates are exponentially more likely to pass the filter", we're still are talking about a metric, yes? A metric that can be optimized? Why would switching from a hard cutoff to some sort of stochastic filter weighted by this metric discourage optimization?
Optimizing for the metric involves: 1. Optimizing for generally applicable skills that the metric is trying to measure. 2. Optimizing adversarially to hill-climb the metric. You want candidates to do (1) and not (2). You can make them agnostic to the second by setting d(expected gain)/d(opportunity cost) = 0 ==> expected gain \propto opportunity cost It is the case that most metrics are logarithmic: it takes just as much effort to decrease one bit of error as the next bit. So log(score) \propto (opportunity cost) \propto expected gain Thus, for them to be agnostic, you should filter candidates proportional to their log-score on the metric (where 0 is a perfect score). Because generally applicable skills are generally applicable, they will still benefit from improving those, they just no longer benefit from adversarial optimization, unless your score function looks very similar to others who have not adopted this filtering process. The issue with a hard cutoff is that people near the boundary are extremely incentivized to adversarially optimize, as it is usually cheaper than working on generally applicable skills and actually pays off for them. You see this phenomenon on AoPS where (esp. Californian) students talk about grinding for MATHCOUNTS instead of learning calculus.
It's always amazed me that a tech company will pay $300,000+ for a good engineer, because talent is so hard hard to find... meanwhile their recruiter operates unsupported, has a very different idea about what good looks like. Their ATS black-holes >50% the resumes because it's filtering heuristics are garbage because recruiting selected the ATS system because it has a google Gmail integration or something, and the ATS's filtering technology was not reviewed by anyone in the engineering or data teams.
I ran the ATS myself and had a similarly quirky experience. I was in the 70s because it couldn't find my GitHub profile, and then it didn't like some of the popular Ruby libraries I'm the author of. After a few runs it picked things up appropriately. I always got dinged on formal education though. This stuff is gross.
Similar to my experience. Put me around 65 in some runs, because it didn't like I don't have contributions to OSS. Also, it doesn't pick up certifications or awards. I tried some PRs people are suggesting with enhancements (https://github.com/Zem-0/hiring-agent), it helps, but overall their ATS is hugely biased towards people with large GitHub contributions to OSS.
I tried this with my CV, and it somehow scored me bonus points for GSoC! BONUS POINTS: 5.0 ------------------------------ Google Summer of Code (GSoC) participation: +5 Even though I've never done this, and don't claim to have done it in my CV.
Happened to me as well. It is a known hallucination https://github.com/interviewstreet/hiring-agent/issues/240
Thanks - interesting. Very odd, though.
This insanity only exists because the tech industry is standard-less. No formal education needed, no formal training requirement, no apprenticeship, no software building code, no professional organization. Resumes have never been a good predictor of success - and why would they be?? Even if they're truthful and it's "impressive looking", that doesn't give you any assurance of knowledge, of who they learned under, what they learned, that they passed some minimum criteria. We might as well be rolling dice. So why not an LLM that randomly assigns scores?
I have no data to lean on other than my experience and intuition but I’d say that’s not the case. My domain is corporate finance, which encompasses a lot of structured roles and certifications, yet I consistently feel the Resume is just a poor device for making any judgement calls. Having people summarize their career into 1-2 pages of bullet points just doesn’t mean much. Especially now that keyword packing is a thing. It’s just meant as an introduction/sniff test to open the door for a conversation. Then it allows for deeper more probing questions to be asked. This where you’ll assess how impactful their contribution to a project actually was. Were they really living up to your definition of a manager, or were they more so an IC that had a lot responsibility. Stuff like that. > Resumes have never been a good predictor of success Applies broadly to the world, it’s not unique to tech
The problem is we have too many applicants to phone screen them all. For a lot of jobs today you end up with 10,000 applications, which is why these automated resume-skimming systems exist, but unfortunately this page shows how they basically don't work
People seem to hit a wall when flooded by resumes. They feel like there some needle in the haystack they need to find and it’s overwhelming. But you don’t have to read all of them. Or talk to all of them. Or use a system like this to filter. If you know what you’re looking for, you just start skimming them and maybe ranking them based on your own rubric. If it’s an obvious “no” you can usually tell within 5 seconds skim. Once you have a handful of high ranking ones, stop, and talk to them. Repeat as necessary until you have a short list of people you’d want to hire. There might be 9900/10000 resumes you never even looked at and maybe one of them would have been slightly better but you can’t let perfection be the enemy of progress. Stand by your convictions of feeling the candidate is qualified and capable and meets what you expect and hire them, get back to business. Having been in “talent shortage” mode for a long while I’d rather have 10000 resumes than 3. Having to pick one from a suboptimal selection is an awful position to be in, but sometimes a necessity.
Do you think fields that have formal criteria don’t use resumes with keywords? I bet Lawyers look for school names and big law firms all the time. Credentialing helps maintain a quality floor. Does this person have basic employable skill? Nothing more. It actually doesn’t help you identify levels of talent and skill which is a universal hiring problem. We do have a credential - a CS degree. And you can see it is a mixed signal. Employers can choose of their own free will to take risks on employees that do have this credential, or not. Mandating by law that you must have a CS degree doesn’t seem to help our field as we famously have high performers across the spectrum of formal education.
Feels like "I Don't Hire Unlucky People" all over again, but with extra tokenmaxxing steps. https://neonrocket.com/2014/05/rescued-from-the-ashes-i-dont...
This is the new AI reality everyone around is wanting: a nondeterministic computing. There is another name for it: a waste of electricity. But wait, not waste! Consumers paid for it fully, with nice profit margins. You and me, paid. Try using google flights, or booking.com: the prices shown in search results list are frequently significantly different from those in a single result. It's a nondeterministic compute when it's easy to spot it. But it's not always that easy. It's all sad, to be honest.
There should be laws against displaying wrong prices or different prices for who you are…
I'm a little confused, is this an ATS system that anyone actually uses? If not, I'm not sure how it's better than just asking ChatGPT to score your resume out of 100. Why would you want to optimize your resume for a system no one is using to score it?
I would assume at least hackerrank is? I don’t think the point of a lot of this is to optimize your resume. It’s to show how arbitrary these systems are.
From my understanding this one is used for hiring tech workers only. The (very) widely used Workday application system for ex seems to have its own built-in ATS.
(Almost) everyone’s using some kind of ATS, every ATS is adding AI auto-ranking (and has been trying to for 15 years), and almost all HR people feel like they have too many obviously bad CVs to read. Whether or not someone is using this ATS specifically, if you submit several CVs to several places, your CV is going into at least one magical 8-ball.
“I'm a little confused, is this an ATS system that anyone actually uses?” You read my mind. If the answer is “no”, then we can ignore this.
For one, if you go on to Hacker Rank's "Screen" page, they mention the product is used by Stripe/AirBnB/LinkedIn/Atlassian/IBM etc etc. I imagine that there's plenty more companies using it too. But I'd also assume that their competitors are doing something similar so I don't think we as an industry can just ignore that it's happening.
Interesting, thanks. I admittedly spent zero time looking into it :) I’m surprised open source contributions count for so much. first I thought was “is that something people actually list in as resume?”. But it looks like it pulls your GitHub account and appends that information. That kind of unfortunate for anyone who doesn’t use GitHub
> HackerRank Screen compresses the top of the hiring funnel by replacing manual resume reviews and unstructured phone screens with structured, auto-scored assessments That seems to be a different type of product.
The takeaway from this for me is that, using an LLM to score anything takes multiple (maybe even many) runs and the result you’ll get is, at best, a sane-ish distribution. Which sort of sounds workable until you scale it up to larger datasets, where at some point compute/time/energy costs will render it non-viable. I am sure there’s some reasonable rule of thumb estimation on distribution that could be applied based off fewer runs per data artifact, but you’re always going to be trading off against confidence by doing this. Beyond this, I’d bet that almost no implemented systems that use LLMs for scoring, ranking, or decision making use such a multi-run approach. Partly because people don’t understand their behaviour is stochastic, perhaps because a lot of people without a background in statistics don’t understand what stochastic actually means, and no doubt partly because of budget concerns: if you have to ask an LLM to do the same thing 10, 50, 100 times to get a sufficiently good result, then the cost saving argument is either weakened or completely destroyed. There is at least one more aspect worth considering in the specific case of resumes/CVs: is the inconsistency of scoring by LLM worse than the inconsistency of scoring by a human following a similar process? Because the reality is that, even for an experienced recruiter, reviewing hundreds or thousands of resumes or CVs gets pretty fatiguing. People get hungry, bored, tired, restless, irritable, etc. That inevitably leads to inconsistencies creeping in, so there’s always an element of “luck” (or, perhaps better, uncertainty) as to whether your resume/CV passes screening. So is that inconsistency better or worse with LLM screening? I don’t know. But, at least, if it’s not worse maybe it doesn’t matter for this specific use case. And if it’s notably better then maybe it’s raised the bar on what “good enough” screening looks like? (And I’m sure other use cases warrant similar, “does it matter?”, questions, with the answers no doubt landing differently.)
My experience with benchmarks and evals is that it can take ~20 runs of a problem for the distribution of answers to start to converge. Ideally you'd know the convergence properties of your algorithm ahead of time and make a Bayesian solution that makes the uncertainty explicit.
Hiring and job search has been so hard and AI has amplified the existing problems instead of solving any.
Wdym, cant you just litter your applications with buzzwords and other bs to automatically get a high score in these systems?
HR market is basically an early google rigging era, where you can place hundreds of keywords at the footer (white text on white background) to start popping up on random searches.
I have been at both side of the market. And it sucks so bad at both ends. Companies which deeply care about next hire are struggling to hire and actual great people looking out are outcompeted by AI slop and AI bulk applying. It is actually a very hard to solve problem.
From `resume_evaluation_system_message.jinja` > *SCORES MUST NEVER DEPEND ON THE FOLLOWING FACTORS:* > - College, university, or educational institution name > - CGPA, GPA, or academic grades I don't understand why they would omit these factors from the evaluation.
> I don't understand why they would omit these factors from the evaluation. Only hiring MIT graduates sounds great to a lot of tech folks! Automatically rejecting applicants from HBCUs, however, sounds like a lawsuit As to GPA thing, I think it's just to stop the LLM glomming onto an obvious numerical grade? LLMs like to rank things by obvious dimensions, and whether someone had a 4.0 or a 3.8 in grad school makes very little difference to their performance 10 years down the line.
https://qz.com/1427621/companies-are-on-the-hook-if-their-hi... > But it didn’t. After the company trained the algorithm on 10 years of its own hiring data, the algorithm reportedly became biased against female applicants. The word “women,” like in women’s sports, would cause the algorithm to specifically rank applicants lower. After Amazon engineers attempted to fix that problem, the algorithm still wasn’t up to snuff and the project was ended. And in another org: > After an audit of the algorithm, the resume screening company found that the algorithm found two factors to be most indicative of job performance: their name was Jared, and whether they played high school lacrosse. Girouard’s client did not use the tool. https://www.npr.org/2024/04/11/1243713272/resume-bias-study-... > Their working paper, published this month and titled "A Discrimination Report Card," found that the typical employer called back the presumably white applicants around 9% more than Black ones. That number rose to roughly 24% for the worst offenders. It'll discriminate by proxy, basically.
I don't understand why they'd hand over those data points over to the model in the first place. If it's in the context window, it's impacting the output. To ensure that no weight is placed on those factors, they should be sanitizing them out before handing the data over to the model.
Hopefully so that people like me, that dropped out of high school yet have had a successful career as a self-taught engineer, have a chance. [1] Just kidding, my resumes are sent to /dev/null like everybody else’s. —— 1: In fact, I will be controversial and say that self-taught engineers tend to be the strongest in their own particular niche, because they are powered by sheer desire to learn and improve. I am routinely appalled by how many people go on forums to ask how to learn a new thing, completely unable to self-direct their learning. I blame the modern school system.
I'm a self-taught programmer as well, who dropped out of university, and these factors being omitted would benefit me as well, but I feel like good grades and a good university are still indicators of someone being or is capable of becoming a good programmer. This system would drop a Harvard top graduate for someone having a year of experience in some outsourcing firm.
At my company someone has introduced an internal tool that should help understand and give a "score" to design documents from teams. Needless to say, this tool gives scores exactly like the article mentions. Same document, same LLM, same prompt, and different results. It becomes even more ridiculous once you switch to other models, or if you ask a model to review the work of another model. I am not sure why we insist on making LLMs do the work they are not supposed to do and/or in a way they are not supposed to do. The worst part is that people are aware of the problem but they just ignore it and consider it as "a reference number, just to have an understanding". If it were like that, it would be less of a problem. The issue comes from the fact that eventually someone without enough knowledge will trust the output (so X points out of Y is how it is), or someone will stop challenging the output and consider it for their process - like in this unfortunate case of hiring. At a certain point, people who don't know what they are doing give a tool that doesn't know what its doingto people who don't know what they are doing. A pure mess. And everyone has to comply and applaud. If you go against, you are against AI. This is what I hate the most about AI. Not the tool, but the shortcuts we're willing to take to justify its existence.
> Sometimes my projects “lack architectural complexity” Well done you! It is difficult to avoid architectural complexity, but imho well worth it.
Count to three, no more, no less. Four shalt thou not count, neither count thou two—excepting that thou then proceed to three. Five is right out.