59.
the grpo reward was the probability assigned by the classifier that the attack was not malicious + a bonus of the…
the grpo reward was the probability assigned by the classifier that the attack was not malicious + a bonus of the argmax was not malicious (meaning the attacker had tricked the classifier) early round the attacker does pretty well, but th