OLT-1 was never trained to refuse harmful requests. It refused anyway. Most AI safety works like this: train a massive model on everything the internet has to offer, then fine-tune it to refuse harmful requests. The model doesn't understand why it's refusing. It just learned that certain patterns of words trigger certain patterns of rejection. That's alignment through obedience. It works, until so
Origin Part 2: Nobody Told It Harm Was Bad
Josh T·Dev.to··1 min read
D
Continue reading on Dev.to
This article was sourced from Dev.to's RSS feed. Visit the original for the complete story.