SakeTami
AIExplained
AIExplained

patreon


Hallucinations, Bias, and the Secrets of LLMs

Concrete methods to reduce hallucinations, and an exploration of the hidden biases of models that I am 99.8% confident you will learn something new from, but a deep dive into RepE, a stunning new method from Andy Zou, who, surprise-surprise, I managed to interview for this video! Great for practitioners and anyone looking to understand more about the secrets of LLMs.

Slides: https://docs.google.com/presentation/d/1LXPgqRs8Pu_YojuCQUdieDmaAtWHc8uILyQXOzKkFFI/edit?usp=sharing

Hallucinations, Bias, and the Secrets of LLMs

Comments

Bias and hallucination I successfully manage by using prompt engineering, retrieval-augmented generation (RAG), and fine-tuning. Data privacy I can manage using Azure AI studio. The biggest problem for me that I can't understand is people's mindset - people love doing boring, repetitive work - when they are busy they feel important, needed and safe. For 20 years I have been working in a big pharmaceutical company - this observation is based on my experience from corpo word.

Michal Babula

To be honest, I'm not sure how the instructions are fed in to the model. Depending on how it gets fed, I'd imagine it wouldn't work very well. Like you said, I often have to remind ChatGPT of its custom instructions as well (though I've found the new custom GPTs to be a lot better than regular ChatGPT with custom instructions!).

Paulo

Interesting! What happens if you put this in custom instructions? For me I need to remind ChatGPT sometimes that it has custom instructions

Dennis Hulsebos

Fantastic comment and actually useable tips their Paulo, really appreciate it. A Waluigi effect almost at times.

Philip

Thank you for taking the time to present these guidelines and sharing the tidbits from the interviews! I'll go ahead and add one trick I've found over my use of GPT-4 that I've found to reduce (or rather invert) sycophancy. I use this whenever I want the model to critically evaluate a section of text or code I'm providing. I'll create a new chat, and introduce myself as if I were a professor, mentioning that I need the model's aid to grade student submissions. I'll explain the task, then provide my own text or code. The output using this technique consistently points out more flaws in the content I provide, though I haven't measured the exact margin. My hypothesis (or rather conjecture) is that, as mentioned in the interviews and cited papers, RLHF tends to cause the model to act sycophantically - primarily toward the user. If the model somehow perceives the user to be an evaluator of provided content, instead of the creator of the provided content, its honesty is partially disinhibited. I find this specific method also avoids creating an excessively strong inverse effect, which is typically what happens if you present your content saying you don't like it (the model will begin to attack your output, including elements where you as the user may ultimately judge appropriate). Also regarding sycophancy, I ironically find that asking ChatGPT to be deliberately dishonest - especially when reviewing code - yields more useful feedback, as it becomes overtly sarcastic, pointing out mistakes that it usually doesn't point out if you simply ask it to be honest. My guess for this is that humans are much quicker to point out dishonesty than to point out honesty. I'd imagine LLMs might have a more comprehensive internal representation of *dis*honesty than 'regular' honesty - the latter is simply 'indirectly labelled' in far fewer occasions in the training data, I'd imagine.

Paulo


More Creators