LLM Alignment - RLHF & DPO

RLHF is a powerful class of methods which can tweak an LLM's outputs to be more in line with desired preferences by generalizing from a subjective subset of human annotated samples. DPO is a recent technique which achieves more than what RLHF can, in a fraction of the resources.
Read More