<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/">
  <channel>
    <title>IPO on Answer</title>
    <link>https://answer.freetools.me/tags/ipo/</link>
    <description>Recent content in IPO on Answer</description>
    <generator>Hugo -- 0.152.2</generator>
    <language>zh-cn</language>
    <lastBuildDate>Mon, 09 Mar 2026 05:13:58 +0800</lastBuildDate>
    <atom:link href="https://answer.freetools.me/tags/ipo/index.xml" rel="self" type="application/rss+xml" />
    <item>
      <title>DPO为何能取代RLHF成为大模型对齐的主流方法：从奖励函数重参数化到偏好优化的数学革命</title>
      <link>https://answer.freetools.me/dpo%E4%B8%BA%E4%BD%95%E8%83%BD%E5%8F%96%E4%BB%A3rlhf%E6%88%90%E4%B8%BA%E5%A4%A7%E6%A8%A1%E5%9E%8B%E5%AF%B9%E9%BD%90%E7%9A%84%E4%B8%BB%E6%B5%81%E6%96%B9%E6%B3%95%E4%BB%8E%E5%A5%96%E5%8A%B1%E5%87%BD%E6%95%B0%E9%87%8D%E5%8F%82%E6%95%B0%E5%8C%96%E5%88%B0%E5%81%8F%E5%A5%BD%E4%BC%98%E5%8C%96%E7%9A%84%E6%95%B0%E5%AD%A6%E9%9D%A9%E5%91%BD/</link>
      <pubDate>Mon, 09 Mar 2026 05:13:58 +0800</pubDate>
      <guid>https://answer.freetools.me/dpo%E4%B8%BA%E4%BD%95%E8%83%BD%E5%8F%96%E4%BB%A3rlhf%E6%88%90%E4%B8%BA%E5%A4%A7%E6%A8%A1%E5%9E%8B%E5%AF%B9%E9%BD%90%E7%9A%84%E4%B8%BB%E6%B5%81%E6%96%B9%E6%B3%95%E4%BB%8E%E5%A5%96%E5%8A%B1%E5%87%BD%E6%95%B0%E9%87%8D%E5%8F%82%E6%95%B0%E5%8C%96%E5%88%B0%E5%81%8F%E5%A5%BD%E4%BC%98%E5%8C%96%E7%9A%84%E6%95%B0%E5%AD%A6%E9%9D%A9%E5%91%BD/</guid>
      <description>深入解析直接偏好优化（DPO）的数学原理与工程实践。从Bradley-Terry偏好模型到奖励函数重参数化的核心洞察，系统阐述DPO如何避免RLHF的复杂性。涵盖DPO与PPO的性能对比、IPO/KTO/ORPO等变体方法的演进脉络，以及β超参数调优、过拟合规避等最佳实践。包含Zephyr等实际模型案例和完整数学推导。</description>
    </item>
  </channel>
</rss>
