交叉验证

交叉验证，有时亦称循环估计^[1] ^[2] ^[3]，是一种统计学上将数据样本切割成较小子集的实用方法。于是可以先在一个子集上做分析，而其它子集则用来做后续对此分析的确认及验证。一开始的子集被称为训练集。而其它的子集则被称为验证集或测试集。交叉验证的目的，是用来给模型作训练的新数据，测试模型的性能，以便减少诸如过拟合和选择偏差等问题，并给出模型如何在一个独立的数据集上通用化（即，一个未知的数据集，如实际问题中的数据）。

交叉验证的理论是由Seymour Geisser（英语：Seymour Geisser）所开始的。它对于防范根据数据建议的测试假设是非常重要的，特别是当后续的样本是危险、成本过高或科学上不适合时去搜集的。

交叉验证的使用

假设有个未知模型具有一个或多个待定的参数，且有一个数据集能够反映该模型的特征属性（训练集）。适应的过程是对模型的参数进行调整，以使模型尽可能反映训练集的特征。如果从同一个训练样本中选择独立的样本作为验证集合，当模型因训练集过小或参数不合适而产生过拟合时，验证集的测试予以反映。交叉验证是一种预测模型拟合性能的方法。

常见的交叉验证形式

Holdout持久性验证

常识来说，Holdout 验证并非一种交叉验证，因为数据并没有交叉使用。随机从最初的样本中选出部分，形成交叉验证数据，而剩馀的就当做训练数据。一般来说，少于原本样本三分之一的数据被选做验证数据。 ^[4]

k折交叉验证

k折交叉验证（英语：k-fold cross-validation），将训练集分割成k个子样本，一个单独的子样本被保留作为验证模型的数据，其他k − 1个样本用来训练。交叉验证重复k次，每个子样本验证一次，平均k次的结果或者使用其它结合方式，最终得到一个单一估测。这个方法的优势在于，同时重复运用随机产生的子样本进行训练和验证，每次的结果验证一次，10次交叉验证是最常用的。

留一验证

正如名称所建议，留一验证（英语：leave-one-out cross-validation, LOOCV）意指只使用原本样本中的一项来当做验证资料，而剩馀的则留下来当做训练资料。这个步骤一直持续到每个样本都被当做一次验证资料。事实上，这等同于k折交叉验证，其中k为原本样本个数。^[5] 在某些情况下是存在有效率的演算法，如使用kernel regression（英语：kernel regression）和吉洪诺夫正则化。

误差估计

可以计算估计误差。常见的误差衡量标准是均方差和方根均方差，分别为交叉验证的方差和标准差。

另见

参考文献

^ Kohavi, Ron. A study of cross-validation and bootstrap for accuracy estimation and model selection. Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence. 1995, 2 (12): 1137–1143 [2008-07-14]. （原始内容存档于2008-03-25）. (Morgan Kaufmann, San Mateo)
^ Chang, J., Luo, Y., and Su, K. 1992. GPSM: a Generalized Probabilistic Semantic Model for ambiguity resolution. In Proceedings of the 30th Annual Meeting on Association For Computational Linguistics (Newark, Delaware, June 28 - July 02, 1992). Annual Meeting of the ACL. Association for Computational Linguistics, Morristown, NJ, 177-184
^ Devijver, P. A., and J. Kittler, Pattern Recognition: A Statistical Approach, Prentice-Hall, London, 1982
^ Tutorial 12. Decision Trees Interactive Tutorial and Resources. [2006-06-21]. （原始内容存档于2006-06-23）.
^ Elements of Statistical Learning: data mining, inference, and prediction. 2nd Edition.. web.stanford.edu. [2019-04-04]. （原始内容存档于2021-01-22）.

外部链接

Naive Bayes implementation with cross-validation in Visual Basic (includes executable and source code)
A generic k-fold cross-validation implementation (free open source; includes a distributed version that can utilize multiple computers and in principle can speed up the running time by several orders of magnitude.)

[Kohavi95-1] Kohavi, Ron. A study of cross-validation and bootstrap for accuracy estimation and model selection. Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence. 1995, 2 (12): 1137–1143 [2008-07-14]. （原始内容存档于2008-03-25）. (Morgan Kaufmann, San Mateo)

[Chang92-2] Chang, J., Luo, Y., and Su, K. 1992. GPSM: a Generalized Probabilistic Semantic Model for ambiguity resolution. In Proceedings of the 30th Annual Meeting on Association For Computational Linguistics (Newark, Delaware, June 28 - July 02, 1992). Annual Meeting of the ACL. Association for Computational Linguistics, Morristown, NJ, 177-184

[Devijver82-3] Devijver, P. A., and J. Kittler, Pattern Recognition: A Statistical Approach, Prentice-Hall, London, 1982

[4] Tutorial 12. Decision Trees Interactive Tutorial and Resources. [2006-06-21]. （原始内容存档于2006-06-23）.

[5] Elements of Statistical Learning: data mining, inference, and prediction. 2nd Edition.. web.stanford.edu. [2019-04-04]. （原始内容存档于2021-01-22）.

[1]

[2]

[3]

[4]

[5]