Abstract:
Purposes To evaluate the diagnostic efficacy of generative artificial intelligence (GenAI) in detecting statistical results errors in biomedical journals. Methods A convenience sampling method was used to include 21 positive samples (with statistical errors in results) and 21 negative samples (without statistical errors in results). Kimi and DeepSeek(both deep thinking and non-deep thinking modes) were employed, combined with three prompting strategies [direct questioning, literature reference (single or dual-paper jointly), and terminology prompting (full-text review or table-by-table review)] to form 10 combined strategies. The diagnostic performance of these ten strategies was compared. Findings DeepSeek deep thinking demonstrated the highest sensitivity (47.6%~100.0%) and accuracy (71.4%~90.5%) across all prompting methods, followed by DeepSeek non-deep thinking, with Kimi performing the worst. Literature reference and specialized terminology prompting methods yielded superior sensitivity and accuracy compared to direct questioning within the same GenAI model. “DeepSeek deep thinking + literature reference (dual-paper jointly)” achieved the highest accuracy (90.5%), while “DeepSeek deep thinking +terminology prompt (table-by-table review)” yielded the highest sensitivity (100.0%) and most comprehensive error detection. Except for “DeepSeek deep thinking + terminology prompt” (specificity: 85.7% for full-text review, 71.4% for table-by-table review), all other combination strategies demonstrated specificity above 90%. Conclusions DeepSeek deep thinking demonstrated superior performance in detecting statistical errors in biomedical journals. Literature reference and terminology prompting were more effective than direct questioning, though false positives occur. It is recommended that editorial offices adopt “DeepSeek deep thinking+literature reference (dual-paper jointly)” for initial manuscript screening, and “DeepSeek deep thinking+terminology prompt (table-by-table review)” for detailed statistical verification of flagged manuscripts and unpublished studies, issueed by manual validation.
Key words:
Biomedical journals,
Generative artificial intelligence,
Statistical review,
Sensitivity,
Specificity,
Accuracy
摘要:
目的 评估生成式人工智能(GenAI)模型在生物医学期刊统计学结果错误中的诊断效能。 方法 采用便利抽样法纳入21个研究样本(存在统计学结果错误)和21个阴性样本(无统计学结果错误)。基于GenAI模型Kimi和DeepSeek(DS,含深度思考和非深度思考模式),结合3种提示方法[直接提问、文献参考(单一文献或2篇文献联合)、专业术语提示(全文审查或分表格审查)],形成10种组合策略,比较不同策略对统计学结果错误的诊断效能。 结果 DS深度思考模式的灵敏度(47.6%~100.0%)和准确率(71.4%~90.5%)在3种提示方法中均最优,DS非深度思考模式次之,Kimi最差。文献参考及专业术语提示法的灵敏度和准确率优于相同GenAI模型的直接提问法。DS深度思考+文献参考(2篇文献联合)准确率最高(90.5%),而DS深度思考+专业术语提示(分表格审查)灵敏度最高(100.0%),且发现错误最全。除DS深度思考+专业术语提示的特异度(全文审查85.7%,分表格审查71.4%)较低外,其他组合策略的特异度均高于90%。 结论 在生物医学期刊统计学结果错误核查中,DS深度思考模式优势明显,文献参考和专业术语提示优于直接提问,但存在假阳性情况。建议编辑部采用DS深度思考+文献参考(2篇文献联合)对文献初筛,采用DS深度思考+专业术语提示(分表格审查)对初筛异常文献以及未发表稿件的统计学结果审核,并建立人工复核机制。
关键词:
生物医学期刊,
生成式人工智能,
统计学审查,
灵敏度,
特异度,
准确率
ZHENG Qiaolan, JIANG Yuxia, WANG Jingzhou. Diagnostic efficacy of Kimi and DeepSeek in detecting statistical results errors in biomedical journals and suggestions for the use[J]. Chinese Journal of Scientific and Technical Periodicals, 2025, 36(11): 1470-1477.
郑巧兰, 江玉霞, 王景周. Kimi和DeepSeek诊断生物医学期刊统计学结果错误的效能评估及应用建议[J]. 中国科技期刊研究, 2025, 36(11): 1470-1477.