【目的】 考察查重报告中相似比例作为稿件重复与否判断标准的可信度,并识别错判原因。【方法】 对CrossCheck/iThenticate生成的642篇查重报告进行人工核查,采用分类算法的评价指标对相似比例的可信度进行评价,并分析错判原因。【结果】 整体相似比例[包括总相似比例(TS)和主体部分相似比例(MS)]和单篇相似比例(SS)判断法的正确率均小于75%,SS法的召回率(85%)和精确率(47%)平衡协调较好(F1=0.61),3种判定方法按照相似比例可信度的排序为SS法、MS法、TS法,但仍存在大量错判案例。【结论】 设定合适的阈值,MS和SS可作为判断稿件重复与否的参考,但仍需对易出错条目进行人工核对,不宜过度依赖查重系统的检测结果。
[Purposes] This study intends to evaluate whether the similarity indexes in plagiarism check reports are reliable and analyze the reasons for the unreliable cases.[Methods] The plagiarism check reports of 642 papers yielded by CrossCheck/iThenticate were examined. Indexes of the sorting algorithm were used to assess the reliability of the similarity indexes and the reasons for the unreliable cases were analyzed. [Findings] Either overall similarity index percentage [including the total similarity (TS) and the main-body similarity (MS)] methods or single similarity (SS) index percentage method had an accuracy of <75%. With recall of 85% and precision of 47%, SS method had an F1 of 0.61. The reliability reduced in the order of SS method, MS method, and TS method. Meanwhile, a great number of manuscripts were incorrectly judged according to the similarity index percentages.[Conclusions] MS and SS can be used as references on condition of appropriate maximum limits, but manual double check is necessary, especially for the error-prone items.