摘要:
目的 探讨AI模型训练场景下,出版数据供、需双方的权责边界,为完善我国科技期刊出版数据版权保护和构建公平合理的流通规则提供参考。 方法 以科技期刊出版利益相关方为研究对象,包括出版机构、行业协会及政府部门,检索各方公开发布的版权协议、技术条款、立场声明、政策与法规文本,分析版权协议、技术限制及义务条款的特征。采用案例分析法,选取Elsevier, Springer Nature, Sage, Wiley和Taylor & Francis 5家出版机构的实践案例,比较不同出版数据用于AI模型训练模式的版权侵权风险。 结果 仅少数版权协议明确约定涉及AI训练的各种情形。出版机构或制定技术性限制条款、或态度中立、或采用更适用于AI训练场景的数据管理模式。开放获取论文占比较高的出版机构更倾向于主动提供数据访问。将出版数据用于AI模型训练的出版机构采用2种策略:自主研发“内部合理使用”和内容授权许可给第三方。二者数据流通范围不同,涉及的版权边界争议点分别为“合理使用”的界定和授权链条完整性。 结论 为应对AI模型训练需求,版权协议应补充“合理使用”适用情形或“分许可”条款,出版机构应制定面向AI模型训练需求的版权管理方案,出版数据持有者与AI模型开发者应明确数据权限、责任边界,为构建市场化的数据流通收益分配机制和争议解决机制提供保障,学/协会和政府部门应出台针对出版数据用于AI训练的专项版权指引,规范出版数据的合规、高效流通环境,探索建立公平的收益分配机制,指导行业有序发展。
关键词:
科技期刊,
出版数据,
网络爬虫,
文本与数据挖掘,
人工智能,
模型,
著作权
Abstract:
Purposes Explore the boundaries of rights and responsibilities between publishers and data users in AI model training scenarios, providing a reference for improving copyright protection and circulation rules for scientific journal-published data in China. Methods This study examines stakeholders in scientific journal publishing,including publishing institutions, industry associations, and government bodies,by analyzing publicly available copyright agreements, technical terms, position statements, policies, and regulatory documents. It aims to characterize the features of copyright clauses, technical restrictions, and obligation terms in these materials. A case study approach is adopted, focusing on the practices of five leading publishing institutions: Elsevier, Springer Nature, Sage, Wiley, and Taylor & Francis. The study compares copyright infringement risks associated with different models of using published data for AI training. Findings At the scientific journal level, copyright transfer agreements for AI model training have yet to be substantially updated. As key players in protecting published data copyrights, leading publishers have only a small number of copyright agreements explicitly address various scenarios related to AI training. Publishing institutions may adopt technical restriction clauses, maintain a neutral stance, or implement data management models better suited for AI training scenarios. Publishers with a higher proportion of open-access papers are more inclined to proactively provide data access. When utilizing published data for AI model training, publishing institutions typically adopt two strategies:independent development under “internal fair use” provisions or licensing content to third parties. These two approaches differ in terms of data circulation scope. The key copyright boundary disputes involved are the definition of “fair use” and completeness of the authorization chain. Conclusions To address the needs of AI model training, copyright agreements should be supplemented with provisions clarifying the applicability of “fair use” or adding “sublicensing” clauses. Publishing institutions should develop copyright management frameworks tailored to AI training requirements. Data holders in publishing institutions and AI model developers must clearly define data permissions and delineate responsibility boundaries, thereby providing a foundation for establishing market-oriented mechanisms for data circulation revenue distribution and dispute resolution. Academic associations and government departments should issue specialized copyright guidelines for the use of published data in AI training, regulate the compliant and efficient circulation of such data, explore the establishment of equitable revenue-sharing mechanisms, and guide the orderly development of the industry.
Key words:
Scientific journals,
Published data,
Web crawler,
Text and data mining,
Artificial intelligence,
Model,
Copyright
倪婧, 郝秀原, 任胜利, 张久珍. 出版数据用于人工智能模型训练的期刊版权保护问题研究[J]. 中国科技期刊研究, 2026, 37(2): 173-180.
NI Jing, HAO Xiuyuan, REN Shengli, ZHANG Jiuzhen. Copyright protection of journal‑published research data used for artificial intelligence model training[J]. Chinese Journal of Scientific and Technical Periodicals, 2026, 37(2): 173-180.