中国科技期刊研究 ›› 2016, Vol. 27 ›› Issue (2): 202-206. doi: 10.11946/cjstp.201509280939

• 数字出版 • 上一篇    下一篇

提取方正排版文件广义元数据并生成全文HTML的探索

杨海亮,徐用吉   

  1. 东北大学学报编辑部,辽宁省沈阳市和平区文化路三巷11号 110819
  • 收稿日期:2015-09-28 修回日期:2015-12-24 出版日期:2016-02-15 发布日期:2016-02-15
  • 作者简介:杨海亮(ORCID:0000-0003-3605-584X),编辑,E-mail: yhl@mail.neu.edu.cn|徐用吉,编审,主任
  • 基金资助:
    辽宁省社会科学规划基金资助项目(L12DXW011)

Study on generalmetadata extraction from Founder typesetting files and generating the full text of HTM L

YANG Hailiang,XU Yongji   

  1. Editorial Office of Journal of Northeastern University,11Wenhua sanxiang Road,Heping District,Shenyang 110819,China
  • Received:2015-09-28 Revised:2015-12-24 Online:2016-02-15 Published:2016-02-15

摘要:

【目的】 实现自动提取科技期刊全文元数据并生成HTML文件。【方法】 以方正排版文件为对象,在可以提取出来文章的标题、摘要等元数据的基础上,将文章的正文内容元数据化,提出了包含图、表、公式等的广义元数据概念,并建立了提取图、表元数据的提取规则,同时将方正排版数学公式转化为LaTeX表达式。然后利用VB编程软件编写了自动提取广义元数据程序并将元数据重新组合生成HTML格式的文件。【结果】 根据方正BD排版语言的特点,建立的提取规则能有效提取全文并元数据化,最后可直接生成HTML文件。【结论】 实际应用表明了利用广义元数据生成HTML文件的有效性和可行性。

关键词: 广义元数据, 方正BD排版语言, VB编程软件, 自动全文提取, HTML文件

Abstract:

[Purposes] This paper aims to automatically extract full textmetadata from the journals of science and technology and generate HTML files.[Methods] Taking Founder typesetting files as the object,and on the basis of extracting metadata such as titles and abstracts,we transfer the contents intometadata.And the concept of generalmetadata(GM)is proposed,which includes the graph,table and formulametadata.The extraction rules of the graph and tablemetadata are established,and the transformation from the Founder formula to LaTeX is proposed.Then,the VB programm ing software is programmed to extract the GM.We combine GM to generate the HTML full text file. [Findings] According to the characteristics of the BD typesetting language,the extraction rules can extract the full textmetadata effectively,and the HTML file can be generated directly.[Conclusions] The practical application shows the effectiveness and feasibility of using the generalmetadata to generate HTML files.

Key words: Generalmetadata(GM), Founder BD typesetting language, VB programm ing software, Automatic full textextraction, HTML file