PDFDataExtractor: A Tool for Reading Scientific Text and Interpreting Metadata from the Typeset Literature in the Portable Document Format

† Cavendish Laboratory, Department of Physics, University of Cambridge, J. J. Thomson Avenue, Cambridge CB3 0HE, U.K.

Find articles by Miao Zhu

Jacqueline M. Cole

† Cavendish Laboratory, Department of Physics, University of Cambridge, J. J. Thomson Avenue, Cambridge CB3 0HE, U.K.

‡ ISIS Neutron and Muon Source, STFC Rutherford Appleton Laboratory, Harwell Science and Innovation Campus, Didcot, Oxfordshire OX11 0QX, U.K.

§ Department of Chemical Engineering and Biotechnology, University of Cambridge, West Cambridge Site, Philippa Fawcett Drive, Cambridge CB3 0AS, U.K.

Find articles by Jacqueline M. Cole

† Cavendish Laboratory, Department of Physics, University of Cambridge, J. J. Thomson Avenue, Cambridge CB3 0HE, U.K.

‡ ISIS Neutron and Muon Source, STFC Rutherford Appleton Laboratory, Harwell Science and Innovation Campus, Didcot, Oxfordshire OX11 0QX, U.K.

§ Department of Chemical Engineering and Biotechnology, University of Cambridge, West Cambridge Site, Philippa Fawcett Drive, Cambridge CB3 0AS, U.K.

Permits the broadest form of re-use including for commercial purposes, provided that author attribution and integrity are maintained (https://creativecommons.org/licenses/by/4.0/).

Associated Data

ci1c01198_si_001.zip GUID: 2F7B26FA-EFAA-47FB-9F87-FAC50A51728A

Abstract

An external file that holds a picture, illustration, etc. Object name is ci1c01198_0017.jpg

The layout of portable document format (PDF) files is constant to any screen, and the metadata therein are latent, compared to mark-up languages such as HTML and XML. No semantic tags are usually provided, and a PDF file is not designed to be edited or its data interpreted by software. However, data held in PDF files need to be extracted in order to comply with open-source data requirements that are now government-regulated. In the chemical domain, related chemical and property data also need to be found, and their correlations need to be exploited to enable data science in areas such as data-driven materials discovery. Such relationships may be realized using text-mining software such as the “chemistry-aware” natural-language-processing tool, ChemDataExtractor; however, this tool has limited data-extraction capabilities from PDF files. This study presents the PDFDataExtractor tool, which can act as a plug-in to ChemDataExtractor. It outperforms other PDF-extraction tools for the chemical literature by coupling its functionalities to the chemical-named entity-recognition capabilities of ChemDataExtractor. The intrinsic PDF-reading abilities of ChemDataExtractor are much improved. The system features a template-based architecture. This enables semantic information to be extracted from the PDF files of scientific articles in order to reconstruct the logical structure of articles. While other existing PDF-extracting tools focus on quantity mining, this template-based system is more focused on quality mining on different layouts. PDFDataExtractor outputs information in JSON and plain text, including the metadata of a PDF file, such as paper title, authors, affiliation, email, abstract, keywords, journal, year, document object identifier (DOI), reference, and issue number. With a self-created evaluation article set, PDFDataExtractor achieved promising precision for all key assessed metadata areas of the document text.

Introduction

The number of publications has increasingly grown since the digitalization of publishing, 1 providing a more efficient platform for scientific communities to share research results. This large number of publications has led to the literature becoming a form of “Big Data.” Such data are useful to data science, which has evolved into a research field owing to the stepwise increase in data capacity for high-performance computing, and the increasing availability of open-source scientific data and software code. It is exciting to realize that the field of “Big Data” has emerged to produce exciting opportunities for discovering new science from patterns found in large arrays of data. Such patterns are best found when data are mined from a structured assembly of information (a database) that contains the most relevant information about the problem in hand.

However, related data are difficult to collate. This is because researchers typically share scientific results through many distinct reports, which can take a variety of forms such as academic papers, technical reports, books, patents, dissertations, or theses. Data are thus strewn across scientific documents in a highly fragmented form. A document may feature unstructured data (e.g., in-line text) or semistructured data (e.g., a table of information), while related data may span many documents. Related data need to be structured and collated in a fashion that auto-builds a database in order to become useful. Text-mining tools that employ natural-language processing (NLP) have enabled the structuring and collation of related data. Open-source software packages, such as CoreNLP 2 and Spacy, 3 can mine text that uses general language. However, such tools perform poorly when applied to the scientific domain, owing to its highly specialized language and writing style. The “chemistry-aware” NLP-based text-mining tool, ChemDataExtractor, 4 was created to overcome this limitation.

ChemDataExtractor 4 uses an NLP-enabled workflow that is geared specifically to mine chemistry-related information from publications. ChemDataExtractor 4 performs best in the scientific literature that is imported as mark-up language, for example, HTML or XML. This is because the literature provided in the HTML or XML format is suitable for parsing in sections that are semantically marked. 5 For example, “PDFDataExtractor: A Tool for Reading Scientific Text and Interpreting Metadata from the Typeset Literature in the Portable Document Format” in this document would be tagged as “title” in mark-up language. Other sections like headings, paragraphs, captions, and tables are also tagged in the literature. Therefore, once combined with auxiliary semantic information, it is possible to perform analysis on one or more user-defined specific sections. Thus, textual noise from document features such as headers, page numbers, and author affiliations can be prevented from being fed into the extraction pipeline.

Unlike HTML and XML, the layout of the literature provided in the portable document format (PDF) stays the same across all different viewing devices. 5 No semantic tags are usually provided in PDF files 6 as the text within this format was not originally designed to be read or interpreted by software programs. Nevertheless, many NLP applications rely on semantic information of text fed into the pipeline. 7 For example, it is essential to correctly identify the semantic roles of text blocks from the literature if one solely wants to perform NLP analysis on abstracts or to find affiliations of authors from references of a given number of scientific documents that are provided in the PDF; that way, only text from the abstract or reference sections of each document are fed to a software program for extraction and analysis. 8 Although most articles can be accessed through HTML or XML, there are a large number of articles that can only be accessed in PDF 5 . Services such as literature mining and database creation rely on accurate metadata from articles. However, metadata are sometimes missing. 9

ChemDataExtractor 4 only has very limited proficiency for extracting and interpreting data from PDF files. This limited functionality relies on a PDF-layout-analysis tool called PDFMiner 10 to process input files. PDFMiner 10 is a PDF-file extraction tool that outputs excellent results in terms of PDF-layout analysis, that is, representing a PDF file in many text blocks with correct reading sequences. It spaces the positions and fonts of the individual characters. 6 However, the extraction ability of PDFMiner 10 is primitive, where limited semantic information about text blocks is extracted. This is because PDFMiner 10 is essentially a structural-analysis package, and no identification of the logical role of text blocks is performed. 11

Extracting information from PDF files is well studied, and notable results have been achieved from previous studies. Available data-extraction solutions usually tackle problems by utilizing either rule-based or machine-learning-based approaches. PDFMiner, 10 PDFX, 1 pdftotext, 12 and PDFExtract 13 represent the rule-based approaches to convert PDF files. These tools generally use a combination of visual and text/content information to reconstruct the logical representation of a PDF file. Machine learning has increasingly drawn more attention in almost every research field. For example, Cermine, 9 ParsCit, 14 and GROBID 15 are excellent tools for information retrieval, which use techniques like support vector machines (SVMs) and conditional random fields (CRFs) to classify text. Overall, these solutions tend to use generalized methods to cover different layouts. They offer fair extraction results with different emphases. However, the driving force behind the creation of PDFDataExtractor presented herein is to (1) create and populate databases and (2) serve as a new PDF extraction plug-in in ChemDataExtractor, 4 which hosts metadata rather than simply perform PDF layout analysis. To this end, we present PDFDataExtractor, a tool that extracts metadata from scientific articles using PDFMiner 10 to build text blocks.

This study shows how this is possible via a template-based approach. Templates can push PDF extraction limits further at the expense of handling more layouts. When building a database, precision is considered to be a more important factor than recall, affording a template-based approach suitable. The modular system of PDFDataExtractor also facilitates future customization. Overall, five large publishers account for more than 70% of chemistry and nearly 40% of physics fields of research. 16