This repository contains a data set for figure retrieval experiments which consists of a collection of 42,530 figures, extracted from papers in the natural language processing domain, a set of 16,829 queries, and relevance judgments. The data set was used for the experiments in the following paper: “Figure Retrieval from Collections of Research Articles. Saar Kuzi and ChengXiang Zhai. In Proceedings of ECIR 2019.” For more details regarding the data set, please refer to the paper or contact Saar Kuzi (skuzi2@illinois.edu).

Dataset

  • Figures Collection (34M)
    Contains an XML file for each research article. The file name corresponds to the paper identifier according to the ACL Anthology format. An XML file of a paper contains the paper title, abstract, and introduction, as well as the different figures that were successfully identified and extracted from the paper. Each figure element is identified by a unique number (for that paper) which is specified under the “id” attribute. For each figure, we provide its caption and the text in the paper that describes/discusses it. More specifically, a “mentionX” element of a figure contains X words before and X words after the place in the paper's full text where the figure was explicitly mentioned. "absSentence” contains a sentence from the abstract of the paper that is related to the figure. Finally, for some figures we also provide the corresponding image file (can be found under “Image Files”), such that the name of the file can be found in the “fileName” element of a figure.
  • Queries (640K)
    Contains the set of queries. A query is essentially a figure caption with a single relevant figure, which is the figure that the caption corresponds to. The file contains two columns (separated by a tab) where the first column is the figure identifier, and the second column is the caption text. A figure identifier has the following format: “paperId_figureId” (e.g., “P11-1069_2”). The queries were pre-processed, including stopword removal and Krovetz stemming (the original caption text, if needed, can be easily located in the paper files).
  • Image Files (1.1G)
    Contains the image files for some of the figures. Use the “fileName” element of a figure in the figure collection files in order to obtain a file name for a figure (does not exist for all figures).
  • Learning to Rank Data (176M)
    Contains the data set used for training and testing the LTR algorithm proposed in the paper.

Auxiliary Files

  • Full Text (239M)
    Contains the entire text of the papers which was extracted from the PDF files using the Grobid toolkit.
  • Stopwords List (4K)
    A list of stopwords that were used in our experiments for pre-processing of figures and queries.