Bibliographic Data

Open Academic Graph based data:

Open Academic Graph contains data from Microsoft Academic Graph (MAG) and AMiner. We have made some processing on MAG data in order to accelarate the use of these data by using mysql (or any other SQL or NOSQL mechanism), especialy when you need mainly the citation graph. The ".sql" files contain the mysql create table commands. The ".txt.bz2" files contain (zipped) tab separated data files. The following data where produced from MAG papers 2017-06-09. You can download the data from here.

  • paper_idshex: Contains the mapping of microsoft paper IDs that need 256bit storage each, to 64bit long int IDs.
  • ref: Contains the citation graph. Each line contains "From PaperID" "To PaperID".
  • fos: The fields of study. Each line contains "fos_id" and "fos_name".
  • venue: The venue collection. Line format: "name" "vid" "venue_type".
  • paper_fos: The connection between paper and fos. Line format: "Paper ID" "fos ID (foreign key)".
  • paper: The publications. Line format: "Paper ID" "Pub year" "venue_id (foreign key)" "Pub Type"
  • paper_txt: The publications' titles and doi.

Data last update: 08-2018


Older Bibliographic Data

Microsoft Academic Search (API v1) based data:

Microsoft Academic Search API v1 has been discontinued by microsoft in 2016. However, we had downloaded and used these data for various experiments as well as for the experiments performed in [1]. We make these data published in order other researchers to be able to reproduce our experiments, as well as to compare with our results.

These data where generated by quering MAS (Microsoft Academic Search) API v1 for Computer Science authors with hindex>=5. After receiving the author_ids, we gathered their publications as well as the in/out reference papers.

Please, if you use these data, cite publication [1] as well as give credits to microsoft for the offer of their collected data and API. You can get these data from here (Please, contact for access).

  • ms_author_sample: Contains the author_ids for our sample. Note that these IDs are no longer valid at MAG since microsoft have made a full rebuild of their database.
  • ms_pub_sample: Contains the paper_ids for our sample. The publications of authors within the base sample are marked as 'direct'. There are also publications marked as 'have-refs_to_me' meaning that this publication is included in the dataset bacause references a publication in base sample. Publications marked as 'have-refs_from_me' are included in the dataset bacause they are cited by publications in the base sample.
  • ms_pub: Info about a publication. The field year may have been produced by three ways.
    • Is equal to original year info from MAS.
    • If original year in MAS is null then we search the publication in dblp. With this way we have filled same missing years. (marked as 'dblp' in year_estim_by field).
    • Else we may assume the year of publication. If all the references of the publication are in years x-n,...x-1,x and all the references TO the publication are of publications of the years x, x+1, x+2,...x+k, then we can assume that the publication year is x. (marked as 'estim' in year_estim_by field).
  • ms_author: Info about authors. Authors not belonging to the base sample may be included in this table.
  • ms_pub_authors: The authors for each publication.
  • ms_citation: The citations. For each citation there is the info of the number of common authors between citing and cited publication. In case we have no info about all authors of both cited and citing publication, then the number of common authors is null.

[1] Stoupas, Georgios and Sidiropoulos, Antonis and Gogoglou, Antonia and Katsaros, Dimitrios and Manolopoulos, Yannis: Rainbow Ranking: An adaptable, multidimensional ranking method for publication sets, Scientometrics, Vol.116, N.1, pp. 147-160, 2018 [BibTex]

Data last update: 07-2016