For each citation path, we have a file to generate a list of pubmed IDs and two other files to build the icite info and OpenAlex info based on these sources. Each python script focuses on a different set of APIs to generate the initial set of PMIDs, and the output of each is then processed by icite.py and openalex.py, using those respective API's to add the necessary keys. The results from openalex.py are then ready to be analyzed alone or in combination with the other sources.
Each "source" has a separate program to generate the initial set of citations:
keyword.pyfor the keyword searchcfde.pyfor scraping the CFDE citations pageflagship.pyfor searching flagship paper citations
For example, if we want to generate the keywords dataset:
keyword.py --keyword-key path-to-keyword-key.csv Uses the path-to-keyword-key.csv to build a JSON file with a list of entries and PMIDs for each item
This program writes a JSON file with a list of DCC or competitors where each entry is:
[
{"competes_with": null,
"type": "cfde_dcc",
"program": "HuBMAP",
"pmid_list": [1232145, 1231245, ...]
},
// ...
]
The JSON is written to a file named keyword_results.json. The other programs will be named cfde_results.json or flagship_results.json.
The result files from this initial step are processed (separately) by the icite.py, which requires a path to that file, which will be prefixed by "keyword", "cfde", or "flagship".
The three input metadata files and flags for usage are:
python keyword.py --keyword-key input/keyword-key.csvproducing the keyword resultskeyword_results.jsonpython commonfund/flagship.py --flagship-key input/flagships.csvproducing the "papers citing flagships" resultsflagship_results.jsonpython commonfund/cfde.py --cfde-key input/cfde_programs_key.jsonproducing the CFDE citations page resultscfde_results.json
For adding iCite metadata to the keywords results:
icite.py --pmid-key keyword_results.json
This script takes each entry's list of PMIDs from {keyword/cfde/flagship}_results.json, and creates a new set of entries with the following keys, one for each publication:
competes_withtypeprogramsourcepmidicite_rcricite_apticite_nih_percentileicite_is_clinicalicite_is_research_article
The results will look like the following:
[
{
"pmid": 1235325,
"competes_with": null,
"type": "cfde_dcc",
"program": "LINCS",
"source": "keyword_search",
"icite_rcr": null,
"icite_apt": 0.0,
"icite_nih_percentile": null,
"icite_is_clinical": "No",
"icite_is_research_article": "Yes"
},
// ...
]
The file is written to {keyword/cfde/flagship}_icite_results.json, and can then be passed to OpenAlex.
Note this is the slowest API since all requests must be done serially. The dataset at the end of this process is considered "final" and ready for analysis.
python openalex.py --icite-key keyword_icite_results.json
oa_author_nameoa_institute_nameoa_institute_urloa_geo_countryoa_geo_cityoa_geo_regionoa_affil_stringoa_openalex_idoa_publication_titleoa_publication_dateoa_publication_yearoa_publisher_nameoa_journal_nameoa_total_citationsoa_is_open_accessoa_mesh
oa_mesh is a list of objects where each object has the following keys:
oa_mesh_idoa_mesh_nameoa_mesh_category
The keys listed above are added to the keys from icite.py, creating the final dataset for each category. The results will look like the following:
[
{
"competes_with": null,
"pmid": "35551182",
"type": "cfde_dcc",
"program": "4D Nucleome",
"source": "cites_a_flagship",
"icite_rcr": null,
"icite_apt": 0.05,
"icite_nih_percentile": null,
"icite_is_clinical": "No",
"icite_is_research_article": "Yes",
"oa_publication_title": "Reconstruct high-resolution 3D genome structures for diverse cell-types using FLAMINGO",
"oa_publication_year": 2022,
"oa_publication_date": "2022-05-12",
"oa_total_citations": 0,
"oa_is_open_access": true,
"oa_openalex_id": "https://api.openalex.org/W4280582334",
"oa_publisher_name": "Springer Nature",
"oa_journal_name": "Nature Communications",
"oa_author_name": "Jianrong Wang",
"oa_affil_string": "Department of Computational Mathematics, Science and Engineering, Michigan State University, East Lansing, MI, 48824, USA. wangj164@msu.edu.",
"oa_institute_name": "Michigan State University",
"oa_institute_url": "https://api.openalex.org/I87216513",
"oa_geo_region": "Michigan",
"oa_geo_country": "US",
"oa_geo_city": "East Lansing",
"oa_mesh": [
{
"oa_mesh_id": "D002843",
"oa_mesh_name": "Chromatin",
"oa_mesh_category": true
},
{
"oa_mesh_id": "D002875",
"oa_mesh_name": "Chromosomes",
"oa_mesh_category": true
},
{
"oa_mesh_id": "D002843",
"oa_mesh_name": "Chromatin",
"oa_mesh_category": false
},
// ...
]
},
// ...
]
The output file will be named {keyword/cfde/flagship}_icite_oa_results.json.
To read the data documents individually, use the following:
# one file
keyword_dataset <- jsonlite::fromJSON(""keyword_icite_oa_results.json")
To read all files with the ending *_icite_oa_results.json in the directory into one data frame, use:
library(dplyr)
library(jsonlite)
data <- bind_rows(purrr::map(list.files(path=".", pattern="_icite_oa_results.json",full.names = T), fromJSON))