Skip to content

lucinamay/fairfetched

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

26 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

fairfetched

data APIs for reproducible data fetching in cheminformatics in line with FAIR principles

installation

you can install this package through uv add fairfetched (recommended)

or if you do not use the uv package manager: pip install fairfetched

examples

you can download Chembl or Papyrus through:

from fairfetched.get import Chembl, Papyrus
mychembl = Chembl.from_latest() # this downloads Chembl raw files + extracts parquet files to wherever you
                                # have set the environment variable FAIRFETCHED_HOME, PYSTOW_HOME,
                                # or <HOME>/.data if not in environment variables.
                                # from there, fairfetched saves it to a folder chembl/<version>

mychembl.lfs                  # a dictionary of all chembl files in polars LazyFrame format, scanned directly from the extracted .parquet files


mychembl.consolidated_paths   # the paths to the parquet-converted tabular data files in the Chembl .db file

mychembl.raw_paths            # the paths to the raw chembl file as downloaded from Chembl. currently does include an uncompressed .db file

mychembl.compounds            # NOT YET IMPLEMENTED !! convenience alias for mychembl.compose()["compounds"], which uses mychembl.lfs LazyFrame joins to obtain an intuitive join of the data.
                              # from there, you can 

examples of how to use the LazyFrames:

checking which columns+datatypes are in the file, so that you can choose to join them:

>>> mychembl.lfs["activities"].collect_schema()
Schema({'activity_id': Int64, 'assay_id': Int64, 'doc_id': Int64, 'record_id': Int64, 'molregno': Int64, 'standard_relation': String, 'standard_value': Float64, 'standard_units': String, 'standard_flag': Int64, 'standard_type': String, 'activity_comment': String, 'data_validity_comment': String, 'potential_duplicate': Int64, 'pchembl_value': Float64, 'bao_endpoint': String, 'uo_units': String, 'qudt_units': String, 'toid': Int64, 'upper_value': Float64, 'standard_upper_value': Null, 'src_id': Int64, 'type': String, 'relation': String, 'value': Float64, 'units': String, 'text_value': String, 'standard_text_value': String, 'action_type': String})

selecting all entries based on doc_id:

>>> mychembl.lfs["activities"].filter(doc_id=89530).drop_nulls("units").collect()
shape: (107, 28)
┌─────────────┬──────────┬────────┬───────────┬───┬───────┬────────────┬─────────────────────┬─────────────┐
│ activity_idassay_iddoc_idrecord_id ┆ … ┆ unitstext_valuestandard_text_valueaction_type │
│ ------------       ┆   ┆ ------------         │
│ i64i64i64i64       ┆   ┆ strstrstrstr         │
╞═════════════╪══════════╪════════╪═══════════╪═══╪═══════╪════════════╪═════════════════════╪═════════════╡
│ 151206381431503895302256150   ┆ … ┆ uMnullnullnull        │
│ 151206391431503895302256151   ┆ … ┆ uMnullnullnull        │
│ 151206401431503895302256152   ┆ … ┆ uMnullnullnull        │
│ 151206411431503895302256153   ┆ … ┆ uMnullnullnull        │
│ 151206421431503895302256154   ┆ … ┆ uMnullnullnull        │
│ …           ┆ …        ┆ …      ┆ …         ┆ … ┆ …     ┆ …          ┆ …                   ┆ …           │
│ 151252001431507895302256167   ┆ … ┆ uMnullnullnull        │
│ 151252011431507895302256168   ┆ … ┆ uMnullnullnull        │
│ 151252021431507895302256169   ┆ … ┆ uMnullnullnull        │
│ 151252031431507895302256170   ┆ … ┆ uMnullnullnull        │
│ 151252041431507895302256171   ┆ … ┆ uMnullnullnull        │
└─────────────┴──────────┴────────┴───────────┴───┴───────┴────────────┴─────────────────────┴─────────────┘

adding compound structure info to the activities on molregno

>>> mychembl.lfs["activities"].join(mychembl.lfs["compound_structures"],on="molregno",how="left",validate="m:1").head().collect()
shape: (5, 32)
┌─────────────┬──────────┬────────┬───────────┬───┬────────────────────────┬─────────────────────────────────┬─────────────────────────────┬─────────────────────────────────┐
│ activity_idassay_iddoc_idrecord_id ┆ … ┆ molfilestandard_inchistandard_inchi_keycanonical_smiles                │
│ ------------       ┆   ┆ ------------                             │
│ i64i64i64i64       ┆   ┆ strstrstrstr                             │
╞═════════════╪══════════╪════════╪═══════════╪═══╪════════════════════════╪═════════════════════════════════╪═════════════════════════════╪═════════════════════════════════╡
│ 31863545056424206172    ┆ … ┆                        ┆ InChI=1S/C20H12N2O2/c1-2-7-13(… ┆ BEBACPIIZGRKGG-UHFFFAOYSA-Nc1ccc(-c2nc3c(-c4nc5ccccc5o4)c… │
│             ┆          ┆        ┆           ┆   ┆      RDKit          2D ┆                                 ┆                             ┆                                 │
│             ┆          ┆        ┆           ┆   ┆                        ┆                                 ┆                             ┆                                 │
│             ┆          ┆        ┆           ┆   ┆  24 2…                 ┆                                 ┆                             ┆                                 │
│ 31864839076432208970    ┆ … ┆                        ┆ InChI=1S/C23H14N2O5/c1-12-5-8-… ┆ SUKVIELCKKEBOJ-UHFFFAOYSA-NCc1ccc2oc(-c3cccc(N4C(=O)c5ccc… │
│             ┆          ┆        ┆           ┆   ┆      RDKit          2D ┆                                 ┆                             ┆                                 │
│             ┆          ┆        ┆           ┆   ┆                        ┆                                 ┆                             ┆                                 │
│             ┆          ┆        ┆           ┆   ┆  30 3…                 ┆                                 ┆                             ┆                                 │
│ 31865881526432208970    ┆ … ┆                        ┆ InChI=1S/C23H14N2O5/c1-12-5-8-… ┆ SUKVIELCKKEBOJ-UHFFFAOYSA-NCc1ccc2oc(-c3cccc(N4C(=O)c5ccc… │
│             ┆          ┆        ┆           ┆   ┆      RDKit          2D ┆                                 ┆                             ┆                                 │
│             ┆          ┆        ┆           ┆   ┆                        ┆                                 ┆                             ┆                                 │
│             ┆          ┆        ┆           ┆   ┆  30 3…                 ┆                                 ┆                             ┆                                 │
│ 31866839076432208987    ┆ … ┆                        ┆ InChI=1S/C30H20N2O7/c1-37-24-6… ┆ ZFJHZUAZBGPPQK-UHFFFAOYSA-NCOc1ccccc1-c1ccc2oc(-c3ccc(OC)… │
│             ┆          ┆        ┆           ┆   ┆      RDKit          2D ┆                                 ┆                             ┆                                 │
│             ┆          ┆        ┆           ┆   ┆                        ┆                                 ┆                             ┆                                 │
│             ┆          ┆        ┆           ┆   ┆  39 4…                 ┆                                 ┆                             ┆                                 │
│ 31867881536432208987    ┆ … ┆                        ┆ InChI=1S/C30H20N2O7/c1-37-24-6… ┆ ZFJHZUAZBGPPQK-UHFFFAOYSA-NCOc1ccccc1-c1ccc2oc(-c3ccc(OC)… │
│             ┆          ┆        ┆           ┆   ┆      RDKit          2D ┆                                 ┆                             ┆                                 │
│             ┆          ┆        ┆           ┆   ┆                        ┆                                 ┆                             ┆                                 │
│             ┆          ┆        ┆           ┆   ┆  39 4…                 ┆                                 ┆                             ┆                                 │
└─────────────┴──────────┴────────┴───────────┴───┴────────────────────────┴─────────────────────────────────┴─────────────────────────────┴─────────────────────────────────┘

move it to pandas for direct drop-in use (if you really want pandas...)

ideally as far down the line after you complete all filtering, you call .collect().to_pandas() (see polars documentation for more info)

mychembl.lfs["activities"].collect().to_pandas()

roadmap

  • papyrus database support
    • papyrus latest version download
    • simple nested filtering
    • efficient nested filtering
    • all-version support
    • built-in pivots
  • chembl database support
    • database to tables (parquet)
    • intuitive pre-merged flat files
    • database visualisation
    • remove the need for storing uncompressed .db
  • reproducion from downloaded raw file
  • reproducible molecular (and protein?) standardisation
  • automated time-url logging and manifest files
  • well-organised logging
  • dependency minimisation
  • other database support
  • preservation of api and parsing logic per major version

About

data APIs for reproducible data fetching in cheminformatics in line with FAIR-principles

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages