# anc2vec **Repository Path**: northpoleforce/anc2vec ## Basic Information - **Project Name**: anc2vec - **Description**: folk - **Primary Language**: Unknown - **License**: MIT - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2024-01-20 - **Last Updated**: 2024-01-20 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # Anc2vec Anc2vec is a novel method based on neural networks to construct embeddings of terms from the [Gene Ontology](http://geneontology.org/) (GO) exclusively using three structural features of it: the ontological uniqueness of terms, their ancestor relationships and the sub-ontology to which they belong. This repository offers a Python package containing the source code of anc2vec, as well as instructions for reproducibility of the main results of the study where this method was proposed: *[Anc2vec: embedding Gene Ontology terms by preserving ancestors relationships](ms.pdf)*, by A. A. Edera, D. H. Milone, and G. Stegmayer. Research Institute for Signals, Systems and Computational Intelligence, [sinc(i)](https://sinc.unl.edu.ar).

Anc2vec

Fig. 1. Panel A) The GO structure is composed by hierarchical relationships between terms arranged in three sub-ontologies: BP, CC, and MF. Panel B) Anc2vec architecture. A GO term is encoded as a one-hot vector x that is transformed into an embedding h, which is used to predict three structural features of the GO that are used for weight optimization.

Anc2Vec

Fig. 2. Anc2vec embeddings of GO terms in the three sub-ontologies. Points depict embeddings of GO terms whose colors encode the sub-ontologies: BP (Biological Process), CC (Cellular Component), and MF (Molecular Function). There is available a video showing how 2-dimensional embeddings are adjusted during weight optimization.
## Requirements Anc2vec requires [Python](https://www.python.org/) 3.6 and [TensorFlow](https://www.tensorflow.org/) 2.3.1. ## Installation It is recommendable to have installed [Conda](https://docs.conda.io/en/latest/), to avoid Python package conflicts. If Conda is installed, first create and activate a conda environment, for example, named anc2vec: ```bash conda create --name anc2vec python=3.6 conda activate anc2vec ``` Next, install the `anc2vec` package via the [pip package manager](https://pip.pypa.io/en/stable/installing/): ```bash pip install -U "anc2vec @ git+https://github.com/aedera/anc2vec.git" ``` ## Anc2vec functionalities ### Access pre-trained embeddings The `anc2vec` package has already available the same embedding of GO terms used in the study. These embeddings were built using the Gene Ontology release [2020-10-06](./anc2vec/data/go.obo). The embeddings can be easily accessed on Python with this command: ```python import anc2vec es = anc2vec.get_embeddings() ``` Here, `es` is a python dictionary that maps GO terms with their corresponding 200-dimensional embeddings. For example, this command uses this dictionary to retrieve the embedding corresponding to the term `GO:0001780`: ```python e = es['GO:0001780'] ``` The variable `e` is a [Numpy](https://numpy.org/) array containing the embedding ```python array([ 0.55203265, -0.23133564, 0.1983797 , -0.3251996 , 0.20564775, -0.32133245, -0.25364587, -0.16675541, -0.46832997, -0.40702957, ... -0.29757708, -0.33143485, -0.31099185, 0.24465033, -0.25458524, -0.24525951, -0.366758 , -0.04628978, 0.29378492, 0.31249675], dtype=float32) ``` These `anc2vec` embeddings are ready to be used for semantic similarity tasks. Below there are examples showing how to use them for calculating [cosine distances](https://en.wikipedia.org/wiki/Cosine_similarity). ### Build your own embeddings The `anc2vec` package also contains a function to build embeddings from scratch using a specific [OBO file](http://owlcollab.github.io/oboformat/doc/obo-syntax.html), a human-readable file usually used to describe the GO. Building embeddings can be particularly useful for experimental scenarios where a specific version of the GO is required, such as those available in the [GO data archive](http://release.geneontology.org/). The following code shows how to build the embedding for a given OBO file named `go.obo`. ```python import anc2vec import anc2vec.train as builder es = builder.fit('go.obo', embedding_sz=200, batch_sz=64, num_epochs=100) ``` The object `builder` uses the input `go.obo` file to extract structural features used to build the embeddings of GO terms. Note that `builder` is called with additional parameters indicating the dimensionality of the embeddings (`embedding_sz`) and the number of optimization steps used for embedding building (`num_epochs`). The embeddings built by `builder` are stored in `es`, which is a Python dictionary mapping GO terms to their corresponding embeddings. Please check the examples below for more information about this functionality. ## Notebooks: examples on how to use the `anc2vec` package To try anc2vec, below there are links to [Jupyter notebooks](https://jupyter.org) that use [Google Colab](https://research.google.com/colaboratory/) which offers free computing on the Google cloud. * [Using `anc2vec` pre-trained embeddings](https://colab.research.google.com/github/aedera/anc2vec/blob/main/examples/pretrained_anc2vec_embeddings.ipynb) * [Projecting `anc2vec` pre-trained embeddings](https://colab.research.google.com/github/aedera/anc2vec/blob/main/examples/project_embeddings.ipynb) * [Building `anc2vec` embeddings for a desired obo file](https://colab.research.google.com/github/aedera/anc2vec/blob/main/examples/train_anc2vec_embeddings.ipynb) ## Datasets These are the main datasets used in the experiments of the study where anc2vec is proposed: * [Ancestors dataset](https://drive.google.com/file/d/1fgK50TNg5nrade22SwmqZYOeAxgPHIHY/view?usp=sharing) * [Protein function dataset](https://drive.google.com/file/d/1eokaKj20tbFTn9jexQXIkONqwHeiBGS-/view?usp=sharing) * [STRING dataset](https://drive.google.com/file/d/1dBZqQeBuGf35_pGT6qJWSuX1At32t9CI/view?usp=sharing) ## License The `anc2vec` package is released under the [MIT License](LICENSE).