J'essaie d'écrire du code pour analyser un fichier OBO Gene Ontology (GO) et pousser les ID de terme go (par exemple GO: 0003824) dans un dictionnaire imbriqué en forme d'arbre. La structure hiérarchique go dans un fichier OBO est indiquée par l'identifiant "is_a", qui est utilisé pour marquer chaque parent de chaque terme GO. Un terme GO peut avoir plusieurs parents, et les termes go les plus élevés de la hiérarchie n'ont pas de parents.

Un petit exemple de fichier GO OBO est présenté ci-dessous:

[Term]
id: GO:0003674
name: molecular_function
namespace: molecular_function
alt_id: GO:0005554
def: "A molecular process that can be carried out by the action of a single macromolecular machine, usually via direct physical interactions with other molecular entities. Function in this sense denotes an action, or activity, that a gene product (or a complex) performs. These actions are described from two distinct but related perspectives: (1) biochemical activity, and (2) role as a component in a larger system/process." [GOC:pdt]
comment: Note that, in addition to forming the root of the molecular function ontology, this term is recommended for use for the annotation of gene products whose molecular function is unknown. When this term is used for annotation, it indicates that no information was available about the molecular function of the gene product annotated as of the date the annotation was made; the evidence code "no data" (ND), is used to indicate this. Despite its name, this is not a type of 'function' in the sense typically defined by upper ontologies such as Basic Formal Ontology (BFO). It is instead a BFO:process carried out by a single gene product or complex.
subset: goslim_aspergillus
subset: goslim_candida
subset: goslim_chembl
subset: goslim_generic
subset: goslim_metagenomics
subset: goslim_pir
subset: goslim_plant
subset: goslim_yeast
synonym: "molecular function" EXACT []

[Term]
id: GO:0003824
name: catalytic activity
namespace: molecular_function
def: "Catalysis of a biochemical reaction at physiological temperatures. In biologically catalyzed reactions, the reactants are known as substrates, and the catalysts are naturally occurring macromolecular substances known as enzymes. Enzymes possess specific binding sites for substrates, and are usually composed wholly or largely of protein, but RNA that has catalytic activity (ribozyme) is often also regarded as enzymatic." [GOC:vw, ISBN:0198506732]
subset: goslim_chembl
subset: goslim_flybase_ribbon
subset: goslim_metagenomics
subset: goslim_pir
subset: goslim_plant
synonym: "enzyme activity" EXACT [GOC:dph, GOC:tb]
xref: Wikipedia:Enzyme
is_a: GO:0003674 ! molecular_function

[Term]
id: GO:0005198
name: structural molecule activity
namespace: molecular_function
def: "The action of a molecule that contributes to the structural integrity of a complex or its assembly within or outside a cell." [GOC:mah, GOC:vw]
subset: goslim_agr
subset: goslim_aspergillus
subset: goslim_candida
subset: goslim_chembl
subset: goslim_flybase_ribbon
subset: goslim_generic
subset: goslim_pir
subset: goslim_plant
subset: goslim_yeast
is_a: GO:0003674 ! molecular_function

[Term]
id: GO:0005488
name: binding
namespace: molecular_function
def: "The selective, non-covalent, often stoichiometric, interaction of a molecule with one or more specific sites on another molecule." [GOC:ceb, GOC:mah, ISBN:0198506732]
comment: Note that this term is in the subset of terms that should not be used for direct, manual gene product annotation. Please choose a more specific child term, or request a new one if no suitable term is available. For ligands that bind to signal transducing receptors, consider the molecular function term 'receptor binding ; GO:0005102' and its children.
subset: gocheck_do_not_manually_annotate
subset: goslim_pir
subset: goslim_plant
synonym: "ligand" NARROW []
xref: Wikipedia:Binding_(molecular)
is_a: GO:0003674 ! molecular_function

[Term]
id: GO:0005515
name: protein binding
namespace: molecular_function
alt_id: GO:0001948
alt_id: GO:0045308
def: "Interacting selectively and non-covalently with any protein or protein complex (a complex of two or more proteins that may include other nonprotein molecules)." [GOC:go_curators]
subset: goslim_aspergillus
subset: goslim_candida
subset: goslim_chembl
subset: goslim_metagenomics
subset: goslim_pir
subset: goslim_plant
synonym: "glycoprotein binding" NARROW []
synonym: "protein amino acid binding" EXACT []
xref: reactome:R-HSA-170835 "An anchoring protein, ZFYVE9 (SARA), recruits SMAD2/3"
xref: reactome:R-HSA-170846 "TGFBR2 recruits TGFBR1"
xref: reactome:R-HSA-3645786 "TGFBR2 mutant dimers recruit TGFBR1"
xref: reactome:R-HSA-3656484 "TGFBR2 recruits TGFBR1 KD Mutants"
xref: reactome:R-HSA-3702153 "An anchoring protein, ZFYVE9 (SARA), recruits SMAD2/3 MH2 domain mutants"
xref: reactome:R-HSA-3713560 "An anchoring protein, ZFYVE9 (SARA), recruits SMAD2/3 phosphorylation motif mutants"
is_a: GO:0005488 ! binding

[Term]
id: GO:0005549
name: odorant binding
namespace: molecular_function
def: "Interacting selectively and non-covalently with an odorant, any substance capable of stimulating the sense of smell." [GOC:jl, ISBN:0721662544]
subset: goslim_pir
is_a: GO:0005488 ! binding

[Term]
id: GO:0005550
name: pheromone binding
namespace: molecular_function
def: "Interacting selectively and non-covalently with a pheromone, a substance, or characteristic mixture of substances, that is secreted and released by an organism and detected by a second organism of the same or a closely related species, in which it causes a specific reaction, such as a definite behavioral reaction or a developmental process." [GOC:ai]
is_a: GO:0005549 ! odorant binding

Vous trouverez ci-dessous une tentative de fonction récursive (et du code de support) pour stocker les ID de termes GO dans un dictionnaire arborescent:

import pandas as pd
import re

with open("tiny_go.obo", 'rt') as f:
    content = f.read()    

# Clean GO terms list
def clean_go_terms(terms):
    l = []
    for term in terms:
        if (len(re.findall('is_obsolete: true', term))==0) and (len(re.findall('id: GO:\d+', term)) > 0):
            l.append(term)
    return l

def get_top_nodes(terms):
    l = []
    for term in terms: 
        if len(re.findall('is_a: GO:\d+', term)) == 0:
            l.append(term)
    return l

split_terms = content.split('\n\n')
split_terms_clean = clean_go_terms(split_terms)
top_nodes = get_top_nodes(split_terms_clean)
len(top_nodes)

# Find every term that has the top node as a parent; apply recursively to entire list of terms
# * Keys with empty lists will be leaves
def generate_go_tree(parent_nodes, all_go_terms, switch=True):
    go_dict = {}
    for node in parent_nodes:
        parent_go_id = re.findall('id: (GO:\d+)', node)[0]
        go_dict[parent_go_id] = {}
        for go_term in all_go_terms:
            go_id = re.findall('id: (GO:\d+)', go_term)[0]
            parent_list = re.findall('is_a: (GO:\d+)', go_term)
            if (parent_go_id in parent_list):
                go_dict[parent_go_id][go_id] = generate_go_tree([go_term], all_go_terms, True)
    return go_dict

go_tree = generate_go_tree(top_nodes, split_terms_clean)

De toute évidence, je n'ai pas construit correctement la fonction récursive car je constate une duplication des clés dans la sortie:

{'GO:0003674': {'GO:0003824': {'GO:0003824': {}},
  'GO:0005198': {'GO:0005198': {}},
  'GO:0005488': {'GO:0005488': {'GO:0005515': {'GO:0005515': {}},
    'GO:0005549': {'GO:0005549': {'GO:0005550': {'GO:0005550': {}}}}}}}}

Des suggestions sur la façon de corriger la fonction récursive seront grandement appréciées! Je vous remercie!

3
tubuliferous 17 mars 2019 à 13:13

2 réponses

Meilleure réponse

Tu as écrit

if (parent_go_id in parent_list):
    go_dict[parent_go_id][go_id] = generate_go_tree([go_term], all_go_terms, True)

Correct serait

if (parent_go_id in parent_list):
    go_dict[parent_go_id][go_id] = generate_go_tree([go_term], all_go_terms, True)[go_id]

Après ce changement, il produit:

{
    'GO:0003674': {
        'GO:0003824': {}, 
        'GO:0005198': {}, 
        'GO:0005488': {
            'GO:0005515': {},
            'GO:0005549': {
                'GO:0005550': {}
            }
        }
    }
}

Mais je suggérerais une approche complètement différente. Créez une classe qui analyse les termes et crée l'arborescence de dépendances en procédant ainsi.

Pour plus de commodité, je l'ai dérivé de dict, vous pouvez donc écrire term.id au lieu de term['id']:

class Term(dict):
    __getattr__ = dict.__getitem__
    __setattr__ = dict.__setitem__
    __delattr__ = dict.__delitem__

    registry = {}
    single_valued = 'id name namespace alt_id def comment synonym is_a'.split()
    multi_valued = 'subset xref'.split()

    def __init__(self, text):
        self.children = []
        self.parent = None

        for line in text.splitlines():
            if not ': ' in line:
                continue
            key, val = line.split(': ', 1)
            if key in Term.single_valued:
                self[key] = val
            elif key in Term.multi_valued:
                if not key in self:
                    self[key] = [val]
                else:
                    self[key].append(val)
            else:
                print('unclear property: %s' % line)

        if 'id' in self:
            Term.registry[self.id] = self

        if 'alt_id' in self:
            Term.registry[self.alt_id] = self

        if 'is_a' in self:
            key = self.is_a.split(' ! ', 1)[0]
            if key in Term.registry:
                Term.registry[key].children.append(self)
                self.parent = Term.registry[key]

    def is_top(self):
        return self.parent == None

    def is_valid(self):
        return self.get('is_obsolete') != 'true' and self.id != None

Vous pouvez désormais récupérer le fichier en une seule fois:

with open('tiny_go.obo', 'rt') as f:
    contents = f.read()

terms = [Term(text) for text in contents.split('\n\n')]

Et récurer l'arbre devient facile. Par exemple, une simple fonction "print" qui ne sort que des nœuds non obsolètes:

def print_tree(terms, indent=''):
    valid_terms = [term for term in terms if term.is_valid()]
    for term in valid_terms:
        print(indent + 'Term %s - %s' % (term.id, term.name))
        print_tree(term.children, indent + '  ')

top_terms = [term for term in terms if term.is_top()]

print_tree(top_terms)

Cela imprime:

Term GO:0003674 - molecular_function
  Term GO:0003824 - catalytic activity
  Term GO:0005198 - structural molecule activity
  Term GO:0005488 - binding
    Term GO:0005515 - protein binding
    Term GO:0005549 - odorant binding
      Term GO:0005550 - pheromone binding

Vous pouvez également faire des choses comme Term.registry['GO:0005549'].parent.name, qui obtiendraient "binding".

Je laisse la production de dicts imbriqués de GO-ID (comme dans votre propre exemple) comme exercice, mais vous n'en aurez peut-être même pas besoin, car Term.registry est déjà très similaire à cela.

4
Tomalak 17 mars 2019 à 14:21

Vous pouvez utiliser la récursivité pour une solution plus courte:

import itertools, re, json
content = list(filter(None, [i.strip('\n') for i in open('filename.txt')]))
entries = [[a, list(b)] for a, b in itertools.groupby(content, key=lambda x:x== '[Term]')]
terms = [(lambda x:x if 'is_a' not in x else {**x, 'is_a':re.findall('^GO:\d+', x['is_a'])[0]})(dict(i.split(': ', 1) for i in b)) for a, b in entries if not a]
terms = sorted(terms, key=lambda x:'is_a' in x)
def tree(d, _start):
  t = [i for i in d if i.get('is_a') == _start]
  return {} if not t else {i['id']:tree(d, i['id']) for i in t}

print(json.dumps({terms[0]['id']:tree(terms, terms[0]['id'])}, indent=4))

Production:

{
  "GO:0003674": {
    "GO:0003824": {},
    "GO:0005198": {},
    "GO:0005488": {
        "GO:0005515": {},
        "GO:0005549": {
            "GO:0005550": {}
        }
      }
   }
}

Cela fonctionnera également si les ensembles de données parents ne sont pas définis avant leurs enfants. Par exemple, lorsque le parent est positionné trois places en dessous de sa place d'origine, le même résultat est toujours généré (voir fichier):

print(json.dumps({terms[0]['id']:tree(terms, terms[0]['id'])}, indent=4))

Production:

{
"GO:0003674": {
    "GO:0003824": {},
    "GO:0005198": {},
    "GO:0005488": {
        "GO:0005515": {},
        "GO:0005549": {
            "GO:0005550": {}
        }
      }
   }
}
3
Ajax1234 18 mars 2019 à 14:39