Demo_DiseaseDiscovery_byHumanUniprot_scaffold

Created: 2007-12-10 23:10:00

Download Workflow

This workflow finds disease relevant to the query string via the following steps:

A user query: a list of terms or boolean query - look at the Apache Lucene project for all details. E.g.: (EZH2 OR "Enhancer of Zeste" +(mutation chromatin) -clinical); consider adding 'ProteinSynonymsToQuery' in front of the input if your query is a protein.
Retrieve documents: finds 'maximumNumberOfHits' relevant documents (abstract+title) based on query (the AIDA service inside is based on Apache's Lucene)
Discover proteins: extract proteins discovered in the set of relevant abstracts with a 'named entity recognizer' trained on genomic terms using a Bayesian approach; the AIDA service inside is based on LingPipe. This subworkflow also 'filters' false positives from the discovered protein by requiring a discovery has a valid UniProt ID. Martijn Schuemie's service to do that contains only human UniProt IDs, which is why this workflow only works for human proteins.
Link proteins to disease contained in the OMIM disease database (with a service from Japan that interrogates OMIM)

Workflow by Marco Roos (AID = Adaptive Information Disclosure, University of Amsterdam; http://adaptivedisclosure.org)

Text mining services by Sophia Katrenko and Edgar Meij (AID), and Martijn Schuemie (BioSemantics, Erasmus University Rotterdam). OMIM service from the Center for Information Biology and DNA Data Bank of Japan, National Institute of Genetics, director Hideaki Sugawara (see http://xml.nig.ac.jp)

Changes to our original BioAID_DiseaseDiscovery workflow:

* Use of Martijn Schuemie's synsets service to * provide uniprot ids to discovered proteins * filter false positive discoveries, only proteins with a uniprot id go through; this introduces some false negatives (e.g. discovered proteins with a name shorter than 3 characters) * solve a major issue with the original workflow where some false positives could contribute disproportionately to the number of discovered diseases * Counting of results in various ways.

Preview

Download as scalable diagram (SVG)

Run

Run this Workflow in the Taverna Workbench...

Option 1:

Copy and paste this link into File > 'Open workflow location...'
http://www.myexperiment.org/workflows/119/download?version=1
[ More Info Expand ]

Workflow Components

Inputs (2)

Name	Description
query_string	Query for retrieving document from an indexed corpus. It is assumed the query will be used for a search service based on Lucene. In short that means the query should be string of terms with logical operators or +/- signs to denote if terms are wanted or unwanted. Documents that comply with this query will be used to discover entities in. If you have a single protein as query, consider adding the 'ProteinSynonymsToQuery' workflow in front of this input.
maxNumberOfDocsToRetrieve	This limits the amount of relevant documents retrieved from our medline index. A maximum of 10 is good for testing, up to a 100 works well for us (takes some time), much above 100 you may find Taverna 1 choking on its memory/data handling limitations.

Processors (5)

Name	Type	Description
Document_index	stringconstant
search_field	stringconstant
CountDiseasesPerProtein	beanshell
CountDiseases	beanshell
CountProteins	beanshell

Beanshells (3)

Name	Inputs	Outputs
CountDiseasesPerProtein	list	count
CountDiseases	list	count
CountProteins	list	count

Outputs (8)

Name	Description
relevant_documents
discovered_proteins
discovered_diseases
diseases_per_protein
protein_count
disease_count_per_protein
discovered_uniprot_ids
disease_count