BioAID_DiseaseDiscovery_RatHumanMouseUniprotFilter

Created: 2008-12-15 20:46:09 Last updated: 2011-08-11 09:22:23

Download Workflow

This workflow finds disease relevant to the query string via the following steps: 1. A user query: a list of terms or boolean query - look at the Apache Lucene project for all details. E.g.: (EZH2 OR "Enhancer of Zeste" +(mutation chromatin) -clinical); consider adding 'ProteinSynonymsToQuery' in front of the input if your query is a protein. 2. Retrieve documents: finds 'maximumNumberOfHits' relevant documents (abstract+title) based on query (the AIDA service inside is based on Apache's Lucene) 3. Discover proteins: extract proteins discovered in the set of relevant abstracts with a 'named entity recognizer' trained on genomic terms using a Bayesian approach; the AIDA service inside is based on LingPipe. This subworkflow also 'filters' false positives from the discovered protein by requiring a discovery has a valid UniProt ID. Martijn Schuemie's service to do that contains only human UniProt IDs, which is why this workflow only works for human proteins. 4. Link proteins to disease contained in the OMIM disease database (with a service from Japan that interrogates OMIM) Workflow by Marco Roos (AID = Adaptive Information Disclosure, University of Amsterdam; http://adaptivedisclosure.org) Text mining services by Sophia Katrenko and Edgar Meij (AID), and Martijn Schuemie (BioSemantics, Erasmus University Rotterdam). OMIM service from the Center for Information Biology and DNA Data Bank of Japan, National Institute of Genetics, director Hideaki Sugawara (see http://xml.nig.ac.jp) Changes to our original BioAID_DiseaseDiscovery workflow: * Use of Martijn Schuemie's synsets service to * provide uniprot ids to discovered proteins * filter false positive discoveries, only proteins with a uniprot id go through; this introduces some false negatives (e.g. discovered proteins with a name shorter than 3 characters) * solve a major issue with the original workflow where some false positives could contribute disproportionately to the number of discovered diseases * Counting of results in various ways.

Preview

Download as scalable diagram (SVG)

Run

Run this Workflow in the Taverna Workbench...

Option 1:

Copy and paste this link into File > 'Open workflow location...'
http://www.myexperiment.org/workflows/72/download?version=4
[ More Info Expand ]

Workflow Components

Inputs (2)

Name	Description
query_string	Query for retrieving document from an indexed corpus. It is assumed the query will be used for a search service based on Lucene. In short that means the query should be string of terms with logical operators or +/- signs to denote if terms are wanted or unwanted. Documents that comply with this query will be used to discover entities in. If you have a single protein as query, consider adding the 'ProteinSynonymsToQuery' workflow in front of this input.
maxNumberOfDocsToRetrieve	This limits the amount of relevant documents retrieved from our medline index. A maximum of 10 is good for testing, up to a 100 works well for us (takes some time), much above 100 you may find Taverna 1 choking on its memory/data handling limitations.

Processors (9)

Name	Type	Description
search_field	stringconstant
Document_index	stringconstant
CountProteins	beanshell
CountDiseases	beanshell
CountDiseasesPerProtein	beanshell
Flatten_and_make_unique	workflow
Link_proteins_to_diseases	workflow
Retrieve_documents	workflow	This workflow retrieves relevant documents, based on a query optimized by adding a string to the original query that will rank the search output according to the most recent years. The added string adds years with priorities (most recent is highest); it starts at 2007.
Discover_RatHumanMouseUniProt_proteins	workflow	This workflow applies the discovery workflow built around the AIDA 'Named Entity Recognize' web service by Sophia Katrenko. It uses the pre-learned genomics model, named 'MedLine', to find genomics concepts in a set of documents in lucene output format.

Beanshells (8)

Name	Inputs	Outputs
label_OMIM_disease	OMIM_disease_string	OMIM_disease_label
Prioritise_lucene_query	query_string priority_string	lucene_query
FilterTrueProteinByUniProtID	protein uniprot	true_protein true_uniprot
Strip_xml	tagged_term	term
UniProtOrNot	uniprotIDlist	uniprotID_or_False
CountProteins	list	count
CountDiseases	list	count
CountDiseasesPerProtein	list	count

Outputs (8)

Name	Description
relevant_documents
discovered_proteins
discovered_diseases
diseases_per_protein
protein_count
disease_count_per_protein
discovered_uniprot_ids
disease_count

Links (18)

Source	Sink
maxNumberOfDocsToRetrieve	Retrieve_documents:maxHits
query_string	Retrieve_documents:query_string
Discover_RatHumanMouseUniProt_proteins:discovered_proteins	CountProteins:list
Discover_RatHumanMouseUniProt_proteins:discovered_proteins	Link_proteins_to_diseases:keyword
Document_index:value	Retrieve_documents:document_index
Flatten_and_make_unique:flattened_unique_output	CountDiseases:list
Link_proteins_to_diseases:OMIM_disease_label	CountDiseasesPerProtein:list
Link_proteins_to_diseases:OMIM_disease_label	Flatten_and_make_unique:input
CountDiseasesPerProtein:count	disease_count_per_protein
CountProteins:count	protein_count
Discover_RatHumanMouseUniProt_proteins:discovered_proteins	discovered_proteins
Discover_RatHumanMouseUniProt_proteins:discovered_uniprot_ids	discovered_uniprot_ids
Flatten_and_make_unique:flattened_unique_output	discovered_diseases
Link_proteins_to_diseases:OMIM_disease_label	diseases_per_protein
Retrieve_documents:relevant_documents	Discover_RatHumanMouseUniProt_proteins:documents_from_lucene
Retrieve_documents:relevant_documents	relevant_documents
search_field:value	Retrieve_documents:search_field
CountDiseases:count	disease_count

Coordinations (0)

Information Workflow Type

Taverna 1

Information Uploader

Marco Roos

Information License

All versions of this Workflow are licensed under:

Information Version 4 (latest) (of 4)

Information Credits (2)

(People/Groups)

Information Attributions (0)

(Workflows/Files)

None

Information Tags (9)

Uploader tags

Log in to add Tags

Information Shared with Groups (1)

Information Featured In Packs (1)

AIDA demo pack

Log in to add to one of your Packs

Information Attributed By (3)

(Workflows/Files)

BioAID_ProteinDiscovery_filterOnHumanUniprot_perDoc_html
Private item
BioAID_ProteinToDiseases

Information Favourited By (6)

Information Statistics

19288 viewings

8893 downloads

[ see breakdown ]

Citations (0)

None

Version History

In chronological order:

BioAID_DiseaseDiscovery

Created by Marco Roos on Monday 12 November 2007 22:39:04 (UTC)

Last edited by Marco Roos on Tuesday 21 October 2008 10:44:19 (UTC)
BioAID_DiseaseDiscovery

Created by Marco Roos on Sunday 14 December 2008 22:40:51 (UTC)

Last edited by Marco Roos on Sunday 14 December 2008 23:59:58 (UTC)

Revision comment:

Added uniprot filter subworkflow.
BioAID_DiseaseDiscovery

Created by Marco Roos on Monday 15 December 2008 20:46:09 (UTC)

Last edited by Marco Roos on Monday 15 December 2008 20:47:51 (UTC)

Revision comment:

SearcherWS should not be on 8080
BioAID_DiseaseDiscovery_RatHumanMouseUniprotFilter

Created by Marco Roos on Wednesday 26 January 2011 14:43:26 (UTC)

Last edited by Marco Roos on Thursday 11 August 2011 09:22:23 (UTC)

Revision comment:

Updated reference to one of the Web Services (SynSets), but unfortunately the OMIM service has been terminated by DDBJ, so the workflow cannot run completely. I will need to find a replacement for the OMIM service.

Reviews (0)

No reviews yet

Be the first to review!

Comments (3)

View Timeline

Log in to make a comment

Marco Roos

This is our original disease discovery workflow. Please note that some false positives among 'proteins' extracted from abstracts can contribute disproportionately to the number of diseaeses retrieved from OMIM (e.g. a protein called 'tumor').

If you are mainly interested in human proteins, please use BioAID_DiseaseDiscovery_byHumanUniProt. This workflow filters false positives by a check against human UniProt IDs (using a service provided by Martijn Schuemie).

In other cases you may want to try the BioAID_DiseaseDisvcovery_count version, as with this you can check manually for false positives. Diseases are listed and counted per extracted protein. We discovered the weakness in our original workflow with this workflow.

Marco Roos	Sunday 14 December 2008 22:45:00 (UTC)
	I updated the original with a version that both filters using uniprot (v2: rat, human, mouse), and counts. The original workflow can still be found as version 1. I will delete the separate uniprot and count versions from myExperiment.

Marco Roos	Thursday 11 August 2011 09:25:54 (UTC)
	Unfortunately, the OMIM service by DDBJ was discontinued. Therefore, you will find that this workflow does not run completely unless you replace the OMIM service with a service with similar function. The workflow up to that service, i.e. doing only protein extraction, has been updated to Taverna 2.

Other workflows that use similar services (14)

Only the first 2 workflows that use similar services are shown. View all workflows that use these services.

Taverna 1

Uploader

Marco Roos

BioAID_ProteinToDiseases (1)

Download

This workflow was based on BioAID_DiseaseDiscovery, changes: expects only one protein name, adds protein synonyms). This workflow finds diseases relevant to the query string via the following steps: A user query: a single protein name Add synonyms (service courtesy of Martijn Scheumie, Erasmus University Rotterdam) Retrieve documents: finds relevant documents (abstract+title) based on query Discover proteins: extract proteins discovered in the set of relevant abstracts 5. Link proteins ...

Created: 2007-11-14 | Last updated: 2007-11-15

Credits: Marco Roos Martijn Schuemie AID

Attributions: BioAID_DiseaseDiscovery_RatHumanMouseUniprotFilter

Taverna 1

Uploader

Marco Roos

BioAID_ProteinDiscovery_filterOnHumanUnipr... (11)

Download

This workflow finds proteins relevant to the query string via the following steps: A user query: a single gene/protein name. E.g.: (EZH2 OR "Enhancer of Zeste"). Retrieve documents: finds 'maximumNumberOfHits' relevant documents (abstract+title) based on query (the AIDA service inside is based on Apache's Lucene) Discover proteins: extract proteins discovered in the set of relevant abstracts with a 'named entity recognizer' trained on genomic terms using a Bayesian approach; the AIDA serv...

Created: 2009-05-28

Credits: Marco Roos Martijn Schuemie AID AID_myGrid_collaboration

Attributions: BioAID_DiseaseDiscovery_RatHumanMouseUniprotFilter