This workflow finds disease relevant to the query string via the following steps:
1. A user query: a list of terms or boolean query - look at the Apache Lucene project for all details. E.g.: (EZH2 OR "Enhancer of Zeste" +(mutation chromatin) -clinical); consider adding 'ProteinSynonymsToQuery' in front of the input if your query is a protein.
2. Retrieve documents: finds 'maximumNumberOfHits' relevant documents (abstract+title) based on query (the AIDA service inside is based on Apache's Lucene)
3. Discover proteins: extract proteins discovered in the set of relevant abstracts with a 'named entity recognizer' trained on genomic terms using a Bayesian approach; the AIDA service inside is based on LingPipe. This subworkflow also 'filters' false positives from the discovered protein by requiring a discovery has a valid UniProt ID. Martijn Schuemie's service to do that contains only human UniProt IDs, which is why this workflow only works for human proteins.
4. Link proteins to disease contained in the OMIM disease database (with a service from Japan that interrogates OMIM)
Workflow by Marco Roos (AID = Adaptive Information Disclosure, University of Amsterdam; http://adaptivedisclosure.org)
Text mining services by Sophia Katrenko and Edgar Meij (AID), and Martijn Schuemie (BioSemantics, Erasmus University Rotterdam).
OMIM service from the Center for Information Biology and DNA Data Bank of Japan, National Institute of Genetics, director Hideaki Sugawara (see http://xml.nig.ac.jp)
Changes to our original BioAID_DiseaseDiscovery workflow:
* Use of Martijn Schuemie's synsets service to
* provide uniprot ids to discovered proteins
* filter false positive discoveries, only proteins with a uniprot id go through; this introduces some false negatives (e.g. discovered proteins with a name shorter than 3 characters)
* solve a major issue with the original workflow where some false positives could contribute disproportionately to the number of discovered diseases
* Counting of results in various ways.
content
MedLine
count = list.size();
list
count
count = list.size();
list
count
count = list.size();
list
count
org.embl.ebi.escience.scuflworkers.java.FlattenList
org.embl.ebi.escience.scuflworkers.java.StringStripDuplicates
get Keyword
http://xml.nig.ac.jp/wsdl/OMIM.wsdl
search
StringBuffer temp= new StringBuffer();
temp.append("<OMIM_disease_label>");
temp.append(OMIM_disease_string);
temp.append("</OMIM_disease_label>");
String OMIM_disease_label = temp.toString();
OMIM_disease_string
OMIM_disease_label
org.embl.ebi.escience.scuflworkers.java.FilterStringList
(#\d+ .+)|(%\d+ .+)
org.embl.ebi.escience.scuflworkers.java.StringStripDuplicates
\n
org.embl.ebi.escience.scuflworkers.java.SplitByRegex
org.embl.ebi.escience.scuflworkers.java.FlattenList
text/xml
This workflow retrieves relevant documents, based on a query optimized by adding a string to the original query that will rank the search output according to the most recent years. The added string adds years with priorities (most recent is highest); it starts at 2007.
10
This workflow retrieves relevant documents, based on a query optimized by adding a string to the original query that will rank the search output according to the most recent years. The added string adds years with priorities (most recent is highest); it starts at 2007.
This workflow does four things:
1. it retrieves documents relevant for the query string
2. it discovers entities in those documents, these are considered relevant entities
3. it filters proteins from those entities (on the tag protein_molecule)
4. it removes all terms from the list produced by 3 (query terms temporarily considered proteins)
ToDo
* Replace step 4 by the following procedure:
1. remove the query terms from the output of NER (probably by a regexp matching on what is inside the tag, possibly case-insensitive)
2. remove tag_as_protein_molecule (obsolete)
* Add synonym service/workflow
Note that Remove_inputquery has an alternative iteration strategy (dot product instead of cross product). Idem for 'Join' in 'SplitQuery'.
This workflow does four things:
1. it retrieves documents relevant for the query string
2. it discovers entities in those documents, these are considered relevant entities
3. it filters proteins from those entities (on the tag protein_molecule)
4. it removes all terms from the list produced by 3 (query terms temporarily considered proteins)
ToDo
* Replace step 4 by the following procedure:
1. remove the query terms from the output of NER (probably by a regexp matching on what is inside the tag, possibly case-insensitive)
2. remove tag_as_protein_molecule (obsolete)
* Add synonym service/workflow
Note that Remove_inputquery has an alternative iteration strategy (dot product instead of cross product). Idem for 'Join' in 'SplitQuery'.
StringBuffer temp=new StringBuffer();
temp.append("+(");
temp.append(query_string);
temp.append(") +");
temp.append(priority_string);
String lucene_query = temp.toString();
query_string
priority_string
lucene_query
year:(2007^10 2006^9 2005^8 2004^7 2004^6 2003^5 2002^4 2001^3 2000^2 1999^1)
Lucene query string
This workflow applies the search web service from the AIDA toolbox.
Comments:
This search service is based on lucene defaults; it may be necessary to optimize the querystring to adopt the behaviour to what is most relevant in a particular domain (e.g. for medline prioritizing based on publication date is useful). Lucene favours shorter sentences, which may be bad for subsequent information extraction.
This workflow applies the search web service from the AIDA toolbox.
Comments:
This search service is based on lucene defaults; it may be necessary to optimize the querystring to adopt the behaviour to what is most relevant in a particular domain (e.g. for medline prioritizing based on publication date is useful). Lucene favours shorter sentences, which may be bad for subsequent information extraction.
http://ws.adaptivedisclosure.org/axis/services/SearcherWS?wsdl
search
text/xml
text/xml
This workflow applies the discovery workflow built around the AIDA 'Named Entity Recognize' web service by Sophia Katrenko. It uses the pre-learned genomics model, named 'MedLine', to find genomics concepts in a set of documents in lucene output format.
This workflow applies the discovery workflow built around the AIDA 'Named Entity Recognize' web service by Sophia Katrenko. It uses the pre-learned genomics model, named 'MedLine', to find genomics concepts in a set of documents in lucene output format.
MedLine
This workflow contains the 'Named Entity Recognize' web service from the AIDA toolbox, created by Sophia Katrenko. It can be used to discover entities of a certain type (determined by 'learned_model') in documents provided in a lucene output format.
This workflow contains the 'Named Entity Recognize' web service from the AIDA toolbox, created by Sophia Katrenko. It can be used to discover entities of a certain type (determined by 'learned_model') in documents provided in a lucene output format.
Known issues:
The output of NErecognize contains concepts with / characters, breaking the xml. For post-processing its results it is better to use string manipulation than xml manipulations.
The output is per document, which means entities will be redundant if they occur in more than one document.
http://ws.adaptivedisclosure.org/axis/services/NERecognizerService?wsdl
NErecognize
lucene
NElist
Model to discover a set of specific concepts; e.g. the prelearned model named 'MedLine' will make the service discover genomics concepts.
text/rdf
text/xml
Entities discoverd in documents provided in lucene output format.
This workflow filters protein_molecule-labeled terms from an input string(list). The result is a tagged list of proteins (disregarding false positives in the input).
Internal information:
This workflow is a copy of 'filter_protein_molecule_MR3' used for the NBIC poster (now in Archive).
This workflow filters protein_molecule-labeled terms from an input string(list). The result is a tagged list of proteins (disregarding false positives in the input).
Internal information:
This workflow is a copy of 'filter_protein_molecule_MR3' used for the NBIC poster (now in Archive).
if (uniprot!="False") {
true_protein=protein;
true_uniprot=uniprot;
}
protein
uniprot
true_protein
true_uniprot
.+
org.embl.ebi.escience.scuflworkers.java.FilterStringList
import java.util.regex.*;
Pattern pattern = Pattern.compile("</?[\\w\\d-]+>");
Matcher matcher = pattern.matcher(tagged_term);
String term= matcher.replaceAll("");
tagged_term
term
Iterator i;
if (uniprotIDlist.isEmpty()) {
uniprotID_or_False = "False";
} else {
uniprotID_or_False = (String) uniprotIDlist.iterator().next().toString();
}
uniprotIDlist
uniprotID_or_False
org.embl.ebi.escience.scuflworkers.java.StringStripDuplicates
http://bubbles.biosemantics.org:8080/axis/services/SynsetServer/SynsetServer.jws?wsdl
getUniprotID
org.embl.ebi.escience.scuflworkers.java.SplitByRegex
<protein_molecule>\w*</protein_molecule>
(?=<protein_molecule>)|(?<=</protein_molecule>)
.+
org.embl.ebi.escience.scuflworkers.java.FilterStringList
org.embl.ebi.escience.scuflworkers.java.FilterStringList
text/rdf
text/xml
Query for retrieving document from an indexed corpus. It is assumed the query will be used for a search service based on Lucene. In short that means the query should be string of terms with logical operators or +/- signs to denote if terms are wanted or unwanted. Documents that comply with this query will be used to discover entities in.
If you have a single protein as query, consider adding the 'ProteinSynonymsToQuery' workflow in front of this input.
This limits the amount of relevant documents retrieved from our medline index. A maximum of 10 is good for testing, up to a 100 works well for us (takes some time), much above 100 you may find Taverna 1 choking on its memory/data handling limitations.
text/rdf
text/xml