This workflow finds proteins relevant to the query string via the following steps:
1. A user query: a single gene/protein name. E.g.: (EZH2 OR "Enhancer of Zeste").
2. Retrieve documents: finds 'maximumNumberOfHits' relevant documents (abstract+title) based on query (the AIDA service inside is based on Apache's Lucene)
3. Discover proteins: extract proteins discovered in the set of relevant abstracts with a 'named entity recognizer' trained on genomic terms using a Bayesian approach; the AIDA service inside is based on LingPipe. This subworkflow also 'filters' false positives from the discovered protein by requiring a discovery has a valid UniProt ID. Martijn Schuemie's service to do that contains only human UniProt IDs, which is why this workflow only works for human proteins.
Workflow by Marco Roos (AID = Adaptive Information Disclosure, University of Amsterdam; http://adaptivedisclosure.org)
Text mining services by Sophia Katrenko and Edgar Meij (AID), and Martijn Schuemie (BioSemantics, Erasmus University Rotterdam).
Changes to our original BioAID_DiseaseDiscovery workflow:
* Stops at protein discovery
* Use of Martijn Schuemie's synsets service to
* add synonyms to the query.
* provide uniprot ids to discovered proteins
* filter false positive discoveries, only proteins with a uniprot id go through; this introduces some false negatives (e.g. discovered proteins with a name shorter than 3 characters)
* Counting of results in various ways, but no outputs defined in this simplified workflow.
* Output into simple html table.
<html> <head> <title>Results of text mining workflow</title> <link href='http://www.adaptivedisclosure.org/workflows/AIDA_workflows.css' rel='stylesheet' type='text/css'/> </head> <body> <div id='wrapper'> <div id='header'> <span id='aida_logo' title='Adaptive Information Disclosure Application'><a href='http://adaptivedisclosure.org'>Adaptive Information Disclosure Application</a></span> <span id='vle_logo' title='Virtual Laboratory for e-Science'><a href='http://www.vl-e.nl/'>Virtual Laboratory for e-Science</a></span> </div><!-- header --> <div id='page_title'> <h1><span title='Adaptive Information Disclosure Application'>AIDA</span> workflow results</h1> </div><!-- page_title --> <h3 align="center"> Results of workflow pending, please check back later... </h3> <div class='workflow' align='center'><img src='http://ws.adaptivedisclosure.org:8080/aida_public/aida_graphics/ProteinDiscoveryWorkflowGraphic.png'></div><div id='footer'> <div id='AIDA'/> </div><!-- footer --> </div><!-- wrapper --></body> </html>
http://aida.science.uva.nl:8888/axis/AidaFiler.jws?wsdl
save_html
/* Beanshell to turn result lists into a html table */
/* input variables:
String query_protein, String discovered_protein_list[][], String discovered_uniprot_id_list[][], String pubmed_id_list[] ,String ranking_score_list[][]
output variable: html_table
*/
import java.lang.String;
import java.util.*;
import java.lang.Integer;
private String pubmed_url_stub = "http://www.ncbi.nlm.nih.gov/sites/entrez?cmd=Retrieve&db=PubMed&list_uids=";
private String iHop_url_stub_before = "http://www.ihop-net.org/UniPub/iHOP/?search=";
private String iHop_url_stub_after = "&field=UNIPROT__AC&ncbi_tax_id=9606&organism_syn=";
/* switch commenting below for single html page mode */
private String html_top="";
/*
private String html_top = "<html>\n<head>\n<title>Results of text mining workflow</title>\n<link href='http://www.adaptivedisclosure.org/workflows/AIDA_workflows.css' rel='stylesheet' type='text/css'/>\n</head>\n<body>\n<div id='wrapper'>\n<div id='header'>\n<span id='aida_logo' title='Adaptive Information Disclosure Application'><a href='http://adaptivedisclosure.org'>Adaptive Information Disclosure Application</a></span>\n<span id='vle_logo' title='Virtual Laboratory for e-Science'><a href='http://www.vl-e.nl/'>Virtual Laboratory for e-Science</a></span>\n</div><!-- header -->\n<div id='page_title'>\n<h1><span title='Adaptive Information Disclosure Application'>AIDA</span> workflow results</h1>\n</div><!-- page_title -->\n"+"<table align='center' summary='this table gives the results of the text mining workflow'>\n<caption><em>Results of text mining workflow</em></caption>\n<tr>\n<th>Query<br/>protein</th>\n<th>Associated<br/>with</th>\n<th>Published in<br/><small>(PubMed ID)</small></th>\n</tr>\n";
*/
private String html_bottom="";
/*
private String html_bottom="</table>\n\n<div id='footer'>\n<div id='AIDA'/>\n</div><!-- footer -->\n</div><!-- wrapper --></body>\n</html>\n";
*/
String key;
String qry;
String prot;
String uniprot;
String pub_id;
String rank;
String iHopString = new String();
String prev_qry="";
String prev_prot="";
String prev_uniprot="";
String prev_pub_id="";
String tablebody="";
Iterator prot_iterator;
Iterator uniprot_iterator;
Iterator rank_iterator;
Iterator protlist_iterator = (Iterator) discovered_protein_list.iterator();
Iterator uniprotlist_iterator = (Iterator) discovered_uniprot_id_list.iterator();
Iterator ranklist_iterator = (Iterator) ranking_score_list.iterator();
Iterator doc_iterator = (Iterator) pubmed_id_list.iterator();
qry =(String) query_protein;
while ( doc_iterator.hasNext() )
{
pub_id=(String) doc_iterator.next().toString();
prot_iterator = (Iterator) protlist_iterator.next().iterator();
uniprot_iterator = (Iterator) uniprotlist_iterator.next().iterator();
rank_iterator = (Iterator) ranklist_iterator.next().iterator();
while ( prot_iterator.hasNext() )
{
prot=(String) prot_iterator.next().toString();
uniprot=(String) uniprot_iterator.next().toString();
rank = (String) rank_iterator.next().toString();
key=(String) qry + rank + prot + pub_id;
if (!qry.equals(prot)) {
if (qry.equals(prev_qry)) { qry=",,"; } else { prev_qry=qry; }
if (prot.equals(prev_prot)) { prot=",,"; } else { prev_prot = prot; }
if (uniprot.equals(prev_uniprot)) { uniprot=",,"; iHopString=""; } else { prev_uniprot = uniprot; iHopString="<small><sup><a href='"+iHop_url_stub_before+uniprot+iHop_url_stub_after+"'>iHop</a></sup></small>"; }
if (pub_id.equals(prev_pub_id)) { pub_id=",,"; } else {prev_pub_id = pub_id; }
tablebody=tablebody+"<tr>\n<td align='center'>"+qry+"</td>\n<td align='center'><a href='http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db=protein&id="+uniprot+"' title='uniprot id: "+uniprot+"'>"+prot+"</a>"+iHopString+"</td><td align='center' ><a href='"+pubmed_url_stub+pub_id+"'>"+pub_id+"</a></td>\n</tr>\n";
}
}
}
html_table = (String)(html_top+tablebody+html_bottom);
query_protein
discovered_protein_list
discovered_uniprot_id_list
pubmed_id_list
ranking_score_list
html_table
html_ref = "<a href='" + url + "'>" + url + "</a>";
url
html_ref
MedLine
count = list.size();
list
count
Yes
Default maximum number of documents to retrieve from medline by the query from which to extract proteins.
25
http://aida.science.uva.nl:8888/axis/AidaFiler.jws?wsdl
save_html
http://aida.science.uva.nl:8888/axis/AidaFiler.jws?wsdl
save_html
http://aida.science.uva.nl:8888/axis/AidaFiler.jws?wsdl
save_html
import java.util.*;
List newlist = new ArrayList();
for (int i=0; i<((int) Integer.parseInt(copy_number.toString())); i++) {
newlist.add(input);
}
clones=newlist;
copy_number
input
clones
http://www.ncbi.nlm.nih.gov/sites/entrez?cmd=Retrieve&db=PubMed&list_uids=
0
count = list.size();
list
count
org.embl.ebi.escience.scuflworkers.java.StringConcat
No
private String html_top = "<html>\n<head>\n<title>Results of text mining workflow</title>\n<link href='http://www.adaptivedisclosure.org/workflows/AIDA_workflows.css' rel='stylesheet' type='text/css'/>\n</head>\n<body>\n<div id='wrapper'>\n<div id='header'>\n<span id='aida_logo' title='Adaptive Information Disclosure Application'><a href='http://adaptivedisclosure.org'>Adaptive Information Disclosure Application</a></span>\n<span id='vle_logo' title='Virtual Laboratory for e-Science'><a href='http://www.vl-e.nl/'>Virtual Laboratory for e-Science</a></span>\n</div><!-- header -->\n<div id='page_title'>\n<h1><span title='Adaptive Information Disclosure Application'>AIDA</span> workflow results</h1>\n</div><!-- page_title -->\n"+"<table align='center' summary='this table gives the results of the text mining workflow'>\n<caption><em>Results of text mining workflow</em></caption>\n<tr>\n<th>Query<br/>protein</th>\n<th>Associated<br/>with</th>\n<th>Published in<br/><small>(PubMed ID)</small></th>\n</tr>\n";
private String html_bottom="</table>\n\n<div id='footer'>\n<div id='AIDA'/>\n</div><!-- footer -->\n</div><!-- wrapper --></body>\n</html>\n";
html_top
html_bottom
content
This workflow applies the discovery workflow built around the AIDA 'Named Entity Recognize' web service by Sophia Katrenko. It uses the pre-learned genomics model, named 'MedLine', to find genomics concepts in a set of documents in lucene output format.
This workflow applies the discovery workflow built around the AIDA 'Named Entity Recognize' web service by Sophia Katrenko. It uses the pre-learned genomics model, named 'MedLine', to find genomics concepts in a set of documents in lucene output format.
MedLine
This workflow contains the 'Named Entity Recognize' web service from the AIDA toolbox, created by Sophia Katrenko. It can be used to discover entities of a certain type (determined by 'learned_model') in documents provided in a lucene output format.
This workflow contains the 'Named Entity Recognize' web service from the AIDA toolbox, created by Sophia Katrenko. It can be used to discover entities of a certain type (determined by 'learned_model') in documents provided in a lucene output format.
Known issues:
The output of NErecognize contains concepts with / characters, breaking the xml. For post-processing its results it is better to use string manipulation than xml manipulations.
The output is per document, which means entities will be redundant if they occur in more than one document.
NElist
lucene
http://ws.adaptivedisclosure.org/axis/services/NERecognizerService?wsdl
NErecognize
Model to discover a set of specific concepts; e.g. the prelearned model named 'MedLine' will make the service discover genomics concepts.
text/rdf
text/xml
Entities discoverd in documents provided in lucene output format.
This workflow filters protein_molecule-labeled terms from an input string(list). The result is a tagged list of proteins (disregarding false positives in the input).
Internal information:
This workflow is a copy of 'filter_protein_molecule_MR3' used for the NBIC poster (now in Archive).
This workflow filters protein_molecule-labeled terms from an input string(list). The result is a tagged list of proteins (disregarding false positives in the input).
Internal information:
This workflow is a copy of 'filter_protein_molecule_MR3' used for the NBIC poster (now in Archive).
//protein_molecule
N/A
NA
net.sourceforge.taverna.scuflworkers.xml.XPathTextWorker
net.sourceforge.taverna.scuflworkers.xml.XPathTextWorker
import java.util.regex.*;
Pattern p = Pattern.compile(findstring);
Matcher m = p.matcher(input);
output = (String) m.replaceAll(replacestring);
input
findstring
replacestring
output
Iterator i;
if (uniprotIDlist.isEmpty()) {
uniprotID_or_False = "False";
} else {
uniprotID_or_False = (String) uniprotIDlist.iterator().next().toString();
}
uniprotIDlist
uniprotID_or_False
if (uniprot!="False") {
true_protein=protein;
true_uniprot=uniprot;
}
protein
uniprot
true_protein
true_uniprot
org.embl.ebi.escience.scuflworkers.java.StringStripDuplicates
http://bubbles.biosemantics.org:8080/axis/services/SynsetServer/SynsetServer.jws?wsdl
getUniprotID
//doc
net.sourceforge.taverna.scuflworkers.xml.XPathTextWorker
.+
org.embl.ebi.escience.scuflworkers.java.FilterStringList
//doc/@id
.+
org.embl.ebi.escience.scuflworkers.java.FilterStringList
text/xml
Example:
<result_final><doc id="15208672"><other_name>Replicative</other_name><other_name>cell</other_name><cell_type>damaged</cell_type><other_name>tumor</other_name><protein_molecule>tumor</protein_molecule><protein_molecule>p53</protein_molecule><other_name>EZH2</other_name><protein_molecule>p53</protein_molecule><other_name>epigenetic</other_name><other_name>genetic</other_name><other_name>EZH2</other_name><tissue>tumors</tissue><protein_molecule>p53</protein_molecule><other_name>cancer</other_name></doc><doc id="15520282"><protein_molecule>Ezh2</protein_molecule><protein_molecule>Polycomb</protein_molecule><protein_complex>PRC3</protein_complex><other_organic_compound>histone</other_organic_compound><other_name>HKMT</other_name><protein_molecule>Ezh2</protein_molecule><protein_molecule>HDAC1</protein_molecule><protein_family_or_group>YY1</protein_family_or_group><other_organic_compound>H3</other_organic_compound><protein_molecule>MyoD</protein_molecule><protein_molecule>SRF</protein_molecule><DNA_family_or_group>chromatin</DNA_family_or_group><other_organic_compound>H3</other_organic_compound><protein_complex>Ezh2</protein_complex><protein_family_or_group>positive</protein_family_or_group><DNA_domain_or_region>genomic</DNA_domain_or_region><other_name>muscle</other_name><other_name>cell</other_name></doc></result_final>
text/rdf
text/xml
This workflow creates a query string from the query term using Martijn Schuemie's synonym service. The service is limited to proteins, enzymes and genes. An input query that is a boolean string will be split and processed, but the boolean logic of the input query will be lost.
This workflow creates a query string from the query term using Martijn Schuemie's synonym service. The service is limited to proteins, enzymes and genes. An input query that is a boolean string will be split and processed, but the boolean logic of the input query will be lost.
org.embl.ebi.escience.scuflworkers.java.FlattenList
org.embl.ebi.escience.scuflworkers.java.FlattenList
import java.util.*;
String synstring="\"" + query_term + "\"";
String syn;
Iterator iterator = synonymlist.iterator();
while ( iterator.hasNext() )
{
synstring = synstring + " OR ";
syn = ((String) iterator.next());
synstring = synstring + "\"" + syn + "\"";
}
new_query = synstring;
synonymlist
query_term
new_query
Protein synonym service by Martijn Schuemie, Erasmus Medical Centre, University of Rotterdam, The Netherlands.
http://bubbles.biosemantics.org:8080/axis/services/SynsetServer/SynsetServer.jws?wsdl
getSynsets
http://aida.science.uva.nl:8888/axis/SynsetServer.jws?wsdl
getSynsets
Splits and input query string into its parts. Works for queries that contain search terms, search phrases between double quotes, connected by AND or OR. Behaviour undetermined when other characters such as +, -, or brackets are used. Should work now for well formed patterns with bracketed substrings separated by AND/OR/AND NOT/OR NOT, e.g. (Topic1) AND NOT (Topic2), but not extensively tested.
org.embl.ebi.escience.scuflworkers.java.SplitByRegex
Changed iteration strategy!
org.embl.ebi.escience.scuflworkers.java.StringSetUnion
(((?<=") (?=\w))|((?<=\w) (?=")))|((?<=") (?="))
org.embl.ebi.escience.scuflworkers.java.FilterStringList
org.embl.ebi.escience.scuflworkers.java.SplitByRegex
"
[^"]+
org.embl.ebi.escience.scuflworkers.java.FilterStringList
org.embl.ebi.escience.scuflworkers.java.FlattenList
org.embl.ebi.escience.scuflworkers.java.FilterStringList
s1 s2 s3 AND s4 s5 OR "s6 s7" s8 s9 AND s10 OR "s11" "s12 s13" s14 s15 "s16"
org.embl.ebi.escience.scuflworkers.java.SplitByRegex
\w.*
".+"
( +AND +NOT +)|( +OR +NOT +)|( +AND +)|( +OR +)|\(|\)
org.embl.ebi.escience.scuflworkers.java.FlattenList
Queries that contain search terms, search phrases between double quotes, possibly connected by AND or OR. Behaviour undetermined when other characters such as +, -, or brackets are used.
Query term without quotes, only synonyms of proteins, enzymes and genes will be returned. Boolean queries will be processed, but the input boolean logic will be lost.
E.g. 'EZH2'
This workflow retrieves relevant documents, based on a query optimized by adding a string to the original query that will rank the search output according to the most recent years. The added string adds years with priorities (most recent is highest); it starts at 2007.
10
This workflow retrieves relevant documents, based on a query optimized by adding a string to the original query that will rank the search output according to the most recent years. The added string adds years with priorities (most recent is highest); it starts at 2007.
This workflow applies the search web service from the AIDA toolbox.
Comments:
This search service is based on lucene defaults; it may be necessary to optimize the querystring to adopt the behaviour to what is most relevant in a particular domain (e.g. for medline prioritizing based on publication date is useful). Lucene favours shorter sentences, which may be bad for subsequent information extraction.
This workflow applies the search web service from the AIDA toolbox.
Comments:
This search service is based on lucene defaults; it may be necessary to optimize the querystring to adopt the behaviour to what is most relevant in a particular domain (e.g. for medline prioritizing based on publication date is useful). Lucene favours shorter sentences, which may be bad for subsequent information extraction.
http://ws.adaptivedisclosure.org/axis/services/SearcherWS?wsdl
search
text/xml
This workflow does four things:
1. it retrieves documents relevant for the query string
2. it discovers entities in those documents, these are considered relevant entities
3. it filters proteins from those entities (on the tag protein_molecule)
4. it removes all terms from the list produced by 3 (query terms temporarily considered proteins)
ToDo
* Replace step 4 by the following procedure:
1. remove the query terms from the output of NER (probably by a regexp matching on what is inside the tag, possibly case-insensitive)
2. remove tag_as_protein_molecule (obsolete)
* Add synonym service/workflow
Note that Remove_inputquery has an alternative iteration strategy (dot product instead of cross product). Idem for 'Join' in 'SplitQuery'.
This workflow does four things:
1. it retrieves documents relevant for the query string
2. it discovers entities in those documents, these are considered relevant entities
3. it filters proteins from those entities (on the tag protein_molecule)
4. it removes all terms from the list produced by 3 (query terms temporarily considered proteins)
ToDo
* Replace step 4 by the following procedure:
1. remove the query terms from the output of NER (probably by a regexp matching on what is inside the tag, possibly case-insensitive)
2. remove tag_as_protein_molecule (obsolete)
* Add synonym service/workflow
Note that Remove_inputquery has an alternative iteration strategy (dot product instead of cross product). Idem for 'Join' in 'SplitQuery'.
StringBuffer temp=new StringBuffer();
temp.append("+(");
temp.append(query_string);
temp.append(") +");
temp.append(priority_string);
String lucene_query = temp.toString();
query_string
priority_string
lucene_query
year:(2007^10 2006^9 2005^8 2004^7 2004^6 2003^5 2002^4 2001^3 2000^2 1999^1)
Lucene query string
text/xml
A protein name to query. A sinlge gene/protein name is expected, because the 'ProteinSynonymsToQuery' workflow is used on the query.
text/uri-list
Completed
save_html_init
SynonymsToQuery
Scheduled
Running
Completed
Proteins_to_html_table
save_html_top
Scheduled
Running
Completed
save_html_top
save_html
Scheduled
Running
Completed
save_html
save_html_bottom
Scheduled
Running