This workflow was based on BioAID_DiseaseDiscovery, changes: expects only one protein name, adds protein synonyms).
This workflow finds diseases relevant to the query string via the following steps:
1. A user query: a single protein name
2. Add synonyms (service courtesy of Martijn Scheumie, Erasmus University Rotterdam)
3. Retrieve documents: finds relevant documents (abstract+title) based on query
4. Discover proteins: extract proteins discovered in the set of relevant abstracts
5. Link proteins to disease contained in the OMIM disease database.
content
MedLine
100
org.embl.ebi.escience.scuflworkers.java.StringStripDuplicates
org.embl.ebi.escience.scuflworkers.java.FlattenList
import java.util.regex.*;
Pattern pattern = Pattern.compile("</?[\\w\\d-]+>");
Matcher matcher = pattern.matcher(tagged_term);
String term= matcher.replaceAll("");
tagged_term
term
This workflow retrieves relevant documents, based on a query optimized by adding a string to the original query that will rank the search output according to the most recent years. The added string adds years with priorities (most recent is highest); it starts at 2007.
This workflow retrieves relevant documents, based on a query optimized by adding a string to the original query that will rank the search output according to the most recent years. The added string adds years with priorities (most recent is highest); it starts at 2007.
This workflow does four things:
1. it retrieves documents relevant for the query string
2. it discovers entities in those documents, these are considered relevant entities
3. it filters proteins from those entities (on the tag protein_molecule)
4. it removes all terms from the list produced by 3 (query terms temporarily considered proteins)
ToDo
* Replace step 4 by the following procedure:
1. remove the query terms from the output of NER (probably by a regexp matching on what is inside the tag, possibly case-insensitive)
2. remove tag_as_protein_molecule (obsolete)
* Add synonym service/workflow
Note that Remove_inputquery has an alternative iteration strategy (dot product instead of cross product). Idem for 'Join' in 'SplitQuery'.
This workflow does four things:
1. it retrieves documents relevant for the query string
2. it discovers entities in those documents, these are considered relevant entities
3. it filters proteins from those entities (on the tag protein_molecule)
4. it removes all terms from the list produced by 3 (query terms temporarily considered proteins)
ToDo
* Replace step 4 by the following procedure:
1. remove the query terms from the output of NER (probably by a regexp matching on what is inside the tag, possibly case-insensitive)
2. remove tag_as_protein_molecule (obsolete)
* Add synonym service/workflow
Note that Remove_inputquery has an alternative iteration strategy (dot product instead of cross product). Idem for 'Join' in 'SplitQuery'.
year:(2007^10 2006^9 2005^8 2004^7 2004^6 2003^5 2002^4 2001^3 2000^2 1999^1)
StringBuffer temp=new StringBuffer();
temp.append("+(");
temp.append(query_string);
temp.append(") +");
temp.append(priority_string);
String lucene_query = temp.toString();
query_string
priority_string
lucene_query
Lucene query string
This workflow applies the search web service from the AIDA toolbox.
Comments:
This search service is based on lucene defaults; it may be necessary to optimize the querystring to adopt the behaviour to what is most relevant in a particular domain (e.g. for medline prioritizing based on publication date is useful). Lucene favours shorter sentences, which may be bad for subsequent information extraction.
This workflow applies the search web service from the AIDA toolbox.
Comments:
This search service is based on lucene defaults; it may be necessary to optimize the querystring to adopt the behaviour to what is most relevant in a particular domain (e.g. for medline prioritizing based on publication date is useful). Lucene favours shorter sentences, which may be bad for subsequent information extraction.
http://ws.adaptivedisclosure.org/axis/services/SearcherWS?wsdl
search
text/xml
text/xml
This workflow applies the discovery workflow built around the AIDA 'Named Entity Recognize' web service by Sophia Katrenko. It uses the pre-learned genomics model, named 'MedLine', to find genomics concepts in a set of documents in lucene output format.
This workflow applies the discovery workflow built around the AIDA 'Named Entity Recognize' web service by Sophia Katrenko. It uses the pre-learned genomics model, named 'MedLine', to find genomics concepts in a set of documents in lucene output format.
MedLine
This workflow filters protein_molecule-labeled terms from an input string(list). The result is a tagged list of proteins (disregarding false positives in the input).
Internal information:
This workflow is a copy of 'filter_protein_molecule_MR3' used for the NBIC poster (now in Archive).
This workflow filters protein_molecule-labeled terms from an input string(list). The result is a tagged list of proteins (disregarding false positives in the input).
Internal information:
This workflow is a copy of 'filter_protein_molecule_MR3' used for the NBIC poster (now in Archive).
(?=<protein_molecule>)|(?<=</protein_molecule>)
<protein_molecule>\w*</protein_molecule>
org.embl.ebi.escience.scuflworkers.java.StringStripDuplicates
org.embl.ebi.escience.scuflworkers.java.FilterStringList
org.embl.ebi.escience.scuflworkers.java.SplitByRegex
text/xml
This workflow contains the 'Named Entity Recognize' web service from the AIDA toolbox, created by Sophia Katrenko. It can be used to discover entities of a certain type (determined by 'learned_model') in documents provided in a lucene output format.
This workflow contains the 'Named Entity Recognize' web service from the AIDA toolbox, created by Sophia Katrenko. It can be used to discover entities of a certain type (determined by 'learned_model') in documents provided in a lucene output format.
Known issues:
The output of NErecognize contains concepts with / characters, breaking the xml. For post-processing its results it is better to use string manipulation than xml manipulations.
The output is per document, which means entities will be redundant if they occur in more than one document.
NElist
lucene
http://ws.adaptivedisclosure.org/axis/services/NERecognizerService?wsdl
NErecognize
Model to discover a set of specific concepts; e.g. the prelearned model named 'MedLine' will make the service discover genomics concepts.
text/rdf
text/xml
Entities discoverd in documents provided in lucene output format.
text/rdf
text/xml
\n
(#\d+ .+)|(%\d+ .+)
org.embl.ebi.escience.scuflworkers.java.StringStripDuplicates
org.embl.ebi.escience.scuflworkers.java.SplitByRegex
org.embl.ebi.escience.scuflworkers.java.FilterStringList
org.embl.ebi.escience.scuflworkers.java.FlattenList
StringBuffer temp= new StringBuffer();
temp.append("<OMIM_disease_label>");
temp.append(OMIM_disease_string);
temp.append("</OMIM_disease_label>");
String OMIM_disease_label = temp.toString();
OMIM_disease_string
OMIM_disease_label
get Keyword
http://xml.nig.ac.jp/wsdl/OMIM.wsdl
search
text/xml
http://rdf.adaptivedisclosure.org/~marco/BioAID/Public/Workflows/BioAID/ProteinSynonymsToQuery.xml
org.embl.ebi.escience.scuflworkers.java.FlattenList
One protein or enzyme name or acronym. A protein synonym service will provide synonyms (courtesy of Martijn Scheumie, Erasmus University Rotterdam).
Example:
EZH2
Advanced information:
Internally the input query is used by an AIDA search service based on the open source information retrieval software Lucene. Please look at the Apache Lucene documentation for details on lucene queries. For this workflow one protein or enzyme name is expected; the behaviour for more elaborate queries is undefined.
text/plain