Rank Phenotype Terms

Created: 2011-02-01 11:22:14 Last updated: 2011-02-01 11:24:42

Download Workflow

This workflow counts the number of articles in the pubmed database in which each term occurs, and identifies the total number of articles in the entire PubMed database. It also identified the total number of articles within pubmed so that a term enrichment score may be calculated. The workflow also takes in a document containing abstracts that are related to a particular phenotype. Scientiifc terms are then extracted from this text and given a weighting according to the number of terms that appear in the document. The higher the value the better the score. This is given as: X = log((a / b) / (c / d)) where: a = number of occurnaces of individual terms in phenotype corpus b = number of abstracts in entire phenotype corpus c = number of occurnaces of individual terms in entire pubmed d = number of articles in entire pubmed Once this has been created, the pathways obtained from the QTL and microarray pathway analysis workflows are analysed. The documents from a search of each pathway in pubmed are merged into a single document of pathway abstracts. The (unweighted) phenotype terms are then searched in the pathways corpus. This will determine if the phenotype term is listed with the given pathway. The higher the value the better the score. Each term is then assigned a weight as: Y = log((e / f) / (c /d)) where: a = number of occurnaces of individual terms in pathway corpus b = number of abstracts in pathway corpus (per pathway) c = number of occurnaces of individual terms in entire pubmed d = number of articles in entire pubmed The weighted terms are then given a link score. This is the total of: X + Y. This gives the link between the pathway and the phenotype a score / significance value. The higher the score the more "appropriate/interesting" the link between the pathway and the phenotype. The terms are also ranked according to the number of pathways which have been given a weight. This is calculated as: W = Sum( X + Y). The higher the value the better the score.

This workflow calculates the cosine vector space between two sets of corpora. The workflow then removes any null values from the output. this is some extra text vbeing added It also counts the number of articles in the pubmed database in which each term occurs, and identifies the total number of articles in the entire PubMed database. It also identified the total number of articles within pubmed so that a term enrichment score may be calculated. The workflow also takes in a document containing abstracts that are related to a particular phenotype. Scientiifc terms are then extracted from this text and given a weighting according to the number of terms that appear in the document. The higher the value the better the score. This is given as: X = log((a / b) / (c / d)) where: a = number of occurnaces of individual terms in phenotype corpus b = number of abstracts in entire phenotype corpus c = number of occurnaces of individual terms in entire pubmed d = number of articles in entire pubmed Once this has been created, the pathways obtained from the QTL and microarray pathway analysis workflows are analysed. The documents from a search of each pathway in pubmed are merged into a single document of pathway abstracts. The (unweighted) phenotype terms are then searched in the pathways corpus. This will determine if the phenotype term is listed with the given pathway. The higher the value the better the score. Each term is then assigned a weight as: Y = log((e / f) / (c /d)) where: a = number of occurnaces of individual terms in pathway corpus b = number of abstracts in pathway corpus (per pathway) c = number of occurnaces of individual terms in entire pubmed d = number of articles in entire pubmed The weighted terms are then given a link score. This is the total of: X + Y. This gives the link between the pathway and the phenotype a score / significance value. The higher the score the more "appropriate/interesting" the link between the pathway and the phenotype. The terms are also ranked according to the number of pathways which have been given a weight. This is calculated as: W = Sum( X + Y). The higher the value the better the score.

Preview

Download as scalable diagram (SVG)

Run

Run this Workflow in the Taverna Workbench...

Option 1:

Copy and paste this link into File > 'Open workflow location...'
http://www.myexperiment.org/workflows/1806/download?version=1
[ More Info Expand ]

Workflow Components

Authors (1)

Titles (1)

Descriptions (2)

Dependencies (0)

Inputs (3)

Name	Description
query_abstracts
phenotype_abstracts
phenotype_terms

Processors (28)

Name	Type	Description
pubmed_database	stringconstant	Value pubmed
xpath	stringconstant	Value /[local-name(.)='eSearchResult']/[local-name(.)='Count']
count	stringconstant	Value count
xpath_count	stringconstant	Value /[local-name(.)='eInfoResult']/[local-name(.)='DbInfo']/*[local-name(.)='Count']
eSearch_database	stringconstant	Value pubmed
regular_expression	stringconstant	Value \n
two_newlines	stringconstant	Value \n\n
extract_terms	beanshell	Script String[] split = input.split("\n"); Vector nonEmpty = new Vector(); for (int i = 1; i < split.length; i++) { String trimmed = split[i].trim(); // if((trimmed.contains("=")) \|\| (trimmed.contains("-"))) // { // next; // } // else // { // String[] trimmed_array = trimmed.split("\t"); // String term = trimmed_array[0]; nonEmpty.add(trimmed); // } } String output = ""; for (int i = 0; i < nonEmpty.size(); i++) { output = output + (String) (nonEmpty.elementAt(i) + "\n"); }
format_rankings	beanshell	Script String[] split = ranked_terms.split("\n"); Vector nonEmpty = new Vector(); String pathway_name = ""; for (int i = 0; i < split.length; i++) { if (!(split[i].equals(""))) { String[] split_array = split[i].split("\t"); pathway_name = split_array[1].trim(); String term_rank = ""; term_rank =split_array[0].trim() + "\t" + split_array[2].trim(); nonEmpty.add(term_rank); System.out.println(pathway_name + "\n"); System.out.println(term_rank + "\n"); } } String[] non_empty = new String[nonEmpty.size()]; for (int i = 0; i < non_empty.length; i ++) { non_empty[i] = nonEmpty.elementAt(i); System.out.println("added: " + nonEmpty.elementAt(i) + "\n"); } String title_term_rankings = ""; title_term_rankings = ">> " + pathway_name + "\n"; for (int i = 0; i < non_empty.length; i++) { title_term_rankings = title_term_rankings + (String) (non_empty[i] + "\n"); System.out.println(title_term_rankings + "\n"); }
merge_term_count	beanshell	Script String term_input = term.trim(); String count_input = count.trim(); String output = ""; output = term_input + "\t" + count_input;
calculate_links	soaplab	Endpoint http://phoebus.cs.man.ac.uk:1977/axis/services/text_mining.calculate_links
enriched_phenotype	soaplab	Endpoint http://phoebus.cs.man.ac.uk:1977/axis/services/text_mining.enriched_phenotype
enriched_pathway	soaplab	Endpoint http://phoebus.cs.man.ac.uk:1977/axis/services/text_mining.enriched_pathway
split_extracted_terms	localworker	Script List split = new ArrayList(); if (!string.equals("")) { String regexString = ","; if (regex != void) { regexString = regex; } String[] result = string.split(regexString); for (int i = 0; i < result.length; i++) { split.add(result[i]); } }
merge_pubmed_count	localworker	Script String seperatorString = "\n"; if (seperator != void) { seperatorString = seperator; } StringBuffer sb = new StringBuffer(); for (Iterator i = stringlist.iterator(); i.hasNext();) { String item = (String) i.next(); sb.append(item); if (i.hasNext()) { sb.append(seperatorString); } } concatenated = sb.toString();
merge_extracted	localworker	Script String seperatorString = "\n"; if (seperator != void) { seperatorString = seperator; } StringBuffer sb = new StringBuffer(); for (Iterator i = stringlist.iterator(); i.hasNext();) { String item = (String) i.next(); sb.append(item); if (i.hasNext()) { sb.append(seperatorString); } } concatenated = sb.toString();
merge_format_rankings	localworker	Script String seperatorString = "\n"; if (seperator != void) { seperatorString = seperator; } StringBuffer sb = new StringBuffer(); for (Iterator i = stringlist.iterator(); i.hasNext();) { String item = (String) i.next(); sb.append(item); if (i.hasNext()) { sb.append(seperatorString); } } concatenated = sb.toString();
merge_list	localworker	Script String seperatorString = "\n"; if (seperator != void) { seperatorString = seperator; } StringBuffer sb = new StringBuffer(); for (Iterator i = stringlist.iterator(); i.hasNext();) { String item = (String) i.next(); sb.append(item); if (i.hasNext()) { sb.append(seperatorString); } } concatenated = sb.toString();
split_abstracts_by_regex	localworker	Script List split = new ArrayList(); if (!string.equals("")) { String regexString = ","; if (regex != void) { regexString = regex; } String[] result = string.split(regexString); for (int i = 0; i < result.length; i++) { split.add(result[i]); } }
extractCount_2	localworker	Script import org.dom4j.Document; import org.dom4j.Node; import org.dom4j.io.SAXReader; SAXReader reader = new SAXReader(false); reader.setIncludeInternalDTDDeclarations(false); reader.setIncludeExternalDTDDeclarations(false); Document document = reader.read(new StringReader(xmltext)); List nodelist = document.selectNodes(xpath); // Process the elements in the nodelist ArrayList outputList = new ArrayList(); ArrayList outputXmlList = new ArrayList(); String val = null; String xmlVal = null; for (Iterator iter = nodelist.iterator(); iter.hasNext();) { Node element = (Node) iter.next(); xmlVal = element.asXML(); val = element.getStringValue(); if (val != null && !val.equals("")) { outputList.add(val); outputXmlList.add(xmlVal); } } List nodelist=outputList; List nodelistAsXML=outputXmlList;
extractCount	localworker	Script import org.dom4j.Document; import org.dom4j.Node; import org.dom4j.io.SAXReader; SAXReader reader = new SAXReader(false); reader.setIncludeInternalDTDDeclarations(false); reader.setIncludeExternalDTDDeclarations(false); Document document = reader.read(new StringReader(xmltext)); List nodelist = document.selectNodes(xpath); // Process the elements in the nodelist ArrayList outputList = new ArrayList(); ArrayList outputXmlList = new ArrayList(); String val = null; String xmlVal = null; for (Iterator iter = nodelist.iterator(); iter.hasNext();) { Node element = (Node) iter.next(); xmlVal = element.asXML(); val = element.getStringValue(); if (val != null && !val.equals("")) { outputList.add(val); outputXmlList.add(xmlVal); } } List nodelist=outputList; List nodelistAsXML=outputXmlList;
parametersXML_1	xmlsplitter
run_eInfo	wsdl	Wsdl http://eutils.ncbi.nlm.nih.gov/soap/v2.0/eutils.wsdl Wsdl Operation run_eInfo
run_eSearch	wsdl	Wsdl http://eutils.ncbi.nlm.nih.gov/soap/v2.0/eutils.wsdl Wsdl Operation run_eSearch
run_eSearch_request	xmlsplitter
cosine_vector_space	soaplab	Endpoint http://phoebus.cs.man.ac.uk:1977/axis/services/text_mining.cosine_vector_space
remove_Nulls	beanshell	Script String[] split = input.split("\n"); Vector nonEmpty = new Vector(); for (int i = 0; i < split.length; i++){ if (!(split[i].equals(""))) { nonEmpty.add(split[i].trim()); } } String[] non_empty = new String[nonEmpty.size()]; for (int i = 0; i < non_empty.length; i ++) { non_empty[i] = nonEmpty.elementAt(i); } String output = ""; for (int i = 0; i < non_empty.length; i++) { output = output + (String) (non_empty[i] + "\n"); }
merge_cosine_scores	localworker	Script String seperatorString = "\n"; if (seperator != void) { seperatorString = seperator; } StringBuffer sb = new StringBuffer(); for (Iterator i = stringlist.iterator(); i.hasNext();) { String item = (String) i.next(); sb.append(item); if (i.hasNext()) { sb.append(seperatorString); } } concatenated = sb.toString();

Beanshells (4)

Name	Inputs	Outputs
extract_terms	input	output
format_rankings	ranked_terms	title_term_rankings
merge_term_count	term count	output
remove_Nulls	input	output

Outputs (4)

Name	Description
concept_rankings
phenotype_term_counts
pubmed_abstract_number
cosine_vector_scores

Datalinks (41)

Source	Sink
phenotype_terms	extract_terms:input
calculate_links:output	format_rankings:ranked_terms
split_extracted_terms:split	merge_term_count:term
merge_extracted:concatenated	merge_term_count:count
enriched_phenotype:output	calculate_links:enriched_phenotype_direct_data
enriched_pathway:output	calculate_links:enriched_pathway_direct_data
phenotype_terms	enriched_phenotype:phenotype_terms_direct_data
phenotype_abstracts	enriched_phenotype:phenotype_abstract_direct_data
merge_pubmed_count:concatenated	enriched_phenotype:pubmed_count_direct_data
merge_list:concatenated	enriched_phenotype:term_count_direct_data
merge_pubmed_count:concatenated	enriched_pathway:pubmed_count_direct_data
split_abstracts_by_regex:split	enriched_pathway:pathway_abstracts_direct_data
phenotype_terms	enriched_pathway:phenotype_terms_direct_data
merge_list:concatenated	enriched_pathway:term_count_direct_data
extract_terms:output	split_extracted_terms:string
regular_expression:value	split_extracted_terms:regex
extractCount:nodelist	merge_pubmed_count:stringlist
extractCount_2:nodelist	merge_extracted:stringlist
format_rankings:title_term_rankings	merge_format_rankings:stringlist
merge_term_count:output	merge_list:stringlist
query_abstracts	split_abstracts_by_regex:string
two_newlines:value	split_abstracts_by_regex:regex
xpath:value	extractCount_2:xpath
run_eSearch:result	extractCount_2:xml-text
xpath_count:value	extractCount:xpath
run_eInfo:result	extractCount:xml-text
pubmed_database:value	parametersXML_1:db
parametersXML_1:output	run_eInfo:request
run_eSearch_request:output	run_eSearch:request
eSearch_database:value	run_eSearch_request:db
count:value	run_eSearch_request:rettype
split_extracted_terms:split	run_eSearch_request:term
phenotype_terms	cosine_vector_space:phenotype_terms_direct_data
split_abstracts_by_regex:split	cosine_vector_space:pathway_abstracts_direct_data
merge_list:concatenated	cosine_vector_space:phenotype_term_count_direct_data
merge_cosine_scores:concatenated	remove_Nulls:input
cosine_vector_space:output	merge_cosine_scores:stringlist
merge_format_rankings:concatenated	concept_rankings
merge_list:concatenated	phenotype_term_counts
merge_pubmed_count:concatenated	pubmed_abstract_number
remove_Nulls:output	cosine_vector_scores

Coordinations (1)

Controller	Target
extract_terms	pubmed_database

Information Workflow Type

Taverna 2

Information Uploader

Paul Fisher

Information License

All versions of this Workflow are licensed under:

Information Version 1 (of 1)

Information Credits (1)

(People/Groups)

Paul Fisher

Information Attributions (2)

(Workflows/Files)

Information Tags (26)

Uploader tags

concept
|
concept profile
|
cosine vector space
|
data-driven
|
enrichment
|
entity recognition
|
eutils
|
evidence
|
getconcepts
|
literature
|
mining
|
pathway
|
pathway-driven
|
pathways
|
phenotype
|
pubmed
|
qtl
|
quanitative
|
ranking
|
significance
|
term
|
term extraction
|
terms
|
text
|
text mining
|
text mining; term extraction; entity recognition

Log in to add Tags

Information Shared with Groups (0)

None

Information Featured In Packs (1)

Text Mining Workflows

Log in to add to one of your Packs

Information Attributed By (0)

(Workflows/Files)

None

Information Favourited By (0)

No one

Information Statistics

2312 viewings

2277 downloads

[ see breakdown ]

Citations (0)

None

Version History

In chronological order:

Rank Phenotype Terms

Created by Paul Fisher on Tuesday 01 February 2011 11:22:14 (UTC)

Last edited by Paul Fisher on Tuesday 01 February 2011 11:24:42 (UTC)

Reviews (0)

No reviews yet

Be the first to review!

Comments (0)

View Timeline

No comments yet

Log in to make a comment

Other workflows that use similar services (4)

Only the first 2 workflows that use similar services are shown. View all workflows that use these services.

Taverna 2

Uploader

Paul Fisher

Rank Phenotype Terms (2)

Download

Created: 2010-12-08 | Last updated: 2011-01-11

Credits: Paul Fisher

Attributions: Rank Phenotype Terms

Taverna 2

Uploader

Paul Fisher

Gene to Pubmed (4)

Download

This workflow takes in a list of gene names and searches the PubMed database for corresponding articles. Any matches to the genes are then retrieved (abstracts only). These abstracts are then returned to the user.

Created: 2011-02-08 | Last updated: 2011-02-10

Credits: Paul Fisher

Attributions: Cosine vector space Extract Scientific Terms Rank Phenotype Terms Cosine vector space Rank Phenotype Terms Pathway to Pubmed Extract Scientific Terms

Rank Phenotype Terms

Preview

Run

Run this Workflow in the Taverna Workbench...

Workflow Components

Value

Value

Value

Value

Value

Value

Value

Script

Script

Script

Endpoint

Endpoint

Endpoint

Script

Script

Script

Script

Script

Script

Script

Script

Wsdl

Wsdl Operation

Wsdl

Wsdl Operation

Endpoint

Script

Script

Reviews (0)

Comments (0)

Other workflows that use similar services (4)