This workflow performs a generic protein sequence analysis. In order to do that a novel protein sequence enters into the software along with a list of known protein identifiers chosen by the biologist to perform a homology search, followed by a multiple sequence alignment and finally a phylogenetic analysis.
org.embl.ebi.escience.scuflworkers.java.FailIfFalse
org.embl.ebi.escience.scuflworkers.java.FailIfFalse
It is not a Protein Sequence, because it does not correspond to the 20 amino acid letters!
net.sourceforge.taverna.scuflworkers.ncbi.ProteinFastaWorker
It is a DNA or RNA sequence! This program does not accept it. Please enter a Protein sequence.
org.embl.ebi.escience.scuflworkers.java.FailIfTrue
org.embl.ebi.escience.scuflworkers.java.FlattenList
org.embl.ebi.escience.scuflworkers.java.FailIfTrue
\n
org.embl.ebi.escience.scuflworkers.java.SplitByRegex
org.embl.ebi.escience.scuflworkers.java.StringListMerge
org.embl.ebi.escience.scuflworkers.java.StringListMerge
/* Verifying if the sequence entered by the user is a Protein according to the 20 amino acids.
Input: sequence
Output: condition
*/
StringBuffer temp= new StringBuffer();
// setting the variable as true
boolean isDNARNA = true;
char ch;
// Stting the sequence in one line without spaces and the character to upper case
String seqNoLine = sequence.replaceAll("\n","");
String seq = seqNoLine.toUpperCase();
int seqSize = seq.length();
// Verifying if it is a Protein sequence according to the 20 amino acids letters
for (int i=0; i<seqSize; i++){
ch = seq.charAt(i);
if(ch!=('A')&&ch!=('C')&&ch!=('D')&&ch!=('E')&&ch!=('F')&&ch!=('G')&&ch!=('H')&&ch!=('I')&&ch!=('K')&&ch!=('L')&&ch!=('M')&&ch!=('N')&&ch!=('P')&&ch!=('Q')&&ch!=('R')&&ch!=('S')&&ch!=('T')&&ch!=('V')&&ch!=('W')&&ch!=('Y')){
isDNARNA = false;
break;
}
}
/* If it is a protein sequence the output will have the word “true”
If it is not a protein sequence the output will have the word “false”; the condition is going to be verified afterwards;
*/
if (isDNARNA){
temp.append("true");
}
else{
temp.append("false");
}
String condition = temp.toString();
sequence
condition
/*As the process for the clustering method has to be entered in lower case letter, this will set the user input as lower case letter
Input: user input (letter)
Output: output in lower case letter
*/
String output = input.toLowerCase();
input
output
/* It will accept all the sequences: the query, the ones from the GI identifiers and finally from BLAST and it will extract 35 sequences according to this order if the number of sequences is more or equal than this number.
Input: insertSeq string with all sequences in FASTA format and with a
specific order
Output: result string with 35 sequences or less
*/
// Extract 35 Sequences if the number of sequences is greater than 35
StringBuffer temp= new StringBuffer();
String [] lines = insertSeq.split(">");
int linesSize = lines.length;
if (linesSize>=36){
for (int i=0; i<36; i++){
temp.append(">"+lines[i]);
}
}
//Extract the number of sequences which are available and are less than 35
else{
for (int j=0; j<linesSize; j++){
temp.append(">"+lines[j]);
}
}
// Output with 35 sequences or less
String result = temp.toString();
insertSeq
result
/* The GI identifiers provided from the user list and from the privious process had to be compared and the duplicates should be eliminated.
Inputs: BlastList as string and UserList as string
Output: result with the GI identifiers in a certain order: GI
identifiers from the user list and after from the BLAST
according to E-values
*/
//eliminates the duplicates from the user list and from the BLAST
StringBuffer temp= new StringBuffer();
// setting each strings to different arrays
String [] lines = BlastList.split("\n");
String [] elements = UserList.split("\n");
// getting the length from each array
int sizeBlast = lines.length;
int sizeUser = elements.length;
// creating a new array with the length of the sum of the other two
String [] my = new String[lines.length + elements.length];
// entering the the GI identifiers in the one array first user list then BLAST list (Both lists to one array)
for (int i=0; i<sizeUser; i++){
my[i] = elements[i];
}
for (int j=0; j<sizeBlast; j++){
my[sizeUser + j] = lines[j];
}
int mySize = my.length;
// eliminating the duplicates in the order they were entered using LinkedHashSet
Set mySet = new LinkedHashSet(Arrays.asList(my));
String [] res= (String[])(mySet.toArray(new String[mySet.size()]));
int sizeres = res.length;
//passing from an array to String
for (int k=0; k<sizeres; k++){
temp.append(res[k] + "\n");
}
// output without duplicates
String result= temp.toString();
BlastList
UserList
result
/* The query sequence has to enter in the multiple sequence alignment in fasta format, which was previously set. This will be entered with the other sequences.
Inputs: fasta and MergeString
Output: result
*/
StringBuffer temp= new StringBuffer();
// Both inputs being set to a variable
temp.append(fasta + "\n\n" + MergeString + "\n");
// Both inputs to one output
String result = temp.toString();
fasta
MergeString
result
/*It will see if each GI number from the BLAST report has an E-value less or equal than 0.02 and it will save the ones according to this number in a string. It will use regular expression to extract them.
Input: BlastReport in a single string with the GI identifiers and the
corresponding E-value
Output: result with the GI identifiers according to the E-value
*/
// Extract the GI identifiers according to the e-value.
import java.util.regex.Matcher;
import java.util.regex.Pattern;
// Regular expression to extract the GI numbers and the corresponding E-value from the String
Pattern pGI = Pattern.compile("(^.*?$)");
Pattern pEvalue = Pattern.compile("is: (.*)$");
Matcher mGI;
Matcher mEvalue;
StringBuffer temp = new StringBuffer();
// As the string entered is separated by new line it is possible to enter each in an array
String [] line = BlastReport.split("\n");
int arraysize = line.length;
// Loop to check if each GI identifiers have an e-value less or equal than 0.02
for (int i=0; i<arraysize; i+=2){
String sGI = line[i];
String sEvalue = line[i+1];
mGI = pGI.matcher(sGI);
mEvalue = pEvalue.matcher(sEvalue);
String gi="";
if (mGI.find()){
gi =mGI.group(1);
}
if (mEvalue.find()){
String eval = mEvalue.group(1);
if(eval.startsWith("e")){
eval= "1".concat(eval);
}
Double d = new Double (eval);
double Evalue = d.doubleValue();
//Getting the GI numbers, which correspond to the e-value<=0.02
if (Evalue<=0.02){
temp.append(gi + "\n");
}
}
}
// Output with the interesting GI numbers in a single string
String result = temp.toString();
BlastReport
result
/* The novel sequence is entered from the user, but not in FASTA format. Setting the query sequence in FASTA format to be entered in the BLAST processor and also in the multiple sequence alignment process
Input: sequence
Output: fasta
*/
StringBuffer temp= new StringBuffer();
temp.append(">"+"|query|"+"\n"+sequence);
// Sequence returned in FASTA format
String fasta = temp.toString();
sequence
fasta
/* Accepts the 35 sequences or less in FASTA format and extracts the information from them for a better understanding of the multiple sequence alignment and trees plots. It uses regular expression for the extraction
Input: Sequences; string with 35 sequences or less
Output: result; string with description of each sequence in FASTA format
*/
// Extract the FASTA description from each sequence
import java.util.regex.Pattern;
import java.util.regex.Matcher;
StringBuffer temp = new StringBuffer();
String information="";
// regular expression to extract only the sequence description
Pattern pattern = Pattern.compile (">(\\w+.*)\\s");
Matcher matcher = pattern.matcher(Sequences);
while(matcher.find()){
information=matcher.group(1);
temp.append(information + "\n");
}
// Output sequence description
String result = temp.toString();
Sequences
result
/* Verifying if the sequence entered by the user is a DNA or RNA
input: sequence
output: condition
*/
StringBuffer temp= new StringBuffer();
// setting the variable as true
boolean isDNARNA = true;
char ch;
// Stting the sequence in one line without spaces and the character to upper case
String seqNoLine = sequence.replaceAll("\n","");
String seq = seqNoLine.toUpperCase();
int seqSize = seq.length();
// Verifying if it is a DNA or RNA sequence
for (int i=0; i<seqSize; i++){
ch = seq.charAt(i);
if(ch!=('A')&&ch!=('C')&&ch!=('G')&&ch!=('T')&&ch!=('U')){
isDNARNA = false;
break;
}
}
/* If it is a DNA or RNA sequence the output will have the word “true”
If it is not a DNA or RNA sequence the output will have the word “false”; The condition is going to be verified afterwards;
*/
if (isDNARNA){
temp.append("true");
}
else{
temp.append("false");
}
String condition = temp.toString();
sequence
condition
Simplifies BLAST output for later use
gi
exp
http://phoebus.cs.man.ac.uk:1977/axis/services/seq_analysis.blastsimplifier
Plots a cladogram- or phenogram-like rooted
tree diagram
http://www.ebi.ac.uk/soaplab/emboss4/services/phylogeny_tree_drawing.fdrawgram
Protein distance algorithm
http://www.ebi.ac.uk/soaplab/emboss4/services/phylogeny_molecular_sequence.fprotdist
Phylogenies from distance matrix by N-J or
UPGMA method
http://www.ebi.ac.uk/soaplab/emboss4/services/phylogeny_distance_matrix.fneighbor
Plots an unrooted tree diagram
http://www.ebi.ac.uk/soaplab/emboss4/services/phylogeny_tree_drawing.fdrawtree
Displays aligned sequences, with colouring
and boxing
10
No
Yes
http://www.ebi.ac.uk/soaplab/emboss4/services/alignment_multiple.prettyplot
Multiple alignment program - interface to
ClustalW program
http://www.ebi.ac.uk/soaplab/emboss4/services/alignment_multiple.emma
Execute Blast
blastp
PROTEIN
http://xml.nig.ac.jp/wsdl/Blast.wsdl
searchSimple
Enter a novel protein sequence. e.g.
AITRRVACLDGVNTATNAACCALFAVRDDIQQNL
FDGGECGEEVHESLRLTFHDAIGISPSLAATGKFGG
GGADGSIMIFDDIEPNFHANNGVDEIINAQKPFVAK
HNMTAGDFIQFAGAVGVSNCPGAPQLSFFLGRPA
Enter a list of protein IDs. e.g:
Q96TS5
Q12575
AAA33739
Q96TS6
AAA33741
For the cluster method enter "n" for Neighbor-Joining algorithm or "u" for UPGMA algorithm
image/png
Completed
Fail_if_true_Protein
Not_Protein_Sequence
Scheduled
Running
Completed
Fail_if_false_Protein
Setting_fasta
Scheduled
Running
Completed
Fail_if_true_DNA
Condition_Protein
Scheduled
Running
Completed
Fail_if_false_DNA
Is_DNA_RNA
Scheduled
Running
Completed
Fail_if_false_Protein
MergeUserList
Scheduled
Running
Completed
Fail_if_false_Protein
toLowerCase
Scheduled
Running