Run InterProScan using a nucleotide sequence as input.
The InterProScan tool (http://www.ebi.ac.uk/Tools/InterProScan/) searches a protein sequence against a selection of protein domain, feature and family signature databases, and integrates the results giving potential assignments to InterPro entries and Gene Ontology terms. Since InterProScan is a protein search tool to use it with a nucleotide sequence, the sequence must be translated into a protein sequence. There are a number of ways of doing this, depending on the properties of the nucleotide sequence, in this case a simple open reading frame (ORF) model is used to obtain the candidate translations. These translations are filtered for length (>80aa) and a search against UniProtKB (http://www.uniprot.org/) is performed to ensure that only sequences which have some relationship with known protein space, on which the signatures used are based, are passed to InterProScan. Once the set of translations has been filtered the remaining sequences as passed on to InterProScan for analysis.
Note: the coordinates in the InterProScan output are in protein coordinates relative to the input translated sequence, to map these on to the input nucleotide sequence see the fasta header of the corresponding translated ORF where the nucleotide coordinates are shown.
This implementation uses:
1. EBI's WSDbfetch web service (http://www.ebi.ac.uk/Tools/webservices/services/dbfetch) to retreive enties specified by database identifer.
2. EMBOSS seqret tool (http://emboss.sourceforge.net/apps/release/5.0/emboss/apps/getorf.html) via Soaplab (http://www.ebi.ac.uk/Tools/webservices/soaplab/overview) to ensure input sequences are in an appropriate format (i.e. fasta format).
3. EMBOSS getorf tool (http://emboss.sourceforge.net/apps/release/5.0/emboss/apps/getorf.html) via Soaplab (http://www.ebi.ac.uk/Tools/webservices/soaplab/overview) to find the ORFs, perform the translation and filter the translations for length.
4. EBI's WSNCBIBlast web service (http://www.ebi.ac.uk/Tools/webservices/services/ncbiblast) to perform the filtering BLAST search against UniProtKB.
5. EBI's WSInterProScan web service (http://www.ebi.ac.uk/Tools/webservices/services/interproscan) to access InterProScan for the final search.
and is based on the proceedure described for nucleotide InterProScan searches described on the WSInterProScan web pages (see http://www.ebi.ac.uk/Tools/webservices/services/interproscan).
Perform an InterProScan analysis of a protein sequence using the EBI’s WSInterProScan service (see http://www.ebi.ac.uk/Tools/webservices/services/interproscan). The input sequence to use and the user e-mail address are inputs, the other parameters for the analysis (see Job_params) are allowed to default.
InterProScan searches a protein sequence against the protein family and domain signature databases integrated into InterPro (see http://www.ebi.ac.uk/interpro/). InterProScan returns a set of InterPro and InterPro member matches with your sequence, along with GO term assignments.
Perform an InterProScan analysis of a protein sequence using the EBI’s WSInterProScan service (see http://www.ebi.ac.uk/Tools/webservices/services/interproscan). The input sequence to use and the user e-mail address are inputs, the other parameters for the analysis (see Job_params) are allowed to default.
InterProScan searches a protein sequence against the protein family and domain signature databases integrated into InterPro (see http://www.ebi.ac.uk/interpro/). InterProScan returns a set of InterPro and InterPro member matches with your sequence, along with GO term assignments.
Unpack byte[] version of result into a string.
org.embl.ebi.escience.scuflworkers.java.ByteArrayToString
Unpack byte[] version of result into a string.
org.embl.ebi.escience.scuflworkers.java.ByteArrayToString
Wrap input data in a list.
org.embl.ebi.escience.scuflworkers.java.XMLInputSplitter
Populate input data structure with input sequence and data type.
sequence
org.embl.ebi.escience.scuflworkers.java.XMLInputSplitter
InterProScan job parameters.
1
p
1
1
org.embl.ebi.escience.scuflworkers.java.XMLInputSplitter
Using the text output of InterProScan generate GFF format (see http://www.sanger.ac.uk/Software/formats/GFF/) output.
import java.util.StringTokenizer;
interproscan_gff = "";
// Split into lines
StringTokenizer tok1 = new StringTokenizer(interproscan_text, "\n");
while(tok1.hasMoreElements()) {
feat1 = tok1.nextElement();
// Split into fields
StringTokenizer tok2 = new StringTokenizer(feat1, "\t");
fieldCount = 0;
attributeStr = "";
while(tok2.hasMoreElements()) {
fieldCount++;
fieldStr = tok2.nextElement();
if(fieldCount < 2) { // First field is the ID
interproscan_gff += fieldStr;
}
// The tool, feature, start and stop
else if(fieldCount == 4 || (fieldCount > 5 && fieldCount < 9)) {
interproscan_gff += "\t" + fieldStr;
}
// Score
else if(fieldCount == 9) {
if(fieldStr.equals("NA")) {
interproscan_gff += "\t.";
} else {
interproscan_gff += "\t" + fieldStr;
}
}
// Matching InterPro entry
else if(fieldCount == 12 && !fieldStr.equals("NULL")) {
attributeStr += fieldStr;
}
// Matching InterPro entry name
else if(fieldCount == 13 && !fieldStr.equals("NULL")) {
attributeStr += " " + fieldStr;
}
}
interproscan_gff += "\t.\t.\tInterProScan";
if(attributeStr.length() > 0) {
interproscan_gff += " ; " + attributeStr;
}
interproscan_gff += "\n";
}
interproscan_text
interproscan_gff
Get the XML format result.
toolxml
http://www.ebi.ac.uk/Tools/webservices/wsdl/WSInterProScan.wsdl
poll
Submit the InterProScan job.
http://www.ebi.ac.uk/Tools/webservices/wsdl/WSInterProScan.wsdl
runInterProScan
Get the plain text format result.
toolraw
http://www.ebi.ac.uk/Tools/webservices/wsdl/WSInterProScan.wsdl
poll
Wait for the job to complete.
If job has not finished fail the workflow.
org.embl.ebi.escience.scuflworkers.java.FailIfFalse
Map status codes into true/false is done flag.
if(job_status.equals("DONE")) {
is_done = "true";
} else {
is_done = "false";
}
job_status
is_done
Get the status of a submited job (see
http://www.ebi.ac.uk/Tools/webservices/services/interproscan#checkstatus_jobid)
http://www.ebi.ac.uk/Tools/webservices/wsdl/WSInterProScan.wsdl
checkStatus
EBI job identifer
Status of job
User e-mail address
Input protein sequence for analysis. This can either be the actual sequence (fasta format recommended) or a database identifier in database:identifer format (e.g. uniprot:wap_rat).
InterProScan result in tab delimited plain text format.
application/xml
InterProScan result in XML format.
EBI job identifier.
Completed
EBI_InterProScan_poll_job
Get_text_result
Scheduled
Running
Completed
EBI_InterProScan_poll_job
Get_XML_result
Scheduled
Running
Perform a BLAST search using the EBI’s WSNCBIBlast service (see http://www.ebi.ac.uk/Tools/webservices/services/ncbiblast). The query sequence, database to search and BLAST program to use are inputs, the other parameters for the search (see Job_params) are allowed to default.
For use with InterProScan the expectation threshold (exp) has been set to 0.00001 and the maximum number of hits to report has been set to 10. The input sequences which find hits are returned via the "Sequences" output.
uniprot
blastp
Perform a BLAST search using the EBI’s WSNCBIBlast service (see http://www.ebi.ac.uk/Tools/webservices/services/ncbiblast). The query sequence, database to search and BLAST program to use are inputs, the other parameters for the search (see Job_params) are allowed to default.
Modifed for use as a prefiltering step for InterProScan:
1. Default expectation threshold lowered to 0.00001.
2. Maximum number of hits reported decreased to 10.
3. Input sequences which find hits are passed throuh to the "Sequence" output.
Covert byte[] to string for XML BLAST output.
org.embl.ebi.escience.scuflworkers.java.ByteArrayToString
Covert byte[] to string for plain text BLAST output.
org.embl.ebi.escience.scuflworkers.java.ByteArrayToString
If no hits are found fail, so input sequence not passed to the output.
org.embl.ebi.escience.scuflworkers.java.FailIfFalse
Collapse the list of hits found to a string.
\n
org.embl.ebi.escience.scuflworkers.java.StringListMerge
Parameters for the NCBI BLAST job.
0.00001
10
10
1
org.embl.ebi.escience.scuflworkers.java.XMLInputSplitter
Wrap the input data in a list.
org.embl.ebi.escience.scuflworkers.java.XMLInputSplitter
Add a type to the input sequence/identifer.
sequence
org.embl.ebi.escience.scuflworkers.java.XMLInputSplitter
Return true if hits were found.
if(hit_list_str.length() > 2) {
found_hits = "true";
} else {
found_hits = "false";
}
hit_list_str
found_hits
Simple passthough used to coordinate with "Found_hits".
output = input;
input
output
Get the results of a job (see http://www.ebi.ac.uk/Tools/webservices/services/ncbiblast#poll_jobid_type)
toolxml
http://www.ebi.ac.uk/Tools/webservices/wsdl/WSNCBIBlast.wsdl
poll
Submit a NCBI BLAST analysis job (see http://www.ebi.ac.uk/Tools/webservices/services/ncbiblast#runncbiblast_params_content)
http://www.ebi.ac.uk/Tools/webservices/wsdl/WSNCBIBlast.wsdl
runNCBIBlast
Get the results of a job (see http://www.ebi.ac.uk/Tools/webservices/services/ncbiblast#poll_jobid_type)
tooloutput
http://www.ebi.ac.uk/Tools/webservices/wsdl/WSNCBIBlast.wsdl
poll
Get the hit identifiers from the analysis result (see http://www.ebi.ac.uk/Tools/webservices/services/ncbiblast#getids_jobid)
http://www.ebi.ac.uk/Tools/webservices/wsdl/WSNCBIBlast.wsdl
getIds
Check if job has completed.
Check the status of a EBI WSNCBIBlast job, and fail if not completed.
Fail workflow if job not complete.
org.embl.ebi.escience.scuflworkers.java.FailIfFalse
Convert job status to true/false.
if(job_status.equals("DONE")) {
is_done = "true";
} else {
is_done = "false";
}
job_status
is_done
Get the status of a submited job (see http://www.ebi.ac.uk/Tools/webservices/services/ncbiblast#checkstatus_jobid)
http://www.ebi.ac.uk/Tools/webservices/wsdl/WSNCBIBlast.wsdl
checkStatus
Identifier for the job to check.
Status of the job checked.
Query seqeunce. Either the actual sequence (fasta format recommended) or a database identifer in database:identifier format (e.g. uniprot:wap_rat).
The database to search (e.g. uniprot).
The BLAST program to use for the search (e.g. blastn, blastp or blastx).
Your e-mail address.
Identifer fot the job at EBI.
The BLAST report output in plain text.
The BLAST report output in XML.
List of identifiers of the hits.
Input sequence if it finds hits.
Completed
EBI_NCBI_BLAST_job_poll
getIds
Scheduled
Running
Completed
EBI_NCBI_BLAST_job_poll
Get_text_result
Scheduled
Running
Completed
EBI_NCBI_BLAST_job_poll
Get_XML_result
Scheduled
Running
Completed
Fail_if_false
Coordinate
Scheduled
Running
From a nucleotide sequence get the protein translations of the open reading frames (stop to stop) that are longer than a specifed minimum length.
EMBOSS getorf is used to find the ORFs and perform the translations. The getorf tool is accessed via Soaplab (see http://www.ebi.ac.uk/Tools/webservices/soaplab/overview).
1
240
From a nucleotide sequence get the protein translations of the open reading frames (stop to stop) that are longer than a specifed minimum length.
EMBOSS getorf is used to find the ORFs and perform the translations. The getorf tool is accessed via Soaplab (see http://www.ebi.ac.uk/Tools/webservices/soaplab/overview).
Split a string containing a set of sequences in fasta format into a list for fasta formated sequences.
Split a string containing a set of sequences in fasta format into a list for fasta formated sequences.
Split string using a regular expression, to get the individual sequences.
\n>
org.embl.ebi.escience.scuflworkers.java.SplitByRegex
For sequences where the angle bracket (>), denoting the start of the fasta formated sequence, was removed during the split, prepend it.
if(!stripped_fasta.startsWith(">")) {
full_fasta = ">" + stripped_fasta;
} else {
full_fasta = stripped_fasta;
}
stripped_fasta
full_fasta
String containing one or more fasta sequences.
List of fasta sequences.
Ensure the sequence is in fasta format.
Given a sequence or sequence entry identifer (e.g. uniprot:wap_rat), return the sequence in fasta format.
If a sequence identifier, in database:identifier format, is input the EBI's WSDbfetch web service (see http://www.ebi.ac.uk/Tools/webservices/services/dbfetch) is used to retrive the sequence in fasta format. Otherwise the input is assumed to be a sequence and if passed through the Soaplab EMBOSS seqret service to force the sequence into fasta format.
Fails if the workflow input was a sequence (i.e. is an identifer).
org.embl.ebi.escience.scuflworkers.java.FailIfTrue
Fails if the workflow input is an identifier (i.e. is an actual sequence).
org.embl.ebi.escience.scuflworkers.java.FailIfFalse
Return true if the input is a sequence or false if the input is a sequence identifer (e.g. uniprot:wap_rat).
lineLen = sequence.indexOf("\n");
if(lineLen < 1) {
lineLen = sequence.length();
}
if(!sequence.startsWith(">") &&
sequence.indexOf(":") > 0 &&
sequence.indexOf(":") < lineLen) {
is_sequence = "false";
} else {
is_sequence = "true";
}
sequence
is_sequence
Fetch the sequence in fasta format from the identifer using EBI's WSDbfetch service (see http://www.ebi.ac.uk/Tools/webservices/services/dbfetch).
fasta
raw
http://www.ebi.ac.uk/Tools/webservices/wsdl/WSDbfetch.wsdl
fetchData
Format sequence into fasta format.
http://www.ebi.ac.uk/soaplab/emboss4/services/edit.seqret
Either an actual sequence or an entry identifer in database:identifier format (e.g. uniprot:wap_rat).
Sequence in fasta format.
Completed
Fail_if_sequence
fetchData
Scheduled
Running
Completed
Fail_if_identifer
seqret
Scheduled
Running
Finds and extracts open reading frames
(ORFs)
0
http://www.ebi.ac.uk/soaplab/emboss4/services/nucleic_gene_finding.getorf
Input nucleotide sequence. Either the actual sequence (fasta format) or an entry identifier in database:identifer format (e.g. embl:x01153).
The ID of the codon translation table to be used (e.g. 1).
Minimum ORF length to report in base pairs (e.g. 240).
Translations of the ORFs found.
Input nucleotide sequence. Either the actual sequence (fasta format) or an entry identifier in database:identifer format (e.g. embl:x01153).
User e-mail address.
InterProScan result in tab-delimited format.
text/xml
InterProScan result in XML format.
List of the translated open reading frame (ORF) sequences which are longer than 80aa, which were passed to the BLAST.
EBI job identifer for the InterProScan job.
EBI job identifer for the NCBI BLAST job.
The NCBI BLAST ouput for each of the translated open reading frame (ORF) input sequences.
InterProScan result in GFF format.