PDF to plain text
This workflow will extract the plain text content of PDF files supplied to the input port. You can connect the Load PDF from directory workflow to this workflows input. We recommend you send the output from this workflow to the Clean plain text workflow, because the PDF to text process can add characters into the text that are XML-invalid and therefore can not be sent to most services as plain text. Another way round this problem is to encode the text as Base64 using the handy local service ("Encode Byte Array to Base 64") included with Taverna, although this requires a service that knows to decode the Base 64 back to text, which is not common. The PDF to text service makes use of the "pdftotext" executable from Xpdf.
This is a workflow component, designed to be used as a nested workflow inside a larger text mining or text processing workflow.
Run this Workflow in the Taverna Workbench...
Copy and paste this link into File > 'Open workflow location...'
[ More Info ]
Taverna is available from http://taverna.sourceforge.net/
If you are having problems downloading it in Taverna, you may need to provide your username and password in the URL so that Taverna can access the Workflow:
Replace http:// in the link above with http://yourusername:yourpassword@
Run this Workflow on the cloud with OnlineHPC...
Click the link below to visit OnlineHPC
[ More Info ]
OnlineHPC offers a free-of-charge online scientific workflow editor, and a High Performance Computer cluster where you can execute your workflows.
OAuth sign-in is available, so you can sign-in using your myExperiment credentials.
import org.apache.commons.codec.binary.Base64; base64 = new String(Base64.encodeBase64(bytes));
Version 1 (of 1)
Log in to add Tags
Shared with Groups (1)
- Private item
- From PDF to lemmatized text
In chronological order:
Other workflows that use similar services (2)
Created: 2010-02-19 | Last updated: 2011-12-13
Credits: James Eales
Created: 2010-09-16 | Last updated: 2012-01-18