PDF to plain text

Created: 2010-02-19 09:07:41 Last updated: 2011-12-13 15:53:29

Download Workflow

This workflow will extract the plain text content of PDF files supplied to the input port. You can connect the Load PDF from directory workflow to this workflows input. We recommend you send the output from this workflow to the Clean plain text workflow, because the PDF to text process can add characters into the text that are XML-invalid and therefore can not be sent to most services as plain text. Another way round this problem is to encode the text as Base64 using the handy local service ("Encode Byte Array to Base 64") included with Taverna, although this requires a service that knows to decode the Base 64 back to text, which is not common. The PDF to text service makes use of the "pdftotext" executable from Xpdf.

This is a workflow component, designed to be used as a nested workflow inside a larger text mining or text processing workflow.

Preview

Download as scalable diagram (SVG)

Run

Run this Workflow in the Taverna Workbench...

Option 1:

Copy and paste this link into File > 'Open workflow location...'
http://www.myexperiment.org/workflows/1058/download?version=1
[ More Info Expand ]

Workflow Components

Authors (0)

Titles (0)

Descriptions (0)

Dependencies (0)

Inputs (1)

Name	Description
pdfFileContentsIn

Processors (4)

Name	Type	Description
pdfToText	wsdl	Wsdl http://gnode1.mib.man.ac.uk:8080/FullTextWebServices/PdfToTextService?wsdl Wsdl Operation pdfToText
pdfToText_input	xmlsplitter
pdfToText_output	xmlsplitter
Encode_Byte_Array_to_Base_64	localworker	Script import org.apache.commons.codec.binary.Base64; base64 = new String(Base64.encodeBase64(bytes));

Beanshells (0)

Outputs (1)

Name	Description
textFileContentsOut

Datalinks (5)

Source	Sink
pdfToText_input:output	pdfToText:parameters
Encode_Byte_Array_to_Base_64:base64	pdfToText_input:pdfFile
pdfToText:parameters	pdfToText_output:input
pdfFileContentsIn	Encode_Byte_Array_to_Base_64:bytes
pdfToText_output:extractedText	textFileContentsOut

Coordinations (0)

Information Workflow Type

Taverna 2

Information Uploader

James Eales

Information License

All versions of this Workflow are licensed under:

Information Version 1 (of 1)

Information Credits (1)

(People/Groups)

James Eales

Information Attributions (0)

(Workflows/Files)

None

Information Tags (5)

Uploader tags

Log in to add Tags

Information Shared with Groups (1)

e-LICO

Information Featured In Packs (2)

Core text mining workflows
Private pack

Log in to add to one of your Packs

Information Attributed By (2)

(Workflows/Files)

Private item
From PDF to lemmatized text

Information Favourited By (0)

No one

Information Statistics

3995 viewings

2574 downloads

[ see breakdown ]

Citations (0)

None

Version History

In chronological order:

PDF to plain text

Created by James Eales on Friday 19 February 2010 09:07:41 (UTC)

Last edited by James Eales on Friday 19 February 2010 10:30:16 (UTC)

Reviews (0)

No reviews yet

Be the first to review!

Comments (0)

View Timeline

No comments yet

Log in to make a comment

Other workflows that use similar services (2)

Taverna 2

Uploader

James Eales

Terms from collection of PDF files (2)

Download

This workflow will give you a set of candidate terms for each PDF document in a user-specified directory. You can also specify a c-value threshold that will restrict the terms to those with higher scores. This workflow was created using only nested workflows. These workflow components work on their own and can be linked together to form more complex workflows such as this. You can view the text mining workflow components in this pack. If you receive errors when running this workflow t...

Created: 2010-02-19 | Last updated: 2011-12-13

Credits: James Eales

Taverna 2

Uploader

Netr

From PDF to lemmatized text (1)

Download

This workflow uses the web service stationed in JSI (IJS Slovenia), which is based on MatjaÅ¾ JuršiÄ's LemmaGen - lemmatization engine. The workflow accepts a PDF file as an input an uses James Eales's wrokflows to preprocess the data. The workflow interactively asks the user of which language is the text, since the lemmatization process is language based. The output is a string in Taverna Workbench.

Created: 2010-09-16 | Last updated: 2012-01-18

Credits: Netr James Eales

Attributions: PDF to plain text Clean plain text

PDF to plain text

Preview

Run

Run this Workflow in the Taverna Workbench...

Workflow Components

Wsdl

Wsdl Operation

Script

Reviews (0)

Comments (0)

Other workflows that use similar services (2)