Workflow Entry: PDF to plain text

Created at: 19/02/10 @ 09:07:41      Last updated: 13/12/11 @ 15:53:29
Information Version 1 (of 1)

Version created on: 19/02/10 @ 09:07:41 by: James Eales   |   Revision comments Expand

Last edited on: 19/02/10 @ 10:30:16 by: James Eales

Title: PDF to plain text

Type: Taverna 2


Information Preview

(Click on the image to get the full size)

Medium


Information Description

This workflow will extract the plain text content of PDF files supplied to the input port.  You can connect the Load PDF from directory workflow to this workflows input. We recommend you send the output from this workflow to the Clean plain text workflow, because the PDF to text process can add characters into the text that are XML-invalid and therefore can not be sent to most services as plain text.  Another way round this problem is to encode the text as Base64 using the handy local service ("Encode Byte Array to Base 64") included with Taverna, although this requires a service that knows to decode the Base 64 back to text, which is not common. The PDF to text service makes use of the "pdftotext" executable from Xpdf.

This is a workflow component, designed to be used as a nested workflow inside a larger text mining or text processing workflow.


Information Download




Information Run

Run this Workflow in the Taverna Workbench...

Option 1:

Copy and paste this link into File > 'Open workflow location...'
http://www.myexperiment.org/workflows/1058/download?version=1
[ More InfoExpand ]


Information Workflow Components

Information Authors (0)
Information Titles (0)
Information Descriptions (0)
Inputs (1)
Processors (4)
Beanshells (0)
Outputs (1)
Datalinks (5)
Coordinations (0)

Information Workflow Type

Taverna 2

Information Original Uploader

Information License

All versions of this Workflow are licensed under:

Information Credits (1)

(People/Groups)

Information Attributions (0)

(Workflows/Files)

None

Information Tags (5)

Log in to add Tags

Information Shared with Groups (1)

Information Featured In Packs (2)

Log in to add to one of your Packs

Information Ratings (1)

Current:

5.0 / 5

(1 rating)

Log in to rate and see breakdown of ratings

Information Attributed By (2)

(Workflows/Files)

Information Favourited By (0)

No one

 

Citations (0)

None


Version History

Earliest Version:
[1] - PDF to plain text

Created on: Friday 19 February 2010 @ 09:07:41 (GMT)

Created by: James Eales

Last edited on: Friday 19 February 2010 @ 10:30:16 (GMT)

Last edited by: James Eales

Revision comments:

None

This Workflow only has one version.



Reviews Reviews (0)

No reviews yet

Be the first to review!



Comments Comments (0)

No comments yet

Log in to make a comment




Workflow Other workflows that use similar services (2)

Original Uploader

Workflow Terms from collection of PDF files (v2)

Created: 19/02/10 @ 10:52:29 | Last updated: 13/12/11 @ 15:56:08

Credits: User James Eales

License: Creative Commons Attribution-Share Alike 3.0 Unported License

Thumb

This workflow will give you a set of candidate terms for each PDF document in a user-specified directory. You can also specify a c-value threshold that will restrict the terms to those with higher scores. This workflow was created using only nested workflows.  These workflow components work on their own and can be linked together to form more complex workflows such as this. You can view the text mining workflow components in this pack. If you receive errors when running this workflow t...

Rating: 0.0 / 5 (0 ratings) | Versions: 2 | Reviews: 0 | Comments: 0 | Citations: 0

Viewed: 71 times | Downloaded: 34 times

Tags (4):

Show View Download Download (v2)

Original Uploader

Workflow From PDF to lemmatized text (v1)

Created: 16/09/10 @ 10:09:58 | Last updated: 18/01/12 @ 10:27:27

Credits: User Netr User James Eales

Attributions: Workflow PDF to plain text Workflow Clean plain text

License: Creative Commons Attribution-Share Alike 3.0 Unported License

Thumb

This workflow uses the web service stationed in JSI (IJS Slovenia), which is based on Matjaž Juršič's LemmaGen - lemmatization engine. The workflow accepts a PDF file as an input an uses James Eales's wrokflows to preprocess the data. The workflow interactively asks the user of which language is the text, since the lemmatization process is language based. The output is a string in Taverna Workbench.

Rating: 0.0 / 5 (0 ratings) | Versions: 1 | Reviews: 0 | Comments: 0 | Citations: 0

Viewed: 6 times | Downloaded: 4 times

Tags (2):

Show View Download Download (v1)

What is this?

Linked Data

Non-Information Resource URI: http://www.myexperiment.org/workflows/1058


Alternative Formats

HTML
RDF
XML

New/Upload

Log in / Register

Username or Email:

Password:

Remember me:

OR

Use OpenID:


(eg: name.myopenid.com)

Need an account?
Click here to register

Forgot Password?

Front Page

Home

Invite people to myExperiment

Help pages

About Us

News and Events

Mailing List

Contact Us

Developers

Publications


Taverna Workflow Workbench

myGrid

BioCatalogue

Trident

Google Coop Search

EPSRC

JISC

Microsoft

Powered by:

Rails

Icons:
Silk icon set 1.3