Terms from collection of PDF files

Created: 2010-02-19 10:52:29 Last updated: 2011-12-13 15:56:08

Download Workflow

This workflow will give you a set of candidate terms for each PDF document in a user-specified directory. You can also specify a c-value threshold that will restrict the terms to those with higher scores.

This workflow was created using only nested workflows. These workflow components work on their own and can be linked together to form more complex workflows such as this. You can view the text mining workflow components in this pack.

If you receive errors when running this workflow then check if you have access to the NaCTeM web services here. If you do not have access then you can request access from the same page.

Preview

Download as scalable diagram (SVG)

Run

Run this Workflow in the Taverna Workbench...

Option 1:

Copy and paste this link into File > 'Open workflow location...'
http://www.myexperiment.org/workflows/1061/download?version=2
[ More Info Expand ]

Workflow Components

Authors (1)

Titles (2)

Descriptions (2)

Dependencies (0)

Inputs (2)

Name	Description
Workflow6_pdfDirectoryPathIn	single input value, absolute path to a directory containing one or more PDF files
Workflow10_cValueThreshold	c-value threshold, terms with score below this value are excluded from the "above threshold" output

Processors (5)

Name	Type	Description
Workflow6	workflow
Workflow7	workflow
Workflow18	workflow
Workflow9	workflow
Workflow10	workflow

Beanshells (4)

Name	Inputs	Outputs
binaryFileReader	absoluteFilePath	fileContents
sentenceListNormaliser	sentences	sentenceListString
jamesXPath	xml xPathString	resultValues
jamesXPath_2	xml xPathString	resultValues

Outputs (2)

Name	Description
Workflow10_allTermCandidates	All terms found by termine
Workflow10_termCandidatesAboveThreshold	Terms with c-value scores above threshold

Datalinks (8)

Source	Sink
Workflow6_pdfDirectoryPathIn	Workflow6:pdfDirectoryPathIn
Workflow6:pdfFileContentsOut	Workflow7:pdfFileContentsIn
Workflow7:textFileContentsOut	Workflow18:plainTextForCleaning
Workflow18:cleanedTextASCII	Workflow9:plainText
Workflow9:sentencesList	Workflow10:sentencesList
Workflow10_cValueThreshold	Workflow10:cValueThreshold
Workflow10:allTermCandidates	Workflow10_allTermCandidates
Workflow10:termCandidatesAboveThreshold	Workflow10_termCandidatesAboveThreshold

Coordinations (0)

Information Workflow Type

Taverna 2

Information Uploader

James Eales

Information License

All versions of this Workflow are licensed under:

Information Version 2 (latest) (of 2)

Information Credits (1)

(People/Groups)

James Eales

Information Attributions (0)

(Workflows/Files)

None

Information Tags (4)

Uploader tags

Log in to add Tags

Information Shared with Groups (1)

e-LICO

Information Featured In Packs (0)

None

Log in to add to one of your Packs

Information Attributed By (0)

(Workflows/Files)

None

Information Favourited By (3)

Information Statistics

3444 viewings

1986 downloads

[ see breakdown ]

Citations (0)

None

Version History

In chronological order:

Terms from collection of PDF files

Created by James Eales on Friday 19 February 2010 10:48:52 (UTC)

Last edited by James Eales on Friday 19 February 2010 10:51:57 (UTC)
Terms from collection of PDF files

Created by James Eales on Friday 19 February 2010 10:52:29 (UTC)

Last edited by James Eales on Saturday 20 March 2010 15:41:53 (UTC)

Reviews (0)

No reviews yet

Be the first to review!

Comments (0)

View Timeline

No comments yet

Log in to make a comment

Other workflows that use similar services (10)

Only the first 2 workflows that use similar services are shown. View all workflows that use these services.

Taverna 2

Uploader

James Eales

Terms from collection of text files (1)

Download

This workflow will give you a set of candidate terms for each text file in a user-specified directory. You can also specify a c-value threshold that will restrict the terms to those with higher scores. This workflow was created using only nested workflows. These workflow components work on their own and can be linked together to form more complex workflows such as this. You can view the text mining workflow components in this pack. If you receive errors when running this workflow then...

Created: 2010-02-22 | Last updated: 2011-12-13

Credits: James Eales

Taverna 2

Uploader

James Eales

One sentence per line (1)

Download

This workflow accepts a plain text input and provides a single text document per input containing one sentence per line. Newline characters are removed from the original input. The OpenNLP sentence splitter is used to split the text, this is provided by University of Manchester Web Services.

Created: 2011-05-06 | Last updated: 2011-12-13

Credits: James Eales