Hadoop Large Document Collection Data Preparation

Created: 2012-08-17 12:19:39      Last updated: 2012-08-18 18:39:26

Workflow for preparing large document collections for data analysis. Different types of hadoop jobs (Hadoop-Streaming-API, Hadoop Map/Reduce, and Hive) are used for specific purposes.

The *PathCreator components create text files with absolute file paths using the unix command 'find'. The workflow then uses 1) a Hadoop Streaming API component (HadoopStreamingExiftoolRead) based on a bash script for reading image metadata using Exiftool, 2) the Map/Reduce component (HadoopHocrAvBlockWidthMapReduce) presented above, and 3) Hive components for creating data tables (HiveLoad*Data) and performing queries on the result files (HiveSelect).

The code for the two hadoop jobs is available on Github: tb-lsdr-seqfilecreator and  tb-lsdr-hocrparser.

Information Preview

Information Run

Run this Workflow in the Taverna Workbench...

Option 1:

Copy and paste this link into File > 'Open workflow location...'
[ More InfoExpand ]

Information Workflow Components

Information Authors (1)
Information Titles (1)
Information Descriptions (0)
Information Dependencies (0)
Inputs (2)
Processors (10)
Beanshells (0)
Outputs (1)
Datalinks (13)
Coordinations (7)

Information Workflow Type

Taverna 2

Information Uploader

Information License

All versions of this Workflow are licensed under:

Information Version 1 (of 1)

Information Credits (1)


Information Attributions (0)



Information Tags (5)

Log in to add Tags

Information Shared with Groups (1)

Information Featured In Packs (0)


Log in to add to one of your Packs

Information Attributed By (0)



Information Favourited By (1)

Information Statistics


Citations (0)


Version History

In chronological order:

Reviews Reviews (0)

No reviews yet

Be the first to review!

Comments Comments (0)

No comments yet

Log in to make a comment

Workflow Other workflows that use similar services (0)

There are no workflows in myExperiment that use similar services to this Workflow.