Hadoop Large Document Collection Data Preparation

Created: 2012-08-17 12:19:39 Last updated: 2012-08-18 18:39:26

Download Workflow

Workflow for preparing large document collections for data analysis. Different types of hadoop jobs (Hadoop-Streaming-API, Hadoop Map/Reduce, and Hive) are used for specific purposes.

The *PathCreator components create text files with absolute file paths using the unix command 'find'. The workflow then uses 1) a Hadoop Streaming API component (HadoopStreamingExiftoolRead) based on a bash script for reading image metadata using Exiftool, 2) the Map/Reduce component (HadoopHocrAvBlockWidthMapReduce) presented above, and 3) Hive components for creating data tables (HiveLoad*Data) and performing queries on the result files (HiveSelect).

The code for the two hadoop jobs is available on Github: tb-lsdr-seqfilecreator and tb-lsdr-hocrparser.

Preview

Download as scalable diagram (SVG)

Run

Run this Workflow in the Taverna Workbench...

Option 1:

Copy and paste this link into File > 'Open workflow location...'
http://www.myexperiment.org/workflows/3105/download?version=1
[ More Info Expand ]

Workflow Components

Authors (1)

Titles (1)

Descriptions (0)

Dependencies (0)

Inputs (2)

Name	Description
hadoop_job_name_prefix	Hadoop job name prefix for
rootpath

Processors (10)

Name	Type	Description
HadoopHocrAvBlockWidthMapReduce	externaltool
HadoopSequenceFileCreator	externaltool
HtmlPathCreator	externaltool
Jp2PathCreator	externaltool
html_extension	stringconstant	Value html
jp2_extension	stringconstant	Value jp2
HadoopStreamingExiftoolRead	externaltool
HiveLoadExifData	externaltool
HiveLoadHocrData	externaltool
HiveSelect	externaltool

Beanshells (0)

Outputs (1)

Name	Description
Out

Datalinks (13)

Source	Sink
HadoopSequenceFileCreator:STDOUT	HadoopHocrAvBlockWidthMapReduce:hdfs_input_dir
hadoop_job_name_prefix	HadoopHocrAvBlockWidthMapReduce:hadoop_job_name_prefix
hadoop_job_name_prefix	HadoopSequenceFileCreator:hadoop_job_name_prefix
HtmlPathCreator:STDOUT	HadoopSequenceFileCreator:hdfs_input_path
rootpath	HtmlPathCreator:rootpath
html_extension:value	HtmlPathCreator:extfilter
rootpath	Jp2PathCreator:rootpath
jp2_extension:value	Jp2PathCreator:extfilter
Jp2PathCreator:STDOUT	HadoopStreamingExiftoolRead:hdfs_input_dir
hadoop_job_name_prefix	HadoopStreamingExiftoolRead:hadoop_job_name_prefix
HadoopStreamingExiftoolRead:STDOUT	HiveLoadExifData:hdfs_result_file
HadoopHocrAvBlockWidthMapReduce:STDOUT	HiveLoadHocrData:hdfs_result_file
HiveSelect:STDOUT	Out

Coordinations (7)

Controller	Target
HadoopSequenceFileCreator	HadoopHocrAvBlockWidthMapReduce
HtmlPathCreator	HadoopSequenceFileCreator
HiveLoadExifData	HiveSelect
HadoopStreamingExiftoolRead	HiveLoadExifData
HiveLoadHocrData	HiveSelect
Jp2PathCreator	HadoopStreamingExiftoolRead
HadoopHocrAvBlockWidthMapReduce	HiveLoadHocrData

Information Workflow Type

Taverna 2

Information Uploader

Sven

Information License

All versions of this Workflow are licensed under:

Information Version 1 (of 1)

Information Credits (1)

(People/Groups)

Sven

Information Attributions (0)

(Workflows/Files)

None

Information Tags (5)

Uploader tags

hadoop
|
hive
|
jp2
|
jpeg2000
|
scape

Log in to add Tags

Information Shared with Groups (1)

SCAPE

Information Featured In Packs (0)

None

Log in to add to one of your Packs

Information Attributed By (0)

(Workflows/Files)

None

Information Favourited By (1)

Asger Askov Blekinge

Information Statistics

2504 viewings

1471 downloads

[ see breakdown ]

Citations (0)

None

Version History

In chronological order:

Hadoop Large Document Collection Data Preparation

Created by Sven on Friday 17 August 2012 12:19:38 (UTC)

Last edited by Sven on Saturday 18 August 2012 18:39:25 (UTC)