Hadoop Large Document Collection Data Prep...
Workflow for preparing large document collections for data analysis. Different types of hadoop jobs (Hadoop-Streaming-API, Hadoop Map/Reduce, and Hive) are used for specific purposes.
The *PathCreator components create text files with absolute file paths using the unix command 'find'. The workflow then uses 1) a Hadoop Streaming API component (HadoopStreamingExiftoolRead) based on a bash script for reading image metadata using Exiftool, 2) the Map/Reduce component (HadoopHocrAvBlockWidthMapR...
| Last updated: 2012-08-18