PDF2TXT2Solr_databasuserSuppliedPDFCorpus11 A list of the pdf files that should have their text extracted and stored in the Solr database. 2013-07-24 11:59:38.885 UTC 2013-07-24 12:08:01.421 UTC pdftotext_STDOUT The standerdized output of the pdftotext process 2013-07-24 12:38:58.418 UTC pdftotext_STDERR the standerdized error of the pdftotext process 2013-07-24 12:39:25.518 UTC SOLRInport_STDOUT The standerdized output of the SolrImport process. 2013-07-24 12:41:35.661 UTC SolrInport_STDERR The standerdized error of the SolrImport process 2013-07-24 12:41:18.59 UTC pdftotextPDFFileLocation0TXTFileLocation0STDOUT00STDERR00 pdftotext Input: Path to a PDF File (input) Path to Text File (output) Output: A Text File with the text of the PDF file pdftotext uses the terminal or commandline to call the program pdftotext. This tool simply extracts the text of a pdf file and stores it in a .txt file. Make sure that pdftotext is installed on the system and can be called from the terminal or commandline. To test if pdftotext is installed on your system try running the following command in your terminal or commandline: $pdftotext -h Copyright of pdftotext: 1996-2004 Glyph & Cog, LLC. 2013-07-24 11:50:48.714 UTC net.sf.taverna.t2.activitiesexternal-tool-activity1.4net.sf.taverna.t2.activities.externaltool.ExternalToolActivity 789663B8-DA91-428A-9F7D-B3F3DA185FD4 default local <?xml version="1.0" encoding="UTF-8"?> <localInvocation><shellPrefix>/bin/sh -c</shellPrefix><linkCommand>/bin/ln -s %%PATH_TO_ORIGINAL%% %%TARGET_NAME%%</linkCommand></localInvocation> 6aec0794-605a-43f8-ba0a-1b87d9172f13 pdftotext "%%PDFFileLocation%%" "%%TXTFileLocation%%" 1200 1800 PDFFileLocation TXTFileLocation TXTFileLocation TXTFileLocation false false false UTF-8 false false false PDFFileLocation PDFFileLocation false false false UTF-8 false false false false true true 0 false net.sf.taverna.t2.coreworkflowmodel-impl1.4net.sf.taverna.t2.workflowmodel.processor.dispatch.layers.Parallelize 1 net.sf.taverna.t2.coreworkflowmodel-impl1.4net.sf.taverna.t2.workflowmodel.processor.dispatch.layers.ErrorBouncenet.sf.taverna.t2.coreworkflowmodel-impl1.4net.sf.taverna.t2.workflowmodel.processor.dispatch.layers.Failovernet.sf.taverna.t2.coreworkflowmodel-impl1.4net.sf.taverna.t2.workflowmodel.processor.dispatch.layers.Retry 1.0 1000 5000 0 net.sf.taverna.t2.coreworkflowmodel-impl1.4net.sf.taverna.t2.workflowmodel.processor.dispatch.layers.InvokeSolrImportinputFile0pathToPostJar0STDOUT00STDERR00 SOLRImport takes the path of the txt file and stores this in a Solr database. Make sure that the SOLR database is running and that the correct path is inside the variable. If Solr is running locally you can check if the files have been stored by browsing to the following location:http://localhost:8983/solr/#/ Solr can be downloaded at: http://lucene.apache.org/solr/ 2013-07-24 12:37:24.110 UTC net.sf.taverna.t2.activitiesexternal-tool-activity1.4net.sf.taverna.t2.activities.externaltool.ExternalToolActivity 789663B8-DA91-428A-9F7D-B3F3DA185FD4 default local <?xml version="1.0" encoding="UTF-8"?> <localInvocation><shellPrefix>/bin/sh -c</shellPrefix><linkCommand>/bin/ln -s %%PATH_TO_ORIGINAL%% %%TARGET_NAME%%</linkCommand></localInvocation> 6acb54d9-5501-46e2-85de-ee69bbd97fc4 #Note the -Dauto argument in the command. This makes solr automatically find #The extension of the file and creates a process that can use a wide range of different #extensions. java -Dauto -jar "%%pathToPostJar%%" "%%inputFile%%" 1200 1800 inputFile pathToPostJar inputFile inputFile false false false UTF-8 false false false pathToPostJar pathToPostJar false false false UTF-8 false false false false true true 0 false net.sf.taverna.t2.coreworkflowmodel-impl1.4net.sf.taverna.t2.workflowmodel.processor.dispatch.layers.Parallelize 1 net.sf.taverna.t2.coreworkflowmodel-impl1.4net.sf.taverna.t2.workflowmodel.processor.dispatch.layers.ErrorBouncenet.sf.taverna.t2.coreworkflowmodel-impl1.4net.sf.taverna.t2.workflowmodel.processor.dispatch.layers.Failovernet.sf.taverna.t2.coreworkflowmodel-impl1.4net.sf.taverna.t2.workflowmodel.processor.dispatch.layers.Retry 1.0 1000 5000 0 net.sf.taverna.t2.coreworkflowmodel-impl1.4net.sf.taverna.t2.workflowmodel.processor.dispatch.layers.InvokecreateFileLocationPDFLocation0TXTLocation00 createFileLocation uses the path of the PDF file and adds the string ".txt" to create the output location for pdftotext. 2013-07-24 11:52:58.580 UTC net.sf.taverna.t2.activitiesbeanshell-activity1.4net.sf.taverna.t2.activities.beanshell.BeanshellActivity PDFLocation 0 text/plain java.lang.String true TXTLocation 0 0 workflow net.sf.taverna.t2.coreworkflowmodel-impl1.4net.sf.taverna.t2.workflowmodel.processor.dispatch.layers.Parallelize 1 net.sf.taverna.t2.coreworkflowmodel-impl1.4net.sf.taverna.t2.workflowmodel.processor.dispatch.layers.ErrorBouncenet.sf.taverna.t2.coreworkflowmodel-impl1.4net.sf.taverna.t2.workflowmodel.processor.dispatch.layers.Failovernet.sf.taverna.t2.coreworkflowmodel-impl1.4net.sf.taverna.t2.workflowmodel.processor.dispatch.layers.Retry 1.0 1000 5000 0 net.sf.taverna.t2.coreworkflowmodel-impl1.4net.sf.taverna.t2.workflowmodel.processor.dispatch.layers.InvokepathToPostJarvalue00 This variable links to the post.jar that Solr uses to add files in the database. 2013-07-24 11:58:42.284 UTC net.sf.taverna.t2.activitiesstringconstant-activity1.4net.sf.taverna.t2.activities.stringconstant.StringConstantActivity /home/sander/Downloads/solr-4.3.1/example/exampledocs/post.jar net.sf.taverna.t2.coreworkflowmodel-impl1.4net.sf.taverna.t2.workflowmodel.processor.dispatch.layers.Parallelize 1 net.sf.taverna.t2.coreworkflowmodel-impl1.4net.sf.taverna.t2.workflowmodel.processor.dispatch.layers.ErrorBouncenet.sf.taverna.t2.coreworkflowmodel-impl1.4net.sf.taverna.t2.workflowmodel.processor.dispatch.layers.Failovernet.sf.taverna.t2.coreworkflowmodel-impl1.4net.sf.taverna.t2.workflowmodel.processor.dispatch.layers.Retry 1.0 1000 5000 0 net.sf.taverna.t2.coreworkflowmodel-impl1.4net.sf.taverna.t2.workflowmodel.processor.dispatch.layers.InvokepdftotextPDFFileLocationuserSuppliedPDFCorpuspdftotextTXTFileLocationcreateFileLocationTXTLocationSolrImportinputFilecreateFileLocationTXTLocationSolrImportpathToPostJarpathToPostJarvaluecreateFileLocationPDFLocationuserSuppliedPDFCorpuspdftotext_STDOUTpdftotextSTDOUTpdftotext_STDERRpdftotextSTDERRSOLRInport_STDOUTSolrImportSTDOUTSolrInport_STDERRSolrImportSTDERR 8697ece0-dea5-46dd-ab80-ac8cedb51d0e 2013-07-23 14:15:22.745 UTC ced002ad-4054-4939-8608-a5d6360ca930 2012-01-12 14:23:00.15 UTC b8a4cc69-6ffa-4236-8fc2-11b98b549459 2013-07-23 13:01:31.138 UTC e0ca65c0-811f-468c-aeb6-37ebb3780fee 2011-07-01 07:26:33.226 UTC 2bbef36f-9608-4a33-97f6-4d94ead0221a 2013-07-18 12:38:36.555 UTC 92bb9f48-cfd7-491e-b800-9a2724a1d487 2013-07-24 12:38:20.67 UTC 63c65a4a-0b27-4e6d-ba43-23c1bb5e3dba 2013-07-23 14:07:37.581 UTC 0ab0368f-e2e8-4736-990d-07fdf1550aa1 2013-07-22 13:54:51.137 UTC 8809445e-4015-4486-be5f-25eab4291ed6 2013-07-24 12:43:34.257 UTC a5c93b8a-fe3c-4087-96fc-e297c081bec1 2013-07-24 12:02:23.274 UTC 6017d6d0-f60a-4a98-bb84-dac545d5e8fa 2013-07-22 13:35:23.381 UTC d4cffe9b-a3d5-4ea9-8c03-606740985492 2013-07-22 13:16:05.658 UTC 50dd7e00-6358-4184-b579-7d1f01870561 2013-07-23 12:59:05.913 UTC f794ffe9-e4b4-4ad7-95c4-eed005f86e1c 2013-07-23 13:31:46.614 UTC 2afe512f-423f-430f-a6ea-44e081e36978 2013-07-23 13:00:06.887 UTC cacce104-5786-4109-8e9a-a7171c75695d 2013-07-23 12:16:11.614 UTC 0ab28f06-58b6-405d-9b42-410a27ff5197 2013-07-24 11:52:59.259 UTC 5a149649-c8eb-4602-b3ab-c857cf1b6998 2013-07-24 12:46:58.293 UTC c1a5d14d-545a-4d06-840b-4119a0b578ed 2013-07-23 14:10:40.243 UTC 865312c9-72aa-4902-b465-51006c500d08 2013-07-22 13:40:38.574 UTC f6ee0459-c966-47d5-8eb0-a1cd29732a2c 2013-07-23 14:33:25.539 UTC e1ed4e18-d8b5-4c9e-96f0-c554145b59dc 2013-07-23 14:28:10.404 UTC 6b534db2-d560-4a06-ad34-741a48a84b35 2013-07-23 11:53:24.698 UTC 77aa9bab-63e5-434b-884d-be4727287d77 2013-07-16 07:38:30.935 UTC 8396bf50-a21c-4146-aa44-60ee08667921 2013-07-23 14:09:16.200 UTC 499416d0-5ce3-4403-bc01-5b924b2e6eb4 2013-07-23 13:57:37.425 UTC e5fff23d-827d-4c88-9896-f6f59b8e5f20 2013-07-24 12:41:38.607 UTC d127e0c2-90f7-4d16-ba91-87d9e54e010f 2013-07-16 07:55:30.495 UTC 052ebe6a-29ff-43a1-a297-5e7b10e04770 2013-07-22 14:12:06.839 UTC 41b802b9-a6ec-4ba2-9b8b-e517e5a3d838 2013-07-24 11:55:39.116 UTC d1e50bb3-d08c-446d-a3a4-b58f11bf36a1 2013-07-23 13:50:53.569 UTC 8a0138ad-7594-47a7-b757-0b06ffe23574 2013-07-23 12:19:02.201 UTC ce5990b5-7659-46f7-8e45-e4766234de26 2013-07-24 12:39:28.449 UTC This workflow will extract the text of a PDF file and saves it locally (in the same directory as the pdf file). After that it stores the newly made txt file in a solr database. Before this workflow is operational, it is important that solr is running and the pathToPostJar variable is linking to the correct path of post.jar in Solr. Another thing to note before running this workflow is to make sure that pdftotext is installed and can be called with the commandline or terminal. To test please run: $ pdftotext -h If Solr is running locally you can check if the files have been stored by browsing to the following location:http://localhost:8983/solr/#/ Dependencies: - pdftotext - Solr Developed and tested on Fedora 2013-07-24 12:49:22.827 UTC Sander van Boom 2013-07-24 11:23:37.728 UTC e9fedebe-4d8f-4b0f-b0c9-84fa688fe030 2013-07-24 11:33:56.861 UTC 89ace55f-4a81-4b53-af7c-f57d9ff15d39 2013-07-22 13:30:11.156 UTC d33dc32e-f655-46ef-bf98-42d98adb99d7 2013-07-23 14:03:00.374 UTC c003fcb4-a048-4401-a20d-3061698d7f1b 2013-07-16 07:50:40.756 UTC 48e4b54c-ce25-4708-8910-b127ed22b83b 2013-07-23 13:24:23.179 UTC d9e224b7-0e67-4362-b5e9-7f09cfbf5c77 2013-07-24 12:36:16.697 UTC 75baed27-388a-4247-a848-06be0da34105 2013-07-24 12:35:55.46 UTC 6ffaa544-af7b-4ceb-b6a5-b3bb39292b8e 2013-07-24 11:40:35.197 UTC e348d68e-b519-4361-8279-136abef7641c 2013-07-24 12:11:32.161 UTC 9218c27c-3a6d-465a-957e-82dc64288560 2013-07-18 12:28:53.505 UTC 429ab404-1230-4931-a204-ddb47b367aa9 2013-07-24 12:16:47.437 UTC 7ef2b6be-0dbe-45fa-a9b1-48a196ebf463 2013-07-23 12:50:00.318 UTC 313ad4e4-5b34-494e-9879-6bad706e4898 2012-01-12 14:30:35.352 UTC ea4e7a3f-5eaa-4a76-9b14-15d40e7fb4ce 2013-07-24 12:01:48.388 UTC 0da36c46-7048-4621-861b-21957b0d79e9 2013-07-22 13:39:23.942 UTC d4ff2511-c529-4315-9241-4a6f13e085fa 2013-07-22 14:26:55.676 UTC bc096cf1-4e64-426f-886f-c8a69ffdf923 2013-07-22 13:22:10.171 UTC 86b2ed23-5c86-4f9a-a57a-d1c989ad8251 2012-01-12 14:37:58.207 UTC d17d6b68-6a3d-497b-9180-407b28a862f4 2013-07-24 11:50:45.537 UTC ee8e74eb-8ba8-4623-91e2-3e80e7b2f97a 2013-07-23 14:17:32.139 UTC c7b88472-ef5f-42ba-9c84-382a2830bdea 2013-07-23 13:29:04.738 UTC e181f3ff-2da4-4433-9e72-40725a883d3e 2013-07-23 14:26:35.319 UTC fd2c6eaa-d969-4fea-8980-d9a0f350fb3a 2013-07-23 13:30:46.516 UTC c385fe61-436c-4035-8e9a-0a9d98c86603 2013-07-24 12:15:14.449 UTC PDF2TXT2Solr_database 2013-07-24 11:25:29.943 UTC e8f32965-2c06-4768-8ef0-b126730dd186 2013-07-24 11:46:05.138 UTC d6eb756c-cd7c-4d7f-8ab2-b3b489c01b3d 2013-07-24 12:49:29.116 UTC bc577ebe-a686-4d51-ac5f-457ffb7a8fcc 2013-07-22 11:37:55.832 UTC 2dbdf880-b6fe-46a3-bab0-05cdd5cc7e7a 2013-07-23 14:16:30.181 UTC 9ad5c82c-5766-49a2-af0e-36f3f5755462 2013-07-23 14:06:27.727 UTC