This workflow generates DOI record files for deposit, using data set metadata for the FLOSSmole project. It reads in an input file generated from a SQL query from an eprints database, and transforms the parts of the source file as necessary to create a comprehensive DOI deposit record. It also generates DOIs for the data sets. These metadata are inserted into an XML record template (based on the std-doi.xsd schema) and the individual resources are aggregated into a single file.
Location of the source CSV with data for creating records.
/file_location/eprint.csv
Takes a flat CSV input file and splits it into a list.
\n
org.embl.ebi.escience.scuflworkers.java.SplitByRegex
Takes a single string output and converts it to a list.
\n
org.embl.ebi.escience.scuflworkers.java.SplitByRegex
Takes the list input and creates a 2-deep list.
";"
org.embl.ebi.escience.scuflworkers.java.SplitByRegex
Shim to read in the file, location provided by a string constant.
net.sourceforge.taverna.scuflworkers.io.TextFileReader
Takes a single string input and splits it into a list of string inputs.
\n
org.embl.ebi.escience.scuflworkers.java.SplitByRegex
Splits the filesize string into two pieces, for size and units, so that the units can be replaced and appended to the size.
\s
org.embl.ebi.escience.scuflworkers.java.SplitByRegex
Creates a sequential integer series for assignment to DOIs.
// A. Wiggins 4/29/2009
count = trigger.size();
delim = "\n";
out = new String();
for(i=0; i < count; i++){
out = out + (i + Integer.valueOf(seed));
out = out + delim;
}
number_sequence = out;
seed
trigger
number_sequence
Extracts the name of the dataset file host from the download location URL.
// A. Wiggins 4/29/2009
import java.util.regex.Pattern;
String sf_regex = "^http://downloads.*";
String gc_regex = "^http://flossmole.*";
if (Pattern.matches(sf_regex, url)) {
publicationPlace = "SourceForge";
} else {
if (Pattern.matches(gc_regex, url)) {
publicationPlace = "GoogleCode";
}
}
url
publicationPlace
Aggregates the individual records into a single XML file.
// A. Wiggins 4/29/2009
delim = "\n";
count = doi.size();
out = "<resources>" + delim;
for(i = 0; i < count; i++) {
out = out + doi.get(i);
out = out + delim;
}
out = out + "</resources>";
import_file = out;
doi
import_file
Constructs the metadata URL for FLOSSmole data sets, given the name of the eprintid.
// A. Wiggins 4/29/2009
url = "http://flosspapers.org/"+eprintid;
eprintid
url
Reads the 2-deep input list and splits out the values into separate variables.
// A. Wiggins 4/29/2009
eprintid = file.get(0);
title = file.get(1);
abstracts = file.get(2);
year = file.get(3);
month = file.get(4);
day = file.get(5);
url = file.get(6);
media = file.get(7);
data_type = file.get(8);
file_type = file.get(9);
file_size = file.get(10);
source = file.get(11);
file
title
eprintid
year
month
day
url
media
data_type
file_type
file_size
source
abstracts
Creates a unique DOI string following the format <DOI.base>/<source>.<year>-<month>.<sequential number>.
// A. Wiggins 4/29/2009
//replace any spaces in the name of the source repository
formatted_source = source.replaceAll("\\s","");
//assemble metadata elements and numberic list into DOI
doi = "doi.base/FLOSSmole." + formatted_source + "." + year + "-" + month + "." + seed;
source
year
month
seed
doi
Creates a general description of the data set contents to include additional metadata.
// A. Wiggins 4/29/2009
description = abstracts+" A "+file_type+" file of "+data_type+" data available as a "+media+" from "+url+". Data collected from "+source+" as raw HTML and parsed into data files by FLOSSmole. Metadata record available at "+metadata_url+"."
media
data_type
file_type
source
metadata_url
url
abstracts
description
Formats the filesize into Byte units by appending the metric unit prefix to the size value.
// A. Wiggins 4/29/2009
import java.util.regex.Pattern;
String gb_regex = "GB";
String mb_regex = "MB";
String kb_regex = "KB";
delim = "\n";
count = filesize.size();
out = new String();
for(i=0; i < count; i++){
String sizes = filesize.get(i).get(0);
String units = filesize.get(i).get(1);
if (Pattern.matches(gb_regex, units)) {
out = out + sizes + "g";
out = out + delim;
} else {
if (Pattern.matches(mb_regex, units)) {
out = out + sizes + "m";
out = out + delim;
} else {
if (Pattern.matches(kb_regex, units)) {
out = out + sizes + "k";
out = out + delim;
} else {
out = out + sizes;
out = out + delim;
}
}
}
}
formatted_filesize = out;
filesize
formatted_filesize
Generates an XML DOI record according to the std-doi.xsd metadata standard schema.
// A. Wiggins 4/29/2009
doi_records = "<resource xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\" xsi:noNamespaceSchemaLocation=\"std-doi.xsd\"><DOI>"+doi+"</DOI><creator>Megan@Squire</creator><creator>Kevin@Crowston</creator><creator>James@Howison</creator><publisher>FLOSSmole</publisher><title>"+title+"</title><language>en</language><structuralType>Digital</structuralType><mode>Abstract</mode><resourceType>Dataset</resourceType><registrationAgency>doi.base</registrationAgency><issueNumber>1</issueNumber><publicationDate>"+publicationDate+"</publicationDate><description>"+description+"</description><publicationPlace>"+publicationPlace+"</publicationPlace><size><value>"+filesize+"</value><unit>Bytes</unit></size><format>text/plain</format><edition>1</edition><discipline>softwareEngineering</discipline></resource>"
filesize
description
publicationPlace
publicationDate
title
doi
doi_records
Aggregates the individual date fields into a single date unit according to the standard for deposit.
// A. Wiggins 4/29/2009
publicationDate = year + "-" + month + "-" + day
year
month
day
publicationDate
Ensures two-digit month values.
// A. Wiggins 4/29/2009
import java.util.regex.Pattern;
String jan_regex = "1";
String feb_regex = "2";
String mar_regex = "3";
String apr_regex = "4";
String may_regex = "5";
String jun_regex = "6";
String jul_regex = "7";
String aug_regex = "8";
String sep_regex = "9";
if (Pattern.matches(jan_regex, month)) {
formatted_month = "01";
} else {
if (Pattern.matches(feb_regex, month)) {
formatted_month = "02";
} else {
if (Pattern.matches(mar_regex, month)) {
formatted_month = "03";
} else {
if (Pattern.matches(apr_regex, month)) {
formatted_month = "04";
} else {
if (Pattern.matches(may_regex, month)) {
formatted_month = "05";
} else {
if (Pattern.matches(jun_regex, month)) {
formatted_month = "06";
} else {
if (Pattern. matches(jul_regex, month)) {
formatted_month = "07";
} else {
if (Pattern.matches(aug_regex, month)) {
formatted_month = "08";
} else {
if (Pattern.matches(sep_regex, month)) {
formatted_month = "09";
} else {
formatted_month = month;
}
}
}
}
}
}
}
}
}
month
formatted_month
Text output of XML input file for ePrints metadata records.