Retrieves data from FLOSSmole and from the Notre Dame SourceForge repository to compute project statistics based on releases, downloads and project lifespan. Project statistics are then used to classify projects according to the criteria set up in English & Schweik, but comparison criteria are parameterized so that a different set of criterion thresholds can be used to evaluate the project characteristics.
Length of time for a project to be considered abandoned, if no releases have been made by this time. Unit: days, integer values only.
365
Desired time between releases so that the releases are not made "too fast" for a sustainable rate of growth. Unit: days, integer values only.
183
11
The threshold for how long a project may remain in the "initiation" stage without having produced a release, and still be considered not abandoned. Unit: days, integer values only
365
Allows switching between three versions of deriving the release rate values for comparison to a threshold to determine whether the releases are too frequent for sustainable growth; integer values should be 1 for first_last, 2 for recent_density, or 3 for average_rate.
2
Minimum number of releases to be considered a success.
3
Takes classtype output from iterated procedure out of list format and into CSV instead.
,
org.embl.ebi.escience.scuflworkers.java.StringListMerge
Takes stage output from iterated procedure out of list format and into CSV instead.
,
org.embl.ebi.escience.scuflworkers.java.StringListMerge
Author: Andrea Wiggins
Provides simple proportions for the stages of projects as output from the classification.
#A. Wiggins 6/16/08
suppressPackageStartupMessages(library(Design))
library(Design)
col.classes<-c("character")
colnames<-c("stages")
stage <- read.csv(textConnection(stages), header=FALSE, row.names=NULL, col.names=colnames, colClasses=col.classes)
attach(stage)
stage2<-sort.list(as.character(stage))
stagelabels<-unclass(stage2)
analysis<- summary.factor(stage, labels=stagelabels)
denominator<-sum(analysis)
proportion<-analysis/denominator
analysis_output<- paste(capture.output(proportion), collapse="\n")
stages
analysis_output
Author: Andrea Wiggins
Uses the output of several criterion tests to determine the classification for a given SourceForge project. The if/else statements form a truth table of possible values based on the classification scheme in the English & Schweik article.
if (releases == "no.releases" && downloads=="0" && web_site == "true") {
classtype = "unclassifiable";
} else {
if (releases == "no.releases" && stage == "growth") {
classtype = "TI";
} else {
if (releases == "no.releases" && stage == "initiation") {
classtype = "II";
} else {
if (releases=="not.enough.releases" && release_mortality=="active" && usage=="downloaded") {
classtype = "IG";
} else {
if (releases=="not.enough.releases" && release_mortality=="active" && usage=="not.downloaded") {
classtype = "TG";
} else {
if (releases=="not.enough.releases" && release_mortality=="inactive" && usage=="downloaded") {
classtype = "TG";
} else {
if (releases=="not.enough.releases" && release_mortality=="inactive" && usage=="not.downloaded") {
classtype = "TG";
} else {
if (releases=="enough.releases" && release_mortality=="active" && release_rate=="ok.release.rate" && usage=="downloaded") {
classtype = "SG";
} else {
if (releases=="enough.releases" && release_mortality=="active" && release_rate=="ok.release.rate" && usage=="not.downloaded") {
classtype = "TG";
} else {
if (releases=="enough.releases" && release_mortality=="active" && release_rate=="fast.release.rate" && usage=="downloaded") {
classtype = "IG";
} else {
if (releases=="enough.releases" && release_mortality=="active" && release_rate=="fast.release.rate" && usage=="not.downloaded") {
classtype = "TG";
} else {
if (releases=="enough.releases" && release_mortality=="inactive" && release_rate=="ok.release.rate" && usage=="downloaded") {
classtype = "SG";
} else {
if (releases=="enough.releases" && release_mortality=="inactive" && release_rate=="ok.release.rate" && usage=="not.downloaded") {
classtype = "TG";
} else {
if (releases=="enough.releases" && release_mortality=="inactive" && release_rate=="fast.release.rate" && usage=="downloaded") {
classtype = "TG";
} else {
if (releases=="enough.releases" && release_mortality=="inactive" && release_rate=="fast.release.rate" && usage=="not.downloaded") {
classtype = "TG";
} else {
classtype = "other";}
}
}
}
}
}
}
}
}
}
}
}
}
}
}
stage
usage
releases
release_mortality
release_rate
web_site
downloads
classtype
Author: Andrea Wiggins
Assembles outputs of analysis procedures with raw criterion data and creates a single CSV file to record all of the relevant variables in a single file.
// Test that input lists are all the same
delim = ",";
firstCount = downloads_list.size();
//this kept causing errors, although it seems to work in the GetData workflow...
//if (lifespan_list.size() != firstCount ||
// release_count_list.size() != firstCount ||
// time_last_current_list.size() != firstCount ||
// release_density_list.size() != firstCount ) {
// throw new Exception("Input lists must be of same length");
// }
out = "sf_unixname,downloads,downloads_test,lifespan,stages,release_count,release_count_test,time_last_current,mortality_test,recent_release_density,time_first_last,release_lag_test,has_sf_url,classification\n";
for (i = 0; i < firstCount; i++) {
out = out + sf_unixname_list.get(i);
out = out + delim;
out = out + downloads_list.get(i);
out = out + delim;
out = out + downloads_test_list.get(i);
out = out + delim;
out = out + lifespan_list.get(i);
out = out + delim;
out = out + stages_list.get(i);
out = out + delim;
out = out + release_count_list.get(i);
out = out + delim;
out = out + release_count_test_list.get(i);
out = out + delim;
out = out + time_last_current_list.get(i);
out = out + delim;
out = out + mortality_test_list.get(i);
out = out + delim;
out = out + release_density_list.get(i);
out = out + delim;
out = out + time_last_first_list.get(i);
out = out + delim;
out = out + release_lag_test_list.get(i);
out = out + delim;
out = out + has_sf_url_list.get(i);
out = out + delim;
out = out + classification_list.get(i);
out = out + "\n";
}
out_csv = out;
downloads_list
lifespan_list
sf_unixname_list
release_count_list
time_last_first_list
release_density_list
time_last_current_list
has_sf_url_list
stages_list
classification_list
release_lag_test_list
release_count_test_list
downloads_test_list
mortality_test_list
out_csv
Author: Andrea Wiggins
Provides simple proportions for the classes of projects as output from the classification.
#A. Wiggins 6/16/08
suppressPackageStartupMessages(library(Design))
library(Design)
col.classes<-c("character")
colnames<-c("class")
classtype <- read.csv(textConnection(classtypes), header=FALSE, row.names=NULL, col.names=colnames, colClasses=col.classes)
attach(classtype)
classes<-sort.list(as.character(classtype))
classlabels<-unclass(classes)
analysis<- summary.factor(classtype, labels=classlabels)
denominator<-sum(analysis)
proportion<-analysis/denominator
analysis_output<- paste(capture.output(proportion), collapse="\n")
classtypes
analysis_output
Author: Andrea Wiggins
For each project, determines whether the number of releases meets the threshold value for minimum number of releases.
release_count_thresholdD = Double.valueOf(release_count_threshold);
// convert inputs
if (num_releases == "") {
num_releasesD = null;
} else {
num_releasesD = Double.valueOf(num_releases);
}
// do release count test
if (num_releasesD == null) {
releases = "null";
} else {
if (num_releasesD == 0.0) {
releases = "no.releases";
} else {
if (num_releasesD >= release_count_thresholdD) {
releases = "enough.releases";
} else {
releases = "not.enough.releases";
}
}
}
System.out.println(releases);
release_count_threshold
num_releases
releases
Author: Andrea Wiggins
For each project, determines whether the lifespan of the project (aggregate data from FLOSSmole: data collection date minus founding date) meets the threshold between initiation phase and growth phase.
initiation_thresholdD = Double.valueOf(initiation_threshold);
initiation = Double.valueOf(initiation_thresholdD);
// convert inputs
if (lifespan == "") {
lifespanD = null;
} else {
lifespanD = Double.valueOf(lifespan);
}
if (lifespanD == null) {
stage = "null";
} else {
if (lifespanD == null) {
stage = "null";
} else {
if (lifespanD >= initiation) {
stage = "growth";
} else {
stage = "initiation";
}
;
}
}
System.out.println(stage);
lifespan
initiation_threshold
stage
Author: Andrea Wiggins
For each project, determines whether the amount of time over which a given number of releases has occurred exceeds a threshold, which is intended to indicate an appropriate amount of time between releases for sustainable project activity, i.e. not too fast. There are three different methods to compare release rate and the lag threshold, based on: 1) "first_last" time elapsed between first and most recent release, 2) "recent_density" time elapsed between last X releases (where X is the workflow variable to indicate minimum number of releases for success), and 3) "average_releases" average time between each release since the first one. Note that method 3 will have a significantly different appropriate value for the release_lag_threshold variable, as it is based on average time between individual releases rather than aggregate time between several releases.
release_lag_thresholdD = Double.valueOf(release_lag_threshold);
threshold = Double.valueOf(release_lag_thresholdD * 86400);
release_rate_typeI = Integer.valueOf(release_rate_type);
// convert inputs
if (first_last_release == "") {
first_last_releaseD = null;
} else {
first_last_releaseD = Double.valueOf(first_last_release);
};
if (recent_release_density == "") {
recent_release_densityD = null;
} else {
recent_release_densityD = Double.valueOf(recent_release_density);
};
if (num_releases == "") {
num_releasesD = null;
} else {
num_releasesD = Double.valueOf(num_releases);
};
// define average rate
if (first_last_releaseD != null && num_releasesD != null) {
average_rate = (first_last_releaseD / (num_releasesD - 1));
} else {
average_rate = null;
}
// define release rate logics
if (recent_release_densityD < threshold) {
recent_density = "ok.release.rate";
} else {
recent_density = "fast.release.rate";
}
if (first_last_releaseD < threshold) {
first_last = "fast.release.rate";
} else {
first_last = "ok.release.rate";
}
if (average_rate < threshold) {
average_releases = "fast.release.rate";
} else {
average_releases = "ok.release.rate";
}
// test for rate type
if (release_rate_typeI == 2
&& recent_release_densityD != null) {
release_rate = recent_density;
} else {
if (release_rate_typeI == 1
&& first_last_releaseD != null) {
release_rate = first_last;
} else {
if (release_rate_typeI == 3 && average_rate != null) {
release_rate = average_releases;
} else {
release_rate = "other";
}
}
}
System.out.println(release_rate);
release_rate_type
first_last_release
recent_release_density
num_releases
release_lag_threshold
release_rate
Author: Andrea Wiggins
For each project, determines whether the number of aggregate downloads for the project exceeds a minimum threshold for usefulness. This would be an interesting place to substitute a scaling function option for the download_threshold value, perhaps adjusting the threshold according to the project's lifespan or number of releases.
download_thresholdD = Double.valueOf(download_threshold);
// convert inputs
if (downloads == "") {
downloadsD = null;
} else {
downloadsD = Double.valueOf(downloads);
}
if (downloadsD == null) {
usage = "null";
} else {
if (downloadsD >= download_thresholdD) {
usage = "downloaded";
} else {
usage = "not.downloaded";
}
;
}
System.out.println(usage);
downloads
download_threshold
usage
Author: Andrea Wiggins
For each project, determines whether the time between the last release and the date of data collection is within a threshold limit that indicates whether the project is active or inactive.
mortality_thresholdD = Double.valueOf(mortality_threshold);
mortality = Double.valueOf(mortality_thresholdD * 86400);
// convert inputs
if (time_since_last_release == "") {
time_since_last_releaseD = null;
} else {
time_since_last_releaseD = Double.valueOf(time_since_last_release);
}
if (time_since_last_releaseD == null) {
release_mortality = "null";
} else {
if (time_since_last_releaseD <= mortality) {
release_mortality = "active";
} else {
release_mortality = "inactive";
}
}
System.out.println(release_mortality);
mortality_threshold
time_since_last_release
release_mortality
Author: James Howison
Procedure to fetch data from FLOSSmole and the Notre Dame SourceForge dumps based on the SourceForge unixname for a given project. Input is a list of project SourceForge unixnames, and a threshold value for the "recent release density" value. Not currently suited for running large batches of projects.
This workflow gets a release list for a project. It uses a wsdl defined proxy to access the Notre Dame Sourceforge repository.
This workflow gets a release list for a project. It uses a wsdl defined proxy to access the Notre Dame Sourceforge repository.
import java.text.SimpleDateFormat;
dateAndTime = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss'Z'");
long milliSecs = Long.valueOf(epoch) * 1000;
epochDate = new Date(milliSecs); // wants milliseconds
xsdDateTime = dateAndTime.format(epochDate);
epoch
xsdDateTime
releases_xml = "<releases>\n";
for (String datetime : datetimes) {
releases_xml = releases_xml + "<release>"+datetime+"</release>\n";
}
releases_xml = releases_xml + "</releases>\n";
datetimes
releases_xml
where_clause = "schema.groups.unix_group_name = '"+sf_unixname+"' AND schema.frs_package.group_id = schema.groups.group_id AND schema.frs_package.package_id = schema.frs_release.package_id";
sf_unixname
where_clause
/records/record/release-date-epoch
net.sourceforge.taverna.scuflworkers.xml.XPathTextWorker
temp_username
temp_password
sf1105
schema.frs_release.release_date AS release_date_epoch
schema.groups, schema.frs_package, schema.frs_release
true
http://rails-test.floss.syr.edu/notre_dame_sf/wsdl
MakeQuery
This workflow takes a sf_unixname and looks up available data for classification from Notre Dame and FLOSSmole. It returns three items, all from the project_statistics table: data_for_date, which indicates the datetime for which the download aggregate and the lifespan are relevant.
This workflow takes a sf_unixname and looks up available data for classification from Notre Dame and FLOSSmole. It returns three items, all from the project_statistics table: data_for_date, which indicates the datetime for which the download aggregate and the lifespan are relevant.
queryString = "SELECT s.downloads, s.lifespan, s.data_for_date FROM ossmole_merged.project_statistics AS s WHERE datasource_id = 4 AND s.is_all_time = 1 AND s.proj_unixname = '"+sf_unixname+"'";
sf_unixname
queryString
jdbc:mysql://floss.syr.edu:3366/ossmole_merged
com.mysql.jdbc.Driver
public_access
digging!
SELECT s.downloads, s.lifespan, s.data_for_date FROM ossmole_merged.project_statistics AS s WHERE datasource_id = 4 AND s.is_all_time = 1 AND s.proj_unixname = 'gaim'
false
net.sourceforge.taverna.scuflworkers.jdbc.SQLQueryWorker
aggregate_downloads = result_row.get(0).get(0);
lifespan_days = result_row.get(0).get(1);
data_for_date = result_row.get(0).get(2);
result_row
aggregate_downloads
lifespan_days
data_for_date
import java.text.SimpleDateFormat;
xsd = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss'Z'");
sql = new SimpleDateFormat("yyyy-MM-dd");
sqlDate = sql.parse(sql_date);
xsd_datetime = xsd.format(sqlDate);
sql_date
xsd_datetime
url_query_string = "SELECT p.real_url FROM projects AS p WHERE p.datasource_id = 28 AND proj_unixname = '"+sf_unixname+"'";
sf_unixname
url_query_string
jdbc:mysql://floss.syr.edu:3366/ossmole_merged
com.mysql.jdbc.Driver
public_access
digging!
false
net.sourceforge.taverna.scuflworkers.jdbc.SQLQueryWorker
import java.util.regex.Pattern;
real_url = result_row.get(0).get(0);
String regex = "http:\\/\\/"+Pattern.quote(sf_unixname)+"\\.sourceforge\\.net\\/?";
if (Pattern.matches(regex, real_url)) {
has_sf_url = "true";
} else {
has_sf_url = "false";
}
result_row
sf_unixname
has_sf_url
This should be a simple workflow to calculate a few summary stats from a list of releases. The releases are pulled out of an XML document (string) that describes elements of the project. eg:
<project_info>
<sf_unixname />
<downloads_by_date />
<lifespan_by_date />
<releases>
<release datetime="2004-01-08T12:12:50Z" />
<release datetime="2004-01-08T12:12:40Z" />
<release datetime="2004-01-08T12:12:30Z" />
<release datetime="2004-01-08T12:12:20Z" />
<release datetime="2004-01-08T12:12:10Z" />
<release datetime="2004-01-08T12:12:00Z" />
</releases>
</project_info>
This workflow calculates the summary of release history needed for the English and Schweik workflow. The outputs are a simple count of releases (before a cutoff date), the seconds elapsed between the last release and the cutoff date, and recent release density, which is the aggreate time expired between the Nth last release and the last release (prior to the cutoff date). N for density is defined by the num_releases_for_density input.
release_count = releases.size();
releases
release_count
import java.text.SimpleDateFormat;
dateAndTime = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss'Z'");
dates = new TreeSet();
for (dateString : datetimes) {
dates.add(dateAndTime.parse(dateString));
}
chosenDate = dates.first();
chosen_datetime = dateAndTime.format(chosenDate);
datetimes
chosen_datetime
import java.text.SimpleDateFormat;
dateAndTime = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss'Z'");
yingDate = dateAndTime.parse(datetime_1);
yangDate = dateAndTime.parse(datetime_2);
distance = ( yingDate.getTime() - yangDate.getTime() ) / 1000;
// Need absolute value
seconds_between = Math.abs(distance);
datetime_1
datetime_2
seconds_between
1
import java.text.SimpleDateFormat;
dateAndTime = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss'Z'");
dates = new ArrayList();
for (dateString : datetimes) {
dates.add(dateAndTime.parse(dateString));
}
Collections.sort(dates);
Collections.reverse(dates);
chosenDate = dates.get(Integer.valueOf(index_wanted) - 1);
chosen_datetime = dateAndTime.format(chosenDate);
index_wanted
datetimes
chosen_datetime
import java.text.SimpleDateFormat;
dateAndTime = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss'Z'");
dates = new ArrayList();
cutoffDate = dateAndTime.parse(cutoff_date);
for (dateString : release_datetimes) {
currDate = dateAndTime.parse(dateString);
if (currDate.before(cutoffDate)) {
dates.add(currDate);
}
}
trunc_release_datetimes = new ArrayList();
for (Date myDate : dates ) {
trunc_release_datetimes.add(dateAndTime.format(myDate));
}
release_datetimes
cutoff_date
trunc_release_datetimes
import java.text.SimpleDateFormat;
dateAndTime = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss'Z'");
yingDate = dateAndTime.parse(datetime_1);
yangDate = dateAndTime.parse(datetime_2);
distance = ( yingDate.getTime() - yangDate.getTime() ) / 1000;
// Need absolute value
seconds_between = Math.abs(distance);
datetime_1
datetime_2
seconds_between
import java.text.SimpleDateFormat;
dateAndTime = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss'Z'");
dates = new ArrayList();
for (dateString : datetimes) {
dates.add(dateAndTime.parse(dateString));
}
Collections.sort(dates);
Collections.reverse(dates);
chosenDate = dates.get(Integer.valueOf(index_wanted) - 1);
chosen_datetime = dateAndTime.format(chosenDate);
index_wanted
datetimes
chosen_datetime
/releases/release
net.sourceforge.taverna.scuflworkers.xml.XPathTextWorker
import java.text.SimpleDateFormat;
dateAndTime = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss'Z'");
yingDate = dateAndTime.parse(datetime_1);
yangDate = dateAndTime.parse(datetime_2);
distance = ( yingDate.getTime() - yangDate.getTime() ) / 1000;
// Need absolute value
seconds_between = Math.abs(distance);
datetime_1
datetime_2
seconds_between
l(text/plain)
List of SourceForge unixnames for the projects of interest; accesses data for time periods through mid-2005.
Proportions of each class of project in the sample. Potential values include: SG (success-growth), TG (tragedy-growth), IG (indeterminate-growth), TI (tragedy-initiation), II (indeterminate-initiation), unclassifiable (has 0 downloads, 0 releases, and a non-SourceForge web site), and other (cannot be classified.) Any project producing an "other" value should be closely scrutinized to determine why it does not fit into the classification scheme; the most likely reason would be the presence of null values due to missing data from one of the repositories.
Proportions of projects classified as being in the growth or initiation phase.
CSV output of all the data used for classification, the classification criterion values, and the final class assigned to each project. Suitable for use with R or Excel for later ad-hoc analysis.