-
Notifications
You must be signed in to change notification settings - Fork 1
Evaluation
Gold standards for matching are simple CSV files that list two record IDs and a boolean flag specifying whether the match is correct or not:
Dataset1_record1,Dataset2_record2,true
Dataset1_record2,Dataset5_record5,true
Dataset1_record1,Dataset2_record1,false
Dataset1_record2,Dataset2_record7,false
The gold standards can either be complete or partial:
Complete Gold Standard: Contains all possible matches in the data sets. All correspondences must be marked as correct (the third value is "true").
MatchingGoldStandard gs = new MatchingGoldStandard();
gs.loadFromCSVFile(new File("complete.csv"));
gs.setComplete(true);
Partial Gold Standard: Contains positive and negative examples for matches, indicated by the flag “true” or “false” as third value. Only correspondences that are included in the partial gold standard are evaluated.
MatchingGoldStandard gs = new MatchingGoldStandard();
gs.loadFromCSVFile(new File("partial.csv"));
The evaluation of a matching result is performed by the matching evaluator:
MatchingEvaluator<Record, Attribute> evaluator =
new MatchingEvaluator<Record, Attribute>(true);
Performance perf = evaluator.evaluateMatching(correspondences.get(),gs);
For data fusion, a gold standard is just another dataset. If the fused values are the same as the values in this dataset, they are evaluated as correct. The connection is made via the record IDs in the datasets.
// load the gold standard
DataSet<Movie, Attribute> gs = new FusableDataSet<>();
new MovieXMLReader().loadFromXML(new File("fused.xml"), "/movies/movie", gs);
// evaluate
DataFusionEvaluator<Movie, Attribute> evaluator = new DataFusionEvaluator<>(
strategy,
new RecordGroupFactory<Movie, Attribute>());
double accuracy = evaluator.evaluate(fusedDataSet, gs, null);