gec_metrics.meta_eval.gjg module

class gec_metrics.meta_eval.gjg.MetaEvalGJG(config: Config = None)[source]

Bases: MetaEvalBase

class GJGSentenceCorrOutput(corr: Corr = None)[source]

Bases: Output

The dataclass to store the meta-evaluation results.

Parameters:

ts (MetaEvalBase.Corr) – The correlation using TrueSkill-based human evaluation.
ts – The correlation using Expected Wins-based human evaluation.

corr: Corr = None

class GJGSystemCorrOutput(ew: Corr = None, ts: Corr = None)[source]

Bases: Output

The dataclass to store the meta-evaluation results.

Parameters:

ts (MetaEvalBase.Corr) – The correlation using TrueSkill-based human evaluation.
ts – The correlation using Expected Wins-based human evaluation.

ew: Corr = None

ts: Corr = None

class GJGWindowAnalysisSystemCorrOutput(ew: dict = None, ts: dict = None)[source]

Bases: Output

The dataclass to store the meta-evaluation results.

Parameters:

ts (MetaEvalBase.Corr) – The correlation using TrueSkill-based human evaluation.
ts – The correlation using Expected Wins-based human evaluation.

ew: dict = None

ts: dict = None

MODELS = ['AMU', 'RAC', 'CAMB', 'CUUI', 'POST', 'UFC', 'PKU', 'UMC', 'IITB', 'SJTU', 'INPUT', 'NTHU', 'IPN']

SCORE_ID = ['ew', 'ts']

corr_sentence(metric: MetricBase) → GJGSentenceCorrOutput[source]

Compute sentence-level correlations.

Parameters:: metric (MetricBase) – The metric to be evaluated.
Returns:: The correlations.
Return type:: GJGSentenceCorrOutput

corr_system(metric: MetricBase, aggregation='default') → GJGSystemCorrOutput[source]

Compute system-level correlations.

Parameters:: metric (MetricBase) – The metric to be evaluated.
Returns:: The correlations.
Return type:: GJGSystemCorrOutput

load_sentence_data() → dict[str, list][source]

Loads sentence-level meta-evaluation data.

Returns:

The meta-evaluation data contianing the following keys:

”sources”: Source sentences. The shape is (num_sentences, ).
”hypotheses”: Hypotheses sentences. The shape is (num_systems, num_sentences).
”references”: Reference sentences. The shape is (num_references, num_sentences).
”models”: The model names. This index corresponds to the first dimension of “hypotheses”.
”human_scores”: Human scores for the systems.
- ”ew” is human Expected Wins scores.
  The shape is (num_sentences, num_systems, num_systems).
- ”ts” is human TrueSkill scores.
  The shape is (num_sentences, num_systems, num_systems).

Return type:

dict[str, list]

load_system_data() → dict[str, list][source]

Load system-level meta-evaluation data.

Returns:

The meta-evaluation data contianing the following keys:

”sources”: Source sentences. The shape is (num_sentences, ).
”hypotheses”: Hypotheses sentences. The shape is (num_systems, num_sentences).
”references”: Reference sentences. The shape is (num_references, num_sentences).
”models”: The model names. This index corresponds to the first dimension of “hypotheses”.
”human_scores”: Human scores for the systems. The shape is (num_systems, )
- ”ew” is human Expected Wins scores.
- ”ts” is human TrueSkill scores.

Return type:

dict[str, list]

load_xml(xml_path: str, target_models: list[str]) → dict[int, list[list[int]]][source]

Load a XML file.

Parameters:

xml_path (str) – Path to a XML file.
target_models (list[str]) – Model names to be evaluated.

Returns:

Dictionary containing sentence-level human evaluation rankings. The data is stored for each source and annotator. You can refer to the ranking by dict[src_id][annotator_id][system_id] = -rank. Note that each element is minus rank, so higher values are higher quality.

Return type:

dict[int, list[list[int]]

window_analysis_system(metric: MetricBase, window: int = 4, aggregation='default') → GJGWindowAnalysisSystemCorrOutput[source]

System-level window analysis.

Parameters:

metric (MetricBase) – The metric to be evaluated.
window (int) – The window size.

Returns:

The correlations.

Contains .ew_edit, .ew_sent, .ts_edit, .ts_sent.
Each is a dictinary: {(start_rank, end_rank): Corr}.

Return type:

SEEDAWindowAnalysisSystemCorrOutput