gec_metrics.meta_eval.seeda module

class gec_metrics.meta_eval.seeda.MetaEvalSEEDA(config: Config = None)[source]

Bases: MetaEvalBase

class Config(system: str = 'base')[source]

Bases: object

system: str = 'base'
MODELS = ['BART', 'BERT-fuse', 'GECToR-BERT', 'GECToR-ens', 'GPT-3.5', 'INPUT', 'LM-Critic', 'PIE', 'REF-F', 'REF-M', 'Riken-Tohoku', 'T5', 'TemplateGEC', 'TransGEC', 'UEDIN-MS']
SCORE_ID = ['EW_edit', 'EW_sent', 'TS_edit', 'TS_sent']
class SEEDASentenceCorrOutput(sent: Corr = None, edit: Corr = None)[source]

Bases: Output

The dataclass to store sentence-level correlations.

Parameters:
edit: Corr = None
sent: Corr = None
class SEEDASystemCorrOutput(ew_edit: Corr = None, ew_sent: Corr = None, ts_edit: Corr = None, ts_sent: Corr = None)[source]

Bases: Output

The dataclass to store system-level correlations.

Parameters:
  • ew_sent (MetaEvalBase.Corr) – SEEDA-S correlation based on Expected Wins-based human evaluation.

  • ew_edit (MetaEvalBase.Corr) – SEEDA-E correlation based on Expected Wins-based human evaluation.

  • ts_sent (MetaEvalBase.Corr) – SEEDA-S correlation based on TrueSkill-based human evaluation.

  • ts_edit (MetaEvalBase.Corr) – SEEDA-E correlation based on TrueSkill-based human evaluation.

ew_edit: Corr = None
ew_sent: Corr = None
ts_edit: Corr = None
ts_sent: Corr = None
class SEEDAWindowAnalysisSystemCorrOutput(ew_edit: dict = None, ew_sent: dict = None, ts_edit: dict = None, ts_sent: dict = None)[source]

Bases: Output

The dataclass to store system-level correlations.

Parameters:
  • ew_sent (MetaEvalBase.Corr) – SEEDA-S correlation based on Expected Wins-based human evaluation.

  • ew_edit (MetaEvalBase.Corr) – SEEDA-E correlation based on Expected Wins-based human evaluation.

  • ts_sent (MetaEvalBase.Corr) – SEEDA-S correlation based on TrueSkill-based human evaluation.

  • ts_edit (MetaEvalBase.Corr) – SEEDA-E correlation based on TrueSkill-based human evaluation.

ew_edit: dict = None
ew_sent: dict = None
ts_edit: dict = None
ts_sent: dict = None
corr_sentence(metric: MetricBase) SEEDASentenceCorrOutput[source]

Compute sentence-level correlations.

Parameters:

metric (MetricBase) – The metric to be evaluated.

Returns:

The correlations.

Return type:

SEEDASentenceCorrOutput

corr_system(metric: MetricBase, aggregation='default') SEEDASystemCorrOutput[source]

Compute system-level correlations.

Parameters:

metric (MetricBase) – The metric to be evaluated.

Returns:

The correlations.

Return type:

SEEDASystemCorrOutput

load_sentence_data() dict[str, int][source]

Load sentence-level meta-evaluation data.

Returns:

The meta-evaluation data contianing the following keys:
  • ”sources”: Source sentences. The shape is (num_sentences, ).

  • ”hypotheses”: Hypotheses sentences. The shape is (num_systems, num_sentences).

  • ”references”: Reference sentences. The shape is (num_references, num_sentences).

  • ”models”: The model names. This index corresponds to the first dimension of “hypotheses”.

  • ”human_scores”: Dictionary of Human scores for the systems. The shape is (num_sentences, num_systems, num_systems).
    • ”EW_edit”: Expected Wins scores using edit-based human evaluation.

    • ”EW_sent”: Expected Wins scores using sentence-based human evaluation.

    • ”TS_edit”: TrueSkill scores using edit-based human evaluation.

    • ”TS_sent”: TrueSkill scores using sentence-based human evaluation.

Return type:

dict[str, list]

load_system_data() dict[str, list][source]

Load system-level meta-evaluation data.

Returns:

The meta-evaluation data contianing the following keys:
  • ”sources”: Source sentences. The shape is (num_sentences, ).

  • ”hypotheses”: Hypotheses sentences. The shape is (num_systems, num_sentences).

  • ”references”: Reference sentences. The shape is (num_references, num_sentences).

  • ”models”: The model names. This index corresponds to the first dimension of “hypotheses”.

  • ”human_scores”: Dictionary of Human scores. The shape is (num_systems, ).
    • ”EW_edit”: Expected Wins scores using edit-based human evaluation.

    • ”EW_sent”: Expected Wins scores using sentence-based human evaluation.

    • ”TS_edit”: TrueSkill scores using edit-based human evaluation.

    • ”TS_sent”: TrueSkill scores using sentence-based human evaluation.

Return type:

dict[str, list]

load_xml(xml_path: str, target_models: list[str]) dict[str, list[list[int]]][source]

Load a XML file.

Parameters:
  • xml_path (str) – Path to a XML file.

  • target_models (list[str]) – Model names to be evaluated.

Returns:

Dictionary containing sentence-level human evaluation rankings. The data is stored for each source and annotator. You can refer to the ranking by dict[src_id][annotator_id][system_id] = -rank. Note that each element is minus rank, so higher values are higher quality.

Return type:

dict[int, list[list[int]]]

window_analysis_system(metric: MetricBase, window: int = 4, aggregation='default') SEEDAWindowAnalysisSystemCorrOutput[source]

System-level window analysis.

Parameters:
  • metric (MetricBase) – The metric to be evaluated.

  • window (int) – The window size.

Returns:

The correlations.
  • Contains .ew_edit, .ew_sent, .ts_edit, .ts_sent.

  • Each is a dictinary: {(start_rank, end_rank): Corr}.

Return type:

SEEDAWindowAnalysisSystemCorrOutput