gec_metrics.meta_eval.base module

class gec_metrics.meta_eval.base.MetaEvalBase(config: Config = None)[source]

Bases: ABC

class Config[source]: Bases: object

class Corr(pearson: float = None, spearman: float = None, accuracy: float = None, kendall: float = None, human_scores: list[float] = None, metric_scores: list[float] = None)[source]

Bases: object

accuracy: float = None

human_scores: list[float] = None

kendall: float = None

metric_scores: list[float] = None

pearson: float = None

spearman: float = None

class Output[source]: Bases: object

abstractmethod corr_sentence(metric: MetricBase)[source]

Compute sentence-level correlations.

Parameters:: metric (MetricBase) – The metric to be evaluated.
Returns:: The sentence-level correlations output.
Return type:: **SentenceCorrOutput

abstractmethod corr_system(metric: MetricBase, aggregation: str = 'default')[source]

Compute system-level correlations.

Parameters:

metric (MetricBase) – The metric to be evaluated.
aggregation (str) – How to aggregate sentence-level scores into system rankings. - ‘default’: Default aggregation, e.g.,average or accumulation. - ‘trueskill’: TrueSkill aggregation.

Returns:

The system-level correlations output.

Return type:

**SystemCorrOutput

abstractmethod load_sentence_data() → dict[str, list][source]

abstractmethod load_system_data() → dict[str, list][source]

pairwise_analysis(metric: MetricBase)[source]

Compute sentence-level correlations.

Parameters:: metric (MetricBase) – The metric to be evaluated.
Returns:: The sentence-level correlations output.
Return type:: **SentenceCorrOutput

pairwise_analysis_plot(results: list[tuple, float])[source]

rearange_sent_data(data)[source]

Rearange the format and content of sentence-level evaluation results. This is intednded to use for LLMKobayashi24** metric. The LLMKobayashi24** sentence-level meta evaluation requires the input

of the same set of corrected sentences as in the SEEDA manual evaluation.

On the other hand, when the number of corrected sentences is larger than 5,: the same sentences as in SEEDA are not necessarily sampled and evaluated.
To ensure that the same sentences as in SEEDA are used in the evaluation,: this function replaces the unused sentences with the used ones.

This ensures that only the same sentences as SEEDA are present in the hypothesis set.

Also, different annotators evaluate different subsets of the same hypothesis set,: so we flatten the data to make this easier to handle.
Specifically, human scores will be changed: from [num_sentences][num_annotatinos][num_systems] to [num_sentences * num_annotations][1][num_systems].

The hypotheses also be expanded: [num_systems][num_sentences] -> [num_systems][num_sentences * num_annotations].

window_analysis_plot(results: dict[tuple, Corr])[source]

window_analysis_system(metric: MetricBase, window: int = 4, aggregation='default') → SEEDAWindowAnalysisSystemCorrOutput[source]

System-level window analysis.

Parameters:

metric (MetricBase) – The metric to be evaluated.
window (int) – The window size.

Returns:

The correlations.

Contains .ew_edit, .ew_sent, .ts_edit, .ts_sent.
Each is a dictinary: {(start_rank, end_rank): Corr}.

Return type:

SEEDAWindowAnalysisSystemCorrOutput