# Meta-evaluation Introduction

gec-metrics also supports meta-evaluation. 

## Preparation
Please run the following command beforehand to download the relevant datasets:

```bash
gecmetrics-prepare-meta-eval
```

## Import

Each meta-evaluation benchmark is implemented as a single class. These classes can be imported from `gec_metrics.meta_eval`.


```python
from gec_metrics.meta_eval import MetaEvalSEEDA
```

Alternatively, you can use `get_meta_eval()`. Available IDs can be obtained using `get_meta_eval_ids()`.

```python
from gec_metrics import get_meta_eval, get_meta_eval_ids
metric_cls = get_meta_eval('seeda')
print(get_meta_eval_ids())  # ['gjg', 'seeda']
```

gec-metrics currently supports the following meta-evaluation frameworks:
|Framework|Class (`gec_metrics.meta_eval.<HERE>`)|References|
|:--|:--|:--|
|GJG15|`MetaEvalGJG`|[[Grundkiewicz+ 15]](https://aclanthology.org/D15-1052/)|
|SEEDA|`MetaEvalSEEDA`|[[Kobayashi+ 24]](https://aclanthology.org/2024.tacl-1.47/)|


## Initialize

Similar to metrics, initialize them by passing a Config object as needed.

```python
from gec_metrics.meta_eval import MetaEvalSEEDA
meta = MetaEvalSEEDA(MetaEvalSEEDA.Config(
    system='base'
))

# When using the default configuration.
meta = MetaEvalSEEDA()
```

## Meta-evaluation

Meta-evaluation is often discussed at two levels: corpus-level and sentence-level, and gec-metrics supports both.

### System-Level Meta-Evaluation

System-level meta-evaluation is performed using `corr_system()`.
This function compares system scores obtained from human evaluation with the evaluation results from `metric.rank_systems()`. You can control how the metric scores are calculated by specifying the `aggregation=` argument. [[Goto+ 25]](https://aclanthology.org/2025.acl-short.92/)'s experiments can be reproduced by comparing `aggregation="default"` and `aggregation="trueskill"`.

The return value is a dedicated Output class. If the human evaluation includes multiple aspects, the correlations with all of them are calculated and returned.

```python
from gec_metrics.meta_eval import MetaEvalSEEDA
from gec_metrics.metrics import GREEN
metric = GREEN()
meta = MetaEvalSEEDA()
sys_results = meta.corr_system(metric, aggregation='default')
print(sys_results)
```

### Sentence-Level Meta-Evaluation

Sentence-level meta-evaluation is performed using `corr_sentence()`. The return value is a dedicated Output class. If the human evaluation includes multiple aspects, the correlations with all of them are calculated and returned.

```python
from gec_metrics.meta_eval import MetaEvalSEEDA
from gec_metrics.metrics import GREEN
metric = GREEN()
meta = MetaEvalSEEDA()
sent_results = meta.corr_sentence(metric)
print(sent_results)
```

## Analysis

### Window analysis

The window analysis proposed in SEEDA [[Kobayashi+ 24]](https://aclanthology.org/2024.tacl-1.47/) calculates correlations over subsets of systems extracted using a fixed-width window from systems sorted by human rank. This allows for analyzing the evaluation performance on sets of top-ranked or bottom-ranked systems according to human evaluation, as well as assessing the robustness of the evaluation across diverse system sets.

```python
from gec_metrics.meta_eval import MetaEvalSEEDA
from gec_metrics.metrics import ERRANT
import matplotlib.pyplot as plt
metric = ERRANT()
meta = MetaEvalSEEDA()
window_results = meta.window_analysis_system(metric, window=4, aggregation='default')
fig = meta.window_analysis_plot(window_results.ts_edit)
plt.savefig('window-analysis.png')
```

The result will be like this:

![window-fig](../figs/window-analysis.png)


### Pairwise analysis

Pairwise analysis provides a more detailed examination of sentence-level meta-evaluation results [[Sakai+ 25]](https://aclanthology.org/2025.findings-acl.1315/). It groups the sentence-level meta-evaluation results based on human evaluation rank pairs and calculates the agreement rate within each group. This facilitates the analysis of trends, such as whether the metric demonstrates better agreement for pairs with larger differences in human evaluation ranks.

```python
from gec_metrics.meta_eval import MetaEvalSEEDA
from gec_metrics.metrics import ERRANT
import matplotlib.pyplot as plt
metric = ERRANT()
meta = MetaEvalSEEDA()
pairwise_results = meta.pairwise_analysis(metric)
fig = meta.pairwise_analysis_plot(pairwise_results['edit'])
plt.savefig('pairwise-analysis.png')
```

The result will be like this:

![pairwise-fig](../figs/pairwise-analysis.png)