gec_metrics.metrics.impara module
- class gec_metrics.metrics.impara.IMPARA(config: Config = None)[source]
Bases:
MetricBaseForReferenceFree- class Config(model_qe: str = 'gotutiyan/IMPARA-QE', model_se: str = 'google-bert/bert-base-cased', pooling: str = 'cls', max_length: int = 128, threshold: float = 0.9, no_cuda: bool = False, batch_size: int = 32)[source]
Bases:
ConfigIMPARA configuration.
- Parameters:
model_qe (str) – Quality estimation model.
model_se (str) – Similarity estimation model.
pooling (str) – Pooling method. ‘cls’ or ‘mean’.
max_length (int) – Maximum length of inputs.
threshold (float) – Threshold for the similarity score.
no_cuda (bool) – If True, work on CPU.
batch_size (int) – Batch size for the inference.
- batch_size: int = 32
- max_length: int = 128
- model_qe: str = 'gotutiyan/IMPARA-QE'
- model_se: str = 'google-bert/bert-base-cased'
- no_cuda: bool = False
- pooling: str = 'cls'
- threshold: float = 0.9
- score_sentence(sources: list[str], hypotheses: list[str]) list[float][source]
Calculate sentence-level scores.
- Parameters:
sources (list[str]) – Source sentence. The shape is (num_sentences, )
hypotheses (list[str]) – Corrected sentences. The shape is (num_sentences, )
- Returns:
The sentence-level scores.
- Return type:
list[float]
- score_sentence_qe(hypotheses: list[str]) list[float][source]
Compute quality scores.
- Parameters:
hypotheses (list[str]) – Corrected sentences. The shape is (num_sentences, )
- Returns:
The quality scores.
- Return type:
list[float]
- score_sentence_se(sources: list[str], hypotheses: list[str]) list[float][source]
Compute similarity scores.
- Parameters:
sources (list[str]) – Source sentence. The shape is (num_sentences, )
hypotheses (list[str]) – Corrected sentences. The shape is (num_sentences, )
- Returns:
The similarity scores.
- Return type:
list[float]
- class gec_metrics.metrics.impara.SimilarityEstimator(config: Config = None)[source]
Bases:
MetricBaseForReferenceFree- class Config(model: str = 'google-bert/bert-base-cased', batch_size: int = 32, max_length: int = 128, no_cuda: bool = False)[source]
Bases:
ConfigSimilarity Estimator configuration.
- Parameters:
model (str) – Model name to compute similarity.
batch_size (int) – Batch size during inference.
max_length (int) – Maximum length in tokenization. The input is truncated if longer than it.
no_cuda (bool) – If True, it will work on CPU.
- batch_size: int = 32
- max_length: int = 128
- model: str = 'google-bert/bert-base-cased'
- no_cuda: bool = False
- property device
- forward(src_input_ids: Tensor, src_attention_mask: Tensor, pred_input_ids: Tensor, pred_attention_mask: Tensor) Tensor[source]
Compute the cosine similarity given source and corrected sentences.
- Parameters:
src_input_ids (torch.Tensor) – Tokenized source sentences. The shape is (num_batch, sequence_length)
src_attention_mask (torch.Tensor) – The attention mask to handle padding. The shape is (num_batch, sequence_length)
pred_input_ids (torch.Tensor) – Tokenized corrected sentences. The shape is (num_batch, sequence_length)
pred_attention_mask (torch.Tensor) – The attention mask to handle padding. The shape is (num_batch, sequence_length)
- Returns:
- The cosine similarity.
The shape is (num_batch, )
- Return type:
torch.Tensor
- mean_pooling(states: Tensor, mask: Tensor) Tensor[source]
Compute mean pooling. Only the representaion with mask==1 are used.
- Parameters:
states (torch.Tensor) – The token-level representation. The shape is (num_batch, sequence_length, hidden_size)
mask – torch.Tensor: The mask indicates padding or not. The shape is (num_batch, sequence_length)
- Returns:
- The mean pooled representation.
The shape is (num_batch, hidden_size)
- Return type:
torch.Tensor
- score_sentence(sources: list[str], hypotheses: list[str]) list[float][source]
Compute similarity scores.
- Parameters:
sources (list[str]) – Source sentence. The shape is (num_sentences, )
hypotheses (list[str]) – Corrected sentences. The shape is (num_sentences, )
- Returns:
The similarity scores.
- Return type:
list[float]