Evaluate

一、基本概念

Evaluate 是一个用于轻松评估机器学习模型和数据集的库。只需一行代码，你就可以获得不同领域（ NLP 、计算机视觉、强化学习等等）的几十种评估方法，无论是在本地机器上还是在分布式训练中。

安装：


pip install evaluate

确认安装成功：


python -c "import evaluate; print(evaluate.load('exact_match').compute(references=['hello'], predictions=['hello']))"

应该返回：{'exact_match': 1.0} 。

一个典型的机器学习 pipeline 有不同的方面可以被评估，每个方面都可以通过 Evaluate 所提供的工具来评估：
- Metric ：一个 metric 用于评估模型的性能，通常涉及模型的 prediction 以及一些 ground-truth 标签。例如，accuracy, exact match, IoUO 。
  你可以在 evaluate-metric 找到所有的集成的 metrics。
- Comparison：一个 comparison 用来比较两个模型的。例如，可以将它们的 prediction 与 ground-truth 标签进行比较并计算它们的一致性 agreement 。例如，McNemar Test 是一个 paired 非参数统计假设检验，它将两个模型的预测结果进行比较，目的是衡量模型的预测是否有分歧。它输出的 P 值从 0.0 到 1.0 不等，表示两个模型的预测之间的差异，P 值越低表示差异越明显。
  你可以在 evaluate-comparison 找到所有的集成的 comparisons。
- Measurement：数据集和模型一样重要。通过 measurements ，人们可以探查数据集的属性。例如，数据集的平均 word 长度。
  你可以在 evaluate-measurement 中找到所有的集成的 measurements。
这些评估模块中的每一个都作为一个空间存在于 Hugging Face Hub 上。每个 metric, comparison, measurement 都是一个独立的 Python 模块，但是有一个通用入口：evaluate.load() ：
```
xxxxxxxxxx
import evaluate
accuracy = evaluate.load("accuracy")
```
你也可以显式指定模块类型：
```
xxxxxxxxxx
word_length = evaluate.load("word_length", module_type="measurement")
```
有三种 high-level category 的 metrics：
- 通用指标：可以应用于各种情况和数据集，如 precision 、 accuracy 、以及 perplexity 。
  要看到一个给定 metric 的输入结构，你可以看一下它的 metric card 。
- task-specific metrics：仅限于特定任务，如机器翻译任务通常使用 BLEU 或 ROUGE 指标、命名实体识别任务通常使用 seqeval 指标。
- dataset-specific metrics：只在评估模型在特定 benchmark 上的表现，如 GLUE benchmark 有一个专门的评估指标。

可以通过 evaluate.list_evaluation_modules() 来列出所有可用的评估模块：


xxxxxxxxxx
evaluate.list_evaluation_modules(
  module_type="comparison",
  include_community=False, 
  with_details=True)
# [{'name': 'mcnemar', 'type': 'comparison', 'community': False, 'likes': 1},
#  {'name': 'exact_match', 'type': 'comparison', 'community': False, 'likes': 0}]

所有 evalution 模块都有一些属性，这些属性存储在 EvaluationModuleInfo 对象中：


xxxxxxxxxx
Attribute           Description
description         A short description of the evaluation module.
citation            A BibTex string for citation when available.
features            A Features object defining the input format.
inputs_description  This is equivalent to the modules docstring.
homepage            The homepage of the module.
license             The license of the module.
codebase_urls       Link to the code behind the module.
reference_urls      Additional reference URLs.

当涉及到计算实际分数时，有两种主要的方法：

整体式 All-in-one ：在整体式方法中，输入一次性传递给 compute() 来计算出分数（以字典的形式）。


xxxxxxxxxx
accuracy.compute(references=[0,1,0,1], predictions=[1,0,0,1])
# {'accuracy': 0.5}

增量式 Incremental：在增量式方法中，输入通过 EvaluationModule.add() 或 EvaluationModule.add_batch() 添加到模块中，最后用EvaluationModule.compute() 计算出分数。


x
for ref, pred in zip([0,1,0,1], [1,0,0,1]):
    accuracy.add(references=ref, predictions=pred)
accuracy.compute()
# {'accuracy': 0.5}

for refs, preds in zip([[0,1],[0,1]], [[1,0],[0,1]]):
    accuracy.add_batch(references=refs, predictions=preds)
accuracy.compute()
# {'accuracy': 0.5}

for model_inputs, ground_truth in evaluation_dataset:
    predictions = model(model_inputs)
    metric.add_batch(references=ground_truth, predictions=predictions)
metric.compute()

分布式评估：在分布式环境中计算 metrics 可能很棘手。metric 评估是在单独的 Pythonmetric score $f(A\cup B) = f(A) + f(B)$ reduce $f(A\cup B) \ne f(A) + f(B)$ ），就没有那么简单了。例如，你不能把每个数据子集的 F1分数之和作为你的 final metric 。
克服这个问题的一个常见方法是退而求其次，采用单个进程来评估。这些指标在单个 GPU 上进行评估，这就变得低效了。
Evaluate 通过仅在第一个节点上计算 final metric 来解决这个问题。predictions 和 references 在每个节点上被独立地计算并提供给 metric 。这些都被暂时存储在 Apache Arrow table 中，避免了对 GPU 或 CPU 内存的干扰。当你准备 compute() final metric 时，第一个节点能够访问存储在所有其他节点上的 predictions 和 references 。一旦第一个节点收集了所有的 predictions 和 references，compute() 将执行 final metric evaluation 。
这个解决方案允许 Evaluate 执行分布式预测，这对分布式 setting 中的评估速度很重要。同时，你也可以使用复杂的非加性的指标，而不浪费宝贵的 GPU 或 CPU 内存。

结合多个 evaluations：有时候人们需要多个指标。你可以加载一堆指标并依次调用它们。然而，一个更方便的方法是使用combine() 函数将它们打包在一起：


xxxxxxxxxx
clf_metrics = evaluate.combine(["accuracy", "f1", "precision", "recall"])
clf_metrics.compute(predictions=[0, 1, 0], references=[0, 1, 1])
# {
#   'accuracy': 0.667,
#   'f1': 0.667,
#   'precision': 1.0,
#   'recall': 0.5
# }

保存和 push 到 Hub：我们提供了 evaluate.save() 函数来轻松保存 metrics 结果：


xxxxxxxxxx
result = accuracy.compute(references=[0,1,0,1], predictions=[1,0,0,1])
hyperparams = {"model": "bert-base-uncased"}
evaluate.save("./results/"experiment="run 42", **result, **hyperparams)

保存的 JSON 文件看起来像是如下：


xxxxxxxxxx
{
    "experiment": "run 42",
    "accuracy": 0.5,
    "model": "bert-base-uncased",
    "_timestamp": "2022-05-30T22:09:11.959469",
    "_git_commit_hash": "123456789abcdefghijkl",
    "_evaluate_version": "0.1.0",
    "_python_version": "3.9.12 (main, Mar 26 2022, 15:51:15) \n[Clang 13.1.6 (clang-1316.0.21.2)]",
    "_interpreter_path": "/Users/leandro/git/evaluate/env/bin/python"
}

我们还提供了 evaluate.push_to_hub() 函数从而将评估结果 push 到 Hub ：


xxxxxxxxxx
evaluate.push_to_hub(
  model_id="huggingface/gpt2-wikitext2",  # model repository on hub
  metric_value=0.5,                       # metric value
  metric_type="bleu",                     # metric name, e.g. accuracy.name
  metric_name="BLEU",                     # pretty name which is displayed
  dataset_type="wikitext",                # dataset name on the hub
  dataset_name="WikiText",                # pretty name
  dataset_split="test",                   # dataset split used
  task_type="text-generation",            # task id, see https://github.com/huggingface/datasets/blob/master/src/datasets/utils/resources/tasks.json
  task_name="Text Generation"             # pretty name for task
)

Evaluator：evaluate.evalator() 提供自动评估，只需要一个模型、一个数据集、一个指标，而无需提供模型的 predictions 。此时，模型推断在内部自动进行。

目前支持的任务有：


xxxxxxxxxx
"text-classification": 使用 TextClassificationEvaluator
"token-classification": 使用 TokenClassificationEvaluator
"question-answering": 使用 QuestionAnsweringEvaluator
"image-classification": 使用 ImageClassificationEvaluator
"text2text-generation": 使用 Text2TextGenerationEvaluator
"summarization": 使用 SummarizationEvaluator
"translation": 使用 TranslationEvaluator

每个任务对数据集格式和管道输出都有自己的一套要求。

text classification：text classification evaluator 可用于评估分类数据集上的文本模型。除了模型、数据集和 metric 输入外，它还需要以下可选输入：

input_column="text" ：用这个参数可以指定 pipeline 的数据列。
evaluator 预期输入的数据具有一个 "text" 列和一个 "label" 列。如果你的数据不同，那么可以提供关键字参数 input_column="text" 、label_column="label" 。
label_column="label" ：用这个参数可以指定用于评估的标签列。
label_mapping=None：label mapping 将 pipeline 输出中的标签与评估所需的标签对齐。例如，label_column 中的标签可以是整数（0/1 ），而 pipeline 可以产生诸如 "positive"/"negative" 这样的标签名称。

默认情况下，计算 "accuracy" 指标。

如果不指定设备，模型推理的默认值将是机器上的第一个 GPU（如果有的话），否则就是CPU。如果你想使用一个特定的设备，你可以将 device 传递给 compute ，其中：-1 将使用 CPU ，而一个正整数（从 0 开始）将使用相关的 CUDA 设备。

有几种方法可以将模型传递给 evaluator：Hub 上的模型名字、直接加载的 transformers model、初始化好的 transformers.Pipeline 。也可以传递任何的行为类似 pipeline 的可调用对象。

如：


xxxxxxxxxx
from datasets import load_dataset
from evaluate import evaluator
from transformers import AutoModelForSequenceClassification, pipeline

data = load_dataset("imdb", split="test").shuffle(seed=42).select(range(1000))
task_evaluator = evaluator("text-classification")

model = AutoModelForSequenceClassification.from_pretrained("lvwerra/distilbert-imdb")
pipe = pipeline("text-classification", model="lvwerra/distilbert-imdb")

eval_results = task_evaluator.compute(
    model_or_pipeline="lvwerra/distilbert-imdb", # Pass a model name or path
    # model_or_pipeline=model,  # Pass an instantiated model
    # model_or_pipeline=pipe,   # Pass an instantiated pipeline 
    data=data,
    label_mapping={"NEGATIVE": 0, "POSITIVE": 1}
)
print(eval_results)
# {
#     'accuracy': 0.918,
#     'latency_in_seconds': 0.013,
#     'samples_per_second': 78.887,
#     'total_time_in_seconds': 12.676
# }

注意，评估结果既包括要求的指标，也包括通过 pipeline 获得预测的时间信息。时间信息应该谨慎对待：

它们包括在 pipeline 中进行的所有处理。这可能包括 tokenizing 和后处理，这可能取决于模型的不同。
此外，这在很大程度上取决于运行评估的硬件。
此外，可能会通过优化诸如 batch size 来提高速度。

也可以通过 combine() 来评估多个指标：


xxxxxxxxxx
eval_results = task_evaluator.compute(
    model_or_pipeline="lvwerra/distilbert-imdb",
    data=data,
    metric=evaluate.combine(["accuracy", "recall", "precision", "f1"]),
    label_mapping={"NEGATIVE": 0, "POSITIVE": 1}
)
print(eval_results)
# {
#     'accuracy': 0.918,
#     'f1': 0.916,
#     'precision': 0.9147,
#     'recall': 0.9187,
#     'latency_in_seconds': 0.013,
#     'samples_per_second': 78.887,
#     'total_time_in_seconds': 12.676
# }

仅仅计算 metric 的值往往不足以知道一个模型是否比另一个模型表现得明显更好。通过 bootstrapping evaluation 计算置信区间和标准差，这有助于估计一个 score 的稳定性：


xxxxxxxxxx
results = eval.compute(model_or_pipeline=pipe, data=data, metric=metric,
                       label_mapping={"NEGATIVE": 0, "POSITIVE": 1},
                       strategy="bootstrap", n_resamples=200)
print(results)
# {'accuracy': 
#     {
#       'confidence_interval': (0.906, 0.9406749892841922),
#       'standard_error': 0.00865213251082787,
#       'score': 0.923
#     }
# }

token classification：通过 token classification evaluator ，我们可以评估诸如 NER 或 POS tagging 等任务的模型。它具有如下参数：

input_column/label_column/label_mapping：参考text classification。
join_by = " "：大多数的数据集已经被 tokenized 了，然而 pipeline 预期一个字符串。因此在被传递给 pipeline 之前，token 需要被拼接起来。默认情况下，使用一个空格来拼接。

示例：


xxxxxxxxxx
import pandas as pd
from datasets import load_dataset
from evaluate import evaluator
from transformers import pipeline

models = [
    "xlm-roberta-large-finetuned-conll03-english",
    "dbmdz/bert-large-cased-finetuned-conll03-english",
    "elastic/distilbert-base-uncased-finetuned-conll03-english",
    "dbmdz/electra-large-discriminator-finetuned-conll03-english",
    "gunghio/distilbert-base-multilingual-cased-finetuned-conll2003-ner",
    "philschmid/distilroberta-base-ner-conll2003",
    "Jorgeutd/albert-base-v2-finetuned-ner",
]

data = load_dataset("conll2003", split="validation").shuffle().select(1000)
task_evaluator = evaluator("token-classification")

results = []
for model in models:
    results.append(
        task_evaluator.compute(
            model_or_pipeline=model, data=data, metric="seqeval"
            )
        )
df = pd.DataFrame(results, index=models)
df[["overall_f1", "overall_accuracy", "total_time_in_seconds", "samples_per_second", "latency_in_seconds"]]
print(df)

question answering：通过question-answering evaluator ，我们可以评估问答模型。它具有以下的参数：

question_column="question"：数据集中包含 question 的列的名称。
context_column="context"：数据集中包含 context 的列的名称。
id_column="id"：(question, answer) pair 的 id field 的列的名称。
label_column="answers"：包含答案的列的名称。
squad_v2_format=None：数据集是否遵循 squad_v2 数据集的格式，即 question 在上下文中可能没有答案。如果没有提供这个参数，格式将被自动推断出来。

示例（包含置信度，strategy="bootstrap"，n_resamples 设置重采样的数量）：


xxxxxxxxxx
from datasets import load_dataset
from evaluate import evaluator

task_evaluator = evaluator("question-answering")

data = load_dataset("squad", split="validation[:1000]")
eval_results = task_evaluator.compute(
    model_or_pipeline="distilbert-base-uncased-distilled-squad",
    data=data,
    metric="squad",
    strategy="bootstrap",
    n_resamples=30
)

image classification：通过image classification evaluator ，我们可以评估图片分类模型。它具有以下的参数：

input_column="image"：包含 PIL 图像文件的列的名称。
label_column="label"：包含标签的列的名称。
label_mapping=None：参考text classification。

示例：


xxxxxxxxxx
data = load_dataset("imagenet-1k", split="validation", use_auth_token=True)

pipe = pipeline(
    task="image-classification",
    model="facebook/deit-small-distilled-patch16-224"
)

task_evaluator = evaluator("image-classification")
eval_results = task_evaluator.compute(
    model_or_pipeline=pipe,
    data=data,
    metric="accuracy",
    label_mapping=pipe.model.config.label2id
)

evaluator 可以与第三方 pipeline 一起工作，如 Scikit-Learn pipeline 和 Spacy pipeline 。遵循 TextClassificationPipeline 的惯例，pipeline 应该是可调用的，并返回一个字典的列表。


xxxxxxxxxx
from datasets import load_dataset
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer

ds = load_dataset("imdb")
text_clf = Pipeline([
        ('vect', CountVectorizer()),
        ('tfidf', TfidfTransformer()),
        ('clf', MultinomialNB()),
])

text_clf.fit(ds["train"]["text"], ds["train"]["label"])

class ScikitEvalPipeline:
    def __init__(self, pipeline):
        self.pipeline = pipeline
        self.task = "text-classification"

    def __call__(self, input_texts, **kwargs):
        return [{"label": p} for p in self.pipeline.predict(input_texts)]

pipe = ScikitEvalPipeline(text_clf)

from evaluate import evaluator

eval = evaluator("text-classification")
eval.compute(pipe, ds["test"], "accuracy")
# {'accuracy': 0.82956}

二、API

2.1 EvaluationModuleInfo/EvaluationModule

class evaluate.EvaluationModuleInfo：是 MetricInfo, ComparisonInfo, MeasurementInfo 的基类。


xxxxxxxxxx
class evaluate.EvaluationModuleInfo( 
  description: str,
  citation: str,
  features: typing.Union[datasets.features.features.Features, typing.List[datasets.features.features.Features]],
  inputs_description: str = <factory>,
  homepage: str = <factory>,
  license: str = <factory>,
  codebase_urls: typing.List[str] = <factory>,
  reference_urls: typing.List[str] = <factory>,
  streamable: bool = False,
  format: typing.Optional[str] = None,
  module_type: str = 'metric',
  module_name: typing.Optional[str] = None,
  config_name: typing.Optional[str] = None,
  experiment_id: typing.Optional[str] = None )

方法：

from_directory(metric_info_dir) ：从 metric_info_dir 目录中的 JSON 文件来创建 EvaluationModuleInfo 。
write_to_directory(metric_info_dir) ：向 metric_info_dir 目录中写入 JSON 文件。

class evaluate.MetricInfo/ComparisonInfo/MeasurementInfo：metric/comparison/measurement 的信息。


xxxxxxxxxx
class evaluate.MetricInfo/ComparisonInfo/MeasurementInfo( 
  description: str,
  citation: str,
  features: typing.Union[datasets.features.features.Features, typing.List[datasets.features.features.Features]],
  inputs_description: str = <factory>,
  homepage: str = <factory>,
  license: str = <factory>,
  codebase_urls: typing.List[str] = <factory>,
  reference_urls: typing.List[str] = <factory>,
  streamable: bool = False,
  format: typing.Optional[str] = None,
  module_type: str = 'metric',
  module_name: typing.Optional[str] = None,
  config_name: typing.Optional[str] = None,
  experiment_id: typing.Optional[str] = None )

class evaluate.EvaluationModule ：Metric, Comparison, Measurement 的基类，包含一些通用的 API 。
```
xxxxxxxxxx
class evaluate.EvaluationModule(
  config_name: typing.Optional[str] = None, 
  keep_in_memory: bool = False, 
  cache_dir: typing.Optional[str] = None, 
  num_process: int = 1, 
  process_id: int = 0, 
  seed: typing.Optional[int] = None, 
  experiment_id: typing.Optional[str] = None,
  hash: str = None,
  max_concurrent_cache_files: int = 10000,
  timeout: typing.Union[int, float] = 100,
  **kwargs 
)
```
参数：
- config_name：一个字符串，用于定义一个特定于模块计算脚本的哈希值。
- keep_in_memory：一个布尔值，指定是否在内存中保留所有 predictions 和 references 。在分布式 setting 中不可用。
- cache_dir：一个字符串，指定存储临时的 prediction/references data 的目录。在分布式 setting 中，数据目录应位于共享文件系统上。默认为 ~/.cache/huggingface/evaluate/ 。
- num_process：一个整数，指定分布式 setting 中节点的总数。对于分布式 setting 很有用。
- process_id：一个整数，指定分布式 setting 中当前进程的 ID （介于 0 和 num_process-1 之间）。对于分布式 setting 很有用。
- seed：一个整数，指定随机数种子。
- experiment_id：一个字符串，指定实验 id 。对于分布式 setting 很有用。
- hash：一个字符串，用于根据被哈希的文件内容来识别 evaluation 模块。
- max_concurrent_cache_files：一个整数，指定并发的 module cache files 的数量。默认为 10000 。
- timeout：一个整数或浮点数，指定分布式 setting 同步的超时时间，单位为秒。
方法：
- add(prediction = None, reference = None, **kwargs) ：添加一个 (prediction, reference) 用于评估。
  参数：
  - prediction：指定预测结果。
  - reference：指定 ground-truth 。
- add_batch(predictions = None, references = None, **kwargs) ：添加batch 的 (prediction, reference) 用于评估。
  参数：
  - prediction：指定一组预测结果。
  - reference：指定一组 ground-truth 。
- compute(predictions = None, references = None, **kwargs) ：计算评估结果。
  参数：
  - prediction：指定一组预测结果。
  - reference：指定一组 ground-truth 。
  - kwargs：关键字参数，被传递给 evaluation model 的 _compute 方法。
- download_and_prepare(download_config: typing.Optional[evaluate.utils.file_utils.DownloadConfig] = None, dl_manager: typing.Optional[datasets.download.download_manager.DownloadManager] = None ) ：下载和准备数据集。
  参数：
  - download_config：一个 DownloadConfig 对象，指定下载配置参数。
  - dl_manager：一个 DownloadManager，指定下载管理器。

class evaluate.Metric/Comparison/Measurement ：所有 metric/comparison/measurement 的基类。


xxxxxxxxxx
class evaluate.Metric/Comparison/Measurement(
  config_name: typing.Optional[str] = None, 
  keep_in_memory: bool = False, 
  cache_dir: typing.Optional[str] = None, 
  num_process: int = 1, 
  process_id: int = 0, 
  seed: typing.Optional[int] = None, 
  experiment_id: typing.Optional[str] = None,
  hash: str = None,
  max_concurrent_cache_files: int = 10000,
  timeout: typing.Union[int, float] = 100,
  **kwargs 
)

2.2 combine/load/save

evaluate.combine(evaluations, force_prefix = False) ：将几个 metrics, comparisons, measurements 合并成一个单一的CombinedEvaluations 对象，可以像一个 evaluation 模块一样使用。
如果两个 scores 有相同的名字，那么它们用它们的模块名字作为前缀。而如果两个模块有相同的名字，请用字典给它们起不同的名字，否则就在前缀上附加一个整数 id 。
参数：
- evaluations：一个字典或列表，指定一组 evaluation 模块。
- force_prefix：一个布尔值，指定是否所有模块的分数都以其名称为前缀。
class evaluate.CombinedEvaluations(evaluation_modules, force_prefix = False ) ：CombinedEvaluations 对象。
参数：参考 evaluate.combine() 。
方法：add/add_batch/compute ：参考 evaluate.EvaluationModule 。
evaluate.list_evaluation_modules(module_type = None, include_community = True, with_details = False) ：列出 Hugging Face Hub 上所有可用的 evaluation 模块。
参数：
- module_type：一个字符串，指定 evaluation 模块的类型。可以为 'metric' , 'comparison', 'measurement' 其中的一个。如果为 None，则列出所有类型。
- include_community：一个布尔值，指定是否包含社区模块。默认为 True 。
- with_details：一个布尔值，指定是否返回 metrics 的全部细节而不仅仅是 ID 。默认为 False 。

evaluate.load()：加载一个 evaluate.EvaluationModule 。


xxxxxxxxxx
evaluate.load(
  path: str, config_name: typing.Optional[str] = None, module_type: typing.Optional[str] = None,
  process_id: int = 0, num_process: int = 1, cache_dir: typing.Optional[str] = None, 
  experiment_id: typing.Optional[str] = None, keep_in_memory: bool = False, 
  download_config: typing.Optional[evaluate.utils.file_utils.DownloadConfig] = Noned,
  ownload_mode: typing.Optional[datasets.download.download_manager.DownloadMode] = None,
  revision: typing.Union[str, datasets.utils.version.Version, NoneType] = None, **init_kwargs
)

参数：

path：一个字符串，指定带有 evaluation builder 的 evaluation processing script 的路径。可以为：
- processing script 的本地路径，或者包含 processing script 的本地目录（目录名和脚本名相同）。
- HuggingFace evaluate repo 上的 evaluation module identifier 。
config_name：一个字符串，为 metric 衡选择一个配置。
module_type/process_id/cache_dir/experiment_id/keep_in_memory：参考 evaluate.EvaluationModule 。
download_config：一个 DownloadConfig 对象，指定下载配置参数。
download_mode：一个 DownloadMode 对象，指定下载模型。默认为 REUSE_DATASET_IF_EXISTS 。
revision：一个字符串或 evaluate.Version。如果指定，则模块将以这个版本从数据集 repository 加载。指定一个与你的本地版本不同的版本可能会导致兼容性问题。

evaluate.save(path_or_file, **data) ：保存结果到一个 JSON 文件。也会保存一些系统信息，如当前时间、当前 commit hash （如果位于一个 repo 中）、以及 Python 系统信息。
参数：path_or_file：指定存储文件的路径。如果仅提供目录而不是文件名，则生成的文件的文件名为 result-%Y\*%m\*%d-%H\*%M\*%S.json 的格式。

evaluate.push_to_hub()：将一个 metric 的结果推送到 Hub 中的 model repo 的元数据。


xxxxxxxxxx
evaluate.push_to_hub(
  model_id: str, task_type: str, dataset_type: str, dataset_name: str, metric_type: str,
  metric_name: str, metric_value: float, task_name: str = None, dataset_config: str = None, 
  dataset_split: str = None, dataset_revision: str = None, 
  dataset_args: typing.Dict[str, int] = None, metric_config: str = None, 
  metric_args: typing.Dict[str, int] = None, overwrite: bool = False
)

参数：

model_id：一个字符串，指定 model id 。
task_type：一个字符串，指定 task id 。
dataset_type：一个字符串，指定 dataset id 。
dataset_name：一个字符串，指定数据集的名字。
metric_type：一个字符串，指定 metric id 。
metric_name：一个字符串，指定 metric 名字。
metric_value：一个字符串，指定计算出的 metric 值。
task_name：一个字符串，指定任务的名字。
dataset_config：一个字符串，指定 dataset 配置（被用于 datasets.load_dataset() 中）。
dataset_split：一个字符串，指定 metric 计算中用到的 split 名字。
dataset_revision：一个字符串，指定数据集的特定版本的 git hash 。
dataset_args：一个字典，传递给 datasets.load_dataset() 的额外参数。
metric_config：一个字符串，指定 metric 的配置。
metric_args：一个字典，传递给 Metric.compute() 的参数。
overwrite：一个布尔值，指定是否覆盖已有的 metric field 。如果为 False 则不允许覆盖。默认为 False 。

2.3 logging

logging：参考 transformers.utils.logging，只是在 evaluate 模块中，logging 的位置为 evaluate.utils.logging 。

如：


xxxxxxxxxx
evaluate.utils.logging.set_verbosity_warning()
evaluate.utils.logging.set_verbosity_info()
evaluate.utils.logging.set_verbosity_debug()
evaluate.utils.logging.get_verbosity() -> int
evaluate.utils.logging.set_verbosity(verbosity: int )
evaluate.utils.logging.disable_propagation()
evaluate.utils.logging.enable_propagation()

2.4 evaluator

evaluate.evaluator(task: str = None) -> Evaluator ：返回针对给定任务的 evaluator 。它是一个工厂方法从而创建一个 evaluator 。evaluator 封装了一个任务和一个默认的 metric 名称。
参数：
- task：一个字符串，指定具体的任务。目前可以为：
  - "image-classification"：返回一个 ImageClassificationEvaluator 。
  - "question-answering"：返回一个 QuestionAnsweringEvaluator 。
  - "text-classification" （别名为 "sentiment-analysis"）：返回一个 TextClassificationEvaluator 。
  - "token-classification"：返回一个 TokenClassificationEvaluator 。
返回值：一个 Evaluator 对象。
示例：
```
xxxxxxxxxx
from evaluate import evaluator
evaluator("sentiment-analysis")
```
class evaluate.Evaluator( task: str, default_metric_name: str = None)：Evaluator 是所有 evaluators 的基类。
参数：
- task：一个字符串，指定该 Evaluator 对应的任务。
- default_metric_name：一个字符串，指定默认的 metric 名称。
方法：
- check_required_columns(data: typing.Union[str, datasets.arrow_dataset.Dataset], columns_names: typing.Dict[str, str] )：检查数据集从而确保 evaluation 需要的 columns 在数据集中存在。
  参数：
  - data：一个字符串或 Dataset，指定运行 evaluation 的数据。
  - columns_names：一个关于字符串的列表，指定需要检查的 column names 的列表。keys 是 compute() 方法的参数，而 values 是需要检查的列名。
- compute_metric() ：计算并返回 metrics 。
```
xxxxxxxxxx
compute_metric(
  metric: EvaluationModule, 
  metric_inputs: typing.Dict, 
  strategy: typing.Literal['simple', 'bootstrap'] = 'simple', 
  confidence_level: float = 0.95, 
  n_resamples: int = 9999, 
  random_state: typing.Optional[int] = None
)
```
  参数：
  - metric：一个字符串，指定 evaluator 中使用的 metric 。
  - metric_inputs：一个字典，指定 metric 的输入。
  - strategy：一个字符串，可以为 "simple" 或者 "bootstrap" ，指定评估策略。默认为 "simple" 。
    - "simple"：评估指标并返回分数。
    - "bootstrap"：在计算分数的基础上，为每个指标计算置信度区间（使用 scipy’s bootstrap 方法）。
  - confidence_level：一个浮点数，默认为 0.95，当 strategy = "bootstrap" 时传递给 bootstrap 的置信度值。
  - n_resamples：一个整数，默认为 9999，当 strategy = "bootstrap" 时传递给 bootstrap 的 n_resamples 值。
  - random_state：一个整数，当 strategy = "bootstrap" 时传递给 bootstrap 的 random_state 值。
- get_dataset_split(data, subset = None, split = None) -> split：返回数据集的 split。
  参数：
  - data：一个字符串，指定数据集的名称。
  - subset：一个字符串，返回数据集的更细粒度的配置名，如 "glue/cola" 。
  - split：一个字符串，指定使用哪个 split 。如果为 None，则自动推断。也可以使用 test[:40] 这种格式。
  返回值：数据集的 split 。
- load_data(data: typing.Union[str, datasets.arrow_dataset.Dataset], subset: str = None, split: str = None) -> data(Dataset)：加载数据集从而用于 evaluation 。
  参数：
  - data：一个字符串或 Dataset，指定数据集。
  - subset/split：参考 get_dataset_split() 方法。
  返回值：用于评估的数据。
- predictions_processor( *args, **kwargs ) ：Evaluator 的核心类，它处理 pipeline 的输出从而与 metric 兼容。
- prepare_data(data: Dataset, input_column: str, label_column: str) -> dict：准备数据。
  参数：
  - data：一个 Dataset 对象，指定被评估的数据。
  - input_column：一个字符串，默认为 "text"，指定数据集的哪一列包含文本特征。
  - label_column：一个字符串，默认为 "label"，指定数据集的哪一列包含 label 。
  返回值：一个字典。
- prepare_metric( metric: typing.Union[str, evaluate.module.EvaluationModule])：准备 metric 。
  参数：
  - metric：一个字符串或 EvaluationModule ，指定 evaluator 使用哪个指标。
- prepare_pipeline()：准备 pipeline 。
```
xxxxxxxxxx
prepare_pipeline( 
  model_or_pipeline: typing.Union[str, ForwardRef('Pipeline'), typing.Callable, ForwardRef('PreTrainedModel'), ForwardRef('TFPreTrainedModel')],
  tokenizer: typing.Union[ForwardRef('PreTrainedTokenizerBase'), ForwardRef('FeatureExtractionMixin')] = None,
  feature_extractor: typing.Union[ForwardRef('PreTrainedTokenizerBase'), ForwardRef('FeatureExtractionMixin')] = None,
  device: int = None )
```
  参数：
  - model_or_pipeline：一个字符串或 Pipeline 或可调用对象或 PreTrainedModel 或 TFPreTrainedModel ，指定 pipeline 。
    - 如果未指定，则采用针对该任务的默认 pipeline 。
    - 如果是字符串（指定了模型名称）或是一个模型对象，则用它来初始化一个新的 Pipeline 。
    - 否则该参数必须指定一个预先初始化好的 pipeline preprocessor 。
  - tokenizer：一个 tokenizer 对象，如果 model_or_pipeline 是一个模型，则该参数指定了 tokenizer 。否则忽略该参数。
  - feature_extractor：一个 feature_extractor 对象，如果 model_or_pipeline 是一个模型，则该参数指定了 feature_extractor 。否则忽略该参数。
  - device：一个整数，指定 pipeline 的 CPU/GPU 。-1 表示 CPU，正整数对应于相关的 GPU 。如果为 None 则将自动推断：如果 CUDA:0 可用则使用它，否则使用 CPU 。

class evaluate.ImageClassificationEvaluator( task = 'image-classification', default_metric_name = None)：image classification evaluator 。它是 image-classification 任务的默认 evaluator ，要求数据格式与 ImageClassificationPipeline 兼容。

参数：参考 Evaluator 。

方法：

compute()：计算并返回 metrics 。


xxxxxxxxxx
compute(
  model_or_pipeline: typing.Union[str, ForwardRef('Pipeline'), typing.Callable, ForwardRef('PreTrainedModel'), ForwardRef('TFPreTrainedModel')] = None,
  data: typing.Union[str, datasets.arrow_dataset.Dataset] = None,
  subset: typing.Optional[str] = None,
  split: typing.Optional[str] = None,
  metric: typing.Union[str, evaluate.module.EvaluationModule] = None,
  tokenizer: typing.Union[str, ForwardRef('PreTrainedTokenizer'), NoneType] = None,
  feature_extractor: typing.Union[str, ForwardRef('FeatureExtractionMixin'), NoneType] = None,
  strategy: typing.Literal['simple', 'bootstrap'] = 'simple',
  confidence_level: float = 0.95,
  n_resamples: int = 9999,
  device: int = None,
  random_state: typing.Optional[int] = None,
  input_column: str = 'image',
  label_column: str = 'label',
  label_mapping: typing.Union[typing.Dict[str, numbers.Number], NoneType] = None 
)

参数：

model_or_pipeline/tokenizer/feature_extractor/device：参考 Evaluator.prepare_pipeline() 。
data/subset/split：参考 Evaluator.load_data() 。
metric/strategy/confidence_level/n_resamples/random_state：参考 Evaluator.compute() 。
input_column/label_column：参考 Evaluator.prepare_data() 。
label_mapping：一个字典，用于将 pipeline 输出中的标签与评估所需的标签对齐。例如，label_column 中的标签可以是整数（0/1 ），而 pipeline 可以产生诸如 "positive"/"negative" 这样的标签名称。

示例：


xxxxxxxxxx
from evaluate import evaluator
from datasets import load_dataset
task_evaluator = evaluator("image-classification")
data = load_dataset("beans", split="test[:40]")
results = task_evaluator.compute(
    model_or_pipeline="nateraw/vit-base-beans",
    data=data,
    label_column="labels",
    metric="accuracy",
    label_mapping={'angular_leaf_spot': 0, 'bean_rust': 1, 'healthy': 2},
    strategy="bootstrap"
)

class evaluate.QuestionAnsweringEvaluator(task = 'question-answering', default_metric_name = None) ：question answering evaluator 。它是 question-answering 任务的默认 evaluator ，要求数据格式与 QuestionAnsweringPipeline 兼容。

参数：参考 Evaluator 。

方法：

compute()：计算并返回 metrics 。


xxxxxxxxxx
compute(
  model_or_pipeline: typing.Union[str, ForwardRef('Pipeline'), typing.Callable, ForwardRef('PreTrainedModel'), ForwardRef('TFPreTrainedModel')] = None, 
  data: typing.Union[str, datasets.arrow_dataset.Dataset] = None,
  subset: typing.Optional[str] = None,
  split: typing.Optional[str] = None,
  metric: typing.Union[str, evaluate.module.EvaluationModule] = None,
  tokenizer: typing.Union[str, ForwardRef('PreTrainedTokenizer'), NoneType] = None,
  strategy: typing.Literal['simple', 'bootstrap'] = 'simple',
  confidence_level: float = 0.95,
  n_resamples: int = 9999,
  device: int = None,
  random_state: typing.Optional[int] = None,
  question_column: str = 'question',
  context_column: str = 'context',
  id_column: str = 'id',
  label_column: str = 'answers',
  squad_v2_format: typing.Optional[bool] = None
)

参数：

model_or_pipeline/data/subset/split/metric/tokenizer/strategy/confidence_level/n_resamples/device/random_state：参考 ImageClassificationEvaluator.compute() 。
question_column：一个字符串，指定数据集中 question 列的列名。
context_column：一个字符串，指定数据集中 context 列的列名。
id_column：一个字符串，指定数据集中(question, answer) pair 的 id field 的列的名称。
label_column：一个字符串，指定数据集中 label 列的列名。
squad_v2_format：一个布尔值，指定数据集是否遵循 squad_v2 数据集的格式，即 question 在上下文中可能没有答案。如果没有提供这个参数，格式将被自动推断出来。

示例：


xxxxxxxxxx
from evaluate import evaluator
from datasets import load_dataset
task_evaluator = evaluator("question-answering")
data = load_dataset("squad", split="validation[:2]")
results = task_evaluator.compute(
    model_or_pipeline="sshleifer/tiny-distilbert-base-cased-distilled-squad",
    data=data,
    metric="squad",
)

class evaluate.TextClassificationEvaluator(task = 'text-classification', default_metric_name = None)：text classification evaluator 。它是 text-classification 任务（也叫 sentiment-analysis 任务）的默认 evaluator ，要求数据格式与 TextClassificationPipeline 兼容。

参数：参考 Evaluator 。

方法：

compute()：计算并返回 metrics 。


xxxxxxxxxx
compute(
  model_or_pipeline: typing.Union[str, ForwardRef('Pipeline'), typing.Callable, ForwardRef('PreTrainedModel'), ForwardRef('TFPreTrainedModel')] = None, 
  data: typing.Union[str, datasets.arrow_dataset.Dataset] = None,
  subset: typing.Optional[str] = None,
  split: typing.Optional[str] = None,
  metric: typing.Union[str, evaluate.module.EvaluationModule] = None,
  tokenizer: typing.Union[str, ForwardRef('PreTrainedTokenizer'), NoneType] = None,
  feature_extractor: typing.Union[str, ForwardRef('FeatureExtractionMixin'), NoneType] = None,
  strategy: typing.Literal['simple', 'bootstrap'] = 'simple',
  confidence_level: float = 0.95,
  n_resamples: int = 9999,
  device: int = None,
  random_state: typing.Optional[int] = None,
  input_column: str = 'text',
  second_input_column: typing.Optional[str] = None,
  label_column: str = 'label',
  label_mapping: typing.Union[typing.Dict[str, numbers.Number], NoneType] = None
)

参数：

model_or_pipeline/data/subset/split/metric/tokenizer/strategy/confidence_level/n_resamples/device/random_state/label_mapping：参考 ImageClassificationEvaluator.compute() 。
input_column/label_column：参考 Evaluator.prepare_data() 。
second_input_column：一个字符串，指定 second input 的列名。

示例：


xxxxxxxxxx
from evaluate import evaluator
from datasets import load_dataset
task_evaluator = evaluator("text-classification")
data = load_dataset("imdb", split="test[:2]")
results = task_evaluator.compute(
    model_or_pipeline="huggingface/prunebert-base-uncased-6-finepruned-w-distil-mnli",
    data=data,
    metric="accuracy",
    label_mapping={"LABEL_0": 0.0, "LABEL_1": 1.0},
    strategy="bootstrap",
    n_resamples=10,
    random_state=0
)

class evaluate.TokenClassificationEvaluator(task = 'token-classification', default_metric_name = None) ：token classification evaluator 。它是 token-classification 任务的默认 evaluator ，要求数据格式与 TokenClassificationPipeline 兼容。

参数：参考 Evaluator 。

方法：

compute()：计算并返回 metrics 。


xxxxxxxxxx
compute(
  model_or_pipeline: typing.Union[str, ForwardRef('Pipeline'), typing.Callable, ForwardRef('PreTrainedModel'), ForwardRef('TFPreTrainedModel')] = None,
  data: typing.Union[str, datasets.arrow_dataset.Dataset] = None,
  subset: typing.Optional[str] = None,
  split: str = None,
  metric: typing.Union[str, evaluate.module.EvaluationModule] = None,
  tokenizer: typing.Union[str, ForwardRef('PreTrainedTokenizer'), NoneType] = None,
  strategy: typing.Literal['simple', 'bootstrap'] = 'simple',
  confidence_level: float = 0.95,
  n_resamples: int = 9999,
  device: typing.Optional[int] = None,
  random_state: typing.Optional[int] = None,
  input_column: str = 'tokens',
  label_column: str = 'ner_tags',
  join_by: typing.Optional[str] = ' ' 
)

参数：

model_or_pipeline/data/subset/split/metric/tokenizer/strategy/confidence_level/n_resamples/device/random_state/input_column/label_column：参考 TextClassificationEvaluator.compute() 。
join_by：一个字符串，大多数的数据集已经被 tokenized 了，然而 pipeline 预期一个字符串。因此在被传递给 pipeline 之前，token 需要被拼接起来。这里 join_by 指定拼接的字符串，默认为空格。

示例：


xxxxxxxxxx
from evaluate import evaluator
from datasets import load_dataset
task_evaluator = evaluator("token-classification")
data = load_dataset("conll2003", split="validation[:2]")
results = task_evaluator.compute(
    model_or_pipeline="elastic/distilbert-base-uncased-finetuned-conll03-english",
    data=data,
    metric="seqeval",
)

下面的 dataset 是与 TokenClassificationEvaluator 兼容的：


xxxxxxxxxx
dataset = Dataset.from_dict(
    mapping={
        "tokens": [["New", "York", "is", "a", "city", "and", "Felix", "a", "person", "."]],
        "ner_tags": [[1, 2, 0, 0, 0, 0, 3, 0, 0, 0]],
    },
    features=Features({
        "tokens": Sequence(feature=Value(dtype="string")),
        "ner_tags": Sequence(feature=ClassLabel(names=["O", "B-LOC", "I-LOC", "B-PER", "I-PER"])),
        }),
)

下面的 dataset 是与 TokenClassificationEvaluator 不兼容的：


xxxxxxxxxx
dataset = Dataset.from_dict(
    mapping={
        "tokens": [["New York is a city and Felix a person."]],
        "starts": [[0, 23]],
        "ends": [[7, 27]],
        "ner_tags": [["LOC", "PER"]],
    },
    features=Features({
        "tokens": Value(dtype="string"),
        "starts": Sequence(feature=Value(dtype="int32")),
        "ends": Sequence(feature=Value(dtype="int32")),
        "ner_tags": Sequence(feature=Value(dtype="string")),
    }),
)

class evaluate.Text2TextGenerationEvaluator(task = 'text2text-generation', default_metric_name = None ) ：Text2Text generation evaluator 。它是 text2text-generation 任务的默认 evaluator ，要求数据格式与 Text2TextGenerationPipeline 兼容。

参数：参考 Evaluator 。

方法：

compute()：计算并返回 metrics 。


xxxxxxxxxx
compute(
  model_or_pipeline: typing.Union[str, ForwardRef('Pipeline'), typing.Callable, ForwardRef('PreTrainedModel'), ForwardRef('TFPreTrainedModel')] = None,
  data: typing.Union[str, datasets.arrow_dataset.Dataset] = None,
  subset: typing.Optional[str] = None,
  split: typing.Optional[str] = None,
  metric: typing.Union[str, evaluate.module.EvaluationModule] = None,
  tokenizer: typing.Union[str, ForwardRef('PreTrainedTokenizer'), NoneType] = None,
  strategy: typing.Literal['simple', 'bootstrap'] = 'simple',
  confidence_level: float = 0.95,
  n_resamples: int = 9999,
  device: int = None,
  random_state: typing.Optional[int] = None.
  input_column: str = 'text',
  label_column: str = 'label',
  generation_kwargs: dict = None
)

参数：

model_or_pipeline/data/subset/split/metric/tokenizer/strategy/confidence_level/n_resamples/device/random_state/input_column/label_column：参考 TextClassificationEvaluator.compute() 。
generation_kwargs：关键字参数，传递给 pipeline 。

class evaluate.SummarizationEvaluator(task = 'summarization', default_metric_name = None ) ：text summarization evaluator 。它是 summarization 任务的默认 evaluator ，要求数据格式与 SummarizationPipeline 兼容。
参数：参考 Evaluator 。
方法：
- compute()：计算并返回 metrics 。
  参数参考 Text2TextGenerationEvaluator.compute() 。
class evaluate.TranslationEvaluator(task = 'translation', default_metric_name = None)：translation evaluator 。它是 translation 任务的默认 evaluator ，要求数据格式与 TranslationPipeline 兼容。
参数：参考 Evaluator 。
方法：
- compute()：计算并返回 metrics 。
  参数参考 Text2TextGenerationEvaluator.compute() 。