Rapid uses a special pipeline for subjective or generative tasks.
The traditional evaluation task system does translate well to generative text tasks because task accuracies are computed based on comparing the labelers' responses to answer keys. For generative data, there's no single right answer to each task, so it's not feasible to grade against a fixed rubric.
We instead use a new pipeline called the 'Generative Pipeline', which functions by creating two different roles for attempters and reviewers. Attempters focus on text generation (e.g. writing a sentence based on an image), whereas reviewers look at the response and pass a 'yes' or 'no' judgment. In this pipeline, reviewers do not have the ability to change the response.
Reviewers are still evaluated based on evaluation tasks, where they will be graded solely on whether or not they correctly accepted or rejected a given response.
An attempter's performance is then measured by the reviewers' responses. This gives us high quality signals on attempters while not restricting their creativity via evaluation tasks.
The Quality Lab will look slightly different for a project using the generative pipeline. There will no longer be a tab for attempter phase evaluation tasks, since they no longer exist. When viewing a review phase evaluation task, you can adjust both the initial response a reviewer would see, as well as whether their expected judgment when given this response is an "Accept" or a "Reject".
Updated over 1 year ago