Skip to content

Commit

Permalink
Update README.md
Browse files Browse the repository at this point in the history
  • Loading branch information
ftgreat authored Sep 29, 2024
1 parent ab01bf5 commit c7f3df1
Showing 1 changed file with 2 additions and 4 deletions.
6 changes: 2 additions & 4 deletions examples/CCI3-HQ/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,12 +8,10 @@ To improve the quality of Chinese corpora, we followed [Fineweb-edu's](https://h
## Annotation
We used Qwen2-72B-Instruct to score 145,000 pairs of web samples and their scores from 0 to 5, generated by Qwen2. The samples were annotated based on their educational quality with 0 being not educational and 5 being highly educational.

The prompt used for annotation mostly reuses [FineWeb-edu prompt](./prompt.txt).

The prompt used for annotation mostly reuses [FineWeb-edu prompt](./prompt.txt). You can use [qwen2_api](./qwen2_api.py) to request an already deployed API (such as one using vLLM) for annotation.

## Classifier training
The classifier was trained on We added a classification head with a single regression output to [BGE-M3](https://huggingface.co/BAAI/bge-m3) and trained the model for 20 epochs with a learning rate of 3e-4. During training, the embedding and encoder layers were frozen to focus on the classification head and dropout was not used. The model achieved an F1 score of 73% when converted to a binary classifier using a score threshold of 3.

The classifier was trained on We added a classification head with a single regression output to [BGE-M3](https://huggingface.co/BAAI/bge-m3) and trained the model for 20 epochs with a learning rate of 3e-4. During training, the embedding and encoder layers were frozen to focus on the classification head and dropout was not used. The model achieved an F1 score of 73% when converted to a binary classifier using a score threshold of 3. [Training script](./run_classification_trainval.sh) is provided here.

The classifier is available at: https://huggingface.co/BAAI/cci3-hq-classifier

Expand Down

0 comments on commit c7f3df1

Please sign in to comment.