English | 中文
YOLOv8 is the latest version of YOLO by Ultralytics. As a cutting-edge, state-of-the-art (SOTA) model, YOLOv8 builds on the success of previous versions, introducing new features and improvements for enhanced performance, flexibility, and efficiency. YOLOv8 supports a full range of vision AI tasks, including detection, segmentation, pose estimation, tracking, and classification. This versatility allows users to leverage YOLOv8's capabilities across diverse applications and domains.
In order to adapt to the layout analysis task, we have made some improvements to YOLOv8:
- Increase input resolution to 800 * 800;
- Use P3-P6 detection head;
- Remove unnecessary data enhancements (such as Mosaic, Mixup, and HSV enhancement methods).
According to our experiment, the evaluation results on the public benchmark dataset (PublayNet) are as follows:
Model | Context | Mean Average accuracy (mAP) | Train T. | FPS | Recipe | Download |
---|---|---|---|---|---|---|
YOLOv8 | D910x4-MS2.2-G | 94.4% | 335.31 ms/step | 47.01 img/s | yaml | ckpt | mindir |
Notes:
- Context: Training context denoted as {device}x{pieces}-{MS mode}, where mindspore mode can be G-graph mode or F-pynative mode with ms function. For example, D910x4-MS2.2-G is for training on 4 pieces of Ascend 910 NPU using graph mode based on MindSpore version 2.2.
- To reproduce the result on other contexts, please ensure the global batch size is the same.
- The models are trained from scratch without any pre-training. For more dataset details of training and evaluation, please refer to [PubLayNet Dataset Preparation](#3.1.2 PubLayNet Dataset Preparation) section.
- The input Shapes of MindIR of YOLOv8 is (1, 3, 800, 800).
Please refer to the installation instruction in MindOCR.
PubLayNet is a dataset for document layout analysis. It contains images of research papers and articles and annotations for various elements in a page such as "text", "list", "figure" etc in these research paper images. The dataset was obtained by automatically matching the XML representations and the content of over 1 million PDF articles that are publicly available on PubMed Central.
Apart from the dataset setting, please also check the following important args: system.distribute
, system.val_while_train
, common.batch_size
, train.ckpt_save_dir
, train.dataset.dataset_path
, eval.ckpt_load_path
, eval.dataset.dataset_path
, eval.loader.batch_size
. Explanations of these important args:
system:
distribute: &distribute True # `True` for distributed training, `False` for standalone training
amp_level: 'O0'
amp_level_infer: "O0"
seed: 42
val_while_train: False # Validate while training
drop_overflow_update: False
common:
...
batch_size: 16 # Batch size for training
annotations_path: publaynet/val.json
...
train:
ckpt_save_dir: './tmp_layout' # The training result (including checkpoints, per-epoch performance and curves) saving directory
dataset_sink_mode: False
dataset:
type: PublayNetDataset
dataset_path: publaynet/train.txt # Path of training dataset
...
eval:
ckpt_load_path: './tmp_layout/best.ckpt' # checkpoint file path
dataset_sink_mode: False
dataset:
type: PublayNetDataset
dataset_path: publaynet/val.txt # Path of validation dataset
...
loader:
shuffle: False
batch_size: 16 # Batch size for validation
...
Notes:
- As the global batch size (batch_size x num_devices) is important for reproducing the result, please adjust
batch_size
accordingly to keep the global batch size unchanged for a different number of NPUs, or adjust the learning rate linearly to a new global batch size.
- Distributed Training
It is easy to reproduce the reported results with the pre-defined training recipe. For distributed training on multiple Ascend 910 devices, please modify the configuration parameter distribute
as True and run
# distributed training on multiple Ascend devices
mpirun --allow-run-as-root -n 4 python tools/train.py --config configs/layout/yolov8/yolov8n.yaml
- Standalone Training
If you want to train or finetune the model on a smaller dataset without distributed training, please modify the configuration parameterdistribute
as False and run:
# standalone training on a CPU/Ascend device
python tools/train.py --config configs/layout/yolov8/yolov8n.yaml
The training result (including checkpoints, per-epoch performance and curves) will be saved in the directory parsed by the arg ckpt_save_dir
. The default directory is ./tmp_layout
.
To evaluate the accuracy of the trained model, you can use eval.py
. Please set the checkpoint path to the arg ckpt_load_path
in the eval
section of yaml config file, set distribute
to be False, and then run:
python tools/eval.py --config configs/layout/yolov8/yolov8n.yaml
To inference with MindSpot Lite on Ascend 310, please refer to the tutorial MindOCR Inference. In short, the whole process consists of the following steps:
1. Model Export
Please download the exported MindIR file first, or refer to the Model Export tutorial and use the following command to export the trained ckpt model to MindIR file:
python tools/export.py --model_name_or_config configs/layout/yolov8/yolov8n.yaml --data_shape 800 800 --local_ckpt_path /path/to/local_ckpt.ckpt
The data_shape
is the model input shape of height and width for MindIR file. The shape value of MindIR in the download link can be found in Notes under results table. distribute
in yaml shall be set to False.
2. Environment Installation
Please refer to Environment Installation tutorial to configure the MindSpore Lite inference environment.
3. Model Conversion
Please refer to Model Conversion,
and use the converter_lite
tool for offline conversion of the MindIR file.
4. Inference
Assuming that you obtain output.mindir after model conversion, go to the deploy/py_infer
directory, and use the following command for inference:
python infer.py \
--input_images_dir=/your_path_to/val \
--layout_model_path=your_path_to/output.mindir \
--layout_model_name_or_config=../../configs/layout/yolov8/yolov8n.yaml \
--res_save_dir=results_dir
The inference results can be visualized using the following code:
from matplotlib import pyplot as plt
import matplotlib.patches as patches
from PIL import Image
img_path = 'publaynet/val/PMC4958442_00003.jpg'
img = Image.open(img_path)
fig, ax = plt.subplots()
ax.imshow(img)
category_dict = {1: 'text', 2: 'title', 3: 'list', 4: 'table', 5: 'figure'}
color_dict = {1: 'r', 2: 'b', 3: 'g', 4: 'c', 5: 'm'}
results = [{"category_id": 1, "bbox": [308.25, 559.25, 240.5, 81.5], "score": 0.98438},
{"category_id": 1, "bbox": [50.5, 672.75, 240.5, 70.5], "score": 0.9834},
{"category_id": 3, "bbox": [322.875, 349.0, 226.25, 203.0], "score": 0.97949},
{"category_id": 1, "bbox": [308.25, 638.75, 240.5, 70.5], "score": 0.97949},
{"category_id": 1, "bbox": [50.688, 605.0, 240.125, 70.0], "score": 0.97949},
{"category_id": 1, "bbox": [50.5, 423.125, 240.0, 183.75], "score": 0.97754},
{"category_id": 1, "bbox": [308.25, 707.0, 240.5, 36.0], "score": 0.97461},
{"category_id": 1, "bbox": [308.875, 294.0, 240.25, 47.5], "score": 0.97461},
{"category_id": 1, "bbox": [308.625, 230.5, 239.75, 43.75], "score": 0.96875},
{"category_id": 4, "bbox": [51.875, 100.5, 240.25, 273.5], "score": 0.96875},
{"category_id": 5, "bbox": [308.625, 74.375, 237.75, 149.25], "score": 0.9668},
{"category_id": 1, "bbox": [50.688, 70.625, 240.125, 22.0], "score": 0.94141},
{"category_id": 2, "bbox": [50.562, 403.625, 67.375, 12.75], "score": 0.92578},
{"category_id": 1, "bbox": [51.312, 374.625, 171.875, 10.75], "score": 0.7666},
{"category_id": 4, "bbox": [53.625, 80.25, 493.75, 144.0], "score": 0.00247},
{"category_id": 1, "bbox": [51.812, 144.625, 27.875, 12.25], "score": 0.00241},
{"category_id": 1, "bbox": [52.625, 159.125, 14.0, 11.75], "score": 0.00184},
{"category_id": 4, "bbox": [52.0, 207.5, 497.0, 164.5], "score": 0.00173},
{"category_id": 3, "bbox": [326.25, 349.75, 222.5, 64.5], "score": 0.00133},
{"category_id": 2, "bbox": [52.25, 144.938, 27.25, 12.125], "score": 0.00107}]
for item in results:
category_id = item['category_id']
bbox = item['bbox']
score = item['score']
if score < 0.8:
continue
left, bottom, w, h = bbox
rect = patches.Rectangle((left, bottom), w, h, linewidth=1, edgecolor=color_dict[category_id], facecolor='none')
ax.add_patch(rect)
ax.text(left, bottom, '{} {}'.format(category_dict[category_id], score), fontsize=8, color='w',
bbox=dict(facecolor=color_dict[category_id], edgecolor='none', boxstyle='round'))
plt.imshow(img)
plt.axis('off')
plt.show()
The visualization results are as follows: