# Tiny-Darknet **Repository Path**: Zipei-Chen/tiny-darknet ## Basic Information - **Project Name**: Tiny-Darknet - **Description**: No description available - **Primary Language**: Python - **License**: AGPL-3.0 - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 1 - **Forks**: 0 - **Created**: 2020-10-23 - **Last Updated**: 2020-12-22 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # Contents - [Tiny-DarkNet Description](#tiny-darknet-description) - [Model Architecture](#model-architecture) - [Dataset](#dataset) - [Features](#features) - [Distrubuted](#distrubuted) - [Environment Requirements](#environment-requirements) - [Quick Start](#quick-start) - [Script Description](#script-description) - [Script and Sample Code](#script-and-sample-code) - [Script Parameters](#script-parameters) - [Training Process](#training-process) - [Training](#training) - [Distributed Training](#distributed-training) - [Evaluation Process](#evaluation-process) - [Evaluation](#evaluation) - [Model Description](#model-description) - [Performance](#performance) - [Evaluation Performance](#evaluation-performance) - [Inference Performance](#evaluation-performance) - [How to use](#how-to-use) - [Inference](#inference) - [Continue Training on the Pretrained Model](#continue-training-on-the-pretrained-model) - [Description of Random Situation](#description-of-random-situation) - [ModelZoo Homepage](#modelzoo-homepage) # [Tiny-DarkNet Description](#contents) Tiny-DarkNet是Joseph Chet Redmon等人提出的一个16层的针对于经典的图像分类数据集ImageNet所进行的图像分类网络模型. Tiny-DarkNet作为作者为了满足用户对较小模型规模的需求而尽量降低模型的大小设计的简易版本的Darknet,具有优于AlexNet和SqueezeNet的图像分类能力,同时其只使用少于它们的模型参数.为了减少模型的规模,该Tiny-DarkNet网络没有使用全连接层,仅由卷积层、最大池化层、平均池化层组成. [Link](https://pjreddie.com/darknet/tiny-darknet/): Tiny-DarkNet网络官方说明地址 # [Model Architecture](#contents) 具体而言, Tiny-DarkNet网络由**1×1 conv**, **3×3 conv**, **2×2 max**和全局平均池化层组成,这些模块相互组成将输入的图片转换成一个**1×1000**的向量。 # [Dataset](#contents) 以下将介绍模型中使用数据集以及其出处: 所使用的ImageNet数据集可参考论文[ImageNet: A large-scale hierarchical image database]() - 数据集规模:125G, 1250千张分别属于1000个类的彩色图像 - 训练集: 120G, 1200千张图片 - 测试集: 5G, 50千张图片 - 数据格式: RGB格式图片. - Note: 数据将会被 src/dataset.py 中的函数进行处理 # [Features](#contents) ## [Distrubuted](#contents) 在深度学习中,当数据集和参数量的规模越来越大,训练所需的时间和硬件资源会随之增加,最后会变成制约训练的瓶颈。[分布式并行训练](),可以降低对内存、计算性能等硬件的需求,是进行训练的重要优化手段。本模型使用了mindspore提供的自动并行模式AUTO_PARALLEL:该方法是融合了数据并行、模型并行及混合并行的1种分布式并行模式,可以自动建立代价模型,找到训练时间较短的并行策略,为用户选择1种并行模式。 # [Environment Requirements](#contents) - 硬件(Ascend/GPU) - 请准备具有Ascend或GPU处理器的硬件环境.如果想使用Ascend资源,请发送[申请表](https://obs-9be7.obs.cn-east-2.myhuaweicloud.com/file/other/Ascend%20Model%20Zoo%E4%BD%93%E9%AA%8C%E8%B5%84%E6%BA%90%E7%94%B3%E8%AF%B7%E8%A1%A8.docx) 至ascend@huawei.com. 当收到许可即可使用Ascend资源. - 框架 - [MindSpore](https://www.mindspore.cn/install/en) - 更多的信息请访问以下链接: - [MindSpore Tutorials](https://www.mindspore.cn/tutorial/training/en/master/index.html) - [MindSpore Python API](https://www.mindspore.cn/doc/api_python/en/master/index.html) # [Quick Start](#contents) 根据官方网站成功安装MindSpore以后,可以按照以下步骤进行训练和测试模型: - 在Ascend资源上运行: ```python # run training example python train.py > train.log 2>&1 & # run distributed training example bash scripts/run_train.sh rank_table.json # run evaluation example python eval.py > eval.log 2>&1 & OR bash run_eval.sh ``` 进行并行训练时, 需要提前创建JSON格式的hccl配置文件。 请按照以下链接的指导进行设置: https://gitee.com/mindspore/mindspore/tree/master/model_zoo/utils/hccl_tools. - 在GPU资源上运行: 在GPU上进行训练时请首先将 src/config.py 文件中的`device_target` 从 `Ascend` 改为 `GPU` ```python # run training example export CUDA_VISIBLE_DEVICES=0 python train.py > train.log 2>&1 & # run distributed training example bash scripts/run_train_gpu.sh 8 0,1,2,3,4,5,6,7 # run evaluation example python eval.py --checkpoint_path=[CHECKPOINT_PATH] > eval.log 2>&1 & OR bash run_eval_gpu.sh [CHECKPOINT_PATH] ``` 更多的细节请参考具体的script文件 # [Script Description](#contents) ## [Script and Sample Code](#contents) ``` ├── Tiny-DarkNet ├── README.md // descriptions about tinydarknet ├── scripts │ ├──run_train.sh // shell script for distributed on Ascend │ ├──run_train_gpu.sh // shell script for distributed on GPU │ ├──run_eval.sh // shell script for evaluation on Ascend │ ├──run_eval_gpu.sh // shell script for evaluation on GPU ├── src │ ├──dataset.py // creating dataset │ ├──tinydarknet.py // tinydarknet architecture │ ├──config.py // parameter configuration ├── train.py // training script ├── eval.py // evaluation script ├── export.py // export checkpoint files into air/onnx ``` ## [Script Parameters](#contents) 训练和测试的参数可在 config.py 中进行设置 - Tiny-Darknet, ImageNet dataset 的配置文件 ```python 'pre_trained': 'False' # whether training based on the pre-trained model 'num_classes': 1000 # the number of classes in the dataset 'lr_init': 0.1 # initial learning rate 'batch_size': 128 # training batch size 'epoch_size': 500 # total training epochs 'momentum': 0.9 # momentum 'weight_decay': 1e-4 # weight decay value 'image_height': 224 # image height used as input to the model 'image_width': 224 # image width used as input to the model 'data_path': './ImageNet_Original/train/' # absolute full path to the train datasets 'val_data_path': './ImageNet_Original/val/' # absolute full path to the evaluation datasets 'device_target': 'Ascend' # device running the program 'device_id': 0 # device ID used to train or evaluate the dataset. Ignore it when you use run_train.sh for distributed training 'keep_checkpoint_max': 10 # only keep the last keep_checkpoint_max checkpoint 'checkpoint_path': './train_tinydarknet_imagenet-125_390.ckpt' # the absolute full path to save the checkpoint file 'onnx_filename': 'tinydarknet.onnx' # file name of the onnx model used in export.py 'air_filename': 'tinydarknet.air' # file name of the air model used in export.py 'lr_scheduler': 'exponential' # learning rate scheduler 'lr_epochs': [70, 140, 210, 280] # epoch of lr changing 'lr_gamma': 0.3 # decrease lr by a factor of exponential lr_scheduler 'eta_min': 0.0 # eta_min in cosine_annealing scheduler 'T_max': 150 # T-max in cosine_annealing scheduler 'warmup_epochs': 0 # warmup epoch 'is_dynamic_loss_scale': 0 # dynamic loss scale 'loss_scale': 1024 # loss scale 'label_smooth_factor': 0.1 # label_smooth_factor 'use_label_smooth': True # label smooth ``` 更多的细节, 请参考`config.py`. ## [Training Process](#contents) ### Training - 在Ascend资源上运行: ``` python train.py > train.log 2>&1 & ``` 上述的python命令将运行在后台中,可以通过 `train.log` 文件查看运行结果. 训练完成后,默认情况下,可在script文件夹下得到一些checkpoint文件. 训练的损失值将以如下的形式展示: ``` # grep "loss is " train.log epoch: 1 step: 390, loss is 1.4842823 epcoh: 2 step: 390, loss is 1.0897788 ... ``` 模型checkpoint文件将会保存在当前文件夹下. - 在GPU资源上运行: ``` export CUDA_VISIBLE_DEVICES=0 python train.py > train.log 2>&1 & ``` 上述的python命令将运行在后台中,可以通过 `train.log` 文件查看运行结果。. 训练完成后,默认情况下,可在`./ckpt_0/`文件夹下得到一些checkpoint文件. ### Distributed Training - 在Ascend资源上运行: ``` sh scripts/run_train.sh rank_table.json ``` 上述的脚本命令将在后台中进行分布式训练,可以通过`train_parallel[X]/log`文件查看运行结果. 训练的损失值将以如下的形式展示: ``` # grep "result: " train_parallel*/log train_parallel0/log:epoch: 1 step: 48, loss is 1.4302931 train_parallel0/log:epcoh: 2 step: 48, loss is 1.4023874 ... train_parallel1/log:epoch: 1 step: 48, loss is 1.3458025 train_parallel1/log:epcoh: 2 step: 48, loss is 1.3729336 ... ... ``` - 在GPU资源上运行: ``` sh scripts/run_train_gpu.sh 8 0,1,2,3,4,5,6,7 ``` 上述的脚本命令将在后台中进行分布式训练,可以通过`train/train.log`文件查看运行结果. ## [Evaluation Process](#contents) ### Evaluation - 在Ascend资源, ImageNet数据集上进行评估: 在运行如下命令前,请确认用于评估的checkpoint文件的路径.请将checkpoint路径设置为绝对路径,例如:"username/imagenet/train_tiny-darknet_imagenet-125_390.ckpt". ``` python eval.py > eval.log 2>&1 & OR sh scripts/run_eval.sh ``` 上述的python命令将运行在后台中,可以通过"eval.log"文件查看结果. 测试数据集的准确率将如下面所列: ``` # grep "accuracy: " eval.log accuracy: {'acc': 0.934} ``` 请注意在并行训练后,测试请将checkpoint_path设置为最后保存的checkpoint文件的路径,准确率将如下面所列: ``` # grep "accuracy: " eval.log accuracy: {'acc': 0.9217} ``` - 在GPU资源, ImageNet数据集上进行评估: 在运行如下命令前,请确认用于评估的checkpoint文件的路径.请将checkpoint路径设置为绝对路径,例如:"username/imagenet/train_tiny-darknet_imagenet-125_390.ckpt". ``` python eval.py --checkpoint_path=[CHECKPOINT_PATH] > eval.log 2>&1 & ``` 上述的python命令将运行在后台中,可以通过"eval.log"文件查看结果. 测试数据集的准确率输出将如下面所列: ``` # grep "accuracy: " eval.log accuracy: {'acc': 0.930} ``` OR, ``` bash scripts/run_eval_gpu.sh [CHECKPOINT_PATH] ``` 上述的脚本命令将运行在后台中,可以通过"eval/eval.log"文件查看结果. 测试数据集的准确率输出将如下面所列: ``` # grep "accuracy: " eval/eval.log accuracy: {'acc': 0.930} ``` # [Model Description](#contents) ## [Performance](#contents) ### Evaluation Performance #### Tiny-DarkNet on 1200k images | Parameters | Ascend | | -------------------------- | ----------------------------------------------------------- | | Model Version | Tiny-Darknet | | Resource | Ascend 910, CPU 2.60GHz, 56cores, Memory 314G | | uploaded Date | 10/28/2020 (month/day/year) | | MindSpore Version | 1.0.0 | | Dataset | 1200k images | | Training Parameters | epoch=500, steps=5000, batch_size=128, lr=0.1 | | Optimizer | Momentum | | Loss Function | Softmax Cross Entropy | | outputs | probability | | Loss | 2.0 | | Speed | 1pc: 152 ms/step; 8pcs: 171 ms/step | | Total time | 8pcs: 8.8 hours | | Parameters (M) | 13.0 | | Checkpoint for Fine tuning | 52M (.ckpt file) | | Scripts | [googlenet script](https://gitee.com/mindspore/mindspore/tree/r0.7/model_zoo/official/cv/googlenet) | ### Inference Performance #### Tiny-DarkNet on 1200k images | Parameters | Ascend | | ------------------- | --------------------------- | | Model Version | Tiny-DarkNet | | Resource | Ascend 910 | | Uploaded Date | 10/28/2020 (month/day/year) | | MindSpore Version | 1.0.0 | | Dataset | 1200k images | | batch_size | 128 | | outputs | probability | | Accuracy | 8pcs: 81.7% | ## [How to use](#contents) ### Inference 如果需要使用训练好的模型在多种硬件平台上进行推断,例如GPU, Ascend910或者Ascend 310,可以参考这个[链接](https://www.mindspore.cn/tutorial/training/en/master/advanced_use/migrate_3rd_scripts.html). 如下是一个简单的例子: - 在Ascend资源上运行: ``` # Set context context.set_context(mode=context.GRAPH_HOME, device_target=cfg.device_target) context.set_context(device_id=cfg.device_id) # Load unseen dataset for inference dataset = dataset.create_dataset(cfg.data_path, 1, False) # Define model net = Tiny-Darknet(num_classes=cfg.num_classes) opt = Momentum(filter(lambda x: x.requires_grad, net.get_parameters()), 0.01, cfg.momentum, weight_decay=cfg.weight_decay) loss = nn.SoftmaxCrossEntropyWithLogits(sparse=True, reduction='mean') model = Model(net, loss_fn=loss, optimizer=opt, metrics={'acc'}) # Load pre-trained model param_dict = load_checkpoint(cfg.checkpoint_path) load_param_into_net(net, param_dict) net.set_train(False) # Make predictions on the unseen dataset acc = model.eval(dataset) print("accuracy: ", acc) ``` - 在GPU资源上运行:: ``` # Set context context.set_context(mode=context.GRAPH_HOME, device_target="GPU") # Load unseen dataset for inference dataset = dataset.create_dataset(cfg.data_path, 1, False) # Define model net = Tiny-Darknet(num_classes=cfg.num_classes) opt = Momentum(filter(lambda x: x.requires_grad, net.get_parameters()), 0.01, cfg.momentum, weight_decay=cfg.weight_decay) loss = nn.SoftmaxCrossEntropyWithLogits(sparse=True, reduction='mean') model = Model(net, loss_fn=loss, optimizer=opt, metrics={'acc'}) # Load pre-trained model param_dict = load_checkpoint(args_opt.checkpoint_path) load_param_into_net(net, param_dict) net.set_train(False) # Make predictions on the unseen dataset acc = model.eval(dataset) print("accuracy: ", acc) ``` ### Continue Training on the Pretrained Model - 在Ascend资源上运行: ``` # Load dataset dataset = create_dataset(cfg.data_path, 1) batch_num = dataset.get_dataset_size() # Define model net = Tiny-Darknet(num_classes=cfg.num_classes) # Continue training if set pre_trained to be True if cfg.pre_trained: param_dict = load_checkpoint(cfg.checkpoint_path) load_param_into_net(net, param_dict) lr = lr_steps(0, lr_max=cfg.lr_init, total_epochs=cfg.epoch_size, steps_per_epoch=batch_num) opt = Momentum(filter(lambda x: x.requires_grad, net.get_parameters()), Tensor(lr), cfg.momentum, weight_decay=cfg.weight_decay) loss = nn.SoftmaxCrossEntropyWithLogits(sparse=True, reduction='mean') model = Model(net, loss_fn=loss, optimizer=opt, metrics={'acc'}, amp_level="O2", keep_batchnorm_fp32=False, loss_scale_manager=None) # Set callbacks config_ck = CheckpointConfig(save_checkpoint_steps=batch_num * 5, keep_checkpoint_max=cfg.keep_checkpoint_max) time_cb = TimeMonitor(data_size=batch_num) ckpoint_cb = ModelCheckpoint(prefix="train_googlenet_cifar10", directory="./", config=config_ck) loss_cb = LossMonitor() # Start training model.train(cfg.epoch_size, dataset, callbacks=[time_cb, ckpoint_cb, loss_cb]) print("train success") ``` - 在GPU资源上运行: ``` # Load dataset dataset = create_dataset(cfg.data_path, 1) batch_num = dataset.get_dataset_size() # Define model net = Tiny-Darknet(num_classes=cfg.num_classes) # Continue training if set pre_trained to be True if cfg.pre_trained: param_dict = load_checkpoint(cfg.checkpoint_path) load_param_into_net(net, param_dict) lr = lr_steps(0, lr_max=cfg.lr_init, total_epochs=cfg.epoch_size, steps_per_epoch=batch_num) opt = Momentum(filter(lambda x: x.requires_grad, net.get_parameters()), Tensor(lr), cfg.momentum, weight_decay=cfg.weight_decay) loss = nn.SoftmaxCrossEntropyWithLogits(sparse=True, reduction='mean') model = Model(net, loss_fn=loss, optimizer=opt, metrics={'acc'}, amp_level="O2", keep_batchnorm_fp32=False, loss_scale_manager=None) # Set callbacks config_ck = CheckpointConfig(save_checkpoint_steps=batch_num * 5, keep_checkpoint_max=cfg.keep_checkpoint_max) time_cb = TimeMonitor(data_size=batch_num) ckpoint_cb = ModelCheckpoint(prefix="train_googlenet_cifar10", directory="./ckpt_" + str(get_rank()) + "/", config=config_ck) loss_cb = LossMonitor() # Start training model.train(cfg.epoch_size, dataset, callbacks=[time_cb, ckpoint_cb, loss_cb]) print("train success") ``` # [Description of Random Situation](#contents) 在dataset.py文件中的“create_dataset"函数中和train.py文件中,我们设置了随机数基数. # [ModelZoo Homepage](#contents) 请参考官方[主页](https://gitee.com/mindspore/mindspore/tree/master/model_zoo).