神经网络加速器应用:释放嵌入式AI的硬件潜力¶
学习目标¶
完成本教程后,你将能够:
- 理解神经网络加速器的工作原理和架构
- 掌握NPU/TPU的编程接口和使用方法
- 了解不同加速器的特点和适用场景
- 实现基于硬件加速的AI推理应用
- 掌握算子优化和性能调优技术
- 能够评估和对比加速效果
- 理解加速器的局限性和最佳实践
- 完成一个完整的硬件加速项目
前置要求¶
在开始学习之前,建议你具备:
知识要求: - 了解TensorFlow Lite基础知识 - 熟悉神经网络的基本概念 - 理解模型量化和优化技术 - 掌握C/C++编程基础 - 了解嵌入式系统开发
技能要求: - 能够训练和转换TFLite模型 - 会使用嵌入式开发工具 - 熟悉性能分析和调试 - 了解硬件接口编程
开发环境: - 支持硬件加速的开发板(见硬件准备章节) - 相应的SDK和开发工具 - TensorFlow Lite库 - 性能分析工具
神经网络加速器概述¶
什么是神经网络加速器¶
神经网络加速器是专门为加速AI推理而设计的硬件单元,通过并行计算和专用电路大幅提升推理性能。
核心特点:
- 专用硬件
- 针对神经网络运算优化
- 高效的矩阵乘法单元
- 专用的激活函数电路
-
低功耗设计
-
并行计算
- 大规模并行处理
- SIMD(单指令多数据)架构
- 流水线处理
-
多核心协同
-
高性能
- 10-50倍的加速比
- 低延迟推理
- 高吞吐量
-
能效比优异
-
易于集成
- 标准API接口
- 与CPU协同工作
- 透明加速
- 软件兼容性好
加速器类型对比¶
| 类型 | 代表产品 | 算力 | 功耗 | 适用场景 |
|---|---|---|---|---|
| NPU | 海思NPU、瑞芯微NPU | 1-4 TOPS | 1-3W | 移动设备、边缘计算 |
| TPU | Google Edge TPU | 4 TOPS | 2W | 边缘服务器、工业设备 |
| GPU | Mali GPU、Adreno GPU | 0.5-2 TFLOPS | 2-5W | 移动设备、图形处理 |
| DSP | Hexagon DSP、CEVA DSP | 0.5-1 TOPS | 0.5-1W | 低功耗设备、音频处理 |
| FPGA | Xilinx、Intel FPGA | 可配置 | 5-20W | 定制化、高性能应用 |
选择建议: - 移动/嵌入式:NPU(集成度高、功耗低) - 边缘服务器:TPU(性能强、易用性好) - 图形应用:GPU(通用性强、生态完善) - 超低功耗:DSP(功耗最低、适合MCU) - 定制需求:FPGA(灵活性高、可定制)
加速器架构¶
典型NPU架构:
┌─────────────────────────────────────────────────┐
│ 主机CPU │
│ ┌──────────────┐ │
│ │ 应用程序 │ │
│ └──────────────┘ │
│ ↓ │
│ ┌──────────────┐ │
│ │ NPU驱动 │ │
│ └──────────────┘ │
└─────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────┐
│ NPU硬件 │
│ ┌──────────────┐ ┌──────────────┐ │
│ │ 指令解码器 │ │ DMA控制器 │ │
│ └──────────────┘ └──────────────┘ │
│ ↓ ↓ │
│ ┌──────────────────────────────┐ │
│ │ 计算核心阵列 │ │
│ │ ┌────┐ ┌────┐ ┌────┐ ┌────┐ │ │
│ │ │MAC │ │MAC │ │MAC │ │MAC │ │ │
│ │ │单元│ │单元│ │单元│ │单元│ │ │
│ │ └────┘ └────┘ └────┘ └────┘ │ │
│ │ ┌────┐ ┌────┐ ┌────┐ ┌────┐ │ │
│ │ │MAC │ │MAC │ │MAC │ │MAC │ │ │
│ │ │单元│ │单元│ │单元│ │单元│ │ │
│ │ └────┘ └────┘ └────┘ └────┘ │ │
│ └──────────────────────────────┘ │
│ ↓ │
│ ┌──────────────┐ ┌──────────────┐ │
│ │ 激活函数单元 │ │ 池化单元 │ │
│ └──────────────┘ └──────────────┘ │
│ ↓ │
│ ┌──────────────────────────────┐ │
│ │ 片上缓存(SRAM) │ │
│ └──────────────────────────────┘ │
└─────────────────────────────────────────────────┘
关键组件:
- MAC单元(乘加单元)
- 执行 y = a × b + c 运算
- 神经网络的核心运算
- 大规模并行阵列
-
支持Int8/Int16/Float16
-
激活函数单元
- 硬件实现ReLU、Sigmoid等
- 查找表(LUT)加速
- 流水线处理
-
低延迟
-
DMA控制器
- 高速数据传输
- 减少CPU干预
- 支持多通道
-
自动搬运数据
-
片上缓存
- 减少外部内存访问
- 提高数据复用
- 降低功耗
- 典型大小:512KB-2MB
准备工作¶
硬件准备¶
| 名称 | 数量 | 说明 | 推荐型号 |
|---|---|---|---|
| 开发板 | 1 | 带NPU/TPU的开发板 | 见下方推荐 |
| 摄像头 | 1 | 用于视觉应用(可选) | OV5640, IMX219 |
| 显示屏 | 1 | 显示结果(可选) | HDMI显示器 |
| USB线 | 1 | 供电和调试 | - |
| SD卡 | 1 | 存储系统和模型 | 32GB Class 10 |
推荐开发板:
- Raspberry Pi 4 + Coral USB Accelerator
- CPU: Cortex-A72 1.5GHz
- RAM: 4GB
- NPU: Google Edge TPU (4 TOPS)
- 价格: ~$100
-
优点: 生态完善,易于上手
-
RK3588开发板
- CPU: Cortex-A76 + A55
- RAM: 8GB
- NPU: 6 TOPS
- 价格: ~$200
-
优点: 性能强,国产方案
-
Jetson Nano
- CPU: Cortex-A57 1.43GHz
- RAM: 4GB
- GPU: 128-core Maxwell
- 价格: ~$100
-
优点: CUDA支持,NVIDIA生态
-
STM32MP157 + X-CUBE-AI
- CPU: Cortex-A7 + M4
- RAM: 512MB
- NPU: 软件加速
- 价格: ~$50
- 优点: 低成本,STM32生态
软件准备¶
通用工具:
# 安装Python和TensorFlow
pip install tensorflow==2.13.0
pip install numpy pillow matplotlib
# 安装TFLite工具
pip install tflite-runtime
# 验证安装
python -c "import tensorflow as tf; print(tf.__version__)"
针对不同平台:
1. Google Edge TPU:
# 添加软件源
echo "deb https://packages.cloud.google.com/apt coral-edgetpu-stable main" | \
sudo tee /etc/apt/sources.list.d/coral-edgetpu.list
curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | sudo apt-key add -
# 安装Edge TPU运行时
sudo apt-get update
sudo apt-get install libedgetpu1-std
sudo apt-get install python3-pycoral
# 验证安装
python3 -c "from pycoral.utils import edgetpu; print(edgetpu.list_edge_tpus())"
2. RK3588 NPU:
# 安装RKNN工具包
pip install rknn-toolkit2
# 下载SDK
git clone https://github.com/rockchip-linux/rknn-toolkit2.git
cd rknn-toolkit2
# 安装依赖
pip install -r requirements.txt
3. Jetson Nano:
# 安装JetPack SDK(已预装在官方镜像)
# 或手动安装
sudo apt-get install nvidia-jetpack
# 安装TensorRT
sudo apt-get install python3-libnvinfer-dev
# 验证CUDA
nvcc --version
4. STM32 X-CUBE-AI:
- 下载并安装STM32CubeIDE
- 安装X-CUBE-AI扩展包
- 配置工具链
环境配置¶
设置开发环境:
# 创建工作目录
mkdir -p ~/npu-tutorial
cd ~/npu-tutorial
# 创建Python虚拟环境
python3 -m venv venv
source venv/bin/activate
# 安装依赖
pip install tensorflow pillow numpy matplotlib
# 创建项目结构
mkdir -p models data scripts results
测试硬件连接:
# test_hardware.py
import sys
def test_edge_tpu():
"""测试Edge TPU连接"""
try:
from pycoral.utils import edgetpu
devices = edgetpu.list_edge_tpus()
if devices:
print(f"✓ Edge TPU detected: {devices}")
return True
else:
print("✗ No Edge TPU found")
return False
except ImportError:
print("✗ PyCoral not installed")
return False
def test_cuda():
"""测试CUDA/GPU"""
try:
import tensorflow as tf
gpus = tf.config.list_physical_devices('GPU')
if gpus:
print(f"✓ GPU detected: {gpus}")
return True
else:
print("✗ No GPU found")
return False
except Exception as e:
print(f"✗ GPU test failed: {e}")
return False
def test_rknn():
"""测试RKNN NPU"""
try:
from rknnlite.api import RKNNLite
rknn = RKNNLite()
print("✓ RKNN runtime available")
return True
except ImportError:
print("✗ RKNN runtime not installed")
return False
if __name__ == "__main__":
print("=== Hardware Detection ===\n")
results = {
"Edge TPU": test_edge_tpu(),
"CUDA/GPU": test_cuda(),
"RKNN NPU": test_rknn()
}
print("\n=== Summary ===")
available = [k for k, v in results.items() if v]
if available:
print(f"Available accelerators: {', '.join(available)}")
else:
print("No hardware accelerators detected")
print("Will use CPU for inference")
运行测试:
预期输出:
=== Hardware Detection ===
✓ Edge TPU detected: [{'type': 'usb', 'path': '/dev/bus/usb/001/002'}]
✗ No GPU found
✗ RKNN runtime not installed
=== Summary ===
Available accelerators: Edge TPU
步骤1:准备和转换模型¶
1.1 训练基础模型¶
我们将使用MobileNetV2作为示例,训练一个图像分类模型:
# train_model.py
import tensorflow as tf
from tensorflow import keras
import numpy as np
# 加载数据集(使用CIFAR-10作为示例)
(x_train, y_train), (x_test, y_test) = keras.datasets.cifar10.load_data()
# 数据预处理
x_train = x_train.astype('float32') / 255.0
x_test = x_test.astype('float32') / 255.0
# 数据增强
data_augmentation = keras.Sequential([
keras.layers.RandomFlip("horizontal"),
keras.layers.RandomRotation(0.1),
keras.layers.RandomZoom(0.1),
])
# 创建模型(使用MobileNetV2)
base_model = keras.applications.MobileNetV2(
input_shape=(32, 32, 3),
include_top=False,
weights=None # 从头训练
)
model = keras.Sequential([
data_augmentation,
base_model,
keras.layers.GlobalAveragePooling2D(),
keras.layers.Dense(10, activation='softmax')
])
# 编译模型
model.compile(
optimizer=keras.optimizers.Adam(learning_rate=0.001),
loss='sparse_categorical_crossentropy',
metrics=['accuracy']
)
# 训练模型
history = model.fit(
x_train, y_train,
batch_size=128,
epochs=50,
validation_data=(x_test, y_test),
callbacks=[
keras.callbacks.EarlyStopping(patience=5, restore_best_weights=True),
keras.callbacks.ReduceLROnPlateau(factor=0.5, patience=3)
]
)
# 评估模型
test_loss, test_acc = model.evaluate(x_test, y_test)
print(f"\nTest accuracy: {test_acc*100:.2f}%")
# 保存模型
model.save('cifar10_mobilenet.h5')
print("Model saved!")
1.2 转换为加速器专用格式¶
不同的加速器需要不同的模型格式:
1. Edge TPU模型转换:
# convert_to_edgetpu.py
import tensorflow as tf
# 加载模型
model = tf.keras.models.load_model('cifar10_mobilenet.h5')
# 转换为TFLite(全整数量化)
converter = tf.lite.TFLiteConverter.from_keras_model(model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
# 代表性数据集
def representative_dataset():
for _ in range(100):
data = np.random.rand(1, 32, 32, 3).astype(np.float32)
yield [data]
converter.representative_dataset = representative_dataset
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
converter.inference_input_type = tf.uint8
converter.inference_output_type = tf.uint8
# 转换
tflite_model = converter.convert()
# 保存TFLite模型
with open('cifar10_mobilenet_quant.tflite', 'wb') as f:
f.write(tflite_model)
print(f"TFLite model size: {len(tflite_model) / 1024:.2f} KB")
# 使用Edge TPU编译器编译
# 需要在命令行执行:
# edgetpu_compiler cifar10_mobilenet_quant.tflite
在命令行编译Edge TPU模型:
# 安装Edge TPU编译器
curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | sudo apt-key add -
echo "deb https://packages.cloud.google.com/apt coral-edgetpu-stable main" | \
sudo tee /etc/apt/sources.list.d/coral-edgetpu.list
sudo apt-get update
sudo apt-get install edgetpu-compiler
# 编译模型
edgetpu_compiler cifar10_mobilenet_quant.tflite
# 输出:cifar10_mobilenet_quant_edgetpu.tflite
2. RKNN模型转换:
# convert_to_rknn.py
from rknn.api import RKNN
# 创建RKNN对象
rknn = RKNN(verbose=True)
# 配置
print('--> Config model')
rknn.config(
mean_values=[[127.5, 127.5, 127.5]],
std_values=[[127.5, 127.5, 127.5]],
target_platform='rk3588'
)
# 加载TensorFlow模型
print('--> Loading model')
ret = rknn.load_tensorflow(
tf_pb='cifar10_mobilenet.pb',
inputs=['input'],
outputs=['output'],
input_size_list=[[32, 32, 3]]
)
if ret != 0:
print('Load model failed!')
exit(ret)
# 构建RKNN模型
print('--> Building model')
ret = rknn.build(do_quantization=True, dataset='./dataset.txt')
if ret != 0:
print('Build model failed!')
exit(ret)
# 导出RKNN模型
print('--> Export RKNN model')
ret = rknn.export_rknn('cifar10_mobilenet.rknn')
if ret != 0:
print('Export model failed!')
exit(ret)
print('Done!')
rknn.release()
3. TensorRT模型转换(Jetson):
# convert_to_tensorrt.py
import tensorflow as tf
from tensorflow.python.compiler.tensorrt import trt_convert as trt
# 加载SavedModel
saved_model_dir = 'cifar10_mobilenet_saved'
model = tf.keras.models.load_model('cifar10_mobilenet.h5')
model.save(saved_model_dir)
# 转换为TensorRT
conversion_params = trt.DEFAULT_TRT_CONVERSION_PARAMS._replace(
precision_mode=trt.TrtPrecisionMode.INT8,
max_workspace_size_bytes=1 << 30 # 1GB
)
converter = trt.TrtGraphConverterV2(
input_saved_model_dir=saved_model_dir,
conversion_params=conversion_params
)
# 转换
def calibration_input_fn():
for _ in range(100):
yield (np.random.rand(1, 32, 32, 3).astype(np.float32),)
converter.convert(calibration_input_fn=calibration_input_fn)
# 保存
output_saved_model_dir = 'cifar10_mobilenet_trt'
converter.save(output_saved_model_dir)
print(f"TensorRT model saved to {output_saved_model_dir}")
1.3 验证转换后的模型¶
在转换后验证模型的准确性:
# verify_model.py
import numpy as np
import tensorflow as tf
from tensorflow import keras
# 加载测试数据
(_, _), (x_test, y_test) = keras.datasets.cifar10.load_data()
x_test = x_test.astype('float32') / 255.0
# 加载原始模型
original_model = keras.models.load_model('cifar10_mobilenet.h5')
# 加载TFLite模型
interpreter = tf.lite.Interpreter(model_path='cifar10_mobilenet_quant.tflite')
interpreter.allocate_tensors()
input_details = interpreter.get_input_details()
output_details = interpreter.get_output_details()
# 测试样本数
num_samples = 1000
# 原始模型预测
original_predictions = original_model.predict(x_test[:num_samples])
original_accuracy = np.mean(
np.argmax(original_predictions, axis=1) == y_test[:num_samples].flatten()
)
# TFLite模型预测
tflite_predictions = []
for i in range(num_samples):
# 量化输入
input_scale, input_zero_point = input_details[0]['quantization']
test_image = x_test[i:i+1]
test_image_quantized = (test_image / input_scale + input_zero_point).astype(np.uint8)
# 推理
interpreter.set_tensor(input_details[0]['index'], test_image_quantized)
interpreter.invoke()
# 获取输出
output = interpreter.get_tensor(output_details[0]['index'])
# 反量化
output_scale, output_zero_point = output_details[0]['quantization']
output_dequantized = (output.astype(np.float32) - output_zero_point) * output_scale
tflite_predictions.append(output_dequantized[0])
tflite_predictions = np.array(tflite_predictions)
tflite_accuracy = np.mean(
np.argmax(tflite_predictions, axis=1) == y_test[:num_samples].flatten()
)
# 打印结果
print("\n=== Model Verification ===")
print(f"Original model accuracy: {original_accuracy*100:.2f}%")
print(f"TFLite model accuracy: {tflite_accuracy*100:.2f}%")
print(f"Accuracy difference: {abs(original_accuracy - tflite_accuracy)*100:.2f}%")
if abs(original_accuracy - tflite_accuracy) < 0.03:
print("✓ Model conversion successful!")
else:
print("⚠ Significant accuracy loss detected")
步骤2:使用Edge TPU加速¶
2.1 基本推理示例¶
# edgetpu_inference.py
import time
import numpy as np
from PIL import Image
from pycoral.adapters import common
from pycoral.adapters import classify
from pycoral.utils.edgetpu import make_interpreter
# 加载模型
model_path = 'cifar10_mobilenet_quant_edgetpu.tflite'
interpreter = make_interpreter(model_path)
interpreter.allocate_tensors()
# 获取输入输出信息
input_details = interpreter.get_input_details()[0]
output_details = interpreter.get_output_details()[0]
print("=== Model Information ===")
print(f"Input shape: {input_details['shape']}")
print(f"Input type: {input_details['dtype']}")
print(f"Output shape: {output_details['shape']}")
print(f"Output type: {output_details['dtype']}")
# 准备测试图像
def preprocess_image(image_path):
"""预处理图像"""
img = Image.open(image_path).convert('RGB')
img = img.resize((32, 32), Image.LANCZOS)
img_array = np.array(img, dtype=np.uint8)
return np.expand_dims(img_array, axis=0)
# 加载测试图像
test_image = preprocess_image('test_image.jpg')
# 执行推理
print("\n=== Running Inference ===")
start_time = time.perf_counter()
# 设置输入
common.set_input(interpreter, test_image)
# 执行推理
interpreter.invoke()
# 获取输出
output = common.output_tensor(interpreter, 0)
inference_time = (time.perf_counter() - start_time) * 1000
# 打印结果
print(f"Inference time: {inference_time:.2f} ms")
print(f"Predicted class: {np.argmax(output)}")
print(f"Confidence: {np.max(output)*100:.2f}%")
# CIFAR-10类别名称
class_names = ['airplane', 'automobile', 'bird', 'cat', 'deer',
'dog', 'frog', 'horse', 'ship', 'truck']
print(f"\nTop 3 predictions:")
top_3 = np.argsort(output[0])[-3:][::-1]
for idx in top_3:
print(f" {class_names[idx]}: {output[0][idx]*100:.2f}%")
2.2 性能基准测试¶
对比CPU和Edge TPU的性能:
# benchmark_edgetpu.py
import time
import numpy as np
import tensorflow as tf
from pycoral.adapters import common
from pycoral.utils.edgetpu import make_interpreter
def benchmark_cpu(model_path, num_iterations=100):
"""CPU基准测试"""
interpreter = tf.lite.Interpreter(model_path=model_path)
interpreter.allocate_tensors()
input_details = interpreter.get_input_details()[0]
test_input = np.random.randint(
0, 256,
size=input_details['shape'],
dtype=np.uint8
)
# 预热
for _ in range(10):
interpreter.set_tensor(input_details['index'], test_input)
interpreter.invoke()
# 测试
times = []
for _ in range(num_iterations):
start = time.perf_counter()
interpreter.set_tensor(input_details['index'], test_input)
interpreter.invoke()
times.append((time.perf_counter() - start) * 1000)
return {
'mean': np.mean(times),
'std': np.std(times),
'min': np.min(times),
'max': np.max(times)
}
def benchmark_edgetpu(model_path, num_iterations=100):
"""Edge TPU基准测试"""
interpreter = make_interpreter(model_path)
interpreter.allocate_tensors()
input_details = interpreter.get_input_details()[0]
test_input = np.random.randint(
0, 256,
size=input_details['shape'],
dtype=np.uint8
)
# 预热
for _ in range(10):
common.set_input(interpreter, test_input)
interpreter.invoke()
# 测试
times = []
for _ in range(num_iterations):
start = time.perf_counter()
common.set_input(interpreter, test_input)
interpreter.invoke()
times.append((time.perf_counter() - start) * 1000)
return {
'mean': np.mean(times),
'std': np.std(times),
'min': np.min(times),
'max': np.max(times)
}
# 运行基准测试
print("=== Performance Benchmark ===\n")
print("Testing CPU inference...")
cpu_stats = benchmark_cpu('cifar10_mobilenet_quant.tflite', num_iterations=100)
print("Testing Edge TPU inference...")
tpu_stats = benchmark_edgetpu('cifar10_mobilenet_quant_edgetpu.tflite', num_iterations=100)
# 打印结果
print("\n=== Results ===")
print(f"\nCPU Performance:")
print(f" Mean: {cpu_stats['mean']:.2f} ms")
print(f" Std: {cpu_stats['std']:.2f} ms")
print(f" Min: {cpu_stats['min']:.2f} ms")
print(f" Max: {cpu_stats['max']:.2f} ms")
print(f"\nEdge TPU Performance:")
print(f" Mean: {tpu_stats['mean']:.2f} ms")
print(f" Std: {tpu_stats['std']:.2f} ms")
print(f" Min: {tpu_stats['min']:.2f} ms")
print(f" Max: {tpu_stats['max']:.2f} ms")
speedup = cpu_stats['mean'] / tpu_stats['mean']
print(f"\nSpeedup: {speedup:.2f}x")
print(f"Throughput (CPU): {1000/cpu_stats['mean']:.2f} FPS")
print(f"Throughput (TPU): {1000/tpu_stats['mean']:.2f} FPS")
预期输出:
=== Performance Benchmark ===
Testing CPU inference...
Testing Edge TPU inference...
=== Results ===
CPU Performance:
Mean: 45.23 ms
Std: 2.15 ms
Min: 42.10 ms
Max: 52.30 ms
Edge TPU Performance:
Mean: 3.12 ms
Std: 0.18 ms
Min: 2.95 ms
Max: 3.45 ms
Speedup: 14.50x
Throughput (CPU): 22.11 FPS
Throughput (TPU): 320.51 FPS
2.3 实时视频处理¶
使用Edge TPU进行实时视频分类:
# realtime_video.py
import cv2
import time
import numpy as np
from pycoral.adapters import common
from pycoral.utils.edgetpu import make_interpreter
# 加载模型
interpreter = make_interpreter('cifar10_mobilenet_quant_edgetpu.tflite')
interpreter.allocate_tensors()
# CIFAR-10类别
class_names = ['airplane', 'automobile', 'bird', 'cat', 'deer',
'dog', 'frog', 'horse', 'ship', 'truck']
# 打开摄像头
cap = cv2.VideoCapture(0)
cap.set(cv2.CAP_PROP_FRAME_WIDTH, 640)
cap.set(cv2.CAP_PROP_FRAME_HEIGHT, 480)
# 性能统计
fps_history = []
inference_times = []
print("Press 'q' to quit")
while True:
# 读取帧
ret, frame = cap.read()
if not ret:
break
frame_start = time.perf_counter()
# 预处理
# 从中心裁剪32x32区域
h, w = frame.shape[:2]
center_crop = frame[h//2-16:h//2+16, w//2-16:w//2+16]
input_image = cv2.resize(center_crop, (32, 32))
input_image = cv2.cvtColor(input_image, cv2.COLOR_BGR2RGB)
input_tensor = np.expand_dims(input_image, axis=0).astype(np.uint8)
# 推理
inference_start = time.perf_counter()
common.set_input(interpreter, input_tensor)
interpreter.invoke()
output = common.output_tensor(interpreter, 0)
inference_time = (time.perf_counter() - inference_start) * 1000
# 获取预测结果
predicted_class = np.argmax(output[0])
confidence = output[0][predicted_class]
# 计算FPS
frame_time = time.perf_counter() - frame_start
fps = 1.0 / frame_time
# 更新统计
fps_history.append(fps)
inference_times.append(inference_time)
if len(fps_history) > 30:
fps_history.pop(0)
inference_times.pop(0)
avg_fps = np.mean(fps_history)
avg_inference = np.mean(inference_times)
# 绘制结果
cv2.rectangle(frame, (w//2-16, h//2-16), (w//2+16, h//2+16), (0, 255, 0), 2)
# 显示信息
info_text = [
f"Class: {class_names[predicted_class]}",
f"Confidence: {confidence*100:.1f}%",
f"FPS: {avg_fps:.1f}",
f"Inference: {avg_inference:.1f}ms"
]
y_offset = 30
for text in info_text:
cv2.putText(frame, text, (10, y_offset),
cv2.FONT_HERSHEY_SIMPLEX, 0.7, (0, 255, 0), 2)
y_offset += 30
# 显示帧
cv2.imshow('Edge TPU Real-time Classification', frame)
# 退出检查
if cv2.waitKey(1) & 0xFF == ord('q'):
break
# 清理
cap.release()
cv2.destroyAllWindows()
print(f"\nAverage FPS: {np.mean(fps_history):.2f}")
print(f"Average inference time: {np.mean(inference_times):.2f} ms")
步骤3:使用RKNN NPU加速¶
3.1 RKNN基本推理¶
# rknn_inference.py
import numpy as np
from rknnlite.api import RKNNLite
import cv2
# 创建RKNN对象
rknn_lite = RKNNLite()
# 加载RKNN模型
print('--> Load RKNN model')
ret = rknn_lite.load_rknn('cifar10_mobilenet.rknn')
if ret != 0:
print('Load RKNN model failed!')
exit(ret)
# 初始化运行时环境
print('--> Init runtime environment')
ret = rknn_lite.init_runtime(core_mask=RKNNLite.NPU_CORE_0)
if ret != 0:
print('Init runtime environment failed!')
exit(ret)
# 准备输入数据
img = cv2.imread('test_image.jpg')
img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
img = cv2.resize(img, (32, 32))
img = np.expand_dims(img, axis=0)
# 执行推理
print('--> Running inference')
outputs = rknn_lite.inference(inputs=[img])
# 处理输出
output = outputs[0][0]
predicted_class = np.argmax(output)
confidence = output[predicted_class]
# 类别名称
class_names = ['airplane', 'automobile', 'bird', 'cat', 'deer',
'dog', 'frog', 'horse', 'ship', 'truck']
print(f'\nPredicted class: {class_names[predicted_class]}')
print(f'Confidence: {confidence*100:.2f}%')
# 释放资源
rknn_lite.release()
3.2 RKNN性能优化¶
# rknn_optimized.py
from rknnlite.api import RKNNLite
import numpy as np
import time
class RKNNInference:
"""优化的RKNN推理类"""
def __init__(self, model_path, core_mask=RKNNLite.NPU_CORE_0):
self.rknn = RKNNLite()
# 加载模型
ret = self.rknn.load_rknn(model_path)
if ret != 0:
raise RuntimeError('Failed to load RKNN model')
# 初始化运行时
ret = self.rknn.init_runtime(core_mask=core_mask)
if ret != 0:
raise RuntimeError('Failed to init runtime')
print(f"RKNN model loaded successfully")
print(f"Using NPU core: {core_mask}")
def inference(self, input_data):
"""执行推理"""
outputs = self.rknn.inference(inputs=[input_data])
return outputs[0]
def benchmark(self, input_shape, num_iterations=100):
"""性能基准测试"""
# 生成随机输入
test_input = np.random.randint(
0, 256,
size=input_shape,
dtype=np.uint8
)
# 预热
for _ in range(10):
self.rknn.inference(inputs=[test_input])
# 测试
times = []
for _ in range(num_iterations):
start = time.perf_counter()
self.rknn.inference(inputs=[test_input])
times.append((time.perf_counter() - start) * 1000)
return {
'mean': np.mean(times),
'std': np.std(times),
'min': np.min(times),
'max': np.max(times),
'fps': 1000 / np.mean(times)
}
def release(self):
"""释放资源"""
self.rknn.release()
# 使用示例
if __name__ == "__main__":
# 创建推理对象
model = RKNNInference('cifar10_mobilenet.rknn')
# 运行基准测试
print("\n=== Performance Benchmark ===")
stats = model.benchmark(input_shape=(1, 32, 32, 3), num_iterations=100)
print(f"Mean inference time: {stats['mean']:.2f} ms")
print(f"Std: {stats['std']:.2f} ms")
print(f"Min: {stats['min']:.2f} ms")
print(f"Max: {stats['max']:.2f} ms")
print(f"Throughput: {stats['fps']:.2f} FPS")
# 释放资源
model.release()
3.3 多核心并行推理¶
RKNN支持多核心并行,可以进一步提升性能:
# rknn_multicore.py
from rknnlite.api import RKNNLite
import numpy as np
import time
import threading
class MultiCoreRKNN:
"""多核心RKNN推理"""
def __init__(self, model_path, num_cores=3):
"""
初始化多核心推理
num_cores: 使用的核心数(1-3)
"""
self.num_cores = num_cores
self.rknns = []
# 为每个核心创建RKNN实例
core_masks = [
RKNNLite.NPU_CORE_0,
RKNNLite.NPU_CORE_1,
RKNNLite.NPU_CORE_2
]
for i in range(num_cores):
rknn = RKNNLite()
ret = rknn.load_rknn(model_path)
if ret != 0:
raise RuntimeError(f'Failed to load model on core {i}')
ret = rknn.init_runtime(core_mask=core_masks[i])
if ret != 0:
raise RuntimeError(f'Failed to init runtime on core {i}')
self.rknns.append(rknn)
print(f"Core {i} initialized")
def inference_single(self, core_id, input_data, results, index):
"""单核心推理(线程函数)"""
output = self.rknns[core_id].inference(inputs=[input_data])
results[index] = output[0]
def inference_batch(self, input_batch):
"""批量推理"""
batch_size = len(input_batch)
results = [None] * batch_size
threads = []
# 创建线程
for i in range(batch_size):
core_id = i % self.num_cores
thread = threading.Thread(
target=self.inference_single,
args=(core_id, input_batch[i], results, i)
)
threads.append(thread)
thread.start()
# 等待所有线程完成
for thread in threads:
thread.join()
return results
def benchmark_multicore(self, input_shape, batch_size=10, num_iterations=10):
"""多核心性能测试"""
# 生成测试数据
test_batch = [
np.random.randint(0, 256, size=input_shape, dtype=np.uint8)
for _ in range(batch_size)
]
# 预热
for _ in range(3):
self.inference_batch(test_batch)
# 测试
times = []
for _ in range(num_iterations):
start = time.perf_counter()
self.inference_batch(test_batch)
elapsed = (time.perf_counter() - start) * 1000
times.append(elapsed)
avg_time = np.mean(times)
throughput = (batch_size * 1000) / avg_time
return {
'batch_size': batch_size,
'avg_time': avg_time,
'throughput': throughput,
'time_per_image': avg_time / batch_size
}
def release(self):
"""释放所有资源"""
for rknn in self.rknns:
rknn.release()
# 使用示例
if __name__ == "__main__":
print("=== Multi-Core RKNN Benchmark ===\n")
# 测试不同核心数
for num_cores in [1, 2, 3]:
print(f"\nTesting with {num_cores} core(s)...")
model = MultiCoreRKNN('cifar10_mobilenet.rknn', num_cores=num_cores)
stats = model.benchmark_multicore(
input_shape=(1, 32, 32, 3),
batch_size=12,
num_iterations=10
)
print(f"Batch size: {stats['batch_size']}")
print(f"Total time: {stats['avg_time']:.2f} ms")
print(f"Time per image: {stats['time_per_image']:.2f} ms")
print(f"Throughput: {stats['throughput']:.2f} images/sec")
model.release()
步骤4:算子优化¶
4.1 分析算子性能¶
了解哪些算子在加速器上运行,哪些在CPU上运行:
# analyze_operators.py
import tensorflow as tf
def analyze_tflite_model(model_path):
"""分析TFLite模型的算子"""
interpreter = tf.lite.Interpreter(model_path=model_path)
interpreter.allocate_tensors()
# 获取所有算子
ops_details = interpreter._get_ops_details()
# 统计算子类型
op_types = {}
for op in ops_details:
op_name = op['op_name']
op_types[op_name] = op_types.get(op_name, 0) + 1
print("=== Operator Analysis ===\n")
print(f"Total operators: {len(ops_details)}")
print(f"\nOperator types:")
for op_name, count in sorted(op_types.items()):
print(f" {op_name}: {count}")
# 分析张量大小
tensor_details = interpreter.get_tensor_details()
total_size = 0
print(f"\n=== Tensor Analysis ===\n")
print(f"Total tensors: {len(tensor_details)}")
for tensor in tensor_details:
size = np.prod(tensor['shape']) * tensor['dtype'].itemsize
total_size += size
print(f"Total tensor memory: {total_size / 1024:.2f} KB")
return op_types
# 分析模型
analyze_tflite_model('cifar10_mobilenet_quant.tflite')
4.2 优化模型架构¶
某些算子可能不被加速器支持,需要调整模型架构:
# optimize_model_architecture.py
import tensorflow as tf
from tensorflow import keras
def create_accelerator_friendly_model(input_shape, num_classes):
"""创建加速器友好的模型"""
# 使用加速器支持良好的层
model = keras.Sequential([
# 输入层
keras.layers.Input(shape=input_shape),
# 使用标准卷积(避免深度可分离卷积的某些变体)
keras.layers.Conv2D(32, 3, strides=2, padding='same', activation='relu'),
keras.layers.Conv2D(64, 3, strides=2, padding='same', activation='relu'),
keras.layers.Conv2D(128, 3, strides=2, padding='same', activation='relu'),
# 全局平均池化(比Flatten更高效)
keras.layers.GlobalAveragePooling2D(),
# 分类层
keras.layers.Dense(num_classes, activation='softmax')
])
return model
# 创建模型
model = create_accelerator_friendly_model((32, 32, 3), 10)
model.summary()
# 编译和训练
model.compile(
optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy']
)
# 训练模型(使用你的数据)
# model.fit(x_train, y_train, epochs=50, validation_data=(x_test, y_test))
# 保存模型
model.save('optimized_model.h5')
4.3 算子融合¶
手动融合某些算子以提高性能:
# operator_fusion.py
import tensorflow as tf
from tensorflow import keras
class FusedConvBNReLU(keras.layers.Layer):
"""融合的Conv+BN+ReLU层"""
def __init__(self, filters, kernel_size, strides=1, padding='same', **kwargs):
super().__init__(**kwargs)
self.conv = keras.layers.Conv2D(
filters, kernel_size, strides=strides,
padding=padding, use_bias=False
)
self.bn = keras.layers.BatchNormalization()
self.relu = keras.layers.ReLU()
def call(self, inputs, training=None):
x = self.conv(inputs)
x = self.bn(x, training=training)
x = self.relu(x)
return x
def fuse_for_inference(self):
"""融合BN到Conv权重中(推理时)"""
# 获取Conv权重
conv_weights = self.conv.get_weights()[0]
# 获取BN参数
gamma, beta, moving_mean, moving_var = self.bn.get_weights()
epsilon = self.bn.epsilon
# 计算融合后的权重
std = tf.sqrt(moving_var + epsilon)
fused_weights = conv_weights * (gamma / std).reshape(1, 1, 1, -1)
fused_bias = beta - (gamma * moving_mean) / std
# 创建新的Conv层
fused_conv = keras.layers.Conv2D(
self.conv.filters,
self.conv.kernel_size,
strides=self.conv.strides,
padding=self.conv.padding,
activation='relu',
use_bias=True
)
# 设置融合后的权重
fused_conv.build(self.conv.input_shape)
fused_conv.set_weights([fused_weights, fused_bias])
return fused_conv
# 使用融合层构建模型
def create_fused_model(input_shape, num_classes):
inputs = keras.Input(shape=input_shape)
x = FusedConvBNReLU(32, 3, strides=2)(inputs)
x = FusedConvBNReLU(64, 3, strides=2)(x)
x = FusedConvBNReLU(128, 3, strides=2)(x)
x = keras.layers.GlobalAveragePooling2D()(x)
outputs = keras.layers.Dense(num_classes, activation='softmax')(x)
model = keras.Model(inputs, outputs)
return model
model = create_fused_model((32, 32, 3), 10)
model.summary()
步骤5:性能调优¶
5.1 批处理优化¶
使用批处理提高吞吐量:
# batch_inference.py
import numpy as np
import time
from pycoral.adapters import common
from pycoral.utils.edgetpu import make_interpreter
def benchmark_batch_inference(model_path, batch_sizes=[1, 2, 4, 8]):
"""测试不同批大小的性能"""
results = {}
for batch_size in batch_sizes:
print(f"\nTesting batch size: {batch_size}")
# 加载模型
interpreter = make_interpreter(model_path)
interpreter.allocate_tensors()
# 准备批量输入
input_details = interpreter.get_input_details()[0]
input_shape = input_details['shape']
# 生成测试数据
test_batch = np.random.randint(
0, 256,
size=(batch_size, *input_shape[1:]),
dtype=np.uint8
)
# 预热
for _ in range(10):
for i in range(batch_size):
common.set_input(interpreter, test_batch[i:i+1])
interpreter.invoke()
# 测试
num_iterations = 100
start_time = time.perf_counter()
for _ in range(num_iterations):
for i in range(batch_size):
common.set_input(interpreter, test_batch[i:i+1])
interpreter.invoke()
total_time = (time.perf_counter() - start_time) * 1000
avg_time_per_batch = total_time / num_iterations
avg_time_per_image = avg_time_per_batch / batch_size
throughput = (batch_size * num_iterations * 1000) / total_time
results[batch_size] = {
'time_per_batch': avg_time_per_batch,
'time_per_image': avg_time_per_image,
'throughput': throughput
}
print(f" Time per batch: {avg_time_per_batch:.2f} ms")
print(f" Time per image: {avg_time_per_image:.2f} ms")
print(f" Throughput: {throughput:.2f} images/sec")
return results
# 运行测试
results = benchmark_batch_inference(
'cifar10_mobilenet_quant_edgetpu.tflite',
batch_sizes=[1, 2, 4, 8, 16]
)
# 找出最优批大小
best_batch_size = max(results.keys(), key=lambda k: results[k]['throughput'])
print(f"\nOptimal batch size: {best_batch_size}")
print(f"Max throughput: {results[best_batch_size]['throughput']:.2f} images/sec")
5.2 内存优化¶
优化内存使用以支持更大的模型或批大小:
# memory_optimization.py
import numpy as np
import psutil
import os
class MemoryMonitor:
"""内存监控工具"""
def __init__(self):
self.process = psutil.Process(os.getpid())
self.baseline = self.get_memory_usage()
def get_memory_usage(self):
"""获取当前内存使用(MB)"""
return self.process.memory_info().rss / 1024 / 1024
def get_memory_increase(self):
"""获取相对于基线的内存增长"""
return self.get_memory_usage() - self.baseline
def print_memory_stats(self, label=""):
"""打印内存统计"""
current = self.get_memory_usage()
increase = self.get_memory_increase()
print(f"{label}")
print(f" Current: {current:.2f} MB")
print(f" Increase: {increase:.2f} MB")
# 使用示例
monitor = MemoryMonitor()
print("=== Memory Usage Analysis ===\n")
# 加载模型前
monitor.print_memory_stats("Before loading model:")
# 加载模型
from pycoral.utils.edgetpu import make_interpreter
interpreter = make_interpreter('cifar10_mobilenet_quant_edgetpu.tflite')
interpreter.allocate_tensors()
monitor.print_memory_stats("\nAfter loading model:")
# 执行推理
input_data = np.random.randint(0, 256, size=(1, 32, 32, 3), dtype=np.uint8)
for _ in range(100):
common.set_input(interpreter, input_data)
interpreter.invoke()
monitor.print_memory_stats("\nAfter 100 inferences:")
5.3 功耗优化¶
监控和优化功耗:
# power_optimization.py
import time
import subprocess
class PowerMonitor:
"""功耗监控(需要硬件支持)"""
def __init__(self):
self.measurements = []
def measure_power(self):
"""测量当前功耗(示例,需要根据硬件调整)"""
try:
# 对于Jetson设备
result = subprocess.run(
['tegrastats', '--interval', '100'],
capture_output=True,
text=True,
timeout=0.2
)
# 解析功耗数据
# 这里需要根据实际硬件的输出格式解析
return 0.0 # 返回功耗值(瓦特)
except:
return 0.0
def benchmark_power(self, inference_func, duration=10):
"""测量推理过程的功耗"""
start_time = time.time()
power_samples = []
inference_count = 0
while time.time() - start_time < duration:
# 执行推理
inference_func()
inference_count += 1
# 测量功耗
power = self.measure_power()
if power > 0:
power_samples.append(power)
if power_samples:
avg_power = np.mean(power_samples)
energy = avg_power * duration # 能量(焦耳)
energy_per_inference = energy / inference_count
return {
'avg_power': avg_power,
'total_energy': energy,
'energy_per_inference': energy_per_inference,
'inference_count': inference_count
}
else:
return None
# 使用示例(需要硬件支持)
# monitor = PowerMonitor()
# stats = monitor.benchmark_power(lambda: model.inference(test_input), duration=10)
# print(f"Average power: {stats['avg_power']:.2f} W")
# print(f"Energy per inference: {stats['energy_per_inference']:.4f} J")
故障排除¶
问题1:模型无法在加速器上运行¶
现象:
可能原因: - 模型包含加速器不支持的算子 - 模型未正确量化 - 模型格式不正确
解决方法:
# 1. 检查模型算子
import tensorflow as tf
interpreter = tf.lite.Interpreter(model_path='model.tflite')
ops = set()
for op in interpreter._get_ops_details():
ops.add(op['op_name'])
print("Model operators:")
for op in sorted(ops):
print(f" {op}")
# 2. 检查是否支持
# 对于Edge TPU,参考:
# https://coral.ai/docs/edgetpu/models-intro/#supported-operations
# 3. 重新设计模型,避免不支持的算子
问题2:性能未达到预期¶
现象: - 加速比小于预期 - 推理时间不稳定 - 吞吐量低
解决方法:
# 1. 检查是否真正使用了加速器
from pycoral.utils import edgetpu
# 列出可用的Edge TPU
devices = edgetpu.list_edge_tpus()
print(f"Available Edge TPUs: {devices}")
# 2. 检查模型是否完全编译
# Edge TPU编译器会输出编译报告
# 查看有多少算子在TPU上运行
# 3. 优化数据传输
# 减少CPU-NPU之间的数据拷贝
# 使用DMA或零拷贝技术
# 4. 使用性能分析工具
# 对于Edge TPU:
# edgetpu_compiler model.tflite --show_operations
# 5. 调整批大小
# 找到最优的批大小以最大化吞吐量
问题3:内存不足¶
现象:
解决方法:
# 1. 减小模型大小
# - 使用更激进的量化
# - 减少层数或通道数
# - 使用模型剪枝
# 2. 优化内存分配
# - 使用原地操作
# - 重用缓冲区
# - 减小批大小
# 3. 使用内存映射
import mmap
def load_model_mmap(model_path):
"""使用内存映射加载模型"""
with open(model_path, 'rb') as f:
model_data = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
return model_data
# 4. 分段处理
# 将大图像分成小块处理
def process_large_image(image, model, tile_size=256):
"""分块处理大图像"""
h, w = image.shape[:2]
results = []
for y in range(0, h, tile_size):
for x in range(0, w, tile_size):
tile = image[y:y+tile_size, x:x+tile_size]
result = model.inference(tile)
results.append(result)
return results
问题4:精度下降¶
现象: - 加速器推理结果与CPU不一致 - 准确率明显下降
解决方法:
# 1. 验证量化参数
interpreter = tf.lite.Interpreter(model_path='model.tflite')
interpreter.allocate_tensors()
input_details = interpreter.get_input_details()[0]
output_details = interpreter.get_output_details()[0]
print("Input quantization:")
print(f" Scale: {input_details['quantization'][0]}")
print(f" Zero point: {input_details['quantization'][1]}")
print("Output quantization:")
print(f" Scale: {output_details['quantization'][0]}")
print(f" Zero point: {output_details['quantization'][1]}")
# 2. 使用更好的代表性数据集
# 确保代表性数据集覆盖真实数据分布
# 3. 使用量化感知训练
# 而不是训练后量化
# 4. 对比CPU和加速器输出
def compare_outputs(cpu_output, npu_output, tolerance=0.1):
"""对比CPU和NPU输出"""
diff = np.abs(cpu_output - npu_output)
max_diff = np.max(diff)
mean_diff = np.mean(diff)
print(f"Max difference: {max_diff:.4f}")
print(f"Mean difference: {mean_diff:.4f}")
if max_diff > tolerance:
print("⚠ Significant difference detected!")
else:
print("✓ Outputs are similar")
问题5:设备未检测到¶
现象:
解决方法:
# 1. 检查硬件连接
lsusb # 查看USB设备
# 应该看到 "Global Unichip Corp."
# 2. 检查驱动安装
dpkg -l | grep edgetpu
# 3. 重新安装驱动
sudo apt-get remove libedgetpu1-std
sudo apt-get install libedgetpu1-std
# 4. 检查权限
sudo usermod -aG plugdev $USER
# 注销并重新登录
# 5. 重启设备
sudo reboot
# 6. 测试连接
python3 -c "from pycoral.utils import edgetpu; print(edgetpu.list_edge_tpus())"
最佳实践¶
模型设计¶
- 使用加速器友好的算子
- 优先使用标准卷积
- 避免复杂的自定义层
- 使用ReLU而不是其他激活函数
-
使用全局平均池化
-
合理的模型大小
- 参数量:1M-10M
- 模型文件:1MB-20MB
-
计算量:100M-1G MACs
-
量化策略
- 优先使用Int8量化
- 提供高质量的代表性数据集
- 验证量化后的精度
性能优化¶
- 批处理
- 测试不同批大小
- 找到吞吐量最大的批大小
-
平衡延迟和吞吐量
-
内存管理
- 重用缓冲区
- 使用原地操作
-
避免不必要的数据拷贝
-
并行处理
- 使用多核心(如果支持)
- 流水线处理
- 异步推理
开发流程¶
-
迭代开发
-
性能基准
- 建立性能基线
- 每次优化后测试
-
记录优化效果
-
持续验证
- 验证精度
- 测试边界情况
- 长时间运行测试
总结¶
通过本教程,你学习了:
- ✅ 神经网络加速器的工作原理和架构
- ✅ 不同加速器(Edge TPU、RKNN、TensorRT)的使用方法
- ✅ 模型转换和优化技术
- ✅ 性能基准测试和对比方法
- ✅ 算子优化和性能调优策略
- ✅ 实时视频处理应用
- ✅ 故障排除和最佳实践
关键要点:
- 硬件加速效果显著
- 10-50倍的加速比
- 显著降低延迟
-
提高能效比
-
模型适配很重要
- 使用加速器支持的算子
- 正确的量化方法
-
合理的模型架构
-
性能优化是系统工程
- 模型优化
- 算子优化
- 内存优化
-
批处理优化
-
实际测试不可少
- 在目标硬件上测试
- 使用真实数据验证
- 长时间运行测试
性能对比总结:
| 平台 | 推理时间 | 吞吐量 | 功耗 | 成本 |
|---|---|---|---|---|
| CPU | 45ms | 22 FPS | 5W | 低 |
| Edge TPU | 3ms | 320 FPS | 2W | 中 |
| RKNN NPU | 5ms | 200 FPS | 3W | 中 |
| Jetson GPU | 8ms | 125 FPS | 10W | 高 |
进阶挑战¶
尝试以下挑战来深化理解:
挑战1:多模型推理 - 同时运行多个模型 - 实现模型切换机制 - 优化内存使用
挑战2:实时目标检测 - 使用YOLO或SSD模型 - 实现实时视频检测 - 达到30 FPS以上
挑战3:边缘AI应用 - 开发完整的边缘AI应用 - 集成摄像头和显示 - 实现用户界面
挑战4:性能极限优化 - 优化模型到极致 - 使用所有优化技术 - 达到硬件理论性能的80%以上
下一步¶
建议继续学习:
- 智能视觉识别项目 - 完整的视觉AI项目
- 边缘AI推理优化技术 - 深入优化技术
- TensorFlow Lite入门 - 复习基础知识
参考资料¶
官方文档: 1. Google Coral Documentation 2. RKNN Toolkit Documentation 3. NVIDIA TensorRT Documentation 4. STM32 X-CUBE-AI Documentation
开源项目: 1. PyCoral Examples 2. RKNN Model Zoo 3. TensorRT Examples
学习资源: 1. Edge AI and Vision Alliance 2. TinyML Foundation 3. Embedded Vision Summit
工具和库: 1. Netron - 模型可视化 2. ONNX Runtime - 跨平台推理 3. OpenVINO - Intel加速工具包
反馈:如果你在使用硬件加速器时遇到问题,欢迎在评论区留言!
版本历史: - v1.0 (2024-01-15): 初始版本发布