Performance Optimization ======================== This comprehensive guide covers performance analysis, optimization techniques, and best practices for Nexus embedded applications. .. contents:: Table of Contents :local: :depth: 3 Overview -------- Performance optimization is critical for embedded systems with limited resources. This guide provides strategies for analyzing and improving performance. **Optimization Goals:** * Minimize CPU usage * Reduce memory footprint * Decrease power consumption * Improve response time * Maximize throughput **Optimization Process:** 1. **Measure** - Profile to find bottlenecks 2. **Analyze** - Understand performance issues 3. **Optimize** - Apply targeted improvements 4. **Verify** - Measure improvements 5. **Iterate** - Repeat until goals met .. warning:: Premature optimization is the root of all evil. Always measure before optimizing! Performance Measurement ----------------------- Timing Measurements ~~~~~~~~~~~~~~~~~~~ **Microsecond Timing:** .. code-block:: c #include "hal/nx_timer.h" void measure_function_time(void) { /* Start high-resolution timer */ uint32_t start = hal_timer_get_counter(TIMER_0); /* Function to measure */ perform_operation(); /* Calculate elapsed time */ uint32_t end = hal_timer_get_counter(TIMER_0); uint32_t cycles = end - start; uint32_t us = cycles / (SystemCoreClock / 1000000); LOG_INFO("Operation took %lu us (%lu cycles)", us, cycles); } **Millisecond Timing:** .. code-block:: c #include "osal/osal.h" void measure_task_time(void) { uint32_t start = osal_get_time_ms(); /* Task work */ process_data(); uint32_t elapsed = osal_get_time_ms() - start; LOG_INFO("Task took %lu ms", elapsed); } **Profiling Macros:** .. code-block:: c #ifdef PROFILE #define PROFILE_START(name) \ uint32_t __profile_##name##_start = hal_timer_get_counter(TIMER_0) #define PROFILE_END(name) \ do { \ uint32_t __end = hal_timer_get_counter(TIMER_0); \ uint32_t __cycles = __end - __profile_##name##_start; \ uint32_t __us = __cycles / (SystemCoreClock / 1000000); \ LOG_INFO("PROFILE %s: %lu us (%lu cycles)", \ #name, __us, __cycles); \ } while (0) #else #define PROFILE_START(name) #define PROFILE_END(name) #endif void my_function(void) { PROFILE_START(my_function); /* Function code */ PROFILE_END(my_function); } CPU Usage Monitoring ~~~~~~~~~~~~~~~~~~~~ **FreeRTOS Runtime Stats:** .. code-block:: c #if (configGENERATE_RUN_TIME_STATS == 1) void print_cpu_usage(void) { char stats_buffer[512]; vTaskGetRunTimeStats(stats_buffer); LOG_INFO("Task Statistics:"); LOG_INFO("%s", stats_buffer); } uint32_t get_cpu_usage_percent(void) { TaskStatus_t* task_array; uint32_t total_runtime; uint32_t num_tasks; /* Get task count */ num_tasks = uxTaskGetNumberOfTasks(); /* Allocate array */ task_array = pvPortMalloc(num_tasks * sizeof(TaskStatus_t)); if (!task_array) { return 0; } /* Get task stats */ num_tasks = uxTaskGetSystemState(task_array, num_tasks, &total_runtime); /* Calculate CPU usage */ uint32_t idle_runtime = 0; for (uint32_t i = 0; i < num_tasks; i++) { if (strcmp(task_array[i].pcTaskName, "IDLE") == 0) { idle_runtime = task_array[i].ulRunTimeCounter; break; } } vPortFree(task_array); if (total_runtime == 0) { return 0; } uint32_t cpu_usage = 100 - ((idle_runtime * 100) / total_runtime); return cpu_usage; } #endif **Idle Task Hook:** .. code-block:: c static uint32_t idle_count = 0; static uint32_t last_check_time = 0; void vApplicationIdleHook(void) { idle_count++; /* Check CPU usage every second */ uint32_t now = osal_get_time_ms(); if (now - last_check_time >= 1000) { uint32_t cpu_usage = get_cpu_usage_percent(); LOG_DEBUG("CPU usage: %lu%%", cpu_usage); last_check_time = now; idle_count = 0; } } Memory Profiling ~~~~~~~~~~~~~~~~ **Stack Usage:** .. code-block:: c void check_stack_usage(void) { osal_task_handle_t current = osal_task_get_current(); uint32_t high_water = osal_task_get_stack_high_water(current); uint32_t stack_size = osal_task_get_stack_size(current); uint32_t used = stack_size - high_water; uint32_t percent = (used * 100) / stack_size; LOG_INFO("Task: %s", osal_task_get_name(current)); LOG_INFO("Stack: %lu/%lu bytes (%lu%%)", used, stack_size, percent); if (percent > 80) { LOG_WARN("Stack usage high!"); } } void check_all_tasks_stack(void) { TaskStatus_t* task_array; uint32_t num_tasks = uxTaskGetNumberOfTasks(); task_array = pvPortMalloc(num_tasks * sizeof(TaskStatus_t)); if (!task_array) { return; } num_tasks = uxTaskGetSystemState(task_array, num_tasks, NULL); LOG_INFO("Task Stack Usage:"); for (uint32_t i = 0; i < num_tasks; i++) { uint32_t high_water = task_array[i].usStackHighWaterMark; LOG_INFO(" %s: %lu bytes free", task_array[i].pcTaskName, high_water); } vPortFree(task_array); } **Heap Usage:** .. code-block:: c void check_heap_usage(void) { size_t free_heap = osal_get_free_heap_size(); size_t min_free = osal_get_minimum_ever_free_heap_size(); size_t total_heap = configTOTAL_HEAP_SIZE; size_t used = total_heap - free_heap; uint32_t percent = (used * 100) / total_heap; LOG_INFO("Heap Usage:"); LOG_INFO(" Total: %zu bytes", total_heap); LOG_INFO(" Used: %zu bytes (%lu%%)", used, percent); LOG_INFO(" Free: %zu bytes", free_heap); LOG_INFO(" Min Free: %zu bytes", min_free); if (percent > 90) { LOG_WARN("Heap usage critical!"); } } Interrupt Latency ~~~~~~~~~~~~~~~~~ **Measure Interrupt Response:** .. code-block:: c static volatile uint32_t irq_entry_time = 0; static volatile uint32_t irq_exit_time = 0; void EXTI0_IRQHandler(void) { /* Record entry time */ irq_entry_time = hal_timer_get_counter(TIMER_0); /* Handle interrupt */ handle_button_press(); /* Record exit time */ irq_exit_time = hal_timer_get_counter(TIMER_0); /* Clear interrupt flag */ EXTI->PR = EXTI_PR_PR0; } void check_irq_latency(void) { if (irq_exit_time > irq_entry_time) { uint32_t cycles = irq_exit_time - irq_entry_time; uint32_t us = cycles / (SystemCoreClock / 1000000); LOG_INFO("IRQ latency: %lu us", us); } } Compiler Optimizations ---------------------- Optimization Levels ~~~~~~~~~~~~~~~~~~~ **GCC/Clang Optimization Flags:** +-------------+------------------+---------------------------+ | Level | Flags | Description | +=============+==================+===========================+ | ``-O0`` | No optimization | Debug builds | +-------------+------------------+---------------------------+ | ``-Og`` | Debug optimize | Debuggable optimization | +-------------+------------------+---------------------------+ | ``-O1`` | Basic optimize | Moderate optimization | +-------------+------------------+---------------------------+ | ``-O2`` | Full optimize | Recommended for release | +-------------+------------------+---------------------------+ | ``-O3`` | Aggressive | Maximum speed | +-------------+------------------+---------------------------+ | ``-Os`` | Size optimize | Minimum code size | +-------------+------------------+---------------------------+ | ``-Ofast`` | Fast math | Non-standard compliant | +-------------+------------------+---------------------------+ **CMake Configuration:** .. code-block:: cmake # Release build with -O2 set(CMAKE_BUILD_TYPE Release) # Size optimization set(CMAKE_BUILD_TYPE MinSizeRel) # Custom optimization add_compile_options(-O3 -flto) Link-Time Optimization (LTO) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ **Enable LTO:** .. code-block:: cmake # CMakeLists.txt if(CMAKE_BUILD_TYPE STREQUAL "Release") set(CMAKE_INTERPROCEDURAL_OPTIMIZATION TRUE) add_compile_options(-flto) add_link_options(-flto) endif() **Benefits:** * Cross-module inlining * Dead code elimination * Better optimization opportunities * Smaller binary size **Trade-offs:** * Longer build times * Higher memory usage during linking * May complicate debugging Function Inlining ~~~~~~~~~~~~~~~~~ **Inline Functions:** .. code-block:: c /* Force inline */ static inline __attribute__((always_inline)) uint32_t fast_multiply(uint32_t a, uint32_t b) { return a * b; } /* Suggest inline */ static inline uint32_t calculate_checksum(const uint8_t* data, size_t len) { uint32_t sum = 0; for (size_t i = 0; i < len; i++) { sum += data[i]; } return sum; } /* Never inline (for debugging) */ __attribute__((noinline)) void debug_function(void) { /* ... */ } **When to Inline:** * Small functions (<10 lines) * Functions called frequently * Functions in hot paths * Simple calculations **When NOT to Inline:** * Large functions * Rarely called functions * Functions with loops * Recursive functions Code Optimization Techniques ----------------------------- Algorithm Optimization ~~~~~~~~~~~~~~~~~~~~~~ **Choose Efficient Algorithms:** .. code-block:: c /* Bad: O(n²) bubble sort */ void bubble_sort(int* arr, size_t n) { for (size_t i = 0; i < n - 1; i++) { for (size_t j = 0; j < n - i - 1; j++) { if (arr[j] > arr[j + 1]) { int temp = arr[j]; arr[j] = arr[j + 1]; arr[j + 1] = temp; } } } } /* Better: O(n log n) quicksort */ void quicksort(int* arr, int low, int high) { if (low < high) { int pivot = partition(arr, low, high); quicksort(arr, low, pivot - 1); quicksort(arr, pivot + 1, high); } } **Use Lookup Tables:** .. code-block:: c /* Bad: Calculate every time */ uint8_t calculate_crc(uint8_t data) { uint8_t crc = 0; for (int i = 0; i < 8; i++) { if ((crc ^ data) & 0x01) { crc = (crc >> 1) ^ 0x8C; } else { crc >>= 1; } data >>= 1; } return crc; } /* Good: Use lookup table */ static const uint8_t crc_table[256] = { 0x00, 0x07, 0x0E, 0x09, /* ... */ }; uint8_t calculate_crc_fast(uint8_t data) { return crc_table[data]; } Loop Optimization ~~~~~~~~~~~~~~~~~ **Loop Unrolling:** .. code-block:: c /* Original loop */ void copy_data(uint8_t* dst, const uint8_t* src, size_t len) { for (size_t i = 0; i < len; i++) { dst[i] = src[i]; } } /* Unrolled loop (4x) */ void copy_data_fast(uint8_t* dst, const uint8_t* src, size_t len) { size_t i = 0; /* Process 4 bytes at a time */ for (; i + 4 <= len; i += 4) { dst[i + 0] = src[i + 0]; dst[i + 1] = src[i + 1]; dst[i + 2] = src[i + 2]; dst[i + 3] = src[i + 3]; } /* Handle remaining bytes */ for (; i < len; i++) { dst[i] = src[i]; } } **Loop Invariant Code Motion:** .. code-block:: c /* Bad: Recalculate every iteration */ void process_array(int* arr, size_t len, int factor) { for (size_t i = 0; i < len; i++) { arr[i] = arr[i] * (factor + 10); /* factor + 10 is invariant */ } } /* Good: Calculate once */ void process_array_fast(int* arr, size_t len, int factor) { int multiplier = factor + 10; /* Move out of loop */ for (size_t i = 0; i < len; i++) { arr[i] = arr[i] * multiplier; } } **Strength Reduction:** .. code-block:: c /* Bad: Use expensive operations */ void calculate_powers(int* result, int base, size_t n) { for (size_t i = 0; i < n; i++) { result[i] = pow(base, i); /* Expensive */ } } /* Good: Use cheaper operations */ void calculate_powers_fast(int* result, int base, size_t n) { int power = 1; for (size_t i = 0; i < n; i++) { result[i] = power; power *= base; /* Cheaper than pow() */ } } Data Structure Optimization ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ **Use Appropriate Data Structures:** .. code-block:: c /* Bad: Linear search in array */ typedef struct { int id; char name[32]; } device_t; device_t devices[100]; device_t* find_device(int id) { for (int i = 0; i < 100; i++) { if (devices[i].id == id) { return &devices[i]; } } return NULL; } /* Good: Use hash table */ #define HASH_SIZE 16 typedef struct device_node { device_t device; struct device_node* next; } device_node_t; device_node_t* hash_table[HASH_SIZE]; uint32_t hash(int id) { return id % HASH_SIZE; } device_t* find_device_fast(int id) { uint32_t index = hash(id); device_node_t* node = hash_table[index]; while (node) { if (node->device.id == id) { return &node->device; } node = node->next; } return NULL; } **Pack Structures:** .. code-block:: c /* Bad: Unpacked structure (12 bytes on 32-bit) */ typedef struct { uint8_t flag; /* 1 byte + 3 padding */ uint32_t value; /* 4 bytes */ uint8_t status; /* 1 byte + 3 padding */ } unpacked_t; /* Good: Packed structure (6 bytes) */ typedef struct __attribute__((packed)) { uint8_t flag; /* 1 byte */ uint8_t status; /* 1 byte */ uint32_t value; /* 4 bytes */ } packed_t; /* Better: Aligned and packed (8 bytes, but faster access) */ typedef struct { uint32_t value; /* 4 bytes */ uint8_t flag; /* 1 byte */ uint8_t status; /* 1 byte */ uint16_t padding; /* 2 bytes explicit padding */ } aligned_t; Memory Access Optimization ~~~~~~~~~~~~~~~~~~~~~~~~~~~ **Cache-Friendly Access:** .. code-block:: c /* Bad: Column-major access (cache unfriendly) */ void process_matrix_bad(int matrix[100][100]) { for (int col = 0; col < 100; col++) { for (int row = 0; row < 100; row++) { matrix[row][col] *= 2; } } } /* Good: Row-major access (cache friendly) */ void process_matrix_good(int matrix[100][100]) { for (int row = 0; row < 100; row++) { for (int col = 0; col < 100; col++) { matrix[row][col] *= 2; } } } **Alignment:** .. code-block:: c /* Ensure proper alignment for DMA */ __attribute__((aligned(32))) uint8_t dma_buffer[1024]; /* Align structure to cache line */ typedef struct __attribute__((aligned(64))) { uint32_t data[16]; } cache_aligned_t; Hardware Acceleration --------------------- DMA Usage ~~~~~~~~~ **Use DMA for Large Transfers:** .. code-block:: c /* Bad: CPU copy */ void copy_large_buffer(uint8_t* dst, const uint8_t* src, size_t len) { for (size_t i = 0; i < len; i++) { dst[i] = src[i]; } } /* Good: DMA copy */ void copy_large_buffer_dma(uint8_t* dst, const uint8_t* src, size_t len) { nx_dma_config_t config = { .direction = DMA_MEMORY_TO_MEMORY, .src_inc = DMA_INC_ENABLE, .dst_inc = DMA_INC_ENABLE, .data_width = DMA_WIDTH_BYTE, }; nx_dma_t* dma = nx_factory_dma(0); dma->configure(dma, &config); dma->start(dma, src, dst, len); dma->wait(dma, 1000); nx_factory_dma_release(dma); } **DMA for Peripheral I/O:** .. code-block:: c /* Use DMA for UART transmission */ void uart_send_dma(nx_uart_t* uart, const uint8_t* data, size_t len) { nx_tx_dma_t* tx_dma = uart->get_tx_dma(uart); if (tx_dma) { tx_dma->send(tx_dma, data, len); /* CPU is free to do other work */ } } Hardware Crypto ~~~~~~~~~~~~~~~ **Use Hardware Acceleration:** .. code-block:: c /* Software AES (slow) */ void aes_encrypt_sw(const uint8_t* key, const uint8_t* input, uint8_t* output) { /* Software AES implementation */ sw_aes_encrypt(key, input, output); } /* Hardware AES (fast) */ void aes_encrypt_hw(const uint8_t* key, const uint8_t* input, uint8_t* output) { nx_crypto_t* crypto = nx_factory_crypto(0); crypto->aes_encrypt(crypto, key, input, output); nx_factory_crypto_release(crypto); } RTOS Optimization ----------------- Task Priority ~~~~~~~~~~~~~ **Set Appropriate Priorities:** .. code-block:: c /* High priority for time-critical tasks */ osal_task_create(isr_handler_task, "isr", 512, NULL, OSAL_PRIORITY_REALTIME, &isr_task); /* Normal priority for regular tasks */ osal_task_create(processing_task, "proc", 1024, NULL, OSAL_PRIORITY_NORMAL, &proc_task); /* Low priority for background tasks */ osal_task_create(logging_task, "log", 512, NULL, OSAL_PRIORITY_LOW, &log_task); **Priority Inversion:** .. code-block:: c /* Use priority inheritance mutexes */ osal_mutex_config_t config = { .type = OSAL_MUTEX_RECURSIVE, .priority_inherit = true, /* Enable priority inheritance */ }; osal_mutex_handle_t mutex; osal_mutex_create_ex(&config, &mutex); Task Stack Size ~~~~~~~~~~~~~~~ **Optimize Stack Sizes:** .. code-block:: c /* Measure actual stack usage */ void optimize_stack_sizes(void) { TaskStatus_t* tasks; uint32_t num_tasks = uxTaskGetNumberOfTasks(); tasks = pvPortMalloc(num_tasks * sizeof(TaskStatus_t)); num_tasks = uxTaskGetSystemState(tasks, num_tasks, NULL); for (uint32_t i = 0; i < num_tasks; i++) { uint32_t high_water = tasks[i].usStackHighWaterMark; uint32_t stack_size = tasks[i].usStackHighWaterMark * 4; /* Approx */ LOG_INFO("Task %s: %lu bytes free (reduce stack?)", tasks[i].pcTaskName, high_water); } vPortFree(tasks); } Synchronization Overhead ~~~~~~~~~~~~~~~~~~~~~~~~~ **Minimize Lock Contention:** .. code-block:: c /* Bad: Hold lock during slow operation */ void process_data_bad(void) { osal_mutex_lock(data_mutex, OSAL_WAIT_FOREVER); /* Long operation while holding lock */ for (int i = 0; i < 1000; i++) { process_item(i); } osal_mutex_unlock(data_mutex); } /* Good: Minimize critical section */ void process_data_good(void) { /* Copy data while holding lock */ osal_mutex_lock(data_mutex, OSAL_WAIT_FOREVER); memcpy(local_buffer, shared_buffer, sizeof(local_buffer)); osal_mutex_unlock(data_mutex); /* Process local copy without lock */ for (int i = 0; i < 1000; i++) { process_item(local_buffer[i]); } } **Use Lock-Free Algorithms:** .. code-block:: c /* Lock-free ring buffer */ typedef struct { volatile uint32_t head; volatile uint32_t tail; uint8_t buffer[256]; } lockfree_ringbuf_t; bool ringbuf_push(lockfree_ringbuf_t* rb, uint8_t data) { uint32_t next_head = (rb->head + 1) % 256; if (next_head == rb->tail) { return false; /* Full */ } rb->buffer[rb->head] = data; rb->head = next_head; /* Atomic on Cortex-M */ return true; } Power Optimization ------------------ See :doc:`power_management` for detailed power optimization techniques. **Quick Tips:** * Use sleep modes when idle * Reduce clock frequency when possible * Disable unused peripherals * Use DMA to allow CPU sleep * Optimize interrupt handlers Code Size Optimization ---------------------- Compiler Flags ~~~~~~~~~~~~~~ **Size Optimization:** .. code-block:: cmake # Optimize for size set(CMAKE_BUILD_TYPE MinSizeRel) # Additional size flags add_compile_options( -Os # Optimize for size -ffunction-sections # Each function in own section -fdata-sections # Each data in own section ) add_link_options( -Wl,--gc-sections # Remove unused sections -Wl,--print-gc-sections # Print removed sections ) Remove Unused Code ~~~~~~~~~~~~~~~~~~ **Conditional Compilation:** .. code-block:: c /* Remove debug code in release builds */ #ifdef DEBUG void debug_print_state(void) { /* Debug code */ } #endif /* Use Kconfig to remove features */ #ifdef CONFIG_FEATURE_ADVANCED void advanced_feature(void) { /* Advanced feature code */ } #endif **Link-Time Garbage Collection:** .. code-block:: cmake # Remove unused functions at link time add_compile_options(-ffunction-sections -fdata-sections) add_link_options(-Wl,--gc-sections) Reduce Library Size ~~~~~~~~~~~~~~~~~~~ **Use Minimal Libraries:** .. code-block:: cmake # Use newlib-nano for smaller C library add_link_options(--specs=nano.specs) # Remove floating point printf support add_compile_definitions(PRINTF_DISABLE_SUPPORT_FLOAT) Best Practices -------------- 1. **Measure First** * Profile before optimizing * Identify real bottlenecks * Set performance goals * Measure improvements 2. **Optimize Hot Paths** * Focus on frequently executed code * Optimize inner loops * Optimize interrupt handlers * Optimize critical sections 3. **Choose Right Algorithms** * Use appropriate data structures * Consider time/space trade-offs * Use standard library when possible * Benchmark alternatives 4. **Minimize Memory Access** * Use registers when possible * Reduce cache misses * Align data properly * Use DMA for large transfers 5. **Reduce Overhead** * Minimize function calls * Reduce context switches * Minimize lock contention * Use efficient synchronization 6. **Balance Optimization** * Don't sacrifice readability * Don't sacrifice maintainability * Don't sacrifice correctness * Document optimizations 7. **Test Thoroughly** * Verify correctness after optimization * Test edge cases * Test on target hardware * Measure actual improvements Performance Checklist --------------------- **Before Optimization:** - [ ] Profile application - [ ] Identify bottlenecks - [ ] Set performance goals - [ ] Establish baseline measurements **During Optimization:** - [ ] Focus on hot paths - [ ] One optimization at a time - [ ] Measure each change - [ ] Document optimizations **After Optimization:** - [ ] Verify correctness - [ ] Measure improvements - [ ] Update documentation - [ ] Review code quality See Also -------- * :doc:`profiling` - Performance Profiling * :doc:`memory_management` - Memory Management * :doc:`power_management` - Power Management * :doc:`../development/performance_optimization` - Development Performance Guide