Recent research has extended computing-in-memory (CIM) to floating-point (FP) operations, enabling high-precision computation to handle complex edge tasks such as object detection and segmentation [1]–[3]. However, the ever-growing edge intelligence escalated the need for higher throughput, better energy efficiency, and on-device updates, imposing significant challenges on prior pre-aligning-based FP CIMs (Fig. 1). 1) A fundamental limitation exists in the INT mantissa multiply-accumulate (MAC): bit-parallel computation is fast but consumes significant area/energy due to wide bit-width multipliers and adder trees, and thus, most designs adopt the bit-serial compute scheme. However, it requires multiple compute cycles. E.g., 8 cycles are required for a BF16 mantissa MAC, severely limiting the throughput. 2) The exponent sorting and mantissa normalization process of FP/INT conversion in previous FP CIMs introduce a complex comparison tree and shifter, greatly increasing the area/energy overhead. 3) Previous FP CIMs do not support on-device fine-tuning for environment changes, resulting in accuracy loss in real-world applications.