The third generation audio video standard (AVS3) has significantly enhanced video coding efficiency. The Sample Adaptive Offset (SAO) is the essential filtering tool of AVS3, designed to reduce the ringing effect. However, SAO contains lots of complex computations and data dependency, making it difficult to implement on a real-time hardware. To address this problem, a fast Coding Tree Unit (CTU)-level SAO algorithm and its hardware architecture is proposed. The SAO is optimized by enabling data decoupling between adjacent CTUs. Combining with the proposed hardware-friendly algorithm, this paper presents a hardware architecture for the information statistics, mode decision, and filtering modules involved in SAO. Through methods such as data reuse and parallel computation, the procedures on hardware are pipelined. Experimental results show that the proposed fast algorithm has a Bjontegaard-Delta rate (BD-Rate) loss of only 0.08% and 0.04% under the Low Delay P (LDP) and Random Access (RA) configurations respectively. The validation result based on Xilinx® Alveo U250 Data Center Accelerator Card shows that the real time encoding of 3840×2160(4K)@30fps at 300MHz is supported. To our knowledge, this is also the first CTU-level SAO hardware architecture for AVS3. In the future, we will test our design in ASIC flow and try to achieve more excellent performance such as 4K@60fps.