Using in-network-computation capabilities of the network devices (also called hardware collectives) to optimize MPI collectives has become popular in high-performance computing, and shows significant performance advantages. However, the hardware collectives are not flawless in practical use scenarios. One of the problems is that it is difficult to use. In order to obtain the performance advantage of hardware collectives, the network management software need to create dedicated aggregate tree for each MPI communicator, which is a complicated task. One solution is to make MPI communicators sharing the global imprecise aggregate trees created by the management software when initiating networks, but it leads to heavy interference between MPI communicators and causes significant performance degradation. So we have to make tradeoff between performance and ease of use. We propose a hybrid approach to optimize MPI collectives by in-network-computation and point-to-point messages. On the one hand, we use the pre-created aggregate trees in each super-node, rather than sending requests to the network management software to create dedicated aggregate trees. On the other hand, the hardware collectives are transferred only in the local super-node, so it cannot disturb the jobs running on other super-nodes. We provide a cost model to evaluate the overhead of the hybrid collective algorithms. We also test its performance in the new generation Sunway supercomputer. The results show that our approach reduces the median latency by 18%~74% compared to collectives implemented by point-to-point messages, although the performance decrease slightly compared to the original hardware collectives. In addition, the tail latency of our approach is significantly lower than that of the original hardware collectives in the presence of heavy interference.