Deep learning (DL) has been widely used in bearing fault diagnosis. In particular, convolutional neural networks (CNNs) improve diagnosis accuracy by extracting excellent fault features. However, CNN lacks an explicit learning mechanism to distinguish between different fault characteristics in the input signal to the diagnosis results. This article presents a new end-to-end depth framework called multi-head self-attention convolution neural network (MSA-CNN) for bearing fault diagnosis. Firstly, we adopt a data pre-processing method that directly converts one-dimensional (1D) original signals into two-dimensional (2D) grayscale images, which is simple to implement and preserves the complete information of the original signal. Secondly, multi-head self-attention (MSA) is first constructed to aggregate the global information and adaptively assign weights to the input signal's features. Thirdly, the CNN with small-scale kernels extracted detailed local features. Finally, the learned high-level representations are fed into the full connect (FC) layer for fault diagnosis. The performance of the MSA-CNN is validated on different datasets. The results show that the proposed MSA-CNN can significantly improve fault diagnosis accuracy compared with the other state-of-the-art methods and has excellent noise immunity performance.