Why clang doesn't generate fmla instruction for vmlaq_f32 intrinsics for armv8-a?

I use clang9 to build code which has many arm64 intrinsics. I use vmlaq_f32 to perform multiply accumulate operations on float32x4_t data type. I have expected fmla instruction will be generated but instead clang generate a fmul and a fadd instruction for me. For simple function this is not an issue but for function which use a lot of neon registers clang9 will generate inefficient code which will store/load neon register to/from stack frequently. But if clang generate fmla instruction 32 neon register is more than enough.
BTW: I have tested a function which use vmlaq_f32 heavily, If I build it for armv7-a it will generate very efficient code(it will generate vmla instruction in this case), but if I build it for armv8-a the generated code looks very inefficient with many store/load to/from stack.
Is there a way to force clang9 generate fmla instruction for vmlaq_f32? Thanks.