This paper presents systematic literature review that enquires into the maturity of FLTs evaluation in terms of baseline comparison, homogeneity of empirical designs and finally, the reproducibility of FLTs and their evaluation. It identifies different issues that substantially affect the ability of researchers and practitioners when trying to identify the best-of-breed FLT or, in the case of researchers, when trying to replicate existing FLT-evaluation studies. The results show that a 95% of the existing research in this field present novel FLTs, and that only just over half of the examined FLTs have been evaluated through formal empirical methods. In addition, only 8% of the studies compared the FLT to openly available, baseline techniques. Another characteristic of the reviewed literature is the 255 different subject systems, 60 metrics, 210 benchmarks and plethora of user input formats in FLT evaluations, which also negatively affects the comparability of FLTs. Finally, there is a lack of reproducible FLTs in the field, disallowing researchers from re-creating the FLT for comparison studies. Cumulatively, these conditions make it difficult to find answers to questions like which are the best FLTs?". Paper concludes by providing guidelines for empirical evaluation of FLTs that may help towards empirical standardization.