In this paper, we seek to establish a framework for empirical comparison of performance of pattern classifiers, allowing comparisons to be made consistently across different studies. As many as 106 datasets from the University of California, Irvine, Machine Learning Repository were used as comparison benchmarks. The framework provides a clear definition of the experimental setup so that it can be unambiguously reproduced or verified by others. Multiple runs of cross-validation and tuning were employed to minimize the possibility of random effects causing much biases in the results obtained. The metrics used to compare among different classifiers are based solely on simple readings obtained through classification tests. This allows future comparisons to be made readily adaptable for inclusion of new metrics.