Are Large Language Models Good Statisticians?

Introduction

Large Language Models (LLMs) have demonstrated impressive capabilities across a range of scientific tasks including mathematics, physics, and chemistry. Despite their successes, the effectiveness of LLMs in handling complex statistical tasks remains systematically under-explored. To bridge this gap, we introduce StatQA, a new benchmark designed for statistical analysis tasks. StatQA comprises 11,623 examples tailored to evaluate LLMs' proficiency in specialized statistical tasks and their applicability assessment capabilities, particularly for hypothesis testing methods.

We systematically experiment with representative LLMs using various prompting strategies and show that even state-of-the-art models such as GPT-4o achieve a best performance of only 64.83%, indicating significant room for improvement. Notably, while open-source LLMs (e.g. LLaMA-3) show limited capability, those fine-tuned ones exhibit marked improvements, outperforming all in-context learning-based methods (e.g. GPT-4o).Moreover, our comparative human experiments highlight a striking contrast in error types between LLMs and humans: LLMs primarily make applicability errors, whereas humans mostly make statistical task confusion errors. This divergence highlights distinct areas of proficiency and deficiency, suggesting that combining LLM and human expertise could lead to complementary strengths, inviting further investigation into their collaborative potential.

Our contributions are summarized as follows:

- StatQA. We propose StatQA, a new benchmark for statistical analysis tasks, particularly focusing on the applicability assessment of statistical methods. We introduce an automated pipeline to construct StatQA by synthesizing statistical tasks and their corresponding answers, which also provides insights for dataset construction in other specialized domains with scarce examples. - Systematic Evaluation. We conduct extensive evaluations on widely used LLMs to establish benchmarks for statistical tasks. We also explore several strategies, including domain-specific prompts and fine-tuning, to better harness the capabilities of LLMs for these tasks. - Comparative Study between Humans and LLMs. We organize group-based human experiments and comparatively analyze differences between humans and LLMs in performance and errors. Our findings highlight humans' and LLMs' distinct strengths and weaknesses and reveal their potential complementarity. - New Empirical Findings and Research Opportunities. Based on the experiments and analysis above, we summarize six key findings and discuss research opportunities in this field.

StatQA Construction

In conventional dataset construction, researchers collect a suitable dataset D, formulate a question Q, and manually annotate answers A. While this method ensures high data quality, it is time-consuming, costly, and limits extensibility, especially in specialized domains with scarce examples. To alleviate these limitations, our key idea is to reverse this process by synthesizing the question Q based on target answers A. We start with target answers A derived from the tabular data D and generate corresponding statistical questions Q. This approach ensures precise alignment between questions and answers, enabling more efficient dataset construction. To implement this, we design an efficient pipeline for constructing StatQA, as shown in Figure below. Unlike traditional methods, we set target answers A based on tabular data D and then synthesize statistical questions Q in reverse. To ensure alignment between Q and A, we incorporate automated prerequisite checks. To support the evaluation of statistical literacy, the target answers A include relevant columns C and applicable statistical methods M, enabling the derivation of computational results R. Therefore, our pipeline can synthesize numerous examples of (D, C, M, Q, R) along with other supplementary information.

An Example in StatQA

The Proportion of Statistical Tasks in StatQA

Experiments

Experimental Protocols. We design experiments for LLMs similar to human statisticians' mindset, to evaluate the abilities of LLMs in statistical tasks. In the experiment, the LLMs need to pick headers of relevant data columns, assess the methods' applicability, and select all statistical methods that fit the usage scenario and prerequisites as statisticians, then respond in a specific format. In the human experiments, we use the same protocol for consistency and develop a testing platform to facilitate participant selection. Metrics. Accuracy of relevant data columns and applicable methods selections, noted as Acc(C,M), is used as our metrics to evaluate if LLMs or participants truly understand the question and the applicability of statistical methods. Acc(C,M) refers to the proportion of methods and column selections fully aligned with the ground truth without any omissions or incorrect selections. Results. We evaluate the performance of LLMs and compare it with human performance. The figures below show the best experimental results for each model, as well as a stacked histogram for error analysis. For full results and detailed analysis, please refer to our paper.

Best Results of Each Model in Sub-tasks.

Distribution of Error Categories Across Experiments

Findings and Take-aways

We summarize six key findings through our systematic experiments and analysis: - Few-shot learning and the inclusion of domain knowledge are helpful for LLMs in this task, whereas CoT is more likely to result in slight performance degradation in smaller models. - LLMs with prompt-based approaches remain behind people in statistics. However, the gap can be filled even surpassed by fine-tuning or introducing domain knowledge to a strong LLM. - Humans and most LLMs are adept at descriptive statistics tasks but struggle with contingency table and variance tests. Domain knowledge significantly boosts larger proprietary LLMs' performance, notably GPT-4o, but has limited impact on smaller open-source models. - LLaMA-3 and GPT models demonstrate a competent understanding of tasks and the latter can accurately select data columns, but LLaMA-2 models have difficulties in these aspects. - LLMs are good at distinguishing different statistical tasks then selecting associated methods but struggle to utilize the domain knowledge to assess method applicability effectively. Conversely, humans excel at discerning method applicability but are prone to task confusion. - Humans and LLMs have distinct proficiencies and weaknesses in different aspects of selecting applicable statistical tasks, highlighting the potential for complementary collaboration.

BibTeX

If you find our work useful or inspiring, please kindly cite:

@inproceedings{zhu2024statqa,
      author = {Zhu, Yizhang and Du, Shiyin and Li, Boyan and Luo, Yuyu and Tang, Nan},
      booktitle = {Advances in Neural Information Processing Systems},
      editor = {A. Globerson and L. Mackey and D. Belgrave and A. Fan and U. Paquet and J. Tomczak and C. Zhang},
      pages = {62697--62731},
      publisher = {Curran Associates, Inc.},
      title = {Are Large Language Models Good Statisticians?},
      url = {https://proceedings.neurips.cc/paper_files/paper/2024/file/729786203d330da046dd8091c2d92a66-Paper-Datasets_and_Benchmarks_Track.pdf},
      volume = {37},
      year = {2024}
    }