Retrieval Augmentation Reduces Factual Errors in Knowledge-Intensive Language Model Tasks

Dai Teng; Changhao Zhang; Jitong Zou

doi:10.54097/8jvwpk07

Authors

Dai Teng
Changhao Zhang
Jitong Zou

DOI:

https://doi.org/10.54097/8jvwpk07

Keywords:

Retrieval-augmented generation, Large language models, Hallucination reduction, Knowledge-intensive NLP, Dense passage retrieval, Factual accuracy, Open-domain question answering

Abstract

Large language models (LLMs) have demonstrated exceptional capabilities across natural language processing (NLP) tasks; however, they remain persistently susceptible to generating factually incorrect content—a phenomenon broadly termed hallucination. Retrieval-augmented generation (RAG) has emerged as a principled paradigm for mitigating this limitation by grounding model outputs in dynamically retrieved external evidence, thereby substantially reducing factual errors in knowledge-intensive settings. This paper presents a comprehensive review of RAG research, tracing developments from early retrieval-enhanced pretraining frameworks to adaptive and self-reflective architectures. We examine how retrieval strategies including dense passage retrieval (DPR), sparse retrieval, and hybrid methods interact with generative components to suppress hallucination. We analyze the Knowledge-Intensive Language Tasks (KILT) benchmark and open-domain question answering (QA) datasets as primary evaluation vehicles, synthesizing empirical evidence demonstrating that RAG consistently lowers factual error rates relative to purely parametric LLMs. We further discuss challenges including retrieval quality, knowledge conflict resolution, multi-hop reasoning, and domain adaptation, and outline future directions essential for realizing the full potential of RAG in high-stakes natural language generation (NLG) applications.

Downloads

Download data is not yet available.

References

[1] Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., ... & Amodei, D. (2020). Language models are few-shot learners. Advances in neural information processing systems, 33, 1877-1901.

[2] Huang, L., Yu, W., Ma, W., Zhong, W., Feng, Z., Wang, H., ... & Liu, T. (2025). A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. ACM Transactions on Information Systems, 43(2), 1-55.

[3] Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., ... & Kiela, D. (2020). Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in neural information processing systems, 33, 9459-9474.

[4] Petroni, F., Piktus, A., Fan, A., Lewis, P., Yazdani, M., De Cao, N., ... & Riedel, S. (2021, June). KILT: a benchmark for knowledge intensive language tasks. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 2523-2544).

[5] Kandpal, N., Deng, H., Roberts, A., Wallace, E., & Raffel, C. (2023, July). Large language models struggle to learn long-tail knowledge. In International conference on machine learning (pp. 15696-15707). PMLR.

[6] Long, M., Sun, D., Yang, D., Wang, J., Luo, Y., Shen, Y., ... & Gu, J. (2025). Diver: A multi-stage approach for reasoning-intensive information retrieval. arXiv preprint arXiv:2508.07995.

[7] Maynez, J., Narayan, S., Bohnet, B., & McDonald, R. (2020, July). On faithfulness and factuality in abstractive summarization. In Proceedings of the 58th annual meeting of the association for computational linguistics (pp. 1906-1919).

[8] Shuster, K., Poff, S., Chen, M., Kiela, D., & Weston, J. (2021, November). Retrieval augmentation reduces hallucination in conversation. In Findings of the Association for Computational Linguistics: EMNLP 2021 (pp. 3784-3803).

[9] Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., ... & Fung, P. (2023). Survey of hallucination in natural language generation. ACM computing surveys, 55(12), 1-38.

[10] Tay, Y., Dehghani, M., Bahri, D., & Metzler, D. (2022). Efficient transformers: A survey. ACM Computing Surveys, 55(6), 1-28.

[11] Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019, June). Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers) (pp. 4171-4186).

[12] Sachan, D. S., Lewis, M., Yogatama, D., Zettlemoyer, L., Pineau, J., & Zaheer, M. (2023). Questions are all you need to train a dense passage retriever. Transactions of the Association for Computational Linguistics, 11, 600-616.

[13] Izacard, G., & Grave, E. (2021, April). Leveraging passage retrieval with generative models for open domain question answering. In Proceedings of the 16th conference of the european chapter of the association for computational linguistics: main volume (pp. 874-880).

[14] Mallen, A., Asai, A., Zhong, V., Das, R., Khashabi, D., & Hajishirzi, H. (2023, July). When not to trust language models: Investigating effectiveness of parametric and non-parametric memories. In Proceedings of the 61st annual meeting of the association for computational linguistics (volume 1: Long papers) (pp. 9802-9822).

[15] Shi, F., Chen, X., Misra, K., Scales, N., Dohan, D., Chi, E. H., ... & Zhou, D. (2023, July). Large language models can be easily distracted by irrelevant context. In International Conference on Machine Learning (pp. 31210-31227). PMLR.

[16] Karpukhin, V., Oguz, B., Min, S., Lewis, P., Wu, L., Edunov, S., ... & Yih, W. T. (2020, November). Dense passage retrieval for open-domain question answering. In Proceedings of the 2020 conference on empirical methods in natural language processing (EMNLP) (pp. 6769-6781).

[17] Guu, K., Lee, K., Tung, Z., Pasupat, P., & Chang, M. (2020, November). Retrieval augmented language model pre-training. In International conference on machine learning (pp. 3929-3938). PMLR.

[18] Borgeaud, S., Mensch, A., Hoffmann, J., Cai, T., Rutherford, E., Millican, K., ... & Sifre, L. (2022, June). Improving language models by retrieving from trillions of tokens. In International conference on machine learning (pp. 2206-2240). PMLR.

[19] Jiang, Z., Xu, F. F., Gao, L., Sun, Z., Liu, Q., Dwivedi-Yu, J., ... & Neubig, G. (2023, December). Active retrieval augmented generation. In Proceedings of the 2023 conference on empirical methods in natural language processing (pp. 7969-7992).

[20] Asai, A., Wu, Z., Wang, Y., Sil, A., & Hajishirzi, H. (2023, October). Self-rag: Learning to retrieve, generate, and critique through self-reflection. In The Twelfth International Conference on Learning Representations.

[21] Ahmad, M. (2025). Toward a Unified Framework for Information Retrieval in Large Language Model Applications: Balancing Textual and Graph-Based Knowledge Sources.

[22] Xie, J., Zhang, K., Chen, J., Lou, R., & Su, Y. (2023, May). Adaptive chameleon or stubborn sloth: Revealing the behavior of large language models in knowledge conflicts. In The Twelfth International Conference on Learning Representations.

[23] Wang, K., Duan, F., Wang, S., Li, P., Xian, Y., Yin, C., ... & Xiong, Z. (2023). Knowledge-driven cot: Exploring faithful reasoning in llms for knowledge-intensive question answering. arXiv preprint arXiv:2308.13259.

[24] Singhal, K., Azizi, S., Tu, T., Mahdavi, S. S., Wei, J., Chung, H. W., ... & Natarajan, V. (2023). Large language models encode clinical knowledge. Nature, 620(7972), 172-180.

[25] Niklaus, J., Matoshi, V., Rani, P., Galassi, A., Stürmer, M., & Chalkidis, I. (2023, December). Lextreme: A multi-lingual and multi-task benchmark for the legal domain. In Findings of the Association for Computational Linguistics: EMNLP 2023 (pp. 3016-3054).

[26] Nakano, R., Hilton, J., Balaji, S., Wu, J., Ouyang, L., Kim, C., ... & Schulman, J. (2021). Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332.

[27] Menick, J., Trebacz, M., Mikulik, V., Aslanides, J., Song, F., Chadwick, M., ... & McAleese, N. (2022). Teaching language models to support answers with verified quotes. arXiv preprint arXiv:2203.11147.

[28] Augenstein, I., Baldwin, T., Cha, M., Chakraborty, T., Ciampaglia, G. L., Corney, D., ... & Zagni, G. (2024). Factuality challenges in the era of large language models and opportunities for fact-checking. Nature Machine Intelligence, 6(8), 852-863.

[29] Min, S., Krishna, K., Lyu, X., Lewis, M., Yih, W. T., Koh, P., ... & Hajishirzi, H. (2023, December). Factscore: Fine-grained atomic evaluation of factual precision in long form text generation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (pp. 12076-12100).

[30] Gao, Y., Xiong, Y., Gao, X., Jia, K., Pan, J., Bi, Y., ... & Wang, H. (2023). Retrieval-augmented generation for large language models: A survey. arXiv preprint arXiv:2312.10997, 2(1), 32.

[31] Asai, A., Min, S., Zhong, Z., & Chen, D. (2023, July). Retrieval-based language models and applications. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 6: Tutorial Abstracts) (pp. 41-46).

[32] Thakur, N., Reimers, N., Rücklé, A., Srivastava, A., & Gurevych, I. (2021). Beir: A heterogenous benchmark for zero-shot evaluation of information retrieval models. arXiv preprint arXiv:2104.08663.

[33] Shang, W., Wang, Z., & Wang, B. (2025). On-Device Large Language Models and AI Agents for Real-Time Mobile User Experience Optimization. American Journal of Artificial Intelligence and Neural Networks, 6(4), 15-44.

[34] Huang, Z., Zeng, H., Zamani, H., & Allan, J. (2023, July). Soft prompt decoding for multilingual dense retrieval. In Proceedings of the 46th international ACM SIGIR conference on research and development in information retrieval (pp. 1208-1218).

[35] Mao, Y., He, P., Liu, X., Shen, Y., Gao, J., Han, J., & Chen, W. (2021, August). Generation-augmented retrieval for open-domain question answering. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) (pp. 4089-4100).

[36] Wang, L., Yang, N., & Wei, F. (2023, December). Query2doc: Query expansion with large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (pp. 9414-9423).

[37] Gao, L., Ma, X., Lin, J., & Callan, J. (2023, July). Precise zero-shot dense retrieval without relevance labels. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 1762-1777).

[38] Lin, Jimmy, Rodrigo Nogueira, and Andrew Yates. Pretrained transformers for text ranking: Bert and beyond. Springer Nature, 2022.

[39] Nogueira, R., Yang, W., Cho, K., & Lin, J. (2019). Multi-stage document ranking with BERT. arXiv preprint arXiv:1910.14424.

[40] Beheshti, A., Hashemi, V. M., & Yakhchi, S. (2019, December). Towards context-aware social behavioral analytics. In Proceedings of the 17th International Conference on Advances in Mobile Computing & Multimedia (pp. 28-35).

[41] Yin, P., Neubig, G., Yih, W. T., & Riedel, S. (2020, July). TaBERT: Pretraining for joint understanding of textual and tabular data. In Proceedings of the 58th annual meeting of the association for computational linguistics (pp. 8413-8426).

[42] Zhao, R., Li, X., Joty, S., Qin, C., & Bing, L. (2023, July). Verify-and-edit: A knowledge-enhanced chain-of-thought framework. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 5823-5840).

[43] Ho, X., Nguyen, A. K. D., Sugawara, S., & Aizawa, A. (2020, December). Constructing a multi-hop qa dataset for comprehensive evaluation of reasoning steps. In Proceedings of the 28th International Conference on Computational Linguistics (pp. 6609-6625).

[44] Trivedi, H., Balasubramanian, N., Khot, T., & Sabharwal, A. (2023, July). Interleaving retrieval with chain-of-thought reasoning for knowledge-intensive multi-step questions. In Proceedings of the 61st annual meeting of the association for computational linguistics (volume 1: long papers) (pp. 10014-10037).

[45] Chen, J., Lin, H., Han, X., & Sun, L. (2024, March). Benchmarking large language models in retrieval-augmented generation. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 38, No. 16, pp. 17754-17762)..

[46] Zhang, Y., Li, Y., Cui, L., Cai, D., Liu, L., Fu, T., ... & Shi, S. (2025).

[47] Kadavath, S., Conerly, T., Askell, A., Henighan, T., Drain, D., Perez, E., ... & Kaplan, J. (2022). Language models (mostly) know what they know. arXiv preprint arXiv:2207.05221.

[48] Siriwardhana, S., Weerasekera, R., Wen, E., Kaluarachchi, T., Rana, R., & Nanayakkara, S. (2023). Improving the domain adaptation of retrieval augmented generation (RAG) models for open domain question answering. Transactions of the Association for Computational Linguistics, 11, 1-17.

[49] Shi, W., Min, S., Lomeli, M., Zhou, C., Li, M., Szilvasy, G., ... & Lewis, M. (2023). In-context pretraining: Language modeling beyond document boundaries. arXiv preprint arXiv:2310.10638.

[50] Zou, A., Phan, L., Chen, S., Campbell, J., Guo, P., Ren, R., ... & Hendrycks, D. (2023). Representation engineering: A top-down approach to ai transparency. arXiv preprint arXiv:2310.01405.

[51] Greshake, K., Abdelnabi, S., Mishra, S., Endres, C., Holz, T., & Fritz, M. (2023, November). Not what you've signed up for: Compromising real-world llm-integrated applications with indirect prompt injection. In Proceedings of the 16th ACM workshop on artificial intelligence and security (pp. 79-90).

[52] Li, Z., Guo, Q., Shao, J., Song, L., Bian, J., Zhang, J., & Wang, R. (2024). Graph neural network enhanced retrieval for question answering of llms. arXiv preprint arXiv:2406.06572.

[53] Pan, S., Luo, L., Wang, Y., Chen, C., Wang, J., & Wu, X. (2024). Unifying large language models and knowledge graphs: A roadmap. IEEE Transactions on Knowledge and Data Engineering, 36(7), 3580-3599.

[54] Panagopoulou, A., Xue, L., Yu, N., Li, J., Li, D., Joty, S., ... & Niebles, J. C. (2023). X-instructblip: A framework for aligning x-modal instruction-aware representations to llms and emergent cross-modal reasoning. arXiv preprint arXiv:2311.18799.

[55] Shi, W., Min, S., Yasunaga, M., Seo, M., James, R., Lewis, M., ... & Yih, W. T. (2024, June). Replug: Retrieval-augmented black-box language models. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) (pp. 8371-8384).

[56] Ren, R., Wang, Y., Qu, Y., Zhao, W. X., Liu, J., Wu, H., ... & Wang, H. (2025, January). Investigating the factual knowledge boundary of large language models with retrieval augmentation. In Proceedings of the 31st International Conference on Computational Linguistics (pp. 3697-3715).

[57] Chen, L., Deng, Y., Bian, Y., Qin, Z., Wu, B., Chua, T. S., & Wong, K. F. (2023, December). Beyond factuality: A comprehensive evaluation of large language models as knowledge generators. In Proceedings of the 2023 conference on empirical methods in natural language processing (pp. 6325-6341).

[58] Ding, G., Yang, S., Lin, H., Chen, Z., & Yang, J. S. (2026). LLM-Driven Adaptive Cloud Resource Scheduling: Bridging Reasoning Intelligence with Optimization Guarantees. IEEE Open Journal of the Computer Society.

[59] Su, W., Tang, Y., Ai, Q., Wu, Z., & Liu, Y. (2024, August). Dragin: Dynamic retrieval augmented generation based on the real-time information needs of large language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 12991-13013).

[60] Yu, W., Zhang, Z., Liang, Z., Jiang, M., & Sabharwal, A. (2023). Improving language models via plug-and-play retrieval feedback. arXiv preprint arXiv:2305.14002.

[61] Lin, D. (2024). Revolutionizing retrieval-augmented generation with enhanced PDF structure recognition. arXiv preprint arXiv:2401.12599.

[62] Cuconasu, F., Trappolini, G., Siciliano, F., Filice, S., Campagnano, C., Maarek, Y., ... & Silvestri, F. (2024, July). The power of noise: Redefining retrieval for rag systems. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 719-729).

[63] Wang, X., Wang, Z., Gao, X., Zhang, F., Wu, Y., Xu, Z., ... & Huang, X. J. (2024, November). Searching for best practices in retrieval-augmented generation. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (pp. 17716-17736).

[64] Zhang, N., Yao, Y., Tian, B., Wang, P., Deng, S., Wang, M., ... & Chen, H. (2024). A comprehensive study of knowledge editing for large language models. arXiv preprint arXiv:2401.01286.

[65] Huang, Y., & Huang, J. (2024). A survey on retrieval-augmented text generation for large language models. arXiv preprint arXiv:2404.10981.

[66] Shao, Z., Gong, Y., Shen, Y., Huang, M., Duan, N., & Chen, W. (2023, December). Enhancing retrieval-augmented large language models with iterative retrieval-generation synergy. In Findings of the Association for Computational Linguistics: EMNLP 2023 (pp. 9248-9274).