A Comprehensive Study of StaQC for Deep Code Summarization
Jayavardhan Reddy Peddamail, Ziyu Yao, Zhen Wang, Huan Sun
Abstract
Learning the mapping between natural language (NL) and programming language, such as retrieving or generating code snippets based on NL queries and annotating code snippets using NL, has been explored by lots of research works. At the core of these works are machine learning and deep learning models, which usually demand for large datasets of pairs for training. This paper describes an experimental study of StaQC, a large-scale and high-quality dataset of pairs in Python and SQL domain, systematically mined from the Stack Overflow forum (SO). We compare StaQC with two other popular datasets mined from SO on the code summarization task, showing that StaQC helps achieve substantially better results, improving the current state-of-the-art model by an order of 8% ∼ 9% in BLEU metric.
Publication
In Deep Learning Day, KDD 2018