Abstract: |
Automated program repair (APR) gained more and more attention over the years, both from an academic, and an industrial point of view. The overall goal of APR is to reduce the cost of development and maintenance, by automagically finding and fixing common bugs, typos, or errors in code. A successful, and highly researched approach is to use deep-learning (DL) techniques to accomplish this task. DL methods are known to be very data-hungry, but despite this, data that is readily available online is hard to find, which poses a challenge to the development of such solutions. In this paper, we address this issue by providing a new dataset consisting of 371,483 code examples on bug-fixing, while also introducing a method that other researchers could use as a feature in their mining software. We extracted code from 5,273 different repositories and 250,090 different commits. Our work contributes to related research by providing a publicly accessible dataset, which DL models could be trained, or fine-tuned on, and a method that easily integrates with almost any code mining tool, as a language-independent feature that gives more granular choices when extracting code parts from a specific bugfix commit. The dataset also includes the summary, and message of the commits in the training data which consists of multiple programming languages, including C, C++, Java, JavaScript, and Python. |