If you ever wrote a computer program in your life and you are not Khal Drogo of coding (Khal Drogo never lost a fight), certainly you have made a bunch of coding mistakes. A missing semicolon, not null checking before using a variable, or a loop without a boundary condition — we’ve all been there. For class projects or assignments, these mistakes are innocent; but, in production-level software development, they can be a disaster. Your users will shower you with swear words in app store when their app crashes because of your sloppy code. Or worse, a sunglass-and-trench-coat-wearing hacker might find that flaw and steal some important data.
For this reason, nearly every professional software development team nowadays has something called “code review”. This means when you are a fellow coder in such a team and you submit a commit, before merging your commit to the code base, a second (usually senior) team member reviews the code and provides comments on if there are defects on the code and how to fix them. These comments then go back to you — the coder — and your job is to fix the code, submit it, and start the same process again. Now, would you like, dear coder, for an AI to read those comments for you, and perform those fixes automatically? Of course, you will. “Lazy” is the unofficial job title for every programmer. That’s something we tried to do in our work Review4Repair. So, buckle up, and keep reading to know how we did, and I will sneak in a few things about Neural Machine Translation (NMT) as well.
Ever wondered how Google Translate turns your “Bonjour étranger” to “Hello stranger”? It’s something called a sequence to sequence neural network. It has two parts, an encoder, and a decoder. The job for an encoder is to “encode” (duh!) a source language (in this example, French), and the decoder’s job is to “decode” the encoded version of the language to the target language (here, which is English). If we train such an Encoder-Decoder model with a bunch and bunch of French-English parallel sentences, the encoder will learn how to represent the French in a way that the decoder can best understand, and the decoder will learn how to decode correctly from the encoded version. Once training is over, the model can translate a French sentence it has never seen before! I am not going too much into details, this is one of the visionary papers that proposed this idea, highly recommended for the interested.
The fun part is, the model has no idea whether it is translating from French to English or Telegu to Marathi. It just translates one sequence to another. Whatever parallel sequences it is trained on, the model has the capacity to learn to perform the same translation. Hence the name — sequence to sequence (also affectionately called seq2seq). This seq2seq model has been rocking the translation world for a long time. Recently, some researchers (Chen et al. 2019, Tufano et al. 2019) started thinking, a source code is a sequence as well, is it possible to train a seq2seq model to take a buggy code as input, and create a fixed code as output? It turns out, yes. These two papers collected large datasets of <buggy code, fixed code> parallel data and show that seq2seq model can, in fact, learn to fix buggy code. Chen et al. 2019 proposed a model named SequenceR, a pointer generator network (a fancier seq2seq model) and trained with buggy and fixed codes. Their results were very promising.
Sounds too easy, right? Not so fast. Source code and natural languages both could be sequences, but they are not the same. In human languages, the number of words is usually limited. For example, the most frequent 10,000 words are usually sufficient to represent the entire English language, and any word outside that list can be ignored as less important. This is good for neural networks because neural networks require you to give it a fixed set of vocabulary. Why? Well, a neural network is basically some matrices, and training a neural network is tweaking each matrix. Before defining a neural net, we must decide the size of the matrix. The following figure shows a very rough visualization of how words are represented as vector in any neural model, not only seq2seq. Because the middle matrix has to be defined, the number of words in our vocabulary (left vector) must be fixed from the beginning.
But this brings us to the next challenge in designing a neural model for source code, the number of possible vocabulary in source code is infinity because variable, function, class, or any identifier names can be arbitrary, and our model has to understand all of them equally. That means, our model should understand, “int x = 0;” just as good as “int numberOfPacketSentPerSecond = 0;”. So how do we make a neural network which must have a fixed size vocabulary to handle an infinite number of words? Both Tufano et al. and SequenceR proposed two different ways of solving this problem.
Tufano et al. 2019 solved the infinite variable problem by, well how should I put it, by not solving it. They did a clever trick by replacing all variable names with a generic identifier token. For example, the first variable of the code becomes VAR0, second variable becomes VAR1, and so on. Thus whether the variable name is x, or numberOfPacketSentPerSecond, the model will only see it as VAR0, VAR1, etc. And usually there are finite number of variables in a code, so the model never have to deal with the messy world of identifier naming.
SequenceR found another clever idea. This unknown variable problem has been a problem in Natural Language as well. Such as place and people names in Natural Language data are also kind of arbitrary (model trained with “New York is a beautiful city” also have to work well for “Rajshahi is a beautiful city”). So this problem is already somewhat solved in Natural Language Processing with a technique called Copy Mechanism. What it does is if it sees an unknown token in the input, it learns to copy that word directly to the output, even if it is not in the vocabulary. More on this paper.
Alright, enough about other people’s work! In Review4Repair, we took SequenceR’s idea of using Copy Mechanism for solving unknown vocabulary problem, used a pointer generator network (i.e. the fancier seq2seq model), and trained it on a data with buggy code and a code review comment as input, and the code fix as output (<comment + defected code , fixed code>). If you are an expert in seq2seq models, you might be saying in your head, “But Masum, comment and code are totally separate modalities, one is natural language the other is a programming language. If you add them together, how would your model know which is which?” Don’t worry expert person, we took care of that by adding 4 special tokens marking the beginning and end of both comment and code. So, here is how the input looks like in practice,
By watching a lots and lots of data, the model internally figures out the content inside the two special tokens <|startcomment|> and <|endcomment|> is a comment, and the content inside <|startcode|> and <|endcode|> is code. If you are wondering what does the <|startfocus|> and <|endfocus|> tokens mean — they indicate the portion of the code that needs modifying. Code review platforms allow reviewers to select the portion of code that needs to be modified. In our input, we mark it as focus. And the model output only generates an alternative fix for this part only. By watching these special tokens, models also know what to fix.
Now here is the fun part. Inside the ML model, we never told it what these special tokens mean. We are just putting the special tokens in the data, and hoping that our model figures them out by itself. And beautifully, it does. With training, the model learns that it only has to predict an alternative for the portion inside the <|focus|> tokens. If it doesn’t, the results are bad. Model is never explicitly taught to see these tokens differently from the others, but in the training time over and over it sees that modifying the content inside the <|focus|> tokens gives better scores. And the model only cares about getting better scores. Similarly, the model also sees that utilizing the instructions inside the <|comment|> tags also helps the score. So it learns to understand the content of the comments and utilize their instructions. I think this is the most important takeaway for Deep Learning research from this article,
“You cannot tell deep learning models what to do, you can just make way for it to do the right thing and reward it for doing the right thing.”
So finally, we give the model a buggy code, a comment on how to fix it, and the model returns us fix for the bug. How well does it work, you ask? Well, not perfect, but better than the others! This is our comparison with SequenceR.
model_cc is our model with review comment, and model_c is our model without review comment’s help. We can see our model fixes code much better than SequenceR, and also the model with review comment clearly stays ahead.
We see the same thing comparing with Tufano et al. 2019 as well.
This is how our two models, with and without code review comment compares with each other.
We can see that the model with comment (model_cc) performs about 9% better at generating the correct fix than model without review comment (model_c) in top 10 accuracy. Which proves that the review comment is in fact helping the code fix. As the only difference between these two models is the review comment, we can say that the model is carrying out the instructions in the review comments for the fix to achieve higher results.
So, where do we go from here? That day is not too far when we will write and edit most of our code just using natural language. AI will revolutionize coding as we know it, it will change people’s lives and create a tremendous amount of industrial value. But it will require a number of breakthroughs and a lot of brilliant researchers’ efforts to get there. Our work Review4Repair is a small step towards that goal. Our dataset, source code, trained models, all are open-sourced (link), so anyone who wants to take the second step, is more than welcome to try.
This work is a combined effort of me (Masum Hasan), Faria Huq Oaishi, Mahim Anzum Pantho, Sazan Mahbub, and supervised by Prof. Anindya Iqbal, and Toufique Ahmed. The computing power required for this work was provided by Intelligent Machines, Dhaka.
SequenceR (2019) — Z. Chen, S. Kommrusch, M. Tufano, L.-N. Pouchet, D. Poshyvanyk, M. Monperrus, Sequencer: Sequence-to-sequence learning for end-to-end program repair, IEEE Transaction on Software Engineering (2019).
Tufano et al. — M. Tufano, J. Pantiuchina, C. Watson, G. Bavota, D. Poshyvanyk, On learning meaningful code changes via neural machine translation, in: Proceedings of the 41st International Conference on Software Engineering, IEEE Press, 2019, pp. 25–36.