Last week, I was at a (company internal) workshop on Question Answering (Q+A), organized by our Search Guild, of which I am a member. The word "guild" sounds vaguely medieval, but its basically a group of employees who share a common interest in Search technologies. As so often happens in large companies, groups tend to be somewhat silo-ized, and one group might not know much about what another one is doing, so the objective of the Search Guild is to bring groups together and promote knowledge sharing. To that end, the Search Guild organizes monthly presentations (with internal speakers as well as industry experts from outside the company) delivered via Webex (we are a distributed company with offices in at least 5 continents). It also provides forums for members to share information via blog posts, mailing lists, etc. As part of this effort, and given the importance of Q+A to Search, this year we organized our very first workshop on Q+A at Philadelphia on October 5 and 6.
What was unique about this workshop for me was that I was an organizer, speaker and attendee here. As speaker, there is obviously significant additional work involved with building your presentation and delivering it. As organizer, however, you truly get an appreciation of how much work goes into making an event successful. Many thanks to my fellow organizers for all the work they did, and apologies to the participants (if any of them are reading this) for any mistakes we made (we made quite a few, next time we should definitely use more checklists. Also remote two-way participation is very hard).
The talks at the Workshop were organized into 4 primary themes. The first group of 3 talks (one of which was mine) dealt with approaches designed against external benchmarks, and were a bit more "researchy" than others. The second group of 3 talks dealt with Question Complexity and how people are tackling them in their various projects. The third group of 4 talks looked at strategies used by engines that were already in production or QA, and the fourth group had 3 talks around different approaches to introducing Q+A into our Clinical search engine. In addition, there were several short talks and demos, mostly around Clinical. The most interest and activity in Q+A is around our Legal and Clinical search engines, followed by search engine products built around Life Sciences, Material Science and Chemistry. Attendance wise, we had around 25 in-person participants and 15 remote. 3 of the 13 talks were delivered remotely from our Amsterdam and Frankfurt offices.
My own experience with Question Answering is fairly minimal, mainly attempts to build functionality over search without trying too hard to understand the question implicit in the query. So it was definitely a great learning experience for me, to hear from people who had thought about their respective domains at length and come up with some pretty innovative solutions. As expected, some of the approaches described were similar to what I had used before, but they were used as part of a broader array of techniques, so there was something to learn for me there as well.
In this post, I will briefly describe my presentation and point you to the slides and code. My talk was about a hobby project that my co-presenter Abhishek Sharma and I started couple of months ago, hoping to deepen our understanding of how Deep Learning could be applied to Question Answering. We are both part of the Deep Learning Enthusiasts Meetup (he is the organizer), and he came up with the idea while we were watching Richard Socher's Deep Learning for Natural Language Processing (CS224d) lectures. The project involves implementing a bunch of Deep Learning models to predict the correct choice for multiple choice 8th grade Science questions. The data came from the Allen AI Science Challenge on Kaggle.
You can find the slides for the talk here. All the code can be found in this github repository. The code is written in Python using the awesome Keras library. I also used gensim to generate and load external embeddings, and NLTK and SpaCy for some simple NLP functionality. The README,md is fairly detailed (with many illustrations originally built for the slides), so I am not going to repeat the stuff here.
I looked at the "question with four candidate answers one of which is correct" as a classification problem with 1 positive and 3 negative examples per question. All my models produce a binary (correct/incorrect) response given a question and answer pair. Once the best model (in terms of accuracy of correct/incorrect predictions) is identified, I then run it on all four (question, answer) pairs and select the one with the best score. To do this, I needed to be able to serialize each model after training and deserialize it in the final prediction script. This is where I ran into problems I described in Keras Issue 3927.
To make a long story short, if you re-use an input with the Sequential model, the weights get mis-aligned somehow and cannot be loaded back into the model. I noticed it after I upgraded to the latest version of Keras from a much older version because of some extra layer types I wanted to use. The workaround for the newer version seems to be to use the Functional API. Unfortunately I wasn't able to do the code rewrite and rerun by my presentation deadline, although luckily for me, I did have a usable model for one of my earlier (weaker) classifiers that I saved using the earlier version.
So in the rest of this post, I will describe the architecture and code for my strongest model, an LSTM-QA model with Attention (inspired by the paper LSTM-based Deep Learning Models for Non-factoid Answer Selection by Tan, dos Santos, Xiang and Zhou), and using a custom embedding generated from approximately 500k Studystack Flashcards, followed by the code for finding the best answer. In other words, the last mile of my solution.
This is what the network looks like:
And here is the code for the network.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118
# Source: qa-lstm-fem-attn.py # -*- coding: utf-8 -*- from __future__ import division, print_function from keras.callbacks import ModelCheckpoint from keras.layers import Input, Dense, Dropout, Reshape, Flatten, merge from keras.layers.embeddings import Embedding from keras.layers.recurrent import LSTM from keras.models import Model from sklearn.cross_validation import train_test_split import os import sys import kaggle DATA_DIR = "../data/comp_data" MODEL_DIR = "../data/models" WORD2VEC_BIN = "studystack.bin" WORD2VEC_EMBED_SIZE = 300 QA_TRAIN_FILE = "8thGr-NDMC-Train.csv" QA_TEST_FILE = "8thGr-NDMC-Test.csv" QA_EMBED_SIZE = 64 BATCH_SIZE = 128 NBR_EPOCHS = 20 ## extract data print("Loading and formatting data...") qapairs = kaggle.get_question_answer_pairs( os.path.join(DATA_DIR, QA_TRAIN_FILE)) question_maxlen = max([len(qapair) for qapair in qapairs]) answer_maxlen = max([len(qapair) for qapair in qapairs]) # Even though we don't use the test set for classification, we still need # to consider any additional vocabulary words from it for when we use the # model for prediction (against the test set). tqapairs = kaggle.get_question_answer_pairs( os.path.join(DATA_DIR, QA_TEST_FILE), is_test=True) tq_maxlen = max([len(qapair) for qapair in tqapairs]) ta_maxlen = max([len(qapair) for qapair in tqapairs]) seq_maxlen = max([question_maxlen, answer_maxlen, tq_maxlen, ta_maxlen]) word2idx = kaggle.build_vocab(, qapairs, tqapairs) vocab_size = len(word2idx) + 1 # include mask character 0 Xq, Xa, Y = kaggle.vectorize_qapairs(qapairs, word2idx, seq_maxlen) Xqtrain, Xqtest, Xatrain, Xatest, Ytrain, Ytest = \ train_test_split(Xq, Xa, Y, test_size=0.3, random_state=42) print(Xqtrain.shape, Xqtest.shape, Xatrain.shape, Xatest.shape, Ytrain.shape, Ytest.shape) # get embeddings from word2vec print("Loading Word2Vec model and generating embedding matrix...") embedding_weights = kaggle.get_weights_word2vec(word2idx, os.path.join(DATA_DIR, WORD2VEC_BIN), is_custom=True) print("Building model...") # output: (None, QA_EMBED_SIZE, seq_maxlen) qin = Input(shape=(seq_maxlen,), dtype="int32") qenc = Embedding(input_dim=vocab_size, output_dim=WORD2VEC_EMBED_SIZE, input_length=seq_maxlen, weights=[embedding_weights])(qin) qenc = LSTM(QA_EMBED_SIZE, return_sequences=True)(qenc) qenc = Dropout(0.3)(qenc) # output: (None, QA_EMBED_SIZE, seq_maxlen) ain = Input(shape=(seq_maxlen,), dtype="int32") aenc = Embedding(input_dim=vocab_size, output_dim=WORD2VEC_EMBED_SIZE, input_length=seq_maxlen, weights=[embedding_weights])(ain) aenc = LSTM(QA_EMBED_SIZE, return_sequences=True)(aenc) aenc = Dropout(0.3)(aenc) # attention model attn = merge([qenc, aenc], mode="dot", dot_axes=[1, 1]) attn = Flatten()(attn) attn = Dense(seq_maxlen * QA_EMBED_SIZE)(attn) attn = Reshape((seq_maxlen, QA_EMBED_SIZE))(attn) qenc_attn = merge([qenc, attn], mode="sum") qenc_attn = Flatten()(qenc_attn) output = Dense(2, activation="softmax")(qenc_attn) model = Model(input=[qin, ain], output=[output]) print("Compiling model...") model.compile(optimizer="adam", loss="categorical_crossentropy", metrics=["accuracy"]) print("Training...") best_model_filename = os.path.join(MODEL_DIR, kaggle.get_model_filename(sys.argv, "best")) checkpoint = ModelCheckpoint(filepath=best_model_filename, verbose=1, save_best_only=True) model.fit([Xqtrain, Xatrain], [Ytrain], batch_size=BATCH_SIZE, nb_epoch=NBR_EPOCHS, validation_split=0.1, callbacks=[checkpoint]) print("Evaluation...") loss, acc = model.evaluate([Xqtest, Xatest], [Ytest], batch_size=BATCH_SIZE) print("Test loss/accuracy final model = %.4f, %.4f" % (loss, acc)) final_model_filename = os.path.join(MODEL_DIR, kaggle.get_model_filename(sys.argv, "final")) json_model_filename = os.path.join(MODEL_DIR, kaggle.get_model_filename(sys.argv, "json")) kaggle.save_model(model, json_model_filename, final_model_filename) best_model = kaggle.load_model(json_model_filename, best_model_filename) best_model.compile(optimizer="adam", loss="categorical_crossentropy", metrics=["accuracy"]) loss, acc = best_model.evaluate([Xqtest, Xatest], [Ytest], batch_size=BATCH_SIZE) print("Test loss/accuracy best model = %.4f, %.4f" % (loss, acc))
The code above builds up questions and answers as an array of indexes into the word dictionary created off the words in the questions and answers. The weights for our embeddings are initialized from running word2vec on our corpus of StudyStack flashcards. Attention is modeled as a dot product of the output of the question and answer vectors that come out of the LSTMs. Finally, the attention vector and question vectors are concatenated and sent into a Dense network, which outputs one of two values.
The next step takes the saved model (final one) and runs each question in the test set and its four choices as a single batch, and predicts the correct answer as the one which has the highest score. The output is written to a CSV file in the format required for submission to Kaggle.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63
# src/predict_testfile.py # -*- coding: utf-8 -*- from __future__ import division, print_function from keras.preprocessing.sequence import pad_sequences import nltk import numpy as np import os import kaggle DATA_DIR = "../data/comp_data" TRAIN_FILE = "8thGr-NDMC-Train.csv" TEST_FILE = "8thGr-NDMC-Test.csv" SUBMIT_FILE = "submission.csv" MODEL_DIR = "../data/models" MODEL_JSON = "qa-lstm-fem-attn.json" MODEL_WEIGHTS = "qa-lstm-fem-attn-final.h5" LSTM_SEQLEN = 196 # seq_maxlen from original model print("Loading model..") model = kaggle.load_model(os.path.join(MODEL_DIR, MODEL_JSON), os.path.join(MODEL_DIR, MODEL_WEIGHTS)) model.compile(optimizer="adam", loss="categorical_crossentropy", metrics=["accuracy"]) print("Loading vocabulary...") qapairs = kaggle.get_question_answer_pairs(os.path.join(DATA_DIR, TRAIN_FILE)) tqapairs = kaggle.get_question_answer_pairs(os.path.join(DATA_DIR, TEST_FILE), is_test=True) word2idx = kaggle.build_vocab(, qapairs, tqapairs) vocab_size = len(word2idx) + 1 # include mask character 0 ftest = open(os.path.join(DATA_DIR, TEST_FILE), "rb") fsub = open(os.path.join(DATA_DIR, SUBMIT_FILE), "wb") fsub.write("id,correctAnswer\n") line_nbr = 0 for line in ftest: line = line.strip().decode("utf8").encode("ascii", "ignore") if line.startswith("#"): continue if line_nbr % 10 == 0: print("Processed %d questions..." % (line_nbr)) cols = line.split("\t") qid = cols question = cols answers = cols[2:] # create batch of question qword_ids = [word2idx[qword] for qword in nltk.word_tokenize(question)] Xq, Xa = ,  for answer in answers: Xq.append(qword_ids) Xa.append([word2idx[aword] for aword in nltk.word_tokenize(answer)]) Xq = pad_sequences(Xq, maxlen=LSTM_SEQLEN) Xa = pad_sequences(Xa, maxlen=LSTM_SEQLEN) Y = model.predict([Xq, Xa]) probs = np.exp(1.0 - (Y[:, 1] - Y[:, 0])) correct_answer = chr(ord('A') + np.argmax(probs)) fsub.write("%s,%s\n" % (qid, correct_answer)) line_nbr += 1 print("Processed %d questions..." % (line_nbr)) fsub.close() ftest.close()
Here is the output for one single question which I had referenced in the presentation slides. The model shows shows the distribution of scores between the answers (normalized to add up to 1).
I did try to run my classifier on the entire test set and produce a submission file for Kaggle, just to see where I stand. Since the classification accuracy for the winner was approximately 59%, it is unlikely that my 70%+ accuracy numbers for my classifiers will carry over into the final task. I had signed up for the competition with the intention of participating but got sidetracked, so I had the original datasets of approximately 8000 training and 8000 test questions, but unfortunately, the final rankings were computed off another test set of approximately 200k questions that were supplied later in the competition, so I didn't have them.
Thats all I have for today. As someone mentioned to me after the workshop, these sort of things are very energizing. Certainly I learned a lot from it. The deadline also pushed me to complete my hobby project, so I got to learn quite a bit about more complex Keras models. Hopefully, this will enable me to build more complex models going forward.