Chatbots are the in thing now. Every website must implement it. Every Data Scientist must know about them. Anytime we talk about AI; Chatbots must be discussed. But they look intimidating to someone very new to the field. We struggle with a lot of questions before we even begin to start working on them.
Are they hard to create? What technologies should I know before attempting to work on them? In the end, we end up discouraged reading through many posts on the internet and effectively accomplishing nothing.
Let me assure you this is not going to be “that kind of a post”.
I will try to distill some of the knowledge I acquired while working through a project in the Natural Language Processing course in the Advanced machine learning specialization.
So before I start, let me first say it for once that don’t be intimidated by the hype and the enigma surrounding Chatbots. They are pretty much using pretty simple NLP techniques which most of us already know. If you don’t, you are welcome to check out my NLP Learning Series, where I go through the problem of text classification in fair detail using Conventional, Deep Learning and Transfer Learning methods.
A Very brief Intro to Chatbots
We can logically divide of Chatbots in the following two categories.
-
Database/FAQ based – We have a database with some questions and answers, and we would like that a user can query that using Natural Language. This is the sort of Chatbots you find at most of the Banking websites for answering FAQs.
-
Chit-Chat Based – Simulate dialogue with the user. These are the kind of chatbots that bring the cool in chatbots. We can use Seq-2-Seq models to create such bots.
The Chatbot we will be creating
We will be creating a dialogue chat bot, which will be able to:
- Answer programming-related questions (using StackOverflow dataset)
- Chit-Chat and simulate dialogue on all non-programming related questions
Once we will have it up and running our final chatbot should look like this.
Seems quite fun.
We will be taking help of resources like Telegram and Chatterbot to build our Chatbot. So before we start, I think I should get you up and running with these two tools.
1. Telegram:
From the website:
Telegram is a messaging app with a focus on speed and security, it’s super-fast, simple and free. You can use Telegram on all your devices at the same time — your messages sync seamlessly across any number of your phones, tablets or computers.
For us, Telegram provides us with an easy way to create a Chatbot UI. It provides us with an access token which we will use to connect to the Telegram App backend and run our chatbot logic. Naturally, we need to have a window where we will write our questions to the chatbot, for us that is provided by Telegram. Also, telegram powers the chatbot by communicating with our chatbot logic. The above screenshot is taken from the telegram app only.
Set up Telegram:
Don’t worry if you don’t understand how it works yet; I will try to give step by step instructions as we go forward.
-
Step 1: Download and Install Telegram App on your Laptop.
-
Step 2: Talk with BotFather by opening this link in Chrome and subsequently your Telegram App.
-
Step 3: The above steps will take you to a Chatbot called Botfather which can help you create a new bot. Inception Anyone? It will look something like this.
- Set up a new bot using command “/newbot”
- Create a name for Your bot.
- Create a username for your bot.
- Step 4: You will get an access token for the bot. Copy the Token at a safe place.
- Step 5: Click on the “t.me/MLWhizbot” link to open Chat with your chatbot in a new window.
Right now if you try to communicate with the chatbot, you won’t receive any answers. And that is how it should be.
But that’s not at all fun. Is it? Let’s do some python magic to make it responsive.
Making our Telegram Chatbot responsive
Create a file main.py
and put the following code in it. Don’t worry most of the code here is Boilerplate code to make our Chatbot communicate with Telegram using the Access token. We need to worry about implementing the class SimpleDialogueManager
. This class contains a function called generate_answer
which is where we will write our bot logic.
#!/usr/bin/env python3
import requests
import time
import argparse
import os
import json
from requests.compat import urljoin
class BotHandler(object):
"""
BotHandler is a class which implements all back-end of the bot.
It has three main functions:
'get_updates' — checks for new messages
'send_message' – posts new message to user
'get_answer' — computes the most relevant on a user's question
"""
def __init__(self, token, dialogue_manager):
self.token = token
self.api_url = "https://api.telegram.org/bot{}/".format(token)
self.dialogue_manager = dialogue_manager
def get_updates(self, offset=None, timeout=30):
params = {"timeout": timeout, "offset": offset}
raw_resp = requests.get(urljoin(self.api_url, "getUpdates"), params)
try:
resp = raw_resp.json()
except json.decoder.JSONDecodeError as e:
print("Failed to parse response {}: {}.".format(raw_resp.content, e))
return []
if "result" not in resp:
return []
return resp["result"]
def send_message(self, chat_id, text):
params = {"chat_id": chat_id, "text": text}
return requests.post(urljoin(self.api_url, "sendMessage"), params)
def get_answer(self, question):
if question == '/start':
return "Hi, I am your project bot. How can I help you today?"
return self.dialogue_manager.generate_answer(question)
def is_unicode(text):
return len(text) == len(text.encode())
class SimpleDialogueManager(object):
"""
This is a simple dialogue manager to test the telegram bot.
The main part of our bot will be written here.
"""
def generate_answer(self, question):
if "Hi" in question:
return "Hello, You"
else:
return "Don't be rude. Say Hi first."
def main():
# Put your own Telegram Access token here...
token = '839585958:AAEfTDo2X6PgHb9IEdb62ueS4SmdpCkhtmc'
simple_manager = SimpleDialogueManager()
bot = BotHandler(token, simple_manager)
###############################################################
print("Ready to talk!")
offset = 0
while True:
updates = bot.get_updates(offset=offset)
for update in updates:
print("An update received.")
if "message" in update:
chat_id = update["message"]["chat"]["id"]
if "text" in update["message"]:
text = update["message"]["text"]
if is_unicode(text):
print("Update content: {}".format(update))
bot.send_message(chat_id, bot.get_answer(update["message"]["text"]))
else:
bot.send_message(chat_id, "Hmm, you are sending some weird characters to me...")
offset = max(offset, update['update_id'] + 1)
time.sleep(1)
if __name__ == "__main__":
main()
You can run the file main.py
in the terminal window to make your bot responsive.
Nice. It is following simple logic. But the good thing is that our bot now does something.
Also, take a look at the terminal window where we have run our main.py
File. Whenever a user asks a question, we get the sort of dictionary below containing Unique Chat ID, Chat Text, User Information, etc.
Update content: {'update_id': 484689748, 'message': {'message_id': 115, 'from': {'id': 844474950, 'is_bot': False, 'first_name': 'Rahul', 'last_name': 'Agarwal', 'language_code': 'en'}, 'chat': {'id': 844474950, 'first_name': 'Rahul', 'last_name': 'Agarwal', 'type': 'private'}, 'date': 1555266010, 'text': 'What is 2+2'}}
Until now whatever we had done was sort of setting up and engineering sort of work.
Only if we can write some sound Data Science logic in the generate_answer
function in our main.py
we should have a decent chatbot.
2. ChatterBot
From the Documentation:
ChatterBot is a Python library that makes it easy to generate automated responses to a user’s input. ChatterBot uses a selection of machine learning algorithms to produce different types of reactions. This makes it easy for developers to create chat bots and automate conversations with users.
Simply. It is a Blackbox system which can provide us with responses for Chitchat type questions for our Chatbot. And the best part about it is that it is pretty easy to integrate with our current flow. We could also have trained a SeqtoSeq model to do the same thing. Might be I will do it in a later post. I digress.
So, install it with:
And change the SimpleDialogueManager
Class in main.py to the following. We can have a bot that can talk to the user and answer random queries.
class SimpleDialogueManager(object):
"""
This is a simple dialogue manager to test the telegram bot.
The main part of our bot will be written here.
"""
def __init__(self):
from chatterbot import ChatBot
from chatterbot.trainers import ChatterBotCorpusTrainer
chatbot = ChatBot('MLWhizChatterbot')
trainer = ChatterBotCorpusTrainer(chatbot)
trainer.train('chatterbot.corpus.english')
self.chitchat_bot = chatbot
def generate_answer(self, question):
response = self.chitchat_bot.get_response(question)
return response
The code in init
instantiates a chatbot using chatterbot and trains it on the provided english corpus data. The data is pretty small, but you can always train it on your dataset too. Just see the documentation. We can then give our responses using the Chatterbot chatbot in the generate_answer
function.
Not too “ba a a a a a d” , I must say.
Creating our StackOverFlow ChatBot
Ok, so finally we are at a stage where we can do something we love. Use Data Science to power our Application/Chatbot. Let us start with creating a rough architecture of what we are going to do next.
We will need to create two classifiers and save them as .pkl
files.
-
Intent-Classifier: This classifier will predict if it a question is a Stack-Overflow question or not. If it is not a Stack-overflow question, we let Chatterbot handle it.
-
Programming-Language(Tag) Classifier: This classifier will predict which language a question belongs to if the question is a Stack-Overflow question. We do this so that we can search for those language questions in our database only.
To keep it simple we will create simple TFIDF models. We will need to save these TFIDF vectorizers.
We will also need to store word vectors for every question for similarity calculations later.
Let us go through the process step by step. You can get the full code in this jupyter notebook in my project repository.
Step 1. Reading and Visualizing the Data
dialogues = pd.read_csv("data/dialogues.tsv",sep="t")
posts = pd.read_csv("data/tagged_posts.tsv",sep="t")
text | tag | |
---|---|---|
0 | Okay — you’re gonna need to learn how to lie. | dialogue |
1 | I’m kidding. You know how sometimes you just … | dialogue |
2 | Like my fear of wearing pastels? | dialogue |
3 | I figured you’d get to the good stuff eventually. | dialogue |
4 | Thank God! If I had to hear one more story ab… | dialogue |
post_id | title | tag | |
---|---|---|---|
0 | 9 | Calculate age in C# | c# |
1 | 16 | Filling a DataSet or DataTable from a LINQ que… | c# |
2 | 39 | Reliable timer in a console application | c# |
3 | 42 | Best way to allow plugins for a PHP application | php |
4 | 59 | How do I get a distinct, ordered list of names… | c# |
print("Num Posts:",len(posts))
print("Num Dialogues:",len(dialogues))
Num Posts: 2171575
Num Dialogues: 218609
Step 2: Create training data for intent classifier – Chitchat/StackOverflow Question
We will be creating a TFIDF model with Logistic regression to do this. If you want to know about the TFIDF model you can read it here.
We could also have used one of the Deep Learning models or transfer learning approaches to do this, but since the main objective of this post is to get a chatbot up and running and not worry too much about the accuracy we sort of work with the TFIDF based model only.
texts = list(dialogues[:200000].text.values) + list(posts[:200000].title.values)
labels = ['dialogue']*200000 + ['stackoverflow']*200000
data = pd.DataFrame({'text':texts,'target':labels})
def text_prepare(text):
"""Performs tokenization and simple preprocessing."""
replace_by_space_re = re.compile('[/(){}[]|@,;]')
bad_symbols_re = re.compile('[^0-9a-z #+_]')
stopwords_set = set(stopwords.words('english'))
text = text.lower()
text = replace_by_space_re.sub(' ', text)
text = bad_symbols_re.sub('', text)
text = ' '.join([x for x in text.split() if x and x not in stopwords_set])
return text.strip()
# Doing some data cleaning
data['text'] = data['text'].apply(lambda x : text_prepare(x))
X_train, X_test, y_train, y_test = train_test_split(data['text'],data['target'],test_size = .1 , random_state=0)
print('Train size = {}, test size = {}'.format(len(X_train), len(X_test)))
Train size = 360000, test size = 40000
Step 3. Create Intent classifier
Here we Create a TFIDF Vectorizer to create features and also train a Logistic regression model to create the intent_classifier. Please note how we are saving TFIDF Vectorizer to resources/tfidf.pkl
and intent_classifier to resources/intent_clf.pkl
. We will need these files once we are going to write the SimpleDialogueManager
class for our final Chatbot.
# We will keep our models and vectorizers in this folder
!mkdir resources
def tfidf_features(X_train, X_test, vectorizer_path):
"""Performs TF-IDF transformation and dumps the model."""
tfv = TfidfVectorizer(dtype=np.float32, min_df=3, max_features=None,
strip_accents='unicode', analyzer='word',token_pattern=r'w{1,}',
ngram_range=(1, 3), use_idf=1,smooth_idf=1,sublinear_tf=1,
stop_words = 'english')
X_train = tfv.fit_transform(X_train)
X_test = tfv.transform(X_test)
pickle.dump(tfv,vectorizer_path)
return X_train, X_test
X_train_tfidf, X_test_tfidf = tfidf_features(X_train, X_test, open("resources/tfidf.pkl",'wb'))
intent_recognizer = LogisticRegression(C=10,random_state=0)
intent_recognizer.fit(X_train_tfidf,y_train)
pickle.dump(intent_recognizer, open("resources/intent_clf.pkl" , 'wb'))
# Check test accuracy.
y_test_pred = intent_recognizer.predict(X_test_tfidf)
test_accuracy = accuracy_score(y_test, y_test_pred)
print('Test accuracy = {}'.format(test_accuracy))
Test accuracy = 0.989825
The Intent Classifier has a pretty good test accuracy of 98%. TFIDF is not so bad.
Step 4: Create Programming Language classifier
Let us first create the data for Programming Language classifier and then train a Logistic Regression model using TFIDF features. We save this tag Classifier at the location resources/tag_clf.pkl
. We do this step mostly because we don’t want to do similarity calculations over the whole database of questions but only on the subset of questions by the language tag.
# creating the data for Programming Language classifier
X = posts['title'].values
y = posts['tag'].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
print('Train size = {}, test size = {}'.format(len(X_train), len(X_test)))
Train size = 1737260, test size = 434315
vectorizer = pickle.load(open("resources/tfidf.pkl", 'rb'))
X_train_tfidf, X_test_tfidf = vectorizer.transform(X_train), vectorizer.transform(X_test)
tag_classifier = OneVsRestClassifier(LogisticRegression(C=5,random_state=0))
tag_classifier.fit(X_train_tfidf,y_train)
pickle.dump(tag_classifier, open("resources/tag_clf.pkl", 'wb'))
# Check test accuracy.
y_test_pred = tag_classifier.predict(X_test_tfidf)
test_accuracy = accuracy_score(y_test, y_test_pred)
print('Test accuracy = {}'.format(test_accuracy))
Test accuracy = 0.8043816124241622
Not Bad again.
Step 5: Store Question database Embeddings
One can use pre-trained word vectors from Google or get a better result by training their embeddings using their data. Since again accuracy and precision is not the primary goal of this post, we will use pretrained vectors.
# Load Google's pre-trained Word2Vec model.
model = gensim.models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)
We want to convert every question to an embedding and store them so that we don’t calculate the embeddings for the whole dataset every time. In essence, whenever the user asks a Stack Overflow question, we want to use some distance similarity measure to get the most similar question.
def question_to_vec(question, embeddings, dim=300):
"""
question: a string
embeddings: dict where the key is a word and a value is its' embedding
dim: size of the representation
result: vector representation for the question
"""
word_tokens = question.split(" ")
question_len = len(word_tokens)
question_mat = np.zeros((question_len,dim), dtype = np.float32)
for idx, word in enumerate(word_tokens):
if word in embeddings:
question_mat[idx,:] = embeddings[word]
# remove zero-rows which stand for OOV words
question_mat = question_mat[~np.all(question_mat == 0, axis = 1)]
# Compute the mean of each word along the sentence
if question_mat.shape[0] > 0:
vec = np.array(np.mean(question_mat, axis = 0), dtype = np.float32).reshape((1,dim))
else:
vec = np.zeros((1,dim), dtype = np.float32)
return vec
counts_by_tag = posts.groupby(by=['tag'])["tag"].count().reset_index(name = 'count').sort_values(['count'], ascending = False)
counts_by_tag = list(zip(counts_by_tag['tag'],counts_by_tag['count']))
print(counts_by_tag)
[('c#', 394451), ('java', 383456), ('javascript', 375867), ('php', 321752), ('c_cpp', 281300), ('python', 208607), ('ruby', 99930), ('r', 36359), ('vb', 35044), ('swift', 34809)]
We save the embeddings in a folder aptly named resources/embeddings_folder
. This folder will contain a .pkl file for every tag. For example one of the files will be python.pkl
.
! mkdir resources/embeddings_folder
for tag, count in counts_by_tag:
tag_posts = posts[posts['tag'] == tag]
tag_post_ids = tag_posts['post_id'].values
tag_vectors = np.zeros((count, 300), dtype=np.float32)
for i, title in enumerate(tag_posts['title']):
tag_vectors[i, :] = question_to_vec(title, model, 300)
# Dump post ids and vectors to a file.
filename = 'resources/embeddings_folder/'+ tag + '.pkl'
pickle.dump((tag_post_ids, tag_vectors), open(filename, 'wb'))
We are nearing the end now. We need to have a function to get most similar question’s post id in the dataset given we know the programming Language of the question. Here it is:
def get_similar_question(question,tag):
# get the path where all question embeddings are kept and load the post_ids and post_embeddings
embeddings_path = 'resources/embeddings_folder/' + tag + ".pkl"
post_ids, post_embeddings = pickle.load(open(embeddings_path, 'rb'))
# Get the embeddings for the question
question_vec = question_to_vec(question, model, 300)
# find index of most similar post
best_post_index = pairwise_distances_argmin(question_vec,
post_embeddings)
# return best post id
return post_ids[best_post_index]
get_similar_question("how to use list comprehension in python?",'python')
array([5947137])
we can use this post ID and find this question at https://stackoverflow.com/questions/5947137
The question the similarity checker suggested has the actual text: “How can I use a list comprehension to extend a list in python? [duplicate]”
Not too bad. It could have been better if we train our embeddings or use starspace embeddings.
Assemble the Puzzle – SimpleDialogueManager Class
Finally, we have reached the end of the whole exercise, and we have to fit all the pieces in the puzzle in our SimpleDialogueManager
Class. Here is the code for that. Go in the main.py
file again to paste this code and see if it works or not.
Go through the comments to understand how the pieces are fitting together to build one wholesome logic.
import gensim
import pickle
import re
import nltk
from nltk.corpus import stopwords
import numpy as np
from sklearn.metrics.pairwise import pairwise_distances_argmin
# We will need this function to prepare text at prediction time
def text_prepare(text):
"""Performs tokenization and simple preprocessing."""
replace_by_space_re = re.compile('[/(){}[]|@,;]')
bad_symbols_re = re.compile('[^0-9a-z #+_]')
stopwords_set = set(stopwords.words('english'))
text = text.lower()
text = replace_by_space_re.sub(' ', text)
text = bad_symbols_re.sub('', text)
text = ' '.join([x for x in text.split() if x and x not in stopwords_set])
return text.strip()
# need this to convert questions asked by user to vectors
def question_to_vec(question, embeddings, dim=300):
"""
question: a string
embeddings: dict where the key is a word and a value is its' embedding
dim: size of the representation
result: vector representation for the question
"""
word_tokens = question.split(" ")
question_len = len(word_tokens)
question_mat = np.zeros((question_len,dim), dtype = np.float32)
for idx, word in enumerate(word_tokens):
if word in embeddings:
question_mat[idx,:] = embeddings[word]
# remove zero-rows which stand for OOV words
question_mat = question_mat[~np.all(question_mat == 0, axis = 1)]
# Compute the mean of each word along the sentence
if question_mat.shape[0] > 0:
vec = np.array(np.mean(question_mat, axis = 0), dtype = np.float32).reshape((1,dim))
else:
vec = np.zeros((1,dim), dtype = np.float32)
return vec
class SimpleDialogueManager(object):
"""
This is a simple dialogue manager to test the telegram bot.
The main part of our bot will be written here.
"""
def __init__(self):
# Instantiate all the models and TFIDF Objects.
print("Loading resources...")
# Instantiate a Chatterbot for Chitchat type questions
from chatterbot import ChatBot
from chatterbot.trainers import ChatterBotCorpusTrainer
chatbot = ChatBot('MLWhizChatterbot')
trainer = ChatterBotCorpusTrainer(chatbot)
trainer.train('chatterbot.corpus.english')
self.chitchat_bot = chatbot
print("Loading Word2vec model...")
# Instantiate the Google's pre-trained Word2Vec model.
self.model = gensim.models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)
print("Loading Classifier objects...")
# Load the intent classifier and tag classifier
self.intent_recognizer = pickle.load(open('resources/intent_clf.pkl', 'rb'))
self.tag_classifier = pickle.load(open('resources/tag_clf.pkl', 'rb'))
# Load the TFIDF vectorizer object
self.tfidf_vectorizer = pickle.load(open('resources/tfidf.pkl', 'rb'))
print("Finished Loading Resources")
# We created this function just above. We just need to have a function to get most similar question's *post id* in the dataset given we know the programming Language of the question. Here it is:
def get_similar_question(self,question,tag):
# get the path where all question embeddings are kept and load the post_ids and post_embeddings
embeddings_path = 'resources/embeddings_folder/' + tag + ".pkl"
post_ids, post_embeddings = pickle.load(open(embeddings_path, 'rb'))
# Get the embeddings for the question
question_vec = question_to_vec(question, self.model, 300)
# find index of most similar post
best_post_index = pairwise_distances_argmin(question_vec,
post_embeddings)
# return best post id
return post_ids[best_post_index]
def generate_answer(self, question):
prepared_question = text_prepare(question)
features = self.tfidf_vectorizer.transform([prepared_question])
# find intent
intent = self.intent_recognizer.predict(features)[0]
# Chit-chat part:
if intent == 'dialogue':
response = self.chitchat_bot.get_response(question)
# Stack Overflow Question
else:
# find programming language
tag = self.tag_classifier.predict(features)[0]
# find most similar question post id
post_id = self.get_similar_question(question,tag)[0]
# respond with
response = 'I think its about %snThis thread might help you: https://stackoverflow.com/questions/%s' % (tag, post_id)
return response
Here is the code for the whole main.py
for you to use and see. Just run the whole main.py
using
And we will have our bot up and running.
Again, here is the link to the github repository
The possibilities are really endless
This is just a small demo project of what you can do with the chatbots. You can do a whole lot more once you recognize that the backend is just python.
-
One idea is to run a chatbot script on all the servers I have to run system commands straight from telegram. We can use
os.system
to run any system command. Bye Bye SSH. -
You can make chatbots to do some daily tasks by using simple keyword-based intents. It is just simple logic. Find out the weather, find out cricket scores or maybe newly released movies. Whatever floats your boat.
-
Or maybe try to integrate Telegram based Chatbot in your website. See livechatbot
-
Or maybe just try to have fun with it.
Conclusion
Here we learned how to create a simple chatbot. And it works okayish. We can improve a whole lot on this present chatbot by increasing classifier accuracy, handling edge cases, making it respond faster or maybe adding more logic to handle more use cases. But the fact remains the same. The AI in chatbots is just simple human logic and nothing magic.
In this post, I closely followed one of the projects from this course to create this chatbot. Do check out this course if you get confused, or tell me your problems in the comments I will certainly try to help.
Follow me up at Medium or Subscribe to my blog to be informed about my next posts.
Till then Ciao!!