click to show in list
quote2vec uses a machine learning technique called doc2vec to let you explore and visualize the world's great quotations (at least, those available on en.wikiquote.org). Read more about how quote2vec works by clicking About, or view the code by clicking github.
Use the search field to begin
Search Person or Source to view all quotes by a certain person or in a certain book or speech.
Search Quote to view the quotes that quote2vec identifies as most similar to the chosen quote.
Search Keywords to view the quotes most similar to any keywords or piece of text.
When Person, Source, or Quote is selected, click Random to make a choice at random from the wikiquote database.
The list view shows all quotes found for the chosen person, source, quote, or keywords. Click any quote, source, or person to jump to a new list for the selected item. Click Add to graph to add the group of quotes to the graph view.
The graph view lets you visualize groups of quotes. quote2vec represents each quote as a point in a high-dimensional space, where quotes that are similar to each other have nearby points. The graph view uses Principal Component Analysis (PCA) to create a useful 2D view of the space.
Once a group of quotes have been added to the graph, you can hover over points to show their corresponding quote, or click a point to highlight its quote in the list view. You can also drag or scroll the mouse to pan and zoom the graph.
Any number of quote groups can be added to the graph. To pull up the list view of a group on the graph, click it's name on the top of the graph. To remove a group from the graph, click the "X" next to its name.
Wikiquote is an open, community-built repository of over 170,000 quotations from writers and thinkers around the world. quote2vec uses doc2vec, a recent machine learning technique, to map the semantic similarities among these quotes. Its purpose is to provide an intuitive tool to explore the huge "space" of quotes, and encourage the discovery of beautiful ideas or surprising connections.
To build quote2vec, I scraped and parsed the quotes from all of the "person" pages on the English Wikiquote. (Wikiquote also has pages devoted to movies, books, and other categories, which I ignored.)
I coded all of quote2vec's data processing, modeling and backend in python. To retrieve all of the person pages I used requests, relying on wikiquote's "List of people by name" pages to compile a complete list of urls to access.
Because wikiquote's pages are created by volunteers using a markup language with loose rules, extracting quote text and source names from a page is challenging. In most cases, each quote appears as a bullet. The source of a quote may appear as a section header above the quote, or it might be included as a sub-bullet after the quote. Quotes in a foreign language are supposed to be written in italics, with the English translation in a sub-bullet underneath, but this convention is not always followed.
To extract quotes and sources as cleanly as possible, I used beautifulsoup and recursively parsed the section tree to map sources to quotes. I created a series of heuristics to deal with inconsistencies in formatting. For example, if most quotes in a section have no sub-bullets, the section head is probably the source of the quotes; if they do have sub-bullets, the head is probably a general term like "Early Life" rather than a source. Similarly, I used a combination of cues, including whether italics were used and the proportion of words that appeared in an English dictionary, to determine whether a quote was likely in a foreign language.
These heuristics are imperfect, but for the most part they succeed in extracting quote–source–person relationships from the raw HTML. This data was then stored in a MySQL database built using SQLalchemy.
To prepare the text for analysis, I used spacy to split the text into individual word tokens, remove punctuation, and lemmatize the words. Lemmatization converts words into their root forms, such as changing "dogs" to "dog" and "running" to "run." This simplifies later analysis and saving the machine learning model from having to learn the relationships among these highly similar words.
To build a model of the similarities among quotes, I used a recent neural network technique called doc2vec. (I also tested a classic matrix-based technique called Latent Semantic Analysis, but found that doc2vec performed much better—see more on model evaluation below.) doc2vec builds off the well-known word2vec model. It seeks to represent documents (quotes, in this case) and the words that appear in them in a high dimensional vector space in such a way that documents that have high cosine similarity in the space are closely semantically related.
The model begins by assigning each document, and each word in the corpus vocabulary, a random vector. It then iteratively updates these vectors using stochastic gradient descent. At each iteration, the model attempts to use the the document vector to predict which of a set of candidate words appears in the document. It then updates the document vector to reduce future error, essentially moving it towards the vector for the correct candidate and away from the distractors. A similar process moves word vectors towards other the vectors of other words that occur nearby in text. The eventual outcome of this process, as shown in the doc2vec paper and the original word2vec paper, is a vector space in which distances between vectors are interpretable as semantic similarity to humans.
I used gensim's implementation of doc2vec, with 300-dimensional vectors and using the "distributed bag of words" approach to learning document vectors. I also made a few modification such as down-weighting similarities to quotes with short vectors, as these tend to correspond to short quotes that have noisy, sometimes-inaccurate representations.
Determining the best model structure requires some external metric of model performance. While doc2vec is trained on predicting words given a document id, my actual goal was to create meaningful vector space of quotes. To measure this, I created an evaluation task in which the model, given a target quote, had to pick which of two quotes came from the same person or source, and which was a randomly selected distractor. Quotes from the same person or source are likely to have similarities in meaning, and the model had no access to person or source labels during training, making this a reasonable test of the model's ability to extract semantic meaning. In my model evaluation process, I conducted this test using both same-source and same-person pairs and averaged the results. Chance performance on this metric is 50%, and the final model achieves performance of about 76.6%.
quote2vec uses a RESTful flask back-end, hosted on Amazon Web Services. The website front-end uses the backbone framework to manage data and state, along with bootstrap to create an attractive layout and interface. I used Twitter's typeahead to create a suggestions interface for the search box.