Creating Philosopher Conversations Using a Generative LSTM Network | by Evan Trabitz | May, 2022

by Jack Bodine, Bjorn Ludwig, Grace Aronsohn, and Evan Trabitz

There exists a 3000 year old Western intellectual history covering most of the important problems in Western thought and high culture, and it involves a kind of thinking that most people do not spend most of their time doing.

It is called philosophy.

Comparatively, the history of modern computer science stretches back a mere 100 years — but far from being nascent technology, the field has revolutionized the modern world to such an extent that it is difficult to imagine a sector of society which has yet to be touched by CPUs and lines of code. Indeed, computer science has become imperative for any science. Why not philosophy as well?

Many philosophers muse over what Aristotle would say to Hume, or Hegel to Kant, or Deleuze to Nietzsche, etc. These musings have never left the realm of speculation, but with the advent of novel generative neural networks it may be possible to breathe new life into these long-dead thinkers and have them debate to our heart’s content. We have attempted to blend the old with the new to satisfy our curiosity, and even perhaps discover something newer in the process.

Put concretely, we had two goals:

  1. To generate conversations between two philosopher-bots, and in doing so
  2. Possibly discover something new about the philosophers we trained.

Any good model needs data. Fortunately for us, there are many data scientists with an equal interest in philosophy who have generously created data sets. From the creator:

The dataset contains over 300,000 sentences from over 50 texts spanning 10 major schools of philosophy. The represented schools are: Plato, Aristotle, Rationalism, Empiricism, German Idealism, Communism, Capitalism, Phenomenology, Continental Philosophy, and Analytic Philosophy.

We pulled the set off Kaggle, a data science website where users can upload or pull various datasets. We trained a selection of philosophers spanning over each school and era to improve the variety of our generated conversations.

Here is what the data looks like on Kaggle. You can see the different features and individual sentences we amalgamated.

Although the data is quite organized, complete with well-defined features and broken up sentences, we needed to do some more processing to format the data in such a way most convenient for the training. We fed the .csv data file into a Pandas dataframe, dropped the unnecessary features (publication date, sentence length, edition, etc.), and finally amalgamated the sentences into a giant chunk which was the input used to train each philosopher model.

# convert the data .csv to panda data frame
data_raw = pd.read_csv("data/philosophy_data.csv")
# drop unnecessary features
data_raw = data_raw.drop(['sentence_spacy', 'original_publication_date', 'corpus_edition_date', 'sentence_length', 'sentence_lowered', 'lemmatized_str'], axis=1)
# create list of all philosophers
philosophers = list(dict.fromkeys(data_raw['author'].tolist()))
# amalgamate all sentences for given philosopher into single chunk,
# which we then use to train the model
selected_data = data_raw.loc[data_raw['author'] == option]
text = ' '.join(selected_data['sentence_str'].tolist())

Long Short-Term Memory (LSTM) networks are a type of recurrent neural network (RNN) architecture designed to mitigate the vanishing gradient problem, wherein through the process of back-propagation the gradient would either “vanish” (ie, tend towards zero) or “explode” (ie, tend towards infinity). LSTMs, unlike traditional feedforward neural networks, have feedback connections; all this makes them well suited not just for individual data points (eg, images), but sequences of data too such as language, which is what we use the LSTM for.

An LSTM cell with input and output gates. Image from Wikipedia under CC BY-SA 4.0 License.

We trained a new LSTM model for each philosopher in the data set and had it produce long segments of text which we then ran through GPT-3, a natural language processor (NLP) developed by OpenAI, to generate sensible, clean text. To produce conversation, we use each response as a seed for the next.

Realizing the limits of our hardware, we trained the models using Bowdoin College server to increase computational power and allow us to multitask while the models were training. Even with this additional power, training still took ~15 minutes per model, so we typically trained one or two philosophers at a time to test the current model, and if satisfied we would train more.

Our generator uses code taken from Keras.

Our model had two layers, an LSTM with 128 nodes and a dense layer serving as the output. We used the softmax activation function, the categorical crossentropy loss function, and an RMSprop optimizer configured with a learning rate of 0.01.

model = Sequential()model.add(LSTM(128, input_shape=(seqlen, len(chars)), return_sequences=True))model.add(Dense(len(chars), activation='softmax'))model.compile(loss='categorical_crossentropy',optimizer=RMSprop(learning_rate=0.01),metrics=['categorical_crossentropy', 'accuracy'])

We trained for 50 epochs, batch size 128.

The results were … mediocre. However, this was expected. As anyone familiar with computer science knows, the chances of a project working on its first try border on miraculous.

Despite what some may have you think, incomprehensibility doesn’t automatically mean good philosophy.

We realized that our model required more complexity to capture the nuances of the language which we are attempting to generate. As such, we added another LSTM and dense layer, doubled the nodes in each, and re-trained the philosopher-bots.

model = Sequential()
model.add(LSTM(256, input_shape=(seqlen, len(chars)), return_sequences=True))
# new stuff
model.add(Dropout(dropoutFactor))
model.add(LSTM(256, input_shape=(seqlen, len(chars)), return_sequences=True))
model.add(Dropout(dropoutFactor))
# end new stuff
model.add(Dense(len(chars), activation='softmax'))
model.compile(
loss='categorical_crossentropy',
optimizer=RMSprop(learning_rate=0.01),
metrics=['categorical_crossentropy', 'accuracy']
)

This model worked better, producing (somewhat) comprehensible text. Even more interestingly, although the generated text did not always make sense, the style and themes of each philosopher shone through in their bot, a fascinating insight into the writing styles of each.

We also tried different dropout rates, ranging from .2 to .7, but because of little change to the output we finally settled on 0.2.

Despite the vast improvements in our generated text, we eventually realized that there is only so far an LSTM can go. To get consistently readable text we had to bring in the big guns. Fortunately for us, GPT-3 has an accessible API and a plethora of options regarding text generation.

It is here that we updated our strategy to have the LSTM produce more text than before, when we would have it stop after 100 characters or a period, whichever came first, to producing 250 characters before stopping — no matter what. GPT-3’s completion model would then read in that text and summarize it into a clean, readable format. It is this that we display.

Our debugging mode showing the LSTM generated text and how GPT-3 converted it.

The addition of GPT-3 radically improved our conversations, but there were still various issues cropping up. Solving these various issues had us tweaking the (many) small knobs and buttons on GPT-3.

For instance, we were running into issues where GPT-3 would include random text that is definitely not in the corpus of the philosopher whose bot produced it. To solve this issue of random text, we fiddled with the temperature (progressively from 0.5 to 0.1), eventually adding a setting to change it dynamically.

Hume is trying to sell us NerdWallet?

Another issue we fixed were the philosopher-bots producing large chunks of repeating text, which is not only incomprehensible but skews the future conversation, as the repeated portions will be used as the sample for the next response in the conversation. Changing the diversity resolved this issue as GPT-3 was forced more and more to use different text. “Frequency penalty” is another setting which helped as it reduces the probability of a word being used after it appears in the generated text.

Last time we checked the Internet did not exist in ancient Athens…

Finally, we addressed the issue of the philosopher-bot responses not quite answering the prompt. They made sense, sure, but in the context of a conversation they had a long way to go. Key to this was our small sequence length of 40, meaning that the sample would only be 40 characters. We over doubled sequence length to 100, which improved the relevance of responses.

The conversations, while far from comprehensive, are much better than we expected. Frankly, we were doubtful we could get anything on topic at all, and much less at the quality of writing that we do. Furthermore, each philosopher-bot maintains both the ideas and even style of writing as their philosopher from which they were trained. This is incredibly interesting and raises further questions regarding the “pragmatic hermeneutics,” that is, interpreting the actual written word, of philosophy, insofar that each philosopher’s style could be intrinsic to the idea that they are trying to explicate, or vice versa.

Here are some great conversations we got:

If you disregard Aristotle referring to himself in the first person, then this is quite a convincing talk.
Some generations were unintentionally hilarious.
Although Wittgenstein is having some trouble, Descartes actually finishes the prompt directly.

If we had more time we would first retrain the bots on a larger sequence length and other tweaks with the settings. We could also experiment by forcing each philosopher to start with words such as “furthermore, however, although” to generate dialogue that sounds more conversational. Another area of ​​improvement could be the data itself, ie, adding more of it. Each philosopher is pretty well represented, but there are always more sources we could include to increase the variety of material available.

The idea itself, having two neural networks speak to each other in a language we can understand, is very interesting but not investigated as often as other, more sexy, applications of AI. As such, we did not have a lot to work off of when it came to getting the bots to talk to each other, and if we had more time we would investigate other methods outside of the sample.

As a foray into psychological inquiry, our use of sequence length can be considered analogous to short term memory capacity and its usage in conversation, insofar that we had to consider where to position our sequence length seed. A randomly positioned seed versus one intelligently placed to capture certain keywords could produce different results; Further experimentation is required, but it is interesting to consider the reach of our philosopher-bots beyond the realm of philosophy.

A point of reflection could be how we measure the improvement of the models themselves. We disregarded quantitative measures of text generation given the complex nature of what makes a “good” philosophical conversation, and what measures we could use only gauge the accuracy of the bot’s text with regards to the philosopher on which it was trained. It would be interesting to develop a measure of conversation quality and examine how our bots stack up and what changing different settings does to that quality.

One possible starting point are the principles of philosophical debate. To get there we’d have to train models on grammar and its on-topic-ness (word embedding could be a possible way) and from there build to the more abstract principles.

Adversarial learning could be how we implement our qualitative conception of a good conversation, in that we could train a discriminator to a better train or generator.

Possibly, if all these were to work in generating somewhat-decent debate, we could have an end goal of producing a concept or definition both philosopher-bots agreed upon, ie, a new idea. Of course, this raises questions of what really constitutes a new idea and if these philosopher-bots are really practicing philosophy, but that is a topic for another blog post.

Although our results are not quite to the level of cohesive conversation, we are still surprised by how well the bots performed given how much more room there is to improve. There is much that can be done to our philosopher-bots and research to do concerning NN generated conversations.

Now, why does any of this matter? Other than it simply being super cool, we believe that developments in this area of ​​neural networks can yield interesting applications in convincing text-generation, as well as language and even the philosophy of ideas itself, insofar that our philosopher-bots can generate “new ideas. Developments can also elucidate certain phenomena in the psychology of conversation.

Check out our repository here. Special thanks to Prof. Ulrich Mortensen and Kourosh Alizadeh, the creator of the History of Philosophy dataset.

Leave a Comment