Technical details

Automatic anthologizing Using OpenAI Embeddings

Rationale
Scraping archives
Using the OpenAI API to generate embeddings
Clustering articles
Reflections

Rationale

I love the New York Review of Books and the London Review of Books. I read both all the time. My grandfather loved the NYRB and I really admired his appetite to learn things totally new and develop a global, well-rounded sensibility. I have never regretted reading one of these articles - which is more than I can say for most of the internet! But at the end of each article I always want to know more.

So, I figured out a way to pull together articles on common topics from the archives of these magazines. The idea being that the NYRB and LRB all generally cover the same general kinds of topics. Working with all three should provide an interesting corpus of related-but-different perspectives. Kind of like a college reading list.

So the solution I wanted to build is an auto-anthologizer. What articles are related to one another and would be interesting to read side-by-side?

What I did at a high level:

  1. Scraped recent archives of the NYRB, LRB, and TLS.

  2. Called the OpenAI API to generate text embeddings for every article

  3. Ran simple k-means clustering to identify groups of articles that belong together

All of this is written in Clojure. This is the only programming language I know - please don't ask why I chose this one, other than I drank the Paul Graham Kool-aid of LISP and I don't actually need to code for my career :)

Scraping

I had to write a separate scraping tool for each web site, but this was straightforward with clj-http and enlive.

At the end of the day I got a map for each article that looked like:

{:review-items
({:name "A World Beneath the Sands: The Golden Age of Egyptology",
:attribution "by Toby Wilkinson",
:details "Norton, 510 pp., $30.00"})

:date "May 13, 2021 issue",
:hash 1735282417,
:content [FULL TEXT OF THE ARTICLE]
:dek
"Europeans made—and carried away—many of the most famous discoveries of the “Golden Age of Egyptology,” but Egyptians today are beginning to reclaim their own past.",
:title "Ancient Egypt for the Egyptians",
:author "Ursula Lindsey",
:url
"https://www.nybooks.com/articles/2021/05/13/ancient-egypt-for-the-egyptians/"}

OpenAI

The next step was generating the embeddings for each article. I did this at the time of scraping so my database also contained a vector of embeddings from OpenAI:

(defn calculate-embedding [f]
(http/post "https://api.openai.com/v1/embeddings"
{:headers {"Content-Type" "application/json"
"Authorization" (str "Bearer " api-key)}
:timeout 60000
:body (json/write-str {:input f
:model "text-embedding-ada-002"
})}
(fn [{:keys [status headers body error]}] ;; asynchronous response handling
(if error
(println "Failed, exception is " error)
body))))

(defn extract-embeddings [i]
(get (first (get (json/read-str (deref (calculate-embedding i))) "data")) "embedding"))

This added an :embeddings key to my article map that contained the vector of embeddings.

Fortunately all of the articles are under the max token limit for the text-embedding-ada-002 model, so I didn't need to worry about splitting up the articles. And yes, I know that I could do a better job with error handling :).

So - that left me with a database of articles, with their full content and a vector containing what they were "about" (i.e. the embedding).

I could already do some cool things like figure out the next recommended article using a cosine similarity function. Many thanks to this post (Why I sometimes like to write my own number crunching code) for helping explain how to run cosine similarity much more efficiently.

Clustering

The next step was to run some simple K-means clustering to figure out the clusters of articles that were similar.

To do this, I called the Weka library from within clojure. After I did some yak-shaving to figure out how to get the clojure data into ARFF and loaded into Weka, I copied some code from clj-weka to run the clustering:

(defn cluster
[data]
(let [kmeans (SimpleKMeans.)
loader (ArffLoader.)
_ (.setSource loader ^InputStream (string->stream data))
instances (.getDataSet loader)
_ (.buildClusterer kmeans instances)]

{:assignments (seq (.getAssignments kmeans))
:centers (map #(.toDoubleArray %)
(.getClusterCentroids kmeans))}
))

This returned the cluster that each article was assigned to (:assignments) and what the center of the cluster was (:centers). I tried hierarchical clustering, couldn't get it to work.

Given I didn't really know how many clusters or topics I wanted, I decided to first ask for a lot of clusters (like 200) - then, I filtered out clusters with too few articles to be interesting.

After that, I had a bunch of clusters with too many articles! So I did some recursive clustering. For every cluster with more than N articles (I chose 10) I re-clustered the data into two clusters, and repeated this until every cluster had less than 10 articles.

(defn recursive-clustering [clusters N]
(let [f (frequencies (map :cluster clusters))
mx (reduce max (vals f))
c (first (filter #(= (get f %) mx) (keys f)))
data (filter #(= (:cluster %) c) clusters)]
(if (< mx N)
clusters
(do
(let [new-clusters (weka/cluster-data data 2)
new-cluster-labels
(map #(update-in % [:cluster] (fn [x] (str c "." x))) new-clusters)
combined-data

(concat (remove #(= (:cluster %) c) clusters) new-cluster-labels)]
(recursive-clustering combined-data N))))))

This gave me a few interesting things:

  • Based on the centroid of the "topic", I could tell which articles were most closely related to the core embedding of the topic.

  • I could also tell which topics / clusters were most closely related to one another

Next steps

The embeddings also unlock additional functionality that I haven't really leveraged yet:

  1. Treating a single article as a "source doc" and pulling the 5-6 ones that are most closely related to it.

  2. Implementing a general semantic search engine on top of the archives

I also wasn't done with OpenAI's capabilities - in order to really make this fully automatic I then wrote some API calls to:

  1. Summarize the articles and headlines within a general cluster.

  2. Classify the dewey decimal that best represents the topics of the articles.

  3. I also call DALL-E to generate nice wallpaper art for the substack.

Reflections

This was successful because I want to read EVERY article in the NYRB and LRB. There aren’t really that many bad or worthless articles. Sometimes the embeddings strike out / don't make sense but other times they make very interesting connections. That being said, the non-fiction anthologies are probably the most practical.

I find these anthologies useful for myself. I decided to put them up on this Substack in case they would be useful for you as well. I am going to try and go into depth on a topic each weekend, hopefully others might be interested in joining with me. Subscribers also get access to the full catalog of topics (which I am still developing and updating). These weekly topics also allow for some more hand-curation on the set of articles.