The news quiz is a tradition at TIME that dates back to 1935. Iterations of the test were used in schools across the country to examine current-affairs knowledge, and it even came in a crossword version.
Now, the recent removal of TIME’s digital paywall has opened up a century of journalism for everyone, ripe for testing your knowledge about the people who shaped history. Since TIME’s archive contains 200 million words, it’s a task that’s well-suited for the new generation of AI technology, which is able to analyze huge amounts of human-generated text in seconds.
So what happens when you turn the power of cutting-edge AI to the task of generating news quizzes based on magazine articles?
Below, you’ll find 10 quizzes that we trained the technology behind ChatGPT to produce, based on 10 stories hand-picked from the TIME archives, which are now available to everyone free of charge. Simply click on the article headline, next to the original issue date, to jump to the story on which each quiz is based. Below the interactive, we discuss how we negotiated with artificial intelligence to teach it to do what we asked.
How It Works
Given some of the truly astonishing outputs that ChatGPT can produce—a plot for a sci-fi novel, say, or mock Biblical texts—producing a quiz may seem like a trivial assignment (so to speak). And at first blush, it is. When we asked ChatGPT to simply “make a quiz based on this article,” and provided a link to TIME’s 2014 cover story on Taylor Swift, it promptly spat out a 10-question quiz with four choices for each answer.
Some of the questions were right on. (Q: Taylor Swift’s fans are famously referred to as what? A: Swifties.) But many referred to albums and events that occurred well after the story’s publication, and one was just wrong. (“Which event led her to publicly endorse a political candidate for the first time?” ChatGPT claimed it was the 2020 election, but backtracked and apologized when we reminded it that she endorsed two Tennessee Democrats in 2018.)
In many cases, ChatGPT and its various rivals may seem indistinguishable from magic. So it’s instructive to find assignments where the bots aren’t immediately capable of near-perfection. Every failure is a clue as to what’s going on under the hood.
So let’s break down what goes in to a multiple-choice question quiz and what that requires a machine to do:
For a human being—particularly one who has seen a few news quizzes and is familiar with the exercise—this is probably more instruction than necessary. But until recently, it wouldn’t have been enough for a machine. A year ago, this exercise would have involved writing a lot of code, picking between different algorithms and pre-trained language models, and constantly tweaking the “hyperparameters,” or human-defined starting conditions for the training process.
In this new world, the task is somewhere in the middle. Instead of writing instructions in Python, where a single misplaced keystroke can derail the whole operation, you deliver the instructions to the machine in plain English, as precisely and literally as you can.
This is known as a “chain of thought” prompt, which you can deliver directly to the OpenAI API, bypassing the conversation with a chatbot and interfacing instead directly with ChatGPT’s brain. You still use a language like Python to make the introduction, but it’s the bot that’s doing all the hard work.
We sent a version of the above instructions to the API and set the “temperature”—whether it randomizes the results—to zero, meaning the model would respond the same way each time we sent it identical commands. When we fed it the same Taylor Swift story, and got back another set of 10 multiple-choice questions. Here’s one:
Who was named Billboard’s woman of the year for 2014?
a: Rihanna
b: Taylor Swift
c: Lady Gaga
d: Beyonce
Any guesses? Hint: The answer to five of the other nine questions was also “Taylor Swift.”
Our first elaboration was to ask that the model to hide the ball better and keep the answers limited to the article text, rather than fall back on whatever it knows from the enormous amount of text it has analyzed in the past. At most, it can handle about 2,000 words at a time, so in most cases we had to break stories into chunks of complete paragraphs.
The instructions we settled on looked something like this, paraphrased:
In the initial trials, we found that the output often included phrases like “according to the text,” as if interrogating the user on whether they have actually read the article. It had trouble remembering that it was supposed to be writing trivia-style questions, not reading-comprehension tests. For a quiz based on an 2016 obituary for Muhammad Ali, it sometimes referred to the boxer in questions as “Cassius Clay”—and also quizzed users on Ali’s original name.
As inscrutable as artificial intelligence can often seem, the beauty of chain-of-thought prompting is that we could ask the model what it was “thinking” at each step of the process and adjust the language to tease out the best results. Should the machine retrieve all the facts? Just three facts? Five? How can we ask it to stop using the phrase “according to the text”?
All these dilemmas were natural byproducts of the fact that, while plain-language instructions are easier to construct than ones written in code, they are, at times, much more difficult to debug. At one point, we even fed the instructions back into the model to ask what its thoughts were about how they were worded, and how we could write it differently to get more consistent outputs. Its thoughts were helpful.
The results did require a round of finessing by TIME editors, mainly to remove options that were difficult to parse or too obscure years later. Every question that got cut becomes one we can ask the model to avoid in future attempts.
This is what a lot of modern computer programming may look like in years to come: Humans and machines collaborating in the former’s language and the latter’s logic to complete tasks and solve problems. Those who herald an end to computer programming may be correct that future developers will rely less on formal computer languages to write software. But if this exercise is any guide, they will still need to think like programmers.
Update, June 7: For a question about the outcome of Muhammad Ali’s 1974 “Rumble in the Jungle” match with George Foreman, the AI-generated answer originally read: “Ali won a unanimous decision over Foreman.” While Ali did lead in all three scorecards at the conclusion, the match ended in a knockout in the eighth round when Foreman failed to get up in time and was counted out. When asked to clarify, the model acknowledged that Ali did win in a knockout. The answer has been updated—and the humans reminded of how much AI still has left to learn.
More Must-Reads from TIME
- Where Trump 2.0 Will Differ From 1.0
- How Elon Musk Became a Kingmaker
- The Power—And Limits—of Peer Support
- The 100 Must-Read Books of 2024
- Column: If Optimism Feels Ridiculous Now, Try Hope
- The Future of Climate Action Is Trade Policy
- FX’s Say Nothing Is the Must-Watch Political Thriller of 2024
- Merle Bombardieri Is Helping People Make the Baby Decision
Write to Chris Wilson at [email protected]