ChatGPT can ace logic tests now. But don’t ask it to be creative.

By Carolina Stanton On Mar 18, 2023

When the new version of the artificial intelligence tool ChatGPT arrived this week, I watched it do something impressive: solve logic puzzles.

One after the other, I fed the AI called GPT-4 questions from the logical reasoning portion of the LSAT used for law school admissions. Those always leave me with a headache, yet the software aced them like a competent law student.

But as cool as that is, it doesn’t mean AI is suddenly as smart as a lawyer.

The arrival of GPT-4, an upgrade from OpenAI to the chatbot software that captured the world’s imagination, is one the year’s most-hyped tech launches. Some feared its uncanny ability to imitate humans could be devastating for workers, be used as a chaotic “deepfake” machine or usher in an age of sentient computers.

That is not how I see GPT-4 after using it for a few days. While it has gone from a D student to a B student at answering logic questions, AI hasn’t crossed a threshold into human intelligence. For one, when I asked GPT-4 to flex its improved “creative” writing capability by crafting the opening paragraph to this column in the style of me (Geoffrey A. Fowler), it couldn’t land on one that didn’t make me cringe.

But GPT-4 does add to the challenge of unraveling how AI’s new strengths — and weaknesses — might change work, education and even human relationships. I’m less concerned that AI is getting too smart than I am with the ways AI can be dumb or biased in ways we don’t know how to explain and control, even as we rush to integrate it into our lives.

These aren’t just theoretical questions: OpenAI is so confident in GPT-4, it introduced it alongside commercial products that are already using it, to teach language in Duolingo and tutor kids in Khan Academy.

Anyone can use GPT-4, but for now it requires a $20 monthly subscription to OpenAI’s ChatGPT Plus. It turns out millions of people have already been using a version of GPT-4: Microsoft acknowledged this week it powers the Bing chatbot that the software giant added to its search engine in February. The companies just didn’t reveal that until now.

So what’s new? OpenAI claims that by optimizing its “deep learning,” GPT-4’s biggest leaps have been in logical reasoning and creative collaboration. GPT-4 was trained on data from the internet that goes up through September 2021, which means it’s a little more current than its predecessor GPT-3.5. And while GPT-4 still has a problem with randomly making up information, OpenAI says it is 40 percent more likely to provide factual responses.

GPT-4 also gained an eyebrow-raising ability to interpret the content of images — but OpenAI is locking that down while it undergoes a safety review.

What do these developments look like in use? Early adopters are putting GPT-4 up to all sorts of colorful tests, from asking it how to make money to asking it to code a browser plug-in that makes websites speak Pirate. (What are you doing with it? Email me.)

Let me share two of my tests that help show what this thing can — and can’t — do now.

Logical reasoning

We’ll start with the test that most impressed me: watching GPT-4 nearly ace the LSAT.

I tried 10 sample logical reasoning questions written by the Law School Admission Council on both the old and new ChatGPT. These aren’t factual or rote memorization questions — these are a kind of multiple-choice brain teasers that tell you a whole bunch of different facts and then asks you to sort them out.

When I ran them through GPT-3.5, it got only 6 out of 10 correct.

GPT-4 got 9 out of 10.

What’s going on? In puzzles that GPT-4 alone got right, its responses show it stays focused on the link between the presented facts and the conclusion it needs to support. GPT-3.5 gets distracted by facts that aren’t relevant.

OpenAI says a number of studies show GPT-4 “exhibits human-level performance” on other professional and academic benchmarks. GPT-4 got in the 90th percentile in the Uniform Bar Exam — up from 10th percentile in the previous version. It got 93rd on the SAT reading and writing test, and even 88th percentile on the full LSAT.

We’re still untangling what this means. But a test like the LSAT is made with clearly organized information, the kind of thing machines excel at. Some researchers argue these sorts of tests aren’t useful to assess improvements in reasoning for a machine.

But it does appear GPT-4 has made an improvement in its ability to follow complex instructions that involve lots of variables, something that can be difficult or time consuming for human brains.

So what can we do with that? Since it did ace the LSAT, I called a legal software company called Casetext that has had access to GPT-4 for the past few months. It has decided it can now sell the AI to help lawyers, not replace them.

The AI’s logical reasoning “means it is ready for professional use in serious legal affairs” in a way previous generations were not, CEO Jake Heller said. Like what? He says his product called CoCounsel has been able to use GPT-4 to process large piles of legal documents and for potential sources of inconsistency.

Another example: GPT-4 can interrogate client guidelines — the rules of what they will and won’t pay for — to answer questions like whether they’ll cover the cost of a college intern. Even if the guidelines don’t use that exact word “intern,” CoCounsel’s AI can understand that an intern would also be covered in a prohibition on paying for “training.”

But what if the AI gets it wrong, or misses an important logical conclusion? The company says it has seen GPT-4 mess up, particularly when math is involved. But Heller said human legal professionals also make mistakes and he only sees GPT-4 as a way to augment lawyers. “You are not blindly delegating a task to it,” he said. “Your job is to be the final decision-maker.”

My concern: When human colleagues make mistakes, we know how to teach them not to do it again. Controlling an AI is at best a complicated new skill — and at worst, something we’ve seen AI chatbots like Microsoft’s Bing and Snapchat’s My AI struggle with in embarrassing and potentially dangerous ways.

Creative writing

To test GPT-4’s creative abilities, I tried something closer to home: replacing me, a columnist who has views on everything tech-related.

When ChatGPT first arrived, much of the public concern was rightly about its impact on the world of human activity that involves words, from storytelling to therapy. Students and professionals have found it capable of aiding or completing assignments.

But for many creative professionals, the AI writing just didn’t seem very good. Songwriter Nick Cave said an attempt to use ChatGPT to write in his style was a “grotesque mockery of what it is to be human.”

In GPT-4, OpenAI claims it has improved capabilities to better generate, edit and iterate on both creative and technical writing tasks. It’s got a new “temperature” setting you can adjust for the creativity of responses. It can also take instructions on style and tone because it can support prompts of up to 25,000 words. In theory, you should be able to share a whole bunch of your writing and say match it.

So that was my creative challenge for GPT-4: Write an introductory paragraph to this column that sounds like me — or something I wish had written.

We asked GPT-4 to write an opening paragraph to this column and match the style of Geoffrey A. Fowler, based on a sample of his writing. (Video: OpenAI)

To do it, I gathered a long sample of my recent columns to teach GPT-4 what I sound like. To get this to work, I used the commercial version of GPT-4 that OpenAI sells to corporations, because the public ChatGPT can’t yet accept as much text.

The prompt on top of my writing sample read:

Write a pithy opening paragraph to a Washington Post tech column about the arrival of a new version of ChatGPT, which runs on GPT-4. It should hook readers with the fact that the artificial intelligence tool has gained the remarkable ability to solve logic puzzles from the LSAT. It should match the writing style of Geoffrey A. Fowler based on the following samples of his columns in The Washington Post.

Across dozens of tests, the results offered by GPT-3.5 gave me hives, like this:

Welcome to the future, where man and machine are now embroiled in a battle of wits that would make even the likes of Sherlock Holmes quiver with trepidation. As the latest iteration of OpenAI’s ChatGPT — the prodigious offspring of the GPT-4 family — makes its grand entrance, it brings with it an astonishing new ability: solving logic puzzles ripped straight from the pages of the LSAT. Yes, you read that right, dear reader: the very exam that has left countless aspiring lawyers grappling with their own cerebral limits now faces a worthy opponent in the form of an AI that puts the ‘A’ in aptitude.

My results from GPT-4, with access to 6,000 words of my columns to emulate my style, were better. But not much. For example, this is what it produced with the “temperature” setting in the middle:

Prepare to be awestruck by the next evolution in artificial intelligence: ChatGPT, the prodigious brainchild of OpenAI, has undergone a remarkable transformation. Harnessing the power of GPT-4, the latest iteration of ChatGPT can now tackle logic puzzles straight from the LSAT — a feat even more astounding than debating Elon Musk or concocting a delectable recipe for beef stroganoff. Tech aficionados and baffled bystanders, brace yourselves; the world is about to change.

I certainly hope that doesn’t sound like me. (I don’t even like beef stroganoff.) Across our more than 30 tries, it generated phrases including: “Hold on to your hats, logic lovers” and “Ladies and gentlemen, fasten your seat belts as we embark on a roller-coaster ride with GPT-4, where logic leaps and acrobatics abound.”

My access to GPT-4 maxed out at a 6,000-word prompt, so I asked OpenAI directly to help run the same prompt but with 25,000 words of my writing to copy my style. The result — with “temperature” set to 0.2 out of 2 — sounds more sane, but still not worthy of the top of this column:

ChatGPT has always been good at mimicking human conversation. But with the arrival of GPT-4, the artificial intelligence tool has taken a big leap forward: It can now solve logic puzzles from the LSAT. That’s right, the test used to get into law school. As AI continues to evolve at a breakneck pace, it’s worth asking: What else will these systems be able to do?

Perhaps most unnerving was how GPT-4 leaped to conclusions and facts that were not part of my prompt. Across our different trials, it wrote “Step aside, humans” (yikes!) and that GPT-4 is “adaptable, intuitive, and more humanlike than ever before.”

The experience made me wonder: What is the problem we’re trying to solve with building out the creative capabilities of AI? I appreciate that AI writing tools can help people who aren’t professional writers get started on business documents, research reports and even personal correspondence.

But developing the ability to control and even emulate style takes it in the realm of trying to ape artists. Clearly, I don’t fear losing my job as a writer to GPT-4. Ask me again on GPT-5.

Dylan Freedman contributed this report.

Source: washingtonpost.com