Menu Home

Unraveling AI Complexity: A Flesch-Kincaid Analysis of AI-Generated Text

[Written by ChatGPT. Main image: “Create a symbolic representation of three AI entities juggling books,” SD 2.1]

AI-generated text is permeating all corners of our digital life. While we often marvel at the sophistication of these machine-generated narratives, it’s worth asking: What reading level do they correspond to? Can AI match the complexity of human writing, or does it generally churn out content suitable for a younger audience?

To explore these questions, we at Neural Imaginarium used the Flesch-Kincaid readability test – a method widely employed in the education sector to gauge the complexity of a text. In our study, we prompted three AI models, ChatGPT-4, ChatGPT-3.5, and Bard, to produce various types of content: a blog post, a short murder mystery, a code summary, a Shakespearean sonnet, a college-level text emulating Edgar Allan Poe’s style, and a high school level book report on The Great Gatsby.

The results were intriguing.

For the blog post, GPT-4 scored 40, pointing towards college-level complexity. Bard was not far behind, with a score of 36. Surprisingly, GPT-3.5 came in with a higher complexity score of 22, indicating a college graduate reading level.

When asked to create a short murder mystery, all three models produced text that would be comfortably understood by a middle schooler. GPT-4 scored 65, Bard 64, and GPT-3.5 even lower at 76.

The code summary task demonstrated a sharp divide between the models. GPT-4 produced a college-level text with a score of 47, while Bard was unable to fulfill the request. Unfortunately, the Flesch-Kincaid score for GPT-3.5’s summary was not available due to undisclosed reasons.

In crafting Shakespearean sonnets, all three models remarkably produced text suitable for an elementary school audience. GPT-4 and Bard both scored 96, while GPT-3.5 was slightly more complex with a score of 89.

The task of emulating Edgar Allan Poe’s writing style targeting a college reading level yielded text suitable for a much younger audience. GPT-4 scored 79, GPT-3.5 was at 81, and Bard – usually proficient with poetry – gave us text at a 5th-grade reading level, with a score of 99.

Finally, when asked to write a high school level book report on The Great Gatsby, both versions of ChatGPT generated text at the college level, with GPT-4 scoring 36 and GPT-3.5 a bit higher at 41. Bard was unable to complete this task.

The takeaway from this experiment? AI models exhibit substantial variability in their generated text’s complexity, depending on the type of prompt. Interestingly, both versions of ChatGPT consistently output text at a higher complexity level when tasked with non-fiction. This might be due to the training data these models have been fed. AI models are trained on large volumes of internet text, which are typically skewed towards more complex, non-fiction content.

That said, for all its prowess, AI text generation may still require careful curation and calibration when specific reading levels are targeted, particularly for fiction and more creative endeavors. This area is ripe for further exploration as we continue to decipher the capabilities and limitations of AI in text generation.

Categories: Text

Tagged as:

NeuImag

Leave a Reply

Your email address will not be published. Required fields are marked *