
In the past year, the use of generative AI tools such as ChatGPT has exploded in popularity. ChatGPT can be used to help people write emails, essays, summaries, and creative stories almost instantly. With more people using such tools, more people are wanting more authentic text written by real humans. So an important question has come up: can we distinguish between human-written and AI-generated text? There is plenty of research done in this area by (Fraser 2025; Georgiou 2025; Fiedler et al. 2025).
This project looks to answer that question using data science techniques. It is important to define the problem, because the term “AI-generated” means something different to different people. If a human uses AI to generate text (AI-assisted), does that still count as human-written? Should there be 3 different categories (human-written, AI-assisted, and AI-generated)? And if so, how much assistance can a human get from AI before the text is classified as AI-generated? For this project, we will keep it simple. We will compare text written completely by humans (unassisted by AI) with text written by AI (produced by generative tools such as ChatGPT).
Collecting the Dataset
To conduct this study, we will collect written samples from English speakers with different ages and educational backgrounds. Everyone will respond to the same prompt. Then we will also have ChatGPT respond to the same prompt. This way, we get two comparable sets of responses, and we know the ground truth.
Examples of prompts might include questions such as:
- “What does a typical weekend look like for you?”
- “Describe a memorable experience from your childhood.”
- “What are your thoughts on working or studying remotely?”
These types of questions allow for variation in tone, structure, and vocabulary, while still keeping the topic consistent across participants.
For each prompt, we expect to see noticeable differences between human-written and AI-generated responses. Human responses may include personal anecdotes, informal language, minor grammatical inconsistencies, or abrupt shifts in thought.
For example, a human might write something like:
“I try to keep my weekends as relaxed as possible. First I have to finish all my MDS assignments on Saturday. But I will make sure I do my 5km Parkrun, and have a nice breakfast somewhere. Sometimes I will also meet up with friends and watch a movie.”
In contrast, a response generated by ChatGPT to the same prompt might be more structured and polished, such as:
“On weekends, many people take time to relax and recharge after a busy workweek. Common activities include completing household chores, spending time with friends, and engaging in hobbies that promote well-being.”
While both answers address the same question, they differ in sentence structure and writing style. These differences form the basis of the features that will later be used for classification.
The Dataset
Here is an example of what the dataset might look like:
| Question | Response | Target |
|------------------------------------------------|-----------------------------------------------------|--------|
| What does a typical weekend look like for you? | I try to keep my weekends as relaxed as possible... | Human |
| What does a typical weekend look like for you? | On weekends, many people take time to relax... | AI |
| Describe a memorable experience from childhood | When I was 7, I went fishing for the first time... | Human |
| Describe a memorable experience from childhood | Childhood memories often hold a special place... | AI |
| What are your thoughts on remote work? | Honestly, I love working from home. No commute... | Human |
Time to Apply Data Science Techniques! (Advanced Reading)
After collecting all the text samples (from humans and ChatGPT), we will create a data science model to identify whether the text is written by a human or AI. A model is simply a mathematical formula that learns from our examples (the dataset) and makes predictions on new data. To build this model, we look for patterns in the text such as: sentence length, word frequency, punctuation frequency, and diversity of vocabulary. When we combine multiple clues like these, they can start to reveal if the text was written by a human or AI.
To create (or train) a model, we need to first transform the raw text into numerical features through a process called feature engineering. This involves extracting measurable characteristics (the patterns we mentioned earlier) from each response. Once we have these features, we can feed them into a classification model and measure how well it performs using a metric called accuracy. The higher the accuracy, the better our model is at detecting if a text was written by a human or AI.
The classification model that we use for this task is called logistic regression. The goal of logistic regression is to estimate the probability that a given piece of text belongs to one of two categories: Human or AI. The model takes numerical features extracted from the text and combines them using learned weights. These weights determine how strongly each feature contributes to the final prediction. The output of the model is a score between 0 and 1, which can be interpreted as the probability that the text is AI-generated. For example, a score close to 1 suggests the text is more likely written by AI, while a score close to 0 suggests more human writing. In practice, this can be implemented using a standard machine learning library such as scikit-learn, where a model can be trained using code like LogisticRegression().fit(X_train, y_train). We can then evaluate how well the model performs by comparing its predictions to the known ground truth labels and calculating accuracy. This allows us to quantify how effective logistic regression is at distinguishing between human-written and AI-generated text based on the features we selected.
Finally, we can test our model on new, unseen data to see how well it generalizes. By holding out a portion of our dataset during training, we can simulate real-world conditions where the model encounters text it has never seen before. If the model performs well on this test set, we can be more confident that it will accurately classify new responses in practice.
Concluding Remarks
This project aims to see if machines can help us identify text written by AI or not. However it is important to consider that AI systems are also constantly evolving (for example, by introducing the “human touch” to artificially-generated text). Limitations of using machines to detect AI-generated text need to be explored as humans also evolve their writing in response to AI writing.
Some More Final Thoughts…
As I was writing this blog post, there were several moments where I wanted to use the em-dash. However, the em-dash has been getting a lot of negative attention lately. Because ChatGPT is known to use a lot of em-dashes in its writing, humans have been wrongly accused of using AI-assistance simply just by having em-dashes in their essay. This is just one small example of how humans are also changing the way they write in the age of AI. Perhaps one day the overlap between human and AI writing will become so blurry that it becomes impossible to tell them apart. The real question then might not be who wrote something, but whether it resonates with us as readers.