JAIR 2025 · Submitted

A Lightweight Approach to Detection
of AI-Generated Texts

Using Stylometric Features with 1D Convolutional Neural Network and Random Forest

Sergey K. Aityan, William Claster, Karthik Sai Emani, Sohni Rais, Thy Tran

Northeastern University

Abstract

A growing number of AI-generated texts raise serious concerns. Detection of such texts has become an important task for many research groups. Most existing approaches rely on fine-tuning large transformer models or building ensembles, which are computationally expensive and often provide limited generalization across domains.

We develop a lightweight approach to AI-generated text detection that does not require extensive computational power. A text is first decomposed into stylometric and readability features, which are used for classification by a compact 1D Convolutional Neural Network (CNN) and a Random Forest (RF).

Evaluated on the Kaggle AI-vs-Human corpus, our models achieve 97% accuracy (F1 ≈ 0.95) for the CNN and 95% accuracy (F1 ≈ 0.94) for the Random Forest, with ROC-AUC scores of 99.5% and 95% respectively. The CNN (~25 MB) and RF (~10.6 MB) are orders of magnitude smaller than transformer-based ensembles and run efficiently on standard CPU devices. We show that simplicity, when guided by structural insights, can rival complexity in AI-generated content detection.

Method

The NEULIF Pipeline

Rather than processing raw token sequences through heavy transformer layers, NEULIF converts each text into a fixed 68-dimensional feature vector — then feeds it into a lightweight CNN or Random Forest. This sidesteps the computational cost of sequence models while preserving rich linguistic signal.

📝

Raw Text

Variable length input

→

🔬

Feature Extraction

spaCy + TextDescriptives

→

📊

68-dim Vector

Fixed-size linguistic profile

→

1D CNN

25MB · 97% acc

Random Forest

10.6MB · 95% acc

→

🤖

AI / Human

Binary label + probability

68 Stylometric Feature Categories

📏

Descriptive Statistics

Token counts, sentence lengths, unique token ratios and distributional measures.

token_count sent_length_mean

📖

Readability Indices

Metrics assessing comprehension ease: Flesch Reading Ease, Flesch-Kincaid, ARI.

flesch_reading_ease flesch_kincaid

🌳

Syntactic Features

Dependency distances, POS-tag proportions (nouns, verbs, adj), parse tree depth.

dep_dist_mean pos_noun_ratio

🔤

Lexical Diversity

Type-token ratio, token entropy, vocabulary variation and stylistic richness.

type_token_ratio entropy

🔗

Cohesion Metrics

Connective counts, co-reference chains, indicators of text coherence and discourse.

n_connectives coref_chains

⚙️

Complexity Heuristics

Clause ratios, sentence variation, punctuation density, spelling error rates.

punct_density clause_ratio

Architecture

1D CNN Architecture

Total trainable parameters: 2,205,185 — orders of magnitude fewer than BERT (110M) or RoBERTa (125M). Input is a 68-dim stylometric feature vector, not raw token sequences.

Layer	Type	Output Shape	Parameters	Details
Input	Input	(None, 68, 1)	0	68-dim linguistic feature vector
Conv1D	Conv	(None, 66, 128)	512	128 filters, kernel=3, ReLU
BatchNorm	Norm	(None, 66, 128)	512	Stabilizes training convergence
Flatten	Reshape	(None, 8448)	0	Converts to 1D for dense layers
Dense 1	Dense	(None, 256)	2,162,944	ReLU · Dropout 0.4
Dense 2	Dense	(None, 128)	32,896	ReLU · Dropout 0.3
Dense 3	Dense	(None, 64)	8,256	ReLU · Dropout 0.2
Output	Sigmoid	(None, 1)	65	P(AI-generated) ∈ [0, 1]

Results

Comparison with Prior Work

NEULIF matches or exceeds heavyweight transformer ensembles — at a fraction of the model size and compute cost. Evaluated on the Kaggle AI-vs-Human corpus (1,997 held-out test samples).

Method	Accuracy	F1	ROC-AUC	Model Size	Hardware
NEULIF CNN Ours	97%	0.95	99.5%	~25 MB Lightweight	CPU
NEULIF RF Ours	95%	0.94	95.0%	~10.6 MB Lightweight	CPU
BERT-base Transformer	~95%	~0.93	—	~440 MB	GPU
RoBERTa Transformer	~93%	~0.92	—	~480 MB	GPU
Ghostbuster Ensemble	~91%	~0.90	—	Large	GPU
Stylometry RF (Opara 2024)	~98%	—	—	Small	CPU

Transformer baselines sourced from Antoun et al. 2023, Guo et al. 2024, Kuznetsov et al. 2024. Direct cross-dataset comparison requires caution.

Citation

BibTeX

If you use NEULIF in your research, please cite:

bibtex

@article{aityan2025neulif, title = {A Lightweight Approach to Detection of AI-Generated Texts Using Stylometric Features with 1D Convolutional Neural Network and Random Forest}, author = {Aityan, Sergey K. and Claster, William and Emani, Karthik Sai and Rais, Sohni and Tran, Thy}, journal = {Journal of Artificial Intelligence Research}, volume = {0}, article = {6}, year = {2025}, doi = {10.1613/jair.1.xxxxx}, note = {Submitted} }

A Lightweight Approach to Detectionof AI-Generated Texts

A Lightweight Approach to Detection
of AI-Generated Texts