# Can You Quantify a Writing Voice? **Author:** Aaddrick Williams **Date:** March 20, 2026 **URL:** https://nonconvexlabs.com/blog/can-you-quantify-a-writing-voice --- I wrote this in a Reddit comment about visiting my wife's family: > My wife's family is a writhing knot of joy and engagement. A day or two is fine, after a week I'm toasted. I fed nine years of my Reddit writing[^gdpr] into a pipeline I built to answer a question: can you decompose a writing voice into measurable dimensions? The corpus was 378 posts and comments, 26,769 words. It looked at that line and reported: personal/parenting register, Flesch-Kincaid grade ~7, short declarative sentence, positive sentiment compound, family pronouns present. All correct. It captured everything about the line except the thing that makes it mine. The image. The pairing of "writhing knot" with "joy and engagement." The compression of an entire family dynamic into eleven words. This article is about the gap between what the pipeline captures and what it misses. ## The problem with "write in my voice" "Write in my voice" is an instruction AI can't follow. Not because the technology isn't there, but because nobody defines what "my voice" actually means. Ask someone to describe their writing voice and you'll get vibes: "conversational but technical," "friendly but direct," "I don't know, it just sounds like me." Vibes aren't implementable. You can't validate a vibe. You can't tell a large language model "sound like me" and measure whether it succeeded. I wanted to find out if voice could be reduced to something more concrete: a set of measurable constraints with tolerances, like an engineering spec. So I built a pipeline that runs 25 analytical skills across four phases, each grounded in published natural language processing and psycholinguistic research, and produces a numeric voice profile. The pipeline is open source and runs entirely inside Claude Code[^claudecode]. The only Python dependency is `requests`, used by one script that fetches context from Reddit. Everything else is markdown files orchestrating a language model. The rest of this article is what I found when I pointed it at myself. ## What I learned about my own voice It handed me a profile. Some of it confirmed things I already knew. Some of it surprised me. ### I write at a sixth-grade reading level FK grade 6-8, with a floor of 5 and a ceiling of 10. Average word length 4.0-4.8 characters. This was the first number that stopped me. I write about AI development tools, Linux system administration, and agentic workflows. Technical topics. But the readability analysis (five formulas run simultaneously, consensus taken from the median) says I write at a level designed for 11-year-olds. The content isn't simple. I just use common words for complex concepts. "It's essentially a fancy build script which repackages the Windows electron app" is a sentence about cross-platform application packaging that any literate person can parse. It uses five readability formulas (Flesch-Kincaid[^flesch], Coleman-Liau, Gunning Fog, SMOG, ARI) because no single formula captures complexity. FK and Coleman-Liau can diverge by 2-4 grade levels on the same text because they weight different proxies: syllables vs. characters, sentence length vs. word length. The consensus range across all five is the reliable signal. Mine was tight: all five agreed within two grades. That consistency says the complexity level is stable across topics, not accidental. The replication angle here is lexical diversity. McCarthy & Jarvis (2010)[^mccarthy] showed that MTLD (Measure of Textual Lexical Diversity, the metric the pipeline uses alongside readability) has a negligible correlation with text length (r = -0.02), unlike raw type-token ratio which mechanically decreases as text gets longer. The analysis measures both and uses MTLD as the primary length-corrected index. ### My sentences alternate short and long Average sentence length: 8-12 words. Range: 3-25 words. But the average isn't the fingerprint. The rhythm is. The stylometric fingerprinting skill draws on Mosteller & Wallace (1963)[^mosteller], Burrows (2002)[^burrows], and Kestemont (2014)[^kestemont]. It extracts distributional features from function words (pronouns, articles, prepositions, conjunctions) and sentence structure, then tests their stability across corpus segments using the coefficient of variation[^cv]. A feature qualifies as part of the fingerprint only if its CV is below 0.30 across multiple independent text samples from different contexts. My sentence length standard deviation and skewness passed. The short-long alternation pattern is stable whether I'm writing about parenting challenges, troubleshooting Claude Code on Linux, or reviewing a standing desk. Short sentence states the point. Longer sentence provides the context or reasoning. Then another short one. That rhythm persists across every topic in the corpus. This is what Mosteller and Wallace found in 1963 when they attributed the disputed Federalist Papers[^federalist] using function-word frequencies. Content words change with topic. Function words and structural habits don't. A person who writes short-long-short does it whether they're discussing politics or cooking. ### I'm classified as HHL The MDPI hypernetwork archetype classification[^ferrara] scores users on three normalized axes: Score (community engagement/visibility), Sentiment (affective valence), and Toxicity (harmful language probability). Each axis gets a High or Low label. The eight possible combinations describe distinct behavioral patterns. I'm HHL: High Score, High Sentiment, Low Toxicity. The profile description: "a constructive, positive, knowledgeable communicator who helps through practical solutions and personal experience." Classified as Subject Matter Expert with a Lurker-turned-Leader trajectory. The trajectory part was the surprise. The longitudinal growth curve analysis fit three models (linear, logistic, piecewise) to my activity over nine years and selected the best fit using BIC[^bic]. It found a clear transition from sparse early commenting to sustained, higher-frequency engagement. I went from lurking to participating to being someone other people replied to. I knew this in a vague sense. I didn't know it was visible in the data as a statistically detectable change point. ### My personality shows in my pronouns The Big Five personality inference measures linguistic markers associated with each OCEAN trait (Openness, Conscientiousness, Extraversion, Agreeableness, Neuroticism). It's anchored to Koutsoumpis et al.'s 2022 meta-analysis[^koutsoumpis] of 85,724 participants across 31 samples. Their finding: 52 LIWC[^liwc] categories collectively explain about 5.1% of self-reported personality variance but 38.5% of observer-reported variance. So text-based personality inference captures how you come across to others better than how you see yourself. My profile: very high emotional stability (remarkably calm, even when discussing frustrating topics), moderate-high conscientiousness (structured, process-oriented), moderate extraversion (friendly but task-focused), low-moderate agreeableness (helpful but direct, not effusive). The low-moderate agreeableness was the one that caught me. I think of myself as agreeable. The analysis looked at my word choices and found: I help, but I don't soften. I give direct answers. When I disagree, I disagree with facts rather than hedging. "I wish there was official Linux support, but I get why they haven't done it yet." That "but I get why" is a concession, not deference. My hedging rate came back at ~26% of items, which sounds high until you realize I hedge opinions ("I think the issue is probably on mobile") but state facts directly ("There's tons of inconsistencies between distros"). The pipeline measures that distinction separately. These confidence bands are wide: +/- 10-15 points on a 0-100 scale for a corpus of this size. The output is honest about that. Every trait score carries an explicit confidence interval. A score in the 40-60 range is reported as "indeterminate" rather than "average," because those are different claims. ### I speak in three registers The register variation analysis compares feature distributions across contexts using Kolmogorov-Smirnov tests and effect sizes (Cohen's d, Cliff's delta). It draws on Biber's (1988, 1995) multi-dimensional framework[^biber]. Statistical significance alone isn't enough with large corpora (virtually everything is "significant"), so the pipeline requires medium or large effect sizes before classifying a difference as genuine register variation. My voice shifts between three modes: **Technical advisory**: slightly fewer contractions, more external links, diagnostic questions ("what Linux backend are you using?"), average sentence length stretches to 10-14 words. Function: help-giving, troubleshooting, explaining. **Personal/parenting**: shift to we/my/they pronouns, higher contraction rate, longer narrative sentences (12-16 words average), warmer emotional register. Function: sharing experiences, empathizing. **Casual/hobby**: shortest sentences (7-10 words), highest contraction rate, most exclamation marks. Function: show-and-tell, sharing enthusiasm. The output includes conditional replication rules from these shifts. It's an if-then structure: when the context is technical, apply register A; when personal, apply register B. The global rules (things that stay constant across all three registers) form the core fingerprint. The register-specific rules are overlays. ### The anti-marker list The anti-marker list turned out to be one of the most useful outputs. These are patterns identified as absent from my writing. Things a language model defaults to that I never use: - No academic vocabulary or formal register - No walls of text without paragraph breaks - No aggressive, dismissive, or sarcastic language - No emoji (extremely rare in the corpus) - No em-dash overuse (a strong AI tell) - No participial phrase endings ("The update shipped, revealing a deeper issue." That construction appears 2-5x more in AI text than human text)[^participial] - No "it's not X, it's Y" reframing constructions - No filler hedging ("it's worth noting," "it's important to note that," "from a broader perspective") - No "delve," "underscore," "harness," "illuminate," "tapestry," "ecosystem," "leverage," "robust," "comprehensive" - No triplet structures used as rhetorical devices (one is fine; repeated triplets across sections is AI rhythm) - No staccato fragment pairs used as rhetorical punch ("Not a hypothetical. Kinetic military action.") - No significance-labeling ("That's the gap between policy and reality." If the fact is strong, it lands without a label) I didn't know half of these about myself until I saw the output. The em-dash thing in particular: I rarely use em-dashes. AI text uses them constantly. That absence is a measurable, replicable constraint. "Don't use em-dashes. Use periods or colons instead" is a clearer instruction than "sound natural." ## The numbers as a spec sheet The full numeric profile the pipeline produced: **Readability:** | Metric | Target | Floor | Ceiling | |--------|--------|-------|---------| | FK Grade Level | 6-8 | 5 | 10 | | Flesch Reading Ease | 60-72 | 55 | 80 | | Avg Sentence Length | 8-12 words | 5 | 16 | | Avg Word Length | 4.0-4.8 chars | 3.5 | 5.5 | **Function words:** | Feature | Target Rate | Stability | |---------|-----------|-----------| | First person singular (I/me/my) | 2.5-3.0% | High | | Second person (you/your) | 1.0-1.5% | Context-dependent | | Contractions | 1.5-2.5% | High | **Speech acts:** | Act Type | Target % | Range | |----------|---------|-------| | Asserting | 41% | 35-47% | | Questioning | 19% | 14-24% | | Advising | 11% | 7-15% | | Thanking | 10% | 6-14% | | Explaining | 10% | 6-14% | | Challenging | 5% | 2-8% | | Agreeing | 4% | 2-7% | **Sentiment:** | Metric | Target | |--------|--------| | Mean compound | +0.15 to +0.40 | | Positive proportion | ~55-60% | | Negative proportion | ~15-22% | | Neutral proportion | ~20-25% | **Rhetorical structure:** | Feature | Target | |---------|--------| | Hedging | ~26% of items | | Concessions | ~0.47 per item | | Multi-paragraph | ~60% of responses | | Bullet/numbered lists | ~7% of responses | **Tone:** | Dimension | Position | |-----------|----------| | Formal / Casual | Slightly casual (40/100) | | Serious / Funny | Moderately serious (30/100) | | Respectful / Irreverent | Respectful (20/100) | | Enthusiastic / Matter-of-fact | Moderately enthusiastic (40/100) | These aren't aspirational targets. They're measurements from the corpus, translated into ranges that account for natural variation. The style specification skill converts each one into a constraint verb scaled to confidence: high-confidence findings from direct measurement (readability, stylometric) become MUST constraints. Moderate-confidence findings become SHOULD. Low-confidence inferences (personality traits with wide confidence bands) become MAY. Total constraint budget: 15-25. Models follow 5-12 well-prioritized constraints reliably. Beyond ~20, they start to interfere with each other. ## No NLP libraries I didn't use spaCy, NLTK, HuggingFace, or any traditional NLP framework. Every analysis is implemented as a Claude Code skill definition: a markdown file that instructs the LLM to perform the analysis directly. Each skill is a structured document with a workflow checklist, code patterns for the LLM to follow or adapt, validation gates, anti-patterns, insufficient data handling decision trees, and a report output template. The skill definitions reference the same published research that traditional tools implement. VADER sentiment analysis[^vader] is a skill that tells Claude how to apply the VADER methodology. Stylometric fingerprinting works the same way for function-word frequency profiles. The LLM's language understanding is the implementation. I did this on purpose. Traditional NLP pipelines require environment setup, dependency management, model downloads, and Python expertise. I wanted something that needed Claude Code and a text corpus. The tradeoff is deterministic reproducibility. A traditional pipeline produces identical results on reruns, but an LLM-based pipeline produces consistent results that vary slightly between runs. For voice analysis, where the output is a range-based profile rather than an exact measurement, that tradeoff works. The pipeline already reports ranges, not point estimates. The architecture also means the pipeline is portable. Each skill is a self-contained markdown file. You can read it, understand exactly what it does, and modify it without touching any code. The pipeline orchestrator coordinates four phase agents through a dependency graph, tracks completion through file existence, and resumes from interruptions. All of that coordination is itself just markdown. ## What it feels like to read your own numbers Running this on my own writing was a strange experience. I wrote every word in the corpus. Nothing in the output is new information, exactly. But it surfaces patterns you can't see from inside. I didn't know I was a lurker-turned-leader. I didn't know my contraction rate was 2%. It showed me that I hedge opinions but not facts, and that this distinction is measurable. "Hey!" as a greeting? Apparently I use it at a rate high enough to be a voice marker. And my period-heavy punctuation style (frequent sentence termination, short sentences) turned out to be the dominant structural fingerprint, more distinctive than my vocabulary. The LIWC psycholinguistic analysis[^liwc-22] categorized my vocabulary into psychological dimensions and found that social processes dominate: I address people directly ("you can try..."), share personal experience ("I had to deal with this too"), and frame technical help as peer-to-peer conversation rather than instruction. It calls this a "Social-Technical Hybrid" register: technical vocabulary delivered through personally-addressed social engagement. Advisory orientation ("you can...") with experience-based credentialing ("I had to..."). Social-Technical Hybrid is probably the single most useful label the pipeline produced. It captures something I'd been doing without thinking about it for a decade. Three words. And it's implementable. A language model can be told "use a Social-Technical Hybrid register: technical vocabulary delivered through personally-addressed social engagement." ## Where the voice profile lives now The pipeline produced a voice agent and a voice skill. I use both. The voice agent is baked into my company's website project. When I draft blog posts, I invoke the aaddrick-voice agent and it writes in a style that matches my measurements. The voice skill is a reference document that any Claude Code session can invoke for the detailed constraints. The more interesting deployment is in my open source projects. I wrote about this separately in ["OSS Maintainers Can Inject Their Standards Into Contributors' AI Tools"](https://nonconvexlabs.com/blog/oss-maintainers-can-inject-their-standards-into-contributors-ai-tools). The core idea is that CLAUDE.md and AGENTS.md files load automatically when a contributor opens a project in their AI tool. The tool reads the project's standards before any code generation begins. I took that a step further. My open source projects include the voice agent definition in their `.claude/agents/` directory. When a contributor clones the repo and opens it in Claude Code, the voice profile is already there. If they ask the tool to write documentation, draft a PR description, or respond to an issue, it can match the maintainer's voice. The profile shipped with the repo. The contributor doesn't need to know the profile exists. Their AI tool picks it up from the project files the same way it picks up coding standards from CLAUDE.md. The voice profile becomes part of the project's infrastructure, like a linter config or an editorconfig file. A voice profile with numeric targets and tiered constraints is portable. It's a file. Check it into a repo, version it, update it when your voice evolves. Any tool that reads markdown can consume it. ## The gap that remains The pipeline captures the structure of a voice: rhythm, complexity, emotional register, pronoun patterns, speech acts, structural habits, register shifts, anti-markers. Enough to produce output that sounds like it could have come from the same person. But it doesn't capture the spark. "A writhing knot of joy and engagement" is not reproducible from a spec sheet. The pipeline can tell you I write personal content at FK grade 7 with positive sentiment and family pronouns. It cannot tell you to pair a kinetic, almost violent image ("writhing knot") with warmth ("joy and engagement") and let the tension between them carry the meaning. The unexpected image, the compression. Those aren't in the measurements. The few-shot examples in the voice agent are the bridge. They demonstrate by example what the constraints can only describe by rule. The pipeline selects 3-5 examples spanning different topics and registers, chosen for typicality rather than brilliance. They show the LLM the shape of the voice in practice. Examples are demonstrations, not generators. They can't teach the voice how to surprise. I'm okay with that limitation. I designed the pipeline to produce measurable, replicable constraints. The kind you can validate after the fact and iterate on when they drift. Metaphor selection and humor timing aren't measurable with current methods. It captures enough of a voice to be useful. The rest is still yours. ## The repo The pipeline is open source at [written-voice-replication](https://github.com/aaddrick/written-voice-replication). It includes: - The full pipeline: 6 agents, 26 skills, a methodology guide - A working example: the voice agent and skill generated from my Reddit corpus - A hydration script for Reddit GDPR exports - Documentation for adapting the data prep phase to other data sources Clone it, open it in Claude Code, tell it to use the pipeline-orchestrator agent. It accepts any text corpus with at least 50 items. The more text you feed it, the tighter the confidence intervals. It analyzes writing across 25 dimensions, each grounded in published research from stylometry, psycholinguistics, sentiment analysis, discourse theory, speech act theory, communication accommodation, and behavioral archetype classification. It hands you back a spec sheet for your voice. With tolerances. You can deploy the spec in your projects, use it for drafting, share it with collaborators, or update it as your writing evolves. --- [^gdpr]: Reddit provides a GDPR data export containing your posts, comments, votes, saved items, and account metadata as CSV files. The pipeline's data prep phase processes these exports, but the analysis skills themselves work with any text corpus that has timestamps. [^claudecode]: [Claude Code](https://docs.anthropic.com/en/docs/claude-code) is Anthropic's CLI tool for working with Claude. It reads project files, executes tools, and can be extended with custom agents and skills defined as markdown files. [^flesch]: Flesch, R. (1948). A new readability yardstick. *Journal of Applied Psychology*, 32(3), 221-233. The other four formulas: Kincaid et al. (1975), Coleman & Liau (1975), Gunning (1952), McLaughlin (1969). [^mccarthy]: McCarthy, P.M. & Jarvis, S. (2010). [MTLD, vocd-D, and HD-D: A validation study of sophisticated approaches to lexical diversity assessment](https://link.springer.com/content/pdf/10.3758/BRM.42.2.381.pdf). *Behavior Research Methods*, 42(2), 381-392. [^mosteller]: Mosteller, F. & Wallace, D.L. (1963). Inference in an authorship problem. *Journal of the American Statistical Association*, 58(302), 275-309. [^burrows]: Burrows, J. (2002). ['Delta': A measure of stylistic difference and a guide to likely authorship](https://academic.oup.com/dsh/article/17/3/267/929277). *Literary and Linguistic Computing*, 17(3), 267-287. [^kestemont]: Kestemont, M. (2014). [Function words in authorship attribution: From black magic to theory?](https://www.researchgate.net/publication/301404098) *Proceedings of the 3rd Workshop on Computational Linguistics for Literature*, 59-66. [^cv]: Coefficient of variation (CV) = standard deviation / mean. A CV below 0.30 means the feature varies less than 30% of its mean value across contexts. Features above this threshold are too context-dependent to be part of a stable fingerprint. [^federalist]: The disputed Federalist Papers (numbers 49-58, 62, and 63) were claimed by both Alexander Hamilton and James Madison. Mosteller and Wallace's 1963 analysis used function-word frequencies to attribute them to Madison, a result that has held up across six decades of subsequent research. [^ferrara]: Ferrara, E., Ferrara, A., & Ferrara, M. (2025). [Characterizing User Archetypes and Discussions on Social Hypernetworks](https://www.mdpi.com/2504-2289/9/9/236). *Big Data and Cognitive Computing*, 9(9), 236. [^bic]: Bayesian Information Criterion. Lower BIC = better model fit with a penalty for complexity. When AIC and BIC disagree on model selection, the pipeline prefers BIC because it penalizes complexity more heavily, favoring explanatory parsimony over predictive accuracy. [^koutsoumpis]: Koutsoumpis, A., Oostrom, J.K., Holtrop, D., Van Breda, W., Ghassemi, S., & de Vries, R.E. (2022). [The kernel of truth in text-based personality assessment: A meta-analysis of the relations between the Big Five and LIWC](https://psycnet.apa.org/record/2023-55252-004). *Psychological Bulletin*, 148(11-12), 843-868. N = 85,724 across 31 samples. [^liwc]: Linguistic Inquiry and Word Count. A dictionary-based tool that categorizes words into psychological dimensions (cognitive processes, social dynamics, affective states, etc.). Originally developed by Pennebaker and colleagues. [^biber]: Biber, D. (1988). *Variation across Speech and Writing*. Cambridge University Press. See also Biber, D. (1995). *Dimensions of Register Variation: A Cross-Linguistic Comparison*. Cambridge University Press. [^participial]: The claim that participial phrase endings appear 2-5x more frequently in AI-generated text than human text is based on pattern analysis during the pipeline's development. Formal corpus studies on this specific construction are ongoing, but it's a consistent enough signal that the voice profile uses it as a negative constraint. [^vader]: Hutto, C.J. & Gilbert, E.E. (2014). [VADER: A Parsimonious Rule-based Model for Sentiment Analysis of Social Media Text](https://ojs.aaai.org/index.php/ICWSM/article/download/14550/14399/18068). *ICWSM*. [^liwc-22]: Boyd, R.L., Ashokkumar, A., Seraj, S., & Pennebaker, J.W. (2022). [The development and psychometric properties of LIWC-22](https://www.liwc.app/static/documents/LIWC-22%20Manual%20-%20Development%20and%20Psychometrics.pdf). See also Tausczik, Y.R. & Pennebaker, J.W. (2010). [The psychological meaning of words](https://www.cs.cmu.edu/~ylataus/files/TausczikPennebaker2010.pdf). *Journal of Language and Social Psychology*, 29(1), 24-54.