Strings in Dart — UTF-16, Runes, and Grapheme Clusters
How Dart stores text, why string.length can lie, and what it takes to correctly count characters in a world of emoji and combining marks.
Why text is harder than numbers
Numbers are simple. An int is a fixed number of bits. A double follows IEEE 754. The rules are universal and unambiguous.
Text is a different beast entirely.
Humans have invented thousands of writing systems. A single "character" might be a Latin letter, a Chinese ideograph, an Arabic ligature, a Korean syllable block, or a family emoji with skin tones. Some characters are combined from multiple pieces. Some look identical but have different underlying codes. Some change shape depending on their neighbors.
A string library has to make choices — and those choices have trade-offs. In this episode we will see exactly what Dart chose, why it sometimes surprises us, and how to work with text correctly.
Dart strings are UTF-16
Every Dart String is a sequence of UTF-16 code units. Each code unit is 16 bits — an unsigned integer from 0 to 65,535.
For simple ASCII text like 'Hello', each letter maps to exactly one code unit. The string has 5 code units, .length is 5, and s[i] gives us the ith character. Everything works as expected.
The trouble starts when we leave ASCII behind.
Surrogate pairs — when one character takes two slots
UTF-16 was designed when Unicode had at most 65,536 characters. That seemed like enough. It wasn't.
Today Unicode has over 150,000 characters, and the number keeps growing. To handle characters beyond the original 65,536, UTF-16 uses surrogate pairs — two code units that together represent a single character.
The grinning face emoji has Unicode code point U+1F600. That is above 65,535, so UTF-16 encodes it as two code units, 0xD83D followed by 0xDE00.
String emoji = '😀';
print(emoji.length); // 2
print(emoji[0]); // Broken - half of an emoji
print(emoji[1]); // Broken - the other halfThis is the first lesson — .length counts code units, not characters. And s[i] gives us a code unit, which might be only half of a character. Runes — iterating by code point
If .length and s[i] can't be trusted for characters, what can?
Dart provides .runes, which iterates over Unicode code points — the actual numbers assigned to each character in the Unicode standard.
String emoji = '😀';
print(emoji.runes.length); // 1
print(emoji.runes.first); // 128512 (decimal for U+1F600)
print(String.fromCharCode(emoji.runes.first)); // 😀Now we get the right count. The emoji is one code point, and .runes handles the surrogate pair for us.Wait — that last row shows a family emoji where even
.runes gives 5, not 1. What's going on? Grapheme clusters — what users actually see
Unicode allows characters to be combined. A family emoji like 👨👩👧 is actually five code points glued together with Zero Width Joiners (ZWJ, U+200D).
• 👨 (man) + ZWJ
• 👩 (woman) + ZWJ
• 👧 (girl)
The ZWJ tells the renderer to fuse them into one visual unit. The same trick is used for skin tone modifiers (👋🏽 = 👋 + 🏽), flag emoji (🇯🇵 = 🇯 + 🇵), and accented characters (é can be e + combining acute accent).
What the user sees as a single "character" can be one code point, or it can be many code points combined. Unicode calls this visual unit a grapheme cluster.
So we have three levels:
1. Code units — what .length counts (UTF-16 building blocks)
2. Code points — what .runes counts (Unicode numbers)
3. Grapheme clusters — what the user perceives as characters
The characters package
Dart's core library doesn't have built-in grapheme cluster support. For that, we need the characters package from the Dart team.
import 'package:characters/characters.dart';
String family = '👨👩👧';
print(family.length); // 8 (code units)
print(family.runes.length); // 5 (code points)
print(family.characters.length); // 1 (grapheme clusters)Now we finally get 1 — the number a human would count.The
.characters extension also lets us iterate, substring, and manipulate text by grapheme cluster.String text = 'Hi 👋🏽!';
for (var char in text.characters) {
print(char); // H, i, (space), 👋🏽, !
}
// Safe substring — won't slice through emoji
print(text.characters.skip(3).take(1).string); // 👋🏽For any user-facing text manipulation — limiting input length, truncating for display, counting characters — the characters package is essential. How Dart stores strings in memory
Dart strings are immutable. Once created, the contents never change. Any "modification" creates a new string.
Under the hood, the Dart VM uses two different internal representations.
• One-byte strings — when all characters fit in Latin-1 (code points 0–255). Each code unit is stored in a single byte, saving memory.
• Two-byte strings — when any character requires a full 16-bit code unit. Uses 2 bytes per code unit.
This optimization means ASCII-heavy strings (most code, URLs, identifiers) use half the memory they would in a naive UTF-16 implementation.
The choice happens automatically when the string is created. We don't control it, and we don't need to — it's an internal detail that keeps memory usage reasonable.
String interning and identity
Dart interns string literals. When we write the same string literal multiple times in our code, they all point to the same object in memory.
String a = 'hello';
String b = 'hello';
print(identical(a, b)); // true — same object
String c = 'hel' + 'lo';
print(identical(a, c)); // true — compile-time constant
String d = ['hel', 'lo'].join();
print(identical(a, d)); // false — built at runtimeInterning saves memory when the same string appears many times. But it only applies to compile-time constants. Strings built at runtime (from user input, parsing, concatenation with variables) are separate objects.For equality, always use
==. The identical() function checks object identity, which depends on interning — a detail we usually don't want to rely on. Common string pitfalls
1. Slicing through surrogate pairs.
String emoji = '😀';
String broken = emoji.substring(0, 1); // Half an emoji!
print(broken); // Garbage or replacement characterFix — Use .characters for user-visible text, or check isLowSurrogate / isHighSurrogate if working at the code-unit level.2. Comparing strings with different normalization.
The letter "é" can be represented two ways. Precomposed (U+00E9, single code point) or decomposed (U+0065 + U+0301, e + combining acute). They look identical but aren't
== to each other. If comparing user input or data from different sources, normalize first.3. Counting length for UI constraints.
// Wrong — may slice through emoji
if (input.length > 50) input = input.substring(0, 50) + '…';
// Right — respects grapheme clusters
import 'package:characters/characters.dart';
if (input.characters.length > 50) {
input = input.characters.take(50).string + '…';
}4. Off-by-one with newlines.Different platforms use
\n, \r\n, or \r. When splitting lines or counting them, be aware of the line-ending convention in your input. StringBuffer for efficient concatenation
Because strings are immutable, repeated concatenation creates many temporary objects.
String result = '';
for (var i = 0; i < 1000; i++) {
result += 'item $i, '; // Creates 1000 intermediate strings
}For building strings in a loop, use StringBuffer.StringBuffer buffer = StringBuffer();
for (var i = 0; i < 1000; i++) {
buffer.write('item $i, ');
}
String result = buffer.toString(); // One allocation at the endStringBuffer grows an internal buffer efficiently, then produces the final string in one allocation.For simple cases (a few concatenations),
+ or string interpolation is fine. The VM optimizes many common patterns. But for loops or large builds, StringBuffer is the right tool.That wraps up strings. The key takeaway — Dart gives us three views of text (code units, code points, and grapheme clusters) and we need to pick the right one for the job. For most user-facing work, grapheme clusters (via the
characters package) are what we actually want. Test your understanding
7 questions
Seven questions covering UTF-16, surrogate pairs, runes, grapheme clusters, and string memory layout in Dart.