Navigating MySQL Data Types: Strings

Related Articles

Introduction

Strings are all over databases, from emails and usernames to messages, logs, and data. It appears easy to use strings in MySQL at first: simply choose VARCHAR or TEXT and go on. In practice, string handling conceals some of the most imperceptible and harmful dangers. Unlike numbers or dates, string behavior heavily depends on encodings and data/index storage formats.

This article is part of a series about MySQL data types. It covers string types, how they function, what may go wrong, and how to make safe choices.


Overview of String Types

Character types:

  • CHAR(n): fixed-length string
  • VARCHAR(n): variable-length string
  • TEXT family: TINYTEXT, TEXT, MEDIUMTEXT, LONGTEXT

Binary types:

  • BINARY(n): fixed-length binary
  • VARBINARY(n): variable-length binary
  • BLOB family: TINYBLOB, BLOB, MEDIUMBLOB, LONGBLOB

MySQL has two sorts of strings: character and binary. Character types are read based on a charset and collation, but binary types only store raw bytes without any constraints.

When you use fixed-width types like CHAR(n) or BINARY(n), they always use the full length and fill in the gaps with spaces or zeros. This makes storage predictable, but it can waste capacity.

Variable-width types, such as VARCHAR(n) and VARBINARY(n), only utilize as many bytes as they need plus a little extra space for length. They work better for content with varied lengths, but they take a little longer to update when the values change in size.

The (n) for variable-length types tells you how many characters (for character types) or bytes (for binary types) can be in the type. The actual storage depends on the content; shorter values take up less space, but no value can be bigger than that. VARCHAR(50) in utf8mb4 can hold up to 50 characters, but it may need up to 200 bytes of space inside because each character can occupy up to 4 bytes.


Pitfalls and Lessons

Encodings and Collations

  • utf8 is not full UTF‑8: MySQL’s utf8 stores only 3-byte characters, so emojis and some symbols fail. Always use utf8mb4.
  • Case sensitivity (_ci vs _cs) and locale rules can surprise you. The most common examples are different handling of ß vs ss, or Turkish I vs i. It’s better to choose modern collations like utf8mb4_0900_ai_ci.
  • In many collations, 'abc' = 'abc ' evaluates as true, which may affect uniqueness constraints.
  • MySQL does not normalize Unicode. é as one codepoint and e+´ as two codepoints look the same but compare differently. Normalize strings at the application layer.
-- Example: trailing spaces ignored
table> CREATE TABLE t (val VARCHAR(10) UNIQUE);
table> INSERT INTO t VALUES ('abc');
table> INSERT INTO t VALUES ('abc   ');
ERROR 1062 (23000): Duplicate entry 'abc' for key 'val'

Storage and Indexing

  • CHAR vs VARCHAR: CHAR pads values with spaces and can waste storage. Use it only for fixed-length tokens (like country codes).
  • Index length limits: In InnoDB, indexes are limited to 3072 bytes. With utf8mb4 (up to 4 bytes per char), VARCHAR(1000) cannot be fully indexed.
-- Example: index too long
table> CREATE TABLE users (
  name VARCHAR(1000),
  INDEX(name)
) CHARSET=utf8mb4;
-- ERROR 1071: Specified key was too long
  • Row size limit: A row cannot exceed 65,535 bytes. Multiple wide VARCHARs can hit this limit even if values are usually short.
  • Off-page storage: Very long VARCHAR or TEXT values live off-page, adding I/O overhead when fetched frequently.

Text vs Binary Semantics

  • Storing hashes or UUIDs in VARCHAR means collation rules apply, breaking equality. Use BINARY(16) for UUID bytes or VARBINARY for hashes.
  • LENGTH() counts bytes; CHAR_LENGTH() counts characters. Mixing them causes subtle bugs with multi-byte charsets.
SELECT LENGTH('🙂'), CHAR_LENGTH('🙂');
-- LENGTH = 4, CHAR_LENGTH = 1

Search and Sorting

  • LIKE 'abc%' can use an index, but LIKE '%abc%' cannot.
  • WHERE LOWER(name) = 'x' kills index use. Instead, choose a case-insensitive collation or use a generated lowercase column with an index.
  • FULLTEXT indexes are good for natural-language search but don’t behave like exact matches.
  • Sorting large text fields with complex collations often spills to on-disk temp tables.

Application–Database Boundary

  • If the client connection is latin1 but data is UTF‑8, characters get corrupted. Always configure connections to use utf8mb4.
  • Schema charset, column collations, and client settings must all align.
-- Safe session setup
SET NAMES utf8mb4 COLLATE utf8mb4_0900_ai_ci;

Best Practices

  • Use utf8mb4 everywhere.
  • Default to utf8mb4_0900_ai_ci (MySQL 8.0).
  • Prefer VARCHAR for text, CHAR only for fixed tokens.
  • Store hashes and UUIDs in binary columns.
  • Normalize Unicode in application code.
  • Watch index byte limits; use prefix indexes or generated columns when needed.
  • Use FULLTEXT for natural language queries.

Conclusion

String types may seem easy, but they can create the worst issues, including text that doesn’t make sense, broken uniqueness, delayed queries, or comparisons that aren’t what you intended. The problems are caused by how MySQL deals with encodings, collations, and storage restrictions. n n You may avoid problems before they happen in production by using safe defaults (utf8mb4, modern collations, and the right use of VARCHAR/BINARY) and knowing what to look out for.

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.