String Inspector - Free Online Tool | PivaBox

Analyze strings: length, word count, character categories, Unicode code points, encoding

String Inspector β€” Deep-Dive Text Analysis with Character-Level Breakdown, Unicode Code Points, UTF-8 Encoding, Type Distribution, and Frequency Visualization

  1. Type or paste any text into the input field β€” the tool instantly begins analyzing every character. Unlike a simple character counter, the String Inspector provides forensic-level text analysis useful for developers, linguists, data scientists, and security researchers. Paste anything from a single emoji (want to know that 😊 is Unicode code point U+1F60A, encoded in UTF-8 as the 4-byte sequence F0 9F 98 8A?) to a multi-megabyte log file (want to see character type distribution and identify encoding anomalies?). The real-time analysis updates as you type, making it perfect for interactive exploration.
  2. Review the comprehensive analysis displayed in organized sections. Basic Statistics: total character count, byte length (in UTF-8 encoding), word count (Unicode-aware β€” correctly handles CJK text where 'words' aren't space-separated), line count, and average word length. Character Type Distribution: breaks down your text into categories β€” letters (further split into uppercase, lowercase), digits, whitespace characters (space, tab, newline, and Unicode whitespace), punctuation, symbols, and 'other' (control characters, private use areas). Each category shows both the absolute count and percentage, often with visual percentage bars for at-a-glance comparison.
  3. Explore the deep details: Unicode Code Points β€” view each character as its U+XXXX hex code point with the Unicode character name (e.g., U+4E2D CJK UNIFIED IDEOGRAPH-4E2D for δΈ­). This is invaluable for debugging encoding issues β€” if you see unexpected U+FFFD (REPLACEMENT CHARACTER), you know encoding corruption has occurred. UTF-8 Byte Representation β€” see exactly how your text encodes to bytes, with highlighting for multi-byte sequences (2-byte, 3-byte, 4-byte UTF-8 sequences). Character Frequency β€” a sorted frequency distribution shows which characters appear most often, with horizontal bar charts for visual comparison β€” useful for text forensics, language detection, and data profiling. All analysis happens in your browser β€” your text, even if it contains sensitive data, passwords, API keys, or proprietary content, never leaves your device.

Frequently Asked Questions

What practical problems does a string inspector solve that a simple character counter doesn't?

A character counter answers 'how long?' β€” a string inspector answers 'what is this, really?' Key use cases: (1) <strong>Unicode debugging</strong> β€” text that looks identical can contain different underlying characters. 'А' (Cyrillic U+0410) and 'A' (Latin U+0041) appear the same in most fonts but are completely different characters β€” the inspector reveals the actual code points, catching homograph attacks and encoding mix-ups. (2) <strong>Encoding validation</strong> β€” after copy-pasting text between applications, invisible control characters or byte-order marks (BOM, U+FEFF) may be introduced β€” the inspector surfaces these. (3) <strong>Data profiling</strong> β€” analyzing a CSV column to check for unexpected character types (are there tabs in what should be comma-separated? Are there non-printable characters in 'clean' text?). (4) <strong>Accessibility</strong> β€” checking that text uses proper Unicode characters rather than ASCII approximations (proper quotes '' '' vs straight quotes '', proper em-dashes β€” vs hyphens -). (5) <strong>Security research</strong> β€” identifying zero-width characters (U+200B ZERO WIDTH SPACE) that can be used for text steganography or watermarking. The PivaBox String Inspector performs all analysis client-side β€” your strings, including passwords and API keys, never leave your browser.

How does the UTF-8 byte representation help with debugging encoding problems?

UTF-8 is the dominant character encoding on the web (used by ~98% of websites), but its variable-length nature causes subtle bugs. UTF-8 encodes each Unicode code point using 1–4 bytes: ASCII characters (U+0000–U+007F) use 1 byte with the high bit cleared (<code>0xxxxxxx</code>); Latin-extended, Greek, Cyrillic, Arabic, Hebrew (U+0080–U+07FF) use 2 bytes with leading bits <code>110xxxxx 10xxxxxx</code>; CJK characters and most other scripts (U+0800–U+FFFF) use 3 bytes with leading bits <code>1110xxxx 10xxxxxx 10xxxxxx</code>; and supplementary characters including emoji and rare CJK (U+10000–U+10FFFF) use 4 bytes with leading bits <code>11110xxx 10xxxxxx 10xxxxxx 10xxxxxx</code>. The byte representation view shows you exactly which bytes each character encodes to, helping debug: (1) <strong>Truncation bugs</strong> β€” cutting a string after N bytes may split a multi-byte character, producing invalid UTF-8; the inspector shows byte boundaries. (2) <strong>Double-encoding</strong> β€” a common bug where UTF-8 bytes are treated as Latin-1 and re-encoded to UTF-8, producing garbled text (Mojibake); the byte view makes this visible. (3) <strong>Database column sizing</strong> β€” MySQL's <code>VARCHAR(255)</code> holds 255 characters in utf8mb4 but only 255 bytes in legacy utf8 (which was actually up to 3-byte UTF-8); the byte count helps verify your data fits.

How does the character frequency analysis help with text forensics and data analysis?

Character frequency analysis, while most famously associated with breaking classical ciphers, has modern practical applications. (1) <strong>Language detection</strong> β€” different languages have distinct character frequency profiles: English is dominated by 'e' (~12.7%), 't' (~9.1%), 'a' (~8.2%); German has additional umlaut frequencies; CJK text shows completely different distribution patterns with thousands of possible characters. The frequency view helps identify the primary language of mixed-language text. (2) <strong>Encoding corruption detection</strong> β€” if the most frequent 'character' is U+FFFD (REPLACEMENT CHARACTER), your text has undergone encoding corruption. If U+0020 (space) dominates suspiciously, you may have whitespace padding issues. (3) <strong>Data quality assessment</strong> β€” in a 'clean names' field, finding high frequencies of digits or punctuation suggests data quality problems. (4) <strong>Steganography detection</strong> β€” unusually high frequencies of zero-width characters or variation selectors may indicate hidden watermarks. (5) <strong>Text authorship analysis</strong> β€” stylometry uses character and word frequency patterns (among other features) to identify or verify authors. The PivaBox String Inspector provides all this analysis for free, entirely in your browser.