Analyze strings: length, word count, character categories, Unicode code points, encoding
F0 9F 98 8A?) to a multi-megabyte log file (want to see character type distribution and identify encoding anomalies?). The real-time analysis updates as you type, making it perfect for interactive exploration.U+XXXX hex code point with the Unicode character name (e.g., U+4E2D CJK UNIFIED IDEOGRAPH-4E2D for δΈ). This is invaluable for debugging encoding issues β if you see unexpected U+FFFD (REPLACEMENT CHARACTER), you know encoding corruption has occurred. UTF-8 Byte Representation β see exactly how your text encodes to bytes, with highlighting for multi-byte sequences (2-byte, 3-byte, 4-byte UTF-8 sequences). Character Frequency β a sorted frequency distribution shows which characters appear most often, with horizontal bar charts for visual comparison β useful for text forensics, language detection, and data profiling. All analysis happens in your browser β your text, even if it contains sensitive data, passwords, API keys, or proprietary content, never leaves your device.A character counter answers 'how long?' β a string inspector answers 'what is this, really?' Key use cases: (1) <strong>Unicode debugging</strong> β text that looks identical can contain different underlying characters. 'Π' (Cyrillic U+0410) and 'A' (Latin U+0041) appear the same in most fonts but are completely different characters β the inspector reveals the actual code points, catching homograph attacks and encoding mix-ups. (2) <strong>Encoding validation</strong> β after copy-pasting text between applications, invisible control characters or byte-order marks (BOM, U+FEFF) may be introduced β the inspector surfaces these. (3) <strong>Data profiling</strong> β analyzing a CSV column to check for unexpected character types (are there tabs in what should be comma-separated? Are there non-printable characters in 'clean' text?). (4) <strong>Accessibility</strong> β checking that text uses proper Unicode characters rather than ASCII approximations (proper quotes '' '' vs straight quotes '', proper em-dashes β vs hyphens -). (5) <strong>Security research</strong> β identifying zero-width characters (U+200B ZERO WIDTH SPACE) that can be used for text steganography or watermarking. The PivaBox String Inspector performs all analysis client-side β your strings, including passwords and API keys, never leave your browser.
UTF-8 is the dominant character encoding on the web (used by ~98% of websites), but its variable-length nature causes subtle bugs. UTF-8 encodes each Unicode code point using 1β4 bytes: ASCII characters (U+0000βU+007F) use 1 byte with the high bit cleared (<code>0xxxxxxx</code>); Latin-extended, Greek, Cyrillic, Arabic, Hebrew (U+0080βU+07FF) use 2 bytes with leading bits <code>110xxxxx 10xxxxxx</code>; CJK characters and most other scripts (U+0800βU+FFFF) use 3 bytes with leading bits <code>1110xxxx 10xxxxxx 10xxxxxx</code>; and supplementary characters including emoji and rare CJK (U+10000βU+10FFFF) use 4 bytes with leading bits <code>11110xxx 10xxxxxx 10xxxxxx 10xxxxxx</code>. The byte representation view shows you exactly which bytes each character encodes to, helping debug: (1) <strong>Truncation bugs</strong> β cutting a string after N bytes may split a multi-byte character, producing invalid UTF-8; the inspector shows byte boundaries. (2) <strong>Double-encoding</strong> β a common bug where UTF-8 bytes are treated as Latin-1 and re-encoded to UTF-8, producing garbled text (Mojibake); the byte view makes this visible. (3) <strong>Database column sizing</strong> β MySQL's <code>VARCHAR(255)</code> holds 255 characters in utf8mb4 but only 255 bytes in legacy utf8 (which was actually up to 3-byte UTF-8); the byte count helps verify your data fits.
Character frequency analysis, while most famously associated with breaking classical ciphers, has modern practical applications. (1) <strong>Language detection</strong> β different languages have distinct character frequency profiles: English is dominated by 'e' (~12.7%), 't' (~9.1%), 'a' (~8.2%); German has additional umlaut frequencies; CJK text shows completely different distribution patterns with thousands of possible characters. The frequency view helps identify the primary language of mixed-language text. (2) <strong>Encoding corruption detection</strong> β if the most frequent 'character' is U+FFFD (REPLACEMENT CHARACTER), your text has undergone encoding corruption. If U+0020 (space) dominates suspiciously, you may have whitespace padding issues. (3) <strong>Data quality assessment</strong> β in a 'clean names' field, finding high frequencies of digits or punctuation suggests data quality problems. (4) <strong>Steganography detection</strong> β unusually high frequencies of zero-width characters or variation selectors may indicate hidden watermarks. (5) <strong>Text authorship analysis</strong> β stylometry uses character and word frequency patterns (among other features) to identify or verify authors. The PivaBox String Inspector provides all this analysis for free, entirely in your browser.