| |
Unicode
Unicode
- Unicode = ISO-10646
- Characters -> Bytes == U+00C0 -> 0xC3 0x80 (utf8)
- Planes, 0x00 - 0x10, 64k chars per plane
- 0x00 - BMP, basic multilingual plane
- 0x01 - Rare / historical
- 0x02 - Less common Asian
- 0x03 - 0x10 - other
- combinations: U+0041 + U+630A = U+00C5
- Form C = compose, then decompose to most basic - normalize to get around all the many different combos
- unicode chars come with metadata
- declare both in http header and meta header tag
- liblangid, yell, thoth, ystring, iconv, icu, ymail_transcoder
- UTF-32 0x000000C0
- UTF-16 0x00C0 (endian problems)
- UTF-8 0xC2 0x80 (no null-bytes end-of-string problems)
- web standard
- multi-byte (security problem, thoth lib can check)
- � U+FFFD "there was a char here but it's gone"
- ASCII is ASCII
- good for streaming
- char(c) support
- format
- lead = 110xxxxx, 1 trail = 10xxxxxx
- lead = 1110xxxx, 2 trail = 10xxxxxx 10xxxxxx
- lead = 11110xxx, 3 trail = 10xxxxxx 10xxxxxx 10xxxxxx
Unicode
|
|
Tip: To turn text into a link, highlight the text, then click on a page or file from the list above.
|
|
|
Comments (0)
You don't have permission to comment on this page.