Unicode

Page history last edited by Brev Patterson 1 yr ago

Unicode

 

  • Unicode = ISO-10646
  • Characters -> Bytes == U+00C0 -> 0xC3 0x80 (utf8)
  • Planes, 0x00 - 0x10, 64k chars per plane
    • 0x00 - BMP, basic multilingual plane
    • 0x01 - Rare / historical
    • 0x02 - Less common Asian
    • 0x03 - 0x10 - other
  • combinations: U+0041 + U+630A = U+00C5
    • Form C = compose, then decompose to most basic - normalize to get around all the many different combos
  • unicode chars come with metadata
  • declare both in http header and meta header tag
  • liblangid, yell, thoth, ystring, iconv, icu, ymail_transcoder

 

  • UTF-32 0x000000C0
  • UTF-16 0x00C0 (endian problems)
  • UTF-8 0xC2 0x80 (no null-bytes end-of-string problems)
    • web standard
    • multi-byte (security problem, thoth lib can check)
    • � U+FFFD "there was a char here but it's gone"
    • ASCII is ASCII
    • good for streaming
    • char(c) support
    • format
      • lead = 110xxxxx, 1 trail = 10xxxxxx
      • lead = 1110xxxx, 2 trail = 10xxxxxx 10xxxxxx
      • lead = 11110xxx, 3 trail = 10xxxxxx 10xxxxxx 10xxxxxx

Comments (0)

You don't have permission to comment on this page.