SakeTami
lingthusiasm
lingthusiasm

patreon


Bonus 60: Emoji, Mongolian, and Multiocular O ꙮ - Dispatches from the Unicode Conference

If you can copy-paste a string of letters and symbols from one text field to another, or if the message you wrote on your device displays correctly when you send it to a friend, you've benefited from the background work of the Unicode Consortium. If you've ever gone into the "insert symbol" menu in a document and poked around some of those strange and beautiful symbols, from hieroglyphs to arrows to emoji? Yup, that's Unicode too. 

In this episode, we get enthusiastic about how electronic devices know what symbols exist, aka character encoding! We talk about the massive list of symbols that your phone carries around, how that list (aka Unicode) came into existence, and why it's still growing a bit every year (it's partly about emoji but there's also so much more). Gretchen went to the annual meeting of the Unicode Consortium a few months ago and she got to show off her esoteric Unicode symbols scarf (yes, people liked it!) and learn many things, like the surprisingly complicated story of why Mongolian is still so hard to encode. Plus, our favourite obscure Unicode symbols, because there are just so many great ones to choose from. (Have one yourself? Share it with us on the Discord!) 

Announcements:

We're doing another Lingthusiasm liveshow on April 9th (Canada) slash 10th (Australia)! It will be a live Q&A for you, our wonderful patreons, all about fan fave topic: swearing! We'll be hosting this session on our Discord server, and it will be available as an edited-for-legibility recording in your usual Patreon live feed if you prefer to listen at a later date. If you haven't joined the Discord yet, here are instructions for linking your Patreon and Discord accounts.  

 LingComm Grants are back in 2022! These are small grants to help kickstart new projects to communicate linguistics to broader audiences. There will be a $500 Project Grant, and ten Startup Grants of $100 each. Apply here by March 31, 2022 or forward this page to anyone you think might be interested, and if you’d like to help us offer more grants, you can support Lingthusiasm on Patreon or contribute directly. We started these grants because a small amount of seed money would have made a huge difference to us when we were starting out, and we want to help there be more interesting linguistics communication in the world.

Here are the links mentioned in this episode:

You can listen to this episode on this page, via the Patreon RSS or download the mp3. A transcription of this episode is available as a Google Doc. Lingthusiasm is also on Facebook, Tumblr, and Twitter. Email us at contact [at] lingthusiasm [dot] com or chat to us on the Patreon page. Gretchen is on Twitter as @GretchenAMcC and blogs at All Things Linguistic. Lauren is on Twitter as @superlinguo and blogs at Superlinguo.

To chat about this episode and other lingthusiastic topics with your fellow linguistics fans, join us on the Lingthusiasm Discord server.

Lingthusiasm is created by Gretchen McCulloch and Lauren Gawne. Our senior producer is Claire Gawne, our production editor is Sarah Dopierala, our production manager is Liz McCullough, and our music is ‘Ancient City’ by The Triangles

Bonus 60: Emoji, Mongolian, and Multiocular O ꙮ - Dispatches from the Unicode Conference

Comments

Right - for working with unicode the hex notation is standard so it makes to use it. But to just understand it conceptually I like to think in the normal base 10 number system. And if I'm just saving and reading text files there's normally nothing to do with hex going on.

That seems to be a very common way of writing Unicode character codes. The Unicode standard itself tables of grapheme clusters indexed by hex row and column indices that together specify the number (see for example https://www.unicode.org/charts/PDF/U0250.pdf). C++ uses a similar syntax when incorporating Unicode characters into literal strings (for example \u0041 in https://en.cppreference.com/w/cpp/language/escape). HTML uses a different version with hex such as A (but also can use decimal numbers A). Even Perl uses a hex notation such as 0x0041 (https://perldoc.perl.org/perluniintro#Hexadecimal-Notation). And it's documentation states "The Unicode standard prefers using hexadecimal notation because that more clearly shows the division of Unicode into blocks of 256 characters." The Unicode FAQ for Unix/Linux talks about how UTF-8 relates to all of these numbers (https://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8).

Shaeeyaa

It feels like it might be unnecessarily complicating things to say that Unicode defines letter A as 'U+0041'. Why not just say that Unicode gives each letter a number and A is number 65. "U+0041" is one way to write that, but that's more a way to communicate between humans. It isn't what the computer is looking at, which would be something in binary and depends on the particular encoding used - usually UTF-8 but the episode didn't get into that.


More Creators