Sadly JavaScript's 16bit strings are broken. Unicode has more than 16bits worth of codes. Especially now that emoji have finally been added the 140million Japanese users of them can now finally take all their messages online......except JavaScript will choke on all of them
Welcome to tons of JS Unicode String libraries to work around this problem :-(
The tutorial is somewhat incorrect with respect to strings. JavaScript strings are UTF-16, which means that strings are represented as a sequence of 16-bit half-words, but a single character could be spread over multiple half-words. For example, open up the Developer console on your browser and type
alert("π±"); // a cat face
and JavaScript will not have any problems processing the string (assuming you have the fonts to see that particular character.) Then try
alert("π±".length); // 2
and you will find that it prints 2, because that single character ("code point") requires two half-words to represent. This does mean that you cannot access this character except by slicing, because
alert("π±"[0]); // an unknown character
produces half of a character, which could be problematic and require libraries to work around if you're attempting to access individual characters. Other string operations still work fine, though
alert("abcπ±de".indexOf("π±")) // 3
so while a few operations will need special library support in some cases, JavaScript certainly won't "choke" on any of these strings.
Interestingly, the javascript scratchpad in firefox requires two deletes to fully delete that character. This is consistent with the fact that its length is 2. The first delete actually changes it to some other character.
From what I understand this is not a case of a unicode character with a second character applied as an accent. Although it's possible it may be behaving the same way.
In any case, it's interesting to see the delete key change a character to something else rather than fully delete it.
JavaScript's access to individual parts of a string works on code units, which are distinct from code points, which in turn are distinct from characters. A single (visible) character can be composed of multiple code points, such as a character plus an accent. However, a single code point can also be represented by multiple code units, where a code unit is a fixed number of bits. UTF-16 is a variable-width encoding, so a given code point can be either one code unit or two; in this case, the cat face is a single character, composed of a single code point, composed of two code units.
(UTF-8 is another variable-width encoding where a code point can be anywhere from one to four 8-bit code units, and UTF-32 is a fixed-width encoding where every code point is exactly 32-bit one code unit. Because code points can be accents or modifiers, UTF-32 is still variable-width with respect to characters, because there's no guarantee that a given character is composed of a single code unit.)
Because UTF-16 is a variable-width encoding, and because JavaScript exposes UTF-16 code units (instead of code points or characters), it is possible to delete half of a code point and even end up with an invalid UTF-16 string in some cases (IIRC). As another comment mentions, some languages (e.g. Python 3) expose code points instead, which still isn't the same as characters.
Then again, they're broken like most other "unicode" string types (Java's and C#'s, Python's β default β narrow builds until Python 3.3, Cocoa in some ways, etc...) in that the string "API" produces user-facing "UCS-2 plus surrogates" so it's nothing new under the sun.
Most modern languages such as Python 3 and Ruby 1.9 have you only care about characters and never about the internal representation.
Calling the respective methods to get a strings length for example will always return the length in characters. There is no way to get to the byte length without explicitly naming an encoding you'd like to get the byte length for.
Older languages, like PHP or JS, Python 2 and Ruby 1.8 leak their internal implementations. The methods to retrieve a string length would return the amount of bytes the internal representation of the string requires. If you need the length in characters, you need to call different methods - sometimes even from external libraries.
> Calling the respective methods to get a strings length for example will always return the length in characters.
Most languages return the length in _unicode characters_, which is just the number of codepoints.
However, in most cases, the programmer actually wants the number of user-perceived characters, ie _unicode grapheme clusters_.
UTF-32 has to be treated as a variable-length coding in most cases, no different from UTF-16 - otherwise, you'd miscount even characters common in western languages like 'Γ€' if it happens that the user used the decomposed form.
Even normalization doesn't help with that, as not all grapheme clusters can be composed into a single codepoint.
Perl6 is an example of a language which does the right thing here: Its string type has no length method - you have to be explicit if you want to get the number of bytes, codepoints or grapheme clusters.
To add some confusion back in, the language also provides a method which gets the number of 'characters', where the idea of what a character is can be configured at lexical scope (it defaults to grapheme cluster).
> To add some confusion back in, it also provide a method which gets the number of 'characters', where the idea of what a character is can be configured at lexical scope (it defaults to grapheme cluster).
That's actually pretty cool, as it lets the library configure itself for the representation which makes most sense to it: a library which deals in storage or network stuff can configure for codepoints or bytes length, whereas a UI library will use grapheme clusters for bounding box computations & al.
Configuring it lexically also makes sense as it avoid leaking that out (which dynamically scoped configuration would).
Not true until Python 3's "flexible strings" implementation. Unless you're using "wide" builds (which use UTF-32 internally), which are already available in Python 2 and are not the default representation.
I'm pretty sure every Linux distribution's official Python packages are wide builds. Certainly, this is the case on Ubuntu and Debian, and I think Red Hat as well.
> I'm pretty sure every Linux distribution's official Python packages are wide builds.
That's a matter of Linux distributions packaging (again, by default, without any specific configuration, Python will set itself up using narrow builds), and if you assume wide builds your code is broken.
Furthermore, pilif asserted a difference between Python 2 and Python 3. There is no such thing prior to the yet-unreleased Python 3.3 as making wide builds the default was explicitly rejected for Python 3, Python 2 and Python 3 behave exactly the same way on that front (again, prior to Python 3.3)
Of course pilif is also wrong in asserting that "The methods to retrieve a string length would return the amount of bytes the internal representation of the string requires.", Python < 3.3 returns the number of code units making up the string (never the number of bytes for the the unicode string types β str/bytes is a different matter as it's not unicode data)
This was actually an thorough overview for JavaScript newbies. My only complaint was that the author glossed over apply() and call() saying they were 'difficult to illustrate', two functions that were new to me.
This tutorial reminds me of Python reference by new riders from which I picked up and started coding fresh literate python in 2 days, writing code that worked and without bugs. I guess if you can show commonly used idioms and supply edge conditions such as "NaNs are toxic" - language learning happens in a snap. big thumbs up for the tutorial.
Wow that was good! I'm very happy to have found this. I've had very little luck finding good tutorials for experienced programmers who just want to pick up JavaScript. Most tutorials I found either assume very little prior knowledge and write dense paragraphs on what an object is, or they assume too much and explain too little. I love the focus on example-driven explanation as I somehow process code much faster than text, and I love that it is all on a single page (a surprisingly rare feature!).
According to the article, the code below causes a memory leak in IE, though I fail to see how.
function addHandler() {
var el = document.getElementById('el');
el.onclick = function() {
this.style.backgroundColor = 'red';
}
}
I'm assuming the author believes `this` is evaluated immediately and refers to the `el` variable instance, rather than being evaluated when the function is called.
When addHandler is called, the anonymous function expression is evaluated and its result is assigned to el.onclick. Thus, el contains a reference to the newly created function. However, when the function is created, it captures the environment in which it is defined, including the variable el. Thus, el points to the function, and the function (closure) points to el. This is the circular reference that IE can't handle.
Maybe I'm missing something, but the function doesn't point to `el`. `el` just exists in a lower execution context on the stack. If a variable inside the anonymous function referenced `el` directly, then a closure would be formed over `el`. In this instance `this` is just assigned when the function is called.
I'm not sure whether you're saying the code is wrong or you can't understand how that code creates the circular reference. This problem was widely known and stopped being a problem in IE8 I believe. Maybe it was IE7. A while ago anyway. That code looks like the type of circular reference that would cause it. You used to be able to download tools to watch for the memory leak.
From what I remember the closure will keep a reference to el even though it will never use it, which means el won't be garbage collected when removed from the dom, which means the closure won't be removed from the onclick to free up el. Hence a circular reference. IE6's javascript garbage collector couldn't detect that.
The article is old. Here's an article from MS in 2005 describing the problem in more detail:
Apologies if you understood that and were just saying the code is wrong, but it looks right to me. Not that I can be bothered to fire up IE6 to test it out.
EDIT: Some more detail about the definitely historical leak from SO, it also turns out it was only partially fixed in IE7:
Thanks for the info. So it was the dreaded IE 6 that would not release things when you reloaded or went to another page. I was wondering why the base article here said you had to restart the browser completely, as I could think of no reason to carry over objects between pages (outside of browser/external persistence). At least IE 7 implemented the obvious "clean up between pages" operation.
The first example there allows the JS engine to prove to itself that no one uses foo, so it optimizes the reference out of the closure to save memory. That's an optimization (and the fact that the debugger doesn't disable it is actually a bug); the language semantic is that all closed-over variables are available to the closure function body.
So to answer myself, it does look like a reference is maintained despite no direct references existing within the inner function (unless eval is causing the reference to be maintained):
It looks exactly like eval is causing the reference to be maintained.
If you change the inner function call parameter 'foo' to ('alert("hi")') you'll see the reference to foo='bar' is still maintained in the inner function. So there is no -direct- reference to foo in the inner function, but there isn't even an -indirect- reference being made in the call to eval. The reference to foo is being maintained purely because eval "might" require it. Or so it appears.
If the return eval(_var) statement is replaced with alert('blah') we're back to foo being undefined in the inner function. So I've definitely got my finger pointed at eval on this one.
Unless they can predict the future, it was updated as recently as 2009:
> The fourth edition was abandoned, due to political differences concerning language complexity. Many parts of the fourth edition formed a basis of the new ECMAScript edition 5, published in December of 2009.
EDIT: If you scroll to the bottom right it says: "Page last modified 21:08, 5 Jan 2012 by Janet Swisher"
Today I would recommend that anyone who plans to learn JS learns Lua first. It's similar enough, but simpler and you don't have to be concerned with all the browser-specific functions right away.
Welcome to tons of JS Unicode String libraries to work around this problem :-(