It appears that the intent of UNICODE \w in both Python 2 and 3 is
to match every character in Unicode general categories L* and N*,
plus U+005F ('_'). However, in 2.7 the re module's idea of the
Unicode database is a little bit out of sync with the unicodedata
module, such that four astral characters in category Nl are not
matched when they should be:
U+012432CUNEIFORM NUMERIC SIGN SHAR2 TIMES GAL PLUS DISHU+012433CUNEIFORM NUMERIC SIGN SHAR2 TIMES GAL PLUS MINU+012456CUNEIFORM NUMERIC SIGN NIGIDAMINU+012457CUNEIFORM NUMERIC SIGN NIGIDAESH
Note that neither is consistent with UTS#18 level 1, which defines
"word characters" as general category Nd (not Nl or No), plus
everything that is "Alphabetic" (which has a complicated definition,
not exactly corresponding to any set of general categories), plus
U+200C and U+200D (ZWNJ and ZWJ). Personally I think the Python
definition is more useful.
Note also that unicodedata itself may be lagging substantially
behind Unicode. Python 2.7 has 5.2.0, 3.4 has 6.2.0, 3.5 has 8.0.0.
Unicode 9.0.0 is "scheduled for release in mid-2016".