Monday, June 3, 2024
 Popular · Latest · Hot · Upcoming
132
rated 0 times [  136] [ 4]  / answers: 1 / hits: 6923  / 10 Years ago, fri, april 18, 2014, 12:00:00

From reading various posts, it seems like JavaScript's unescape() is equivalent to Pythons urllib.unquote(), however when I test both I get different results:



In browser console:



unescape('%u003c%u0062%u0072%u003e');


output: <br>



In Python interpreter:



import urllib
urllib.unquote('%u003c%u0062%u0072%u003e')


output: %u003c%u0062%u0072%u003e



I would expect Python to also return <br>. Any ideas as to what I'm missing here?



Thanks!


More From » python

 Answers
3

%uxxxx is a non standard URL encoding scheme that is not supported by urllib.parse.unquote() (Py 3) / urllib.unquote() (Py 2).



It was only ever part of ECMAScript ECMA-262 3rd edition; the format was rejected by the W3C and was never a part of an RFC.



You could use a regular expression to convert such codepoints:



try:
unichr # only in Python 2
except NameError:
unichr = chr # Python 3

re.sub(r'%u([a-fA-F0-9]{4}|[a-fA-F0-9]{2})', lambda m: unichr(int(m.group(1), 16)), quoted)


This decodes both the %uxxxx and the %uxx form ECMAScript 3rd ed can decode.



Demo:



>>> import re
>>> quoted = '%u003c%u0062%u0072%u003e'
>>> re.sub(r'%u([a-fA-F0-9]{4}|[a-fA-F0-9]{2})', lambda m: chr(int(m.group(1), 16)), quoted)
'<br>'
>>> altquoted = '%u3c%u0062%u0072%u3e'
>>> re.sub(r'%u([a-fA-F0-9]{4}|[a-fA-F0-9]{2})', lambda m: chr(int(m.group(1), 16)), altquoted)
'<br>'


but you should avoid using the encoding altogether if possible.


[#45902] Thursday, April 17, 2014, 10 Years  [reply] [flag answer]
Only authorized users can answer the question. Please sign in first, or register a free account.
breap

Total Points: 606
Total Questions: 96
Total Answers: 108

Location: Djibouti
Member since Sun, Feb 27, 2022
2 Years ago
breap questions
Thu, Jun 24, 21, 00:00, 3 Years ago
Wed, Mar 18, 20, 00:00, 4 Years ago
Mon, Oct 7, 19, 00:00, 5 Years ago
;