Tuesday, May 21, 2024
 Popular · Latest · Hot · Upcoming
122
rated 0 times [  124] [ 2]  / answers: 1 / hits: 83690  / 14 Years ago, mon, april 19, 2010, 12:00:00

I'd like to remove all invalid UTF-8 characters from a string in JavaScript. I've tried with this JavaScript:



strTest = strTest.replace(/([x00-x7F]|[xC0-xDF][x80-xBF]|[xE0-xEF][x80-xBF]{2}|[xF0-xF7][x80-xBF]{3})|./g, $1);



It seems that the UTF-8 validation regex described here (link removed) is more complete and I adapted it in the same way like:



strTest = strTest.replace(/([x09x0Ax0Dx20-x7E]|[xC2-xDF][x80-xBF]|xE0[xA0-xBF][x80-xBF]|[xE1-xECxEExEF][x80-xBF]{2}|xED[x80-x9F][x80-xBF]|xF0[x90-xBF][x80-xBF]{2}|[xF1-xF3][x80-xBF]{3}|xF4[x80-x8F][x80-xBF]{2})|./g, $1);



Both of these pieces of code seem to be allowing valid UTF-8 through, but aren't filtering out hardly any of the bad UTF-8 characters from my test data: UTF-8 decoder capability and stress test. Either the bad characters come through unchanged or seem to have some of their bytes removed creating a new, invalid character.



I'm not very familiar with the UTF-8 standard or with multibyte in JavaScript so I'm not sure if I'm failing to represent proper UTF-8 in the regex or if I'm applying that regex improperly in JavaScript.



Edit: added global flag to my regex per Tomalak's comment - however this still isn't working for me. I'm abandoning doing this on the client side per bobince's comment.


More From » regex

 Answers
72

I use this simple and sturdy approach:



function cleanString(input) {
var output = ;
for (var i=0; i<input.length; i++) {
if (input.charCodeAt(i) <= 127) {
output += input.charAt(i);
}
}
return output;
}


Basically all you really want are the ASCII chars 0-127 so just rebuild the string char by char. If it's a good char, keep it - if not, ditch it. Pretty robust and if if sanitation is your goal, it's fast enough (in fact it's really fast).


[#97031] Friday, April 16, 2010, 14 Years  [reply] [flag answer]
Only authorized users can answer the question. Please sign in first, or register a free account.
grant

Total Points: 169
Total Questions: 96
Total Answers: 98

Location: Cape Verde
Member since Sat, Apr 24, 2021
3 Years ago
;