If your website suddenly shows strings like ’, â€", or Français, you are probably looking at mojibake: text that was encoded one way, decoded another way, and then saved in the wrong state.
That is what happened on my own site. My editor showed clean apostrophes and dashes. Some browsers quietly cleaned things up. Facebook's in-app browser did not. The result was a nasty split: everything looked fine in the place I checked, and broken in the place some readers actually saw it.
The fix was not to replace every weird sequence by hand. The fix was to identify the exact transformation that broke the files, reverse it, and only touch the files that fit that pattern.
The quick answer: how to fix mojibake
For the common UTF-8-as-Windows-1252 pattern, reverse the mistake:
- Take the visible garbled text.
- Encode that text back into Windows-1252 bytes.
- Decode those bytes as UTF-8.
- If the text was garbled twice, repeat the same reversal one more time.
That is why a one-pass repair turns ’ back into a curly apostrophe, while a two-pass repair turns ’ first into ’, then into the original character.
| What you see | What it probably was | Likely cause |
|---|---|---|
’ |
Curly apostrophe | UTF-8 bytes read as Windows-1252 once |
’ |
Curly apostrophe | The same mistake saved and re-read again |
â€" |
Dash or punctuation | UTF-8 punctuation passed through a legacy code page |
Français |
Français | Double mojibake on an accented letter |
Important: mojibake is pattern-specific. A search result tool can repair a pasted string, but a folder full of website files needs a safer classifier so clean files do not get rewritten just because they contain smart punctuation.
Why the files looked fine in my editor
The worst part was not the corruption. It was the false confidence. The files opened normally. The apostrophes looked like apostrophes. The dashes looked like dashes. Nothing in the editor screamed, "Hey, this text is going to embarrass you in a browser you forgot to test."
Browsers, editors, sync tools, and social in-app browsers do not always make the same encoding guesses. One renderer may recover gracefully. Another may show the raw mess. That is how a UTF-8 corruption problem becomes invisible to the person publishing and obvious to the person reading.
This is also why "it looks right when I open it" is not enough. You need to inspect what is actually served, how it is declared, and how it renders in the places your audience uses.
What actually happened
Modern web text should be UTF-8. A lot of older Windows-flavored text handling still knows about Windows-1252, also called cp1252. Those two encodings overlap for plain ASCII, then diverge hard around curly quotes, dashes, symbols, and accented characters.
At some point, a sync process took UTF-8 text, interpreted it through Windows-1252, and saved the result again as UTF-8. That turned real punctuation into visible garbage. Then some files appear to have gone through the same kind of round trip more than once, which is how a simple broken apostrophe became the longer ’ pattern.
The good news: this was not random damage. It was a reversible transformation. Once I knew the path, I could test for that path.
The test that separated broken files from clean files
I had dozens of files. Some were corrupted, some were fine, and eyeballing them was a bad plan. So I used the corruption itself as the test.
A file that genuinely contains this mojibake pattern can survive the reverse operation: encode the visible text as Windows-1252, then decode it as UTF-8. A clean file with real curly quotes or accented letters usually fails strict UTF-8 decoding after that reversal, because those Windows-1252 bytes are not valid UTF-8. A plain ASCII file may pass unchanged, which means there is nothing to repair.
That gave me a practical signal:
- Repair attempt changes the text cleanly: likely mojibake, review the fixed output.
- Repair attempt throws on strict decode: likely clean non-ASCII text, leave it alone.
- Repair attempt returns the same text: probably ASCII-only or not affected.
That is how I found the corrupted files without turning the whole folder into a search-and-replace crime scene.
A PowerShell repair script for double-encoded UTF-8
This version writes repaired copies next to the originals so you can diff them first. That is slower than overwriting files, and that is the point.
$cp1252 = [System.Text.Encoding]::GetEncoding(
1252,
[System.Text.EncoderExceptionFallback]::new(),
[System.Text.DecoderExceptionFallback]::new()
)
$utf8Strict = [System.Text.UTF8Encoding]::new($false, $true)
function Repair-MojibakeText {
param(
[Parameter(Mandatory = $true)]
[string] $Text,
[int] $MaxPasses = 2
)
$current = $Text
$changed = $false
for ($i = 0; $i -lt $MaxPasses; $i++) {
try {
$bytes = $cp1252.GetBytes($current)
$next = $utf8Strict.GetString($bytes)
} catch {
return [pscustomobject]@{
Text = $current
Changed = $changed
Error = $_.Exception.Message
}
}
if ($next -eq $current) { break }
$current = $next
$changed = $true
}
[pscustomobject]@{
Text = $current
Changed = $changed
Error = $null
}
}
Get-ChildItem -Path . -Recurse -File -Include *.md,*.html,*.astro |
ForEach-Object {
$original = Get-Content -LiteralPath $_.FullName -Raw -Encoding UTF8
$result = Repair-MojibakeText -Text $original -MaxPasses 2
if ($result.Changed) {
$out = "$($_.FullName).fixed"
Set-Content -LiteralPath $out -Value $result.Text -Encoding UTF8 -NoNewline
Write-Host "Wrote candidate: $out"
}
}
After that, diff the .fixed files against the originals. If the output is right, copy the fixed version over the original or adjust the script to overwrite only reviewed files.
If you see strings like ’, you may need more than two passes. Raise $MaxPasses to 3, but keep review in the loop. More passes are not automatically better; they are only correct when the file was actually mangled that many times.
Why not just search and replace?
Because the same visible symptom can come from different paths, and the same path can produce many symptoms. Replacing ’ with an apostrophe fixes one character in one state. It does not fix accented letters. It does not tell you which files are affected. It does not catch partial corruption. And it can create new problems if a sequence appears in a code example or an article about encoding corruption. Hello, recursion-shaped headache.
The reversible transform is stronger because it treats the file as a system, not a pile of suspicious strings.
The bug was not random. It was a transformation with an inverse. Once I saw that, the fix stopped being a hunt and became arithmetic.
How to prevent UTF-8 corruption from coming back
After repair, I care about prevention more than cleverness. The checklist is boring, which is exactly what you want from encoding infrastructure:
- Make sure HTML declares UTF-8 with
<meta charset="utf-8">. - Make sure your server sends a UTF-8 charset for text responses where appropriate.
- Keep source files in UTF-8 and make the encoding visible in your editor status bar.
- Be suspicious of sync, backup, import, export, and CMS migration tools that rewrite text files.
- Check rendered pages in more than one browser context, including mobile and in-app browsers when social traffic matters.
- Add a quick pre-publish search for common mojibake strings such as
â,Â, and’.
Need the site-level version of this kind of check?
A technical SEO health check looks at what your site actually serves: crawlability, indexability, schema, redirects, canonical tags, analytics, and the quiet failures visitors never report.
FAQ
What is mojibake?
Mojibake is garbled text caused when bytes written in one character encoding are read as another. A common web pattern is UTF-8 text being interpreted as Windows-1252 or Latin-1, which turns punctuation and accented letters into strange sequences.
How do you fix double-encoded UTF-8?
For the common UTF-8 read as Windows-1252 pattern, reverse the mistake: encode the visible garbled text as Windows-1252 bytes, then decode those bytes as UTF-8. If the text was garbled twice, repeat the same reversal a second time and review the output before overwriting files.
Why does ’ appear instead of an apostrophe?
It is usually a curly apostrophe whose UTF-8 bytes were interpreted through Windows-1252, saved, and then put through a similar path again. One bad pass often shows as ’; another pass makes the longer sequence.
Can mojibake be fixed across many files automatically?
Yes, if the corruption pattern is consistent. Use a script that previews repaired copies first. Do not start with broad search-and-replace rules unless you want a very educational afternoon.