this post was submitted on 28 Oct 2023
341 points (98.0% liked)

Programmer Humor

32453 readers
261 users here now

Post funny things about programming here! (Or just rant about your favourite programming language.)

Rules:

founded 5 years ago
MODERATORS
 
all 28 comments
sorted by: hot top controversial new old
[–] mvirts@lemmy.world 123 points 1 year ago

πŸ™ƒ compression algorithms hate this one simple trick!!

[–] whileloop@lemmy.world 85 points 1 year ago (2 children)

This is a joke, right? This feels like a very dumb solution. I don't know much about UTF-8 encoding, but it sounds like Roman characters can be encoded shorter than most or all others because of a shorthand that assumes Roman characters. In that case, why not take that functionality and let a UTF-8 block specify which language makes up most of the text so that you can have that savings almost every time? I don't see why one would want it to be random.

[–] alvvayson@lemmy.world 127 points 1 year ago* (last edited 1 year ago) (1 children)

It's a joke.

UTF-16 already exists, which doesn't favor Roman characters as much, but UTF-8 is more popular because it is backword compatible with the legacy ASCII.

UTF-32 also exists which has exactly equal length representation for every character.

But the thing that equalizes languages is compression.

Yes, a text written in Cyrillic with UTF-8 will take more space than a Roman language, easily double. However this extra space is much more easily compressed by an algorithm like GZIP.

So after compression, the two compressed texts will then be similarly sized and much smaller than UTF-16 or UTF-32.

[–] jmcs@discuss.tchncs.de 19 points 1 year ago (1 children)

Besides most text on the average computer is either within some configuration file (which tend to use latin script), or within some SGML derived format which has a bunch of latin characters in it. For network transmission most things will use HTML, XML or JSON and use English language property names even in countries that don't speak English (see Yandex's and Baidu's APIs for example).

No one is moving large amounts of .txt files around.

[–] Buckshot@programming.dev 27 points 1 year ago (2 children)

You've never worked in finance then. All our systems at work do nothing but move large amounts of txt files around.

That said, many of our clients still don't support utf-8 so its all ascii and non-latin alphabets are screwed. They can't even handle characters 128-255 so even stuff like Β£ is unsupported.

[–] LaggyKar@programming.dev 12 points 1 year ago

That said, many of our clients still don’t support utf-8 so its all ascii and non-latin alphabets are screwed.

Ah, yes, I heard about that sort of thing. Some bank getting a GDPR complaint because they couldn't correct the spelling of someone's name, because their system uses EBCDIC.

[–] anytimesoon@lemmy.ml 7 points 1 year ago (1 children)

finance

even stuff like Β£ is unsupported.

Probably not an issue then...

[–] fibojoly@sh.itjust.works 12 points 1 year ago* (last edited 1 year ago)

Its not a joke. I worked for a big european bank network and the software there didn't know how to translate from EBCDIC to UTF8 because none of the devs writing the software knew enough of the other side (mainframe vs PC) to realise this was an issue.

Their solution was "if the file has a ? in it when we receive it, it's probably a Β£". Which of course completely breaks down the day you have any other untranslated character.

I spent fucking weeks explaining this issue and why this was abominable, but apparently this wasn't enough of an issue for people to fix it. Go figure...

[–] simplify@lemm.ee 21 points 1 year ago (1 children)

I immediately thought of Leeroy Jenkins in the last sentence.

https://youtu.be/mLyOj_QD4a4?si=6RhZzj8LO3tr80cT

[–] Shhalahr@beehaw.org 2 points 1 year ago (1 children)

Pretty certain it's an intentional reference.

[–] simplify@lemm.ee 2 points 1 year ago (1 children)

You're right, and someone else might be a part of the lucky 10,000 today.

[–] Shhalahr@beehaw.org 1 points 1 year ago (1 children)

And now we have the obligatory xkcd reference. 😁

[–] apotheotic@beehaw.org 19 points 1 year ago (1 children)

I can't read "what a time to be alive" without hearing Two Minute Papers in my head

hold onto your papers

[–] lowleveldata@programming.dev 11 points 1 year ago

longer than necessary

It's as long as it needs to be unique

[–] dukk@programming.dev 2 points 1 year ago

And it has 333 upvotes! We must maintain this at all costs…