IBM CPs conversion to Unicode |
Here youll find CP to unicode converter written by me. I needed such utility for MS-DOS. Yes, MS-DOS doesn't support unicode, but what to do if you need to convert from windows-1257 to Unicode. Yes, there are some converters written for windows. Some of them are shareware. But I didn't find any freeware to batch convert lots of files at a time. Well there are linux sollutions such as enca but it didn't work as I expected. So I written little conversion utility by myself.
At first I've opened unicode documment in hex editor and realised, that first two symbols in it shows, that this document is in unicode, they are 0xFF and 0xFE (in HEX). Then follows symbols. Unicode uses two bytes for one symbol. So I needed to know table which corresponds to windows-1257 and unicode. This website has what I need http://www.orwell.ru.
I didnt need to make universal CP to unicode converter, so I just took cp1257 table and transformed it to two arrays (one array is the first unicode byte and second is 2nd unicode byte). At first I thought somehow to replace cp symbol to two corresponding unicode symbols, but bright idea catched my mind, that array's indexes are the same :) So if I get cp1257 symbols number in code table, I can use it in unicode arrays indexes. For example if symbol is "!" (EXCLAMATION MARK) its code 0x22 in hex and 33 in dec. So I can take value number 33 from unicode table arrays and it'll be the same exclamation mark symbol. Ok, let's see this little converter written in turbo pascal, you can use it or change it as much as you need:
Program cp2uni;
uses crt;
{ Characters table found here: http://www.orwell.ru/test/CP/_?cp1257 }
{ Don't forget that bytes written in reverse order }
{ If you need to write 0x0021 you will write 0x21 0x00 }
{ In Turbo Pascal hex numbers are written in $XX form
{ eg in C 0x21, in TP $21 }
{ IBM CP to Unicode Windows-1257 characters table First byte }
const unicodeArrayFb: array[0..255] of byte =
(
$00, $01, $02, $03, $04, $05, $06, $07, $08,
$09, $0A, $0B, $0C, $0D, $0E, $0F, $10, $11,
$12, $13, $14, $15, $16, $17, $18, $19, $1A,
$1B, $1C, $1D, $1E, $1F, $20, $21, $22, $23,
$24, $25, $26, $27, $28, $29, $2A, $2B, $2C,
$2D, $2E, $2F, $30, $31, $32, $33, $34, $35,
$36, $37, $38, $39, $3A, $3B, $3C, $3D, $3E,
$3F, $40, $41, $42, $43, $44, $45, $46, $47,
$48, $49, $4A, $4B, $4C, $4D, $4E, $4F, $50,
$51, $52, $53, $54, $55, $56, $57, $58, $59,
$5A, $5B, $5C, $5D, $5E, $5F, $60, $61, $62,
$63, $64, $65, $66, $67, $68, $69, $6A, $6B,
$6C, $6D, $6E, $6F, $70, $71, $72, $73, $74,
$75, $76, $77, $78, $79, $7A, $7B, $7C, $7D,
$7E, $7F, $AC, $20, $1A, $20, $1E, $26, $20,
$21, $20, $30, $20, $39, $20, $A8, $C7, $B8,
$20, $18, $19, $1C, $1D, $22, $13, $14, $20,
$22, $20, $3A, $20, $AF, $DB, $20, $A0, $20,
$A2, $A3, $A4, $20, $A6, $A7, $D8, $A9, $56,
$AB, $AC, $AD, $AE, $C6, $B0, $B1, $B2, $B3,
$B4, $B5, $B6, $B7, $F8, $B9, $57, $BB, $BC,
$BD, $BE, $E6, $04, $2E, $00, $06, $C4, $C5,
$18, $12, $0C, $C9, $79, $16, $22, $36, $2A,
$3B, $60, $43, $45, $D3, $4C, $D5, $D6, $D7,
$72, $41, $5A, $6A, $DC, $7B, $7D, $DF, $05,
$2F, $01, $07, $E4, $E5, $19, $13, $0D, $E9,
$7A, $17, $23, $37, $2B, $3C, $61, $44, $46,
$F3, $4D, $F5, $F6, $F7, $73, $42, $5B, $6B,
$FC, $7C, $7E, $D9
);
{ IBM CP to Unicode Windows-1257 characters table Second byte }
const unicodeArraySb: array[0..255] of byte =
(
$00, $00, $00, $00, $00, $00, $00, $00, $00,
$00, $00, $00, $00, $00, $00, $00, $00, $00,
$00, $00, $00, $00, $00, $00, $00, $00, $00,
$00, $00, $00, $00, $00, $00, $00, $00, $00,
$00, $00, $00, $00, $00, $00, $00, $00, $00,
$00, $00, $00, $00, $00, $00, $00, $00, $00,
$00, $00, $00, $00, $00, $00, $00, $00, $00,
$00, $00, $00, $00, $00, $00, $00, $00, $00,
$00, $00, $00, $00, $00, $00, $00, $00, $00,
$00, $00, $00, $00, $00, $00, $00, $00, $00,
$00, $00, $00, $00, $00, $00, $00, $00, $00,
$00, $00, $00, $00, $00, $00, $00, $00, $00,
$00, $00, $00, $00, $00, $00, $00, $00, $00,
$00, $00, $00, $00, $00, $00, $00, $00, $00,
$00, $00, $20, $00, $20, $00, $20, $20, $20,
$20, $00, $20, $00, $20, $00, $00, $02, $00,
$00, $20, $20, $20, $20, $20, $20, $20, $00,
$21, $00, $20, $00, $00, $02, $00, $00, $00,
$00, $00, $00, $00, $00, $00, $00, $00, $01,
$00, $00, $00, $00, $00, $00, $00, $00, $00,
$00, $00, $00, $00, $00, $00, $01, $00, $00,
$00, $00, $00, $01, $01, $01, $01, $00, $00,
$01, $01, $01, $00, $01, $01, $01, $01, $01,
$01, $01, $01, $01, $00, $01, $00, $00, $00,
$01, $01, $01, $01, $00, $01, $01, $00, $01,
$01, $01, $01, $00, $00, $01, $01, $01, $00,
$01, $01, $01, $01, $01, $01, $01, $01, $01,
$00, $01, $00, $00, $00, $01, $01, $01, $01,
$00, $01, $01, $02
);
var InFile, outFile: File of byte;
inFileExists, outFileExists: boolean;
inFileName, outFileName, Path: String;
Ch,Cho: Byte;
begin
{ Just clear screen }
clrscr;
writeln('Converter windows-1257 to unicode...');
writeln('By Vaidotas Gaidelis, 2006');
{ Ugly routine to get path to application}
GetDir(0, Path);
{ Assign filenames from command line parameters }
inFileName := path+'\'+ParamStr(1);
outFileName := path+'\'+ParamStr(2);
writeln(inFileName+ ' -> '+outFileName);
{$I-}
Assign(inFile, inFileName);
Reset(inFile);
Close(inFile);
Assign(outFile, outFileName);
Rewrite(outFile);
Close(outFile);
{$I+}
inFileExists := (IOResult = 0) And (inFileName <> '');
If inFileExists then
Begin
ReWrite(outFile);
Reset(inFile);
{ First two bytes shows, that file is unicode}
Cho := $FF;
Write(outFile, Cho);
Cho := $FE;
Write(outFile, Cho);
While Not EOF(inFile) Do
Begin
{ Get symbol's number }
Read(inFile, Ch);
{ Write two unicode bytes from array }
Write(outFile, unicodeArrayFb[Ch]);
Write(outFile, unicodeArraySb[Ch]);
End;
Close(inFile);
Close(outFile);
End Else
Begin
{ whoops, something went wrong. Try again ;) }
Writeln('File not found!');
Writeln('Usage: cp2uni.exe [inFile] [outFile]');
End;
end.
I've chosen Turbo pascal, just because I needed to write converter for DOS and It was pleasure to remember school times :) But If youre programmer, you'll rewrite this app in whatever you want or need programming language ;)
Also you can download compiled version cp2uni (~7Kb)
Note: This app converts from cp1257 (or windows-1257) to unicode. No checking if encoding is right, just blind conversion. Windows-1257 is lithuanian symbols. If you need to write converter from other encodings, reffer to http://www.orwell.ru and change unicode arrays to your desired table.
If you use this app, code or piece of code, be nice and mention that in comments. Also I will appreciate some ideas or improvements for this converter. This converter is open source and freeware. USE IT ON YOUR OWN RISK.




