chardetect()

Description:

Get the character encoding standard used by a string or a text file.

Syntax:

chardetect(fn,cs)

Note:

The function identifies the character encoding standard used by text file fn when no options are used. Standards it supports include UTF-8, GBK, UTF-16LE and UTF-16BE. Identify the character encoding as GB18030 for text files where the original encoding standards are GBK, GB2312 or GB18030.

 

There could be multiple character set values when trying to identify the character set used for a specified string or binary code representing a (Traditional) Chinese character, a Japanese character or a Korean character because there are overlaps between character sets for the three languages.

 

The function returns the first character set value by default. When parameter cs is present, return the first eligible character set value in the list.

 

fn is interpreted as a URL if it begins with "http://" or "https://".

 

Option:

@v

Get the character encoding from fn if it is a string or a binary variable.

@a

Return the list of all eligible character encoding standards; return the first eligible one by default.

Parameter:

fn

A string or binary value, name of the text file to be identified or object/URL of the text file to be identified

cs

The list of available character encoding standards; can be omitted

Return value:

A charset value or a sequence of charset values

Example:

 

A

 

1

>www="http://www.baidu.com"

 

2

=chardetect(www)

UTF-8.

3

=chardetect@v("abc一二三123")

GB-2312.

4

>file1="d:/UTF8.xml"

Use character set UTF-8.

5

>file2="d:/UTF16LE.xml"

Use UTF-16LE character set.

6

=chardetect(file1)

UTF-8; parameter fn is file name.

7

=file(file2)

 

8

=chardetect(A7)

UTF-16LE; parameter fn is file object.

9

=chardetect@v("你好")

GB2312.

10

=chardetect@av("你好")

Return a list of all eligible character encoding standards.

11

=chardetect@v("你好",["Big5","CP949"])

Return the first eligible character set value in the cs list: Big5.

12

=chardetect@va("你好",["Big5","CP949"])

Return all eligible character sets in the cs list.