public final class Encodings extends Object
The Encoding Standard, which is a Candidate Recommendation as of
early November 2015, defines algorithms for the most common character
encodings used on Web pages and recommends the UTF-8 encoding for new
specifications and Web pages. Calling the GetEncoding(name)
method returns one of the character encodings with the given name
under the Encoding Standard.
Now let's define some terms.
Encoding Terms
There are several kinds of character encodings:
windows-1252.US-ASCII). In the Encoding Standard, all
single-byte encodings use the ASCII characters as the first 128 code
points of their character sets.UTF-16LE and UTF-16BE are two encodings defined in the
Unicode Standard. They use 2 bytes for the most common code points,
and 4 bytes for supplementary code points.UTF-8
is another encoding defined in the Unicode Standard. It uses 1 byte
for ASCII and 2 to 4 bytes for the other Unicode code points.Shift_JIS,
GBK, and Big5 use 1 byte for ASCII (or a slightly
modified version) and, usually, 2 or more bytes for national standard
character sets. In many of these encodings, notably Shift_JIS,
characters whose code points use one byte traditionally take half the
space of characters whose code points use two bytes.ISO-2022-JP supports several escape
sequences that shift into different encodings, including a Katakana,
a Kanji, and an ASCII encoding (with ASCII as the default).hz-gb-2312.Getting an Encoding
The Encoding Standard includes
UTF-8, UTF-16, and many legacy encodings, and gives each one of them
a name. The GetEncoding(name) method takes a name string and
returns an ICharacterEncoding object that implements that encoding,
or null if the name is unrecognized.
However, the
Encoding Standard is designed to include only encodings commonly used
on Web pages, not in other protocols such as email. For email, the
Encoding class includes an alternate function GetEncoding(name,
forEmail). Setting forEmail to true will use rules
modified from the Encoding Standard to better suit encoding and
decoding text from email messages.
Classes for Character Encodings
This Encodings class provides access to common character encodings through classes as described below:
ICharacterEncoder interface.ICharacterDecoder interface.ICharacterEncoding interface. The encoder and
decoder classes should implement the same character
encoding.Custom Encodings
Classes that implement the ICharacterEncoding interface can provide additional character encodings not included in the Encoding Standard. Some examples of these include the following:
(Note that this library doesn't implement either encoding.)
| Modifier and Type | Field and Description |
|---|---|
static ICharacterEncoding |
UTF8
Character encoding object for the UTF-8 character encoding, which represents
each code point in the universal character set using 1 to 4 bytes.
|
| Modifier and Type | Method and Description |
|---|---|
static String |
DecodeToString(ICharacterEncoding enc,
byte[] bytes)
Reads a byte array from a data source and converts the bytes from a given
encoding to a text string.
|
static String |
DecodeToString(ICharacterEncoding enc,
byte[] bytes,
int offset,
int length)
Reads a portion of a byte array from a data source and converts the bytes
from a given encoding to a text string.
|
static String |
DecodeToString(ICharacterEncoding encoding,
IByteReader input)
Reads bytes from a data source and converts the bytes from a given encoding
to a text string.
|
static String |
DecodeToString(ICharacterEncoding enc,
InputStream input)
Not documented yet.
|
static byte[] |
EncodeToBytes(ICharacterInput input,
ICharacterEncoder encoder)
Reads Unicode characters from a character input and writes them to a byte
array encoded using a given character encoding.
|
static byte[] |
EncodeToBytes(ICharacterInput input,
ICharacterEncoding encoding)
Reads Unicode characters from a character input and writes them to a byte
array encoded using the given character encoder.
|
static byte[] |
EncodeToBytes(String str,
ICharacterEncoding enc)
Reads Unicode characters from a text string and writes them to a byte array
encoded in a given character encoding.
|
static void |
EncodeToWriter(ICharacterInput input,
ICharacterEncoder encoder,
IWriter writer)
Reads Unicode characters from a character input and writes them to a byte
array encoded in a given character encoding.
|
static void |
EncodeToWriter(ICharacterInput input,
ICharacterEncoder encoder,
OutputStream output)
Reads Unicode characters from a character input and writes them to a byte
array encoded in a given character encoding.
|
static void |
EncodeToWriter(ICharacterInput input,
ICharacterEncoding encoding,
IWriter writer)
Reads Unicode characters from a character input and writes them to a byte
array encoded using the given character encoder.
|
static void |
EncodeToWriter(ICharacterInput input,
ICharacterEncoding encoding,
OutputStream output)
Reads Unicode characters from a character input and writes them to a byte
array encoded using the given character encoder.
|
static void |
EncodeToWriter(String str,
ICharacterEncoding enc,
IWriter writer)
Converts a text string to bytes and writes the bytes to an output byte
writer.
|
static void |
EncodeToWriter(String str,
ICharacterEncoding enc,
OutputStream output)
Converts a text string to bytes and writes the bytes to an output data
stream.
|
static ICharacterInput |
GetDecoderInput(ICharacterEncoding encoding,
IByteReader stream)
Converts a character encoding into a character input stream, given a
streamable source of bytes.
|
static ICharacterInput |
GetDecoderInput(ICharacterEncoding encoding,
InputStream input)
Not documented yet.
|
static ICharacterInput |
GetDecoderInputSkipBom(ICharacterEncoding encoding,
IByteReader stream)
Converts a character encoding into a character input stream, given a
streamable source of bytes.
|
static ICharacterInput |
GetDecoderInputSkipBom(ICharacterEncoding encoding,
InputStream input)
Converts a character encoding into a character input stream, given a
readable data stream.
|
static ICharacterEncoding |
GetEncoding(String name)
Returns a character encoding from the given name.
|
static ICharacterEncoding |
GetEncoding(String name,
boolean forEmail)
Returns a character encoding from the given name.
|
static ICharacterEncoding |
GetEncoding(String name,
boolean forEmail,
boolean allowReplacement)
Returns a character encoding from the given name.
|
static String |
InputToString(ICharacterInput reader)
Reads Unicode characters from a character input and converts them to a text
string.
|
static String |
ResolveAlias(String name)
Resolves a character encoding's name to a standard form.
|
static String |
ResolveAliasForEmail(String name)
Resolves a character encoding's name to a canonical form, using rules more
suitable for email.
|
static byte[] |
StringToBytes(ICharacterEncoder encoder,
String str)
Converts a text string to a byte array using the given character encoder.
|
static byte[] |
StringToBytes(ICharacterEncoding encoding,
String str)
Converts a text string to a byte array encoded in a given character
encoding.
|
static ICharacterInput |
StringToInput(String str)
Converts a text string to a character input.
|
static ICharacterInput |
StringToInput(String str,
int offset,
int length)
Converts a portion of a text string to a character input.
|
public static final ICharacterEncoding UTF8
public static String DecodeToString(ICharacterEncoding encoding, IByteReader input)
In the .NET implementation, this method is implemented as an extension method to any object implementing ICharacterEncoding and can be called as follows: "encoding.DecodeString(input)". If the object's class already has a DecodeToString method with the same parameters, that method takes precedence over this extension method.
encoding - An object that implements a given character encoding. Any
bytes that can't be decoded are converted to the replacement
character (U + FFFD).input - An object that implements a byte stream.NullPointerException - The parameter encoding or
input is null.public static String DecodeToString(ICharacterEncoding enc, InputStream input)
In the .NET implementation, this method is
implemented as an extension method to any object implementing
ICharacterEncoding and can be called as follows:
encoding.DecodeToString(input). If the object's class already
has a DecodeToString method with the same parameters, that method
takes precedence over this extension method.
enc - An object implementing a character encoding (gives access to an
encoder and a decoder).input - A readable byte stream.NullPointerException - The parameter "encoding" or input is null.public static String DecodeToString(ICharacterEncoding enc, byte[] bytes)
In the .NET implementation, this method is implemented as an
extension method to any object implementing ICharacterEncoding and
can be called as follows: enc.DecodeToString(bytes). If the
object's class already has a DecodeToString method with the same
parameters, that method takes precedence over this extension
method.
enc - An object implementing a character encoding (gives access to an
encoder and a decoder).bytes - A byte array.NullPointerException - The parameter enc or bytes is null.public static String DecodeToString(ICharacterEncoding enc, byte[] bytes, int offset, int length)
In the .NET implementation, this method is implemented
as an extension method to any object implementing ICharacterEncoding
and can be called as follows: enc.DecodeToString(bytes, offset,
length). If the object's class already has a DecodeToString
method with the same parameters, that method takes precedence over
this extension method.
enc - An object implementing a character encoding (gives access to an
encoder and a decoder).bytes - A byte array containing the desired portion to read.offset - A zero-based index showing where the desired portion of bytes begins.length - The length, in bytes, of the desired portion of bytes
(but not more than bytes 's length).NullPointerException - The parameter enc or bytes is null.IllegalArgumentException - Either offset or length is
less than 0 or greater than bytes 's length, or bytes
' s length minus offset is less than length.public static byte[] EncodeToBytes(ICharacterInput input, ICharacterEncoding encoding)
In the .NET
implementation, this method is implemented as an extension method to
any object implementing ICharacterInput and can be called as follows:
input.EncodeToBytes(encoding). If the object's class already
has an EncodeToBytes method with the same parameters, that method
takes precedence over this extension method.
input - An object that implements a stream of universal code points.encoding - An object that implements a given character encoding.NullPointerException - The parameter encoding is null.public static byte[] EncodeToBytes(ICharacterInput input, ICharacterEncoder encoder)
In the .NET
implementation, this method is implemented as an extension method to
any object implementing ICharacterInput and can be called as follows:
input.EncodeToBytes(encoder). If the object's class already
has a EncodeToBytes method with the same parameters, that method
takes precedence over this extension method.
input - An object that implements a stream of universal code points.encoder - An object that implements a character encoder.NullPointerException - The parameter encoder or input is null.public static byte[] EncodeToBytes(String str, ICharacterEncoding enc)
In the .NET implementation, this method
is implemented as an extension method to any String object and can be
called as follows: str.EncodeToBytes(enc). If the object's
class already has a EncodeToBytes method with the same parameters,
that method takes precedence over this extension method.
str - The parameter str is a text string.enc - An object implementing a character encoding (gives access to an
encoder and a decoder).NullPointerException - The parameter str or enc is null.public static void EncodeToWriter(ICharacterInput input, ICharacterEncoding encoding, IWriter writer)
In the .NET
implementation, this method is implemented as an extension method to
any object implementing ICharacterInput and can be called as follows:
input.EncodeToBytes(encoding). If the object's class already
has a EncodeToBytes method with the same parameters, that method
takes precedence over this extension method.
input - An object that implements a stream of universal code points.encoding - An object that implements a character encoding.writer - A byte writer to write the encoded bytes to.NullPointerException - The parameter encoding is null.public static void EncodeToWriter(ICharacterInput input, ICharacterEncoder encoder, IWriter writer)
In the .NET
implementation, this method is implemented as an extension method to
any object implementing ICharacterInput and can be called as follows:
input.EncodeToBytes(encoder). If the object's class already
has a EncodeToBytes method with the same parameters, that method
takes precedence over this extension method.
input - An object that implements a stream of universal code points.encoder - An object that implements a character encoder.writer - A byte writer to write the encoded bytes to.NullPointerException - The parameter encoder or input is null.public static void EncodeToWriter(String str, ICharacterEncoding enc, IWriter writer)
In the
.NET implementation, this method is implemented as an extension
method to any String object and can be called as follows:
str.EncodeToBytes(enc, writer). If the object's class already
has a EncodeToBytes method with the same parameters, that method
takes precedence over this extension method.
str - A text string to encode.enc - An object implementing a character encoding (gives access to an
encoder and a decoder).writer - A byte writer where the encoded bytes will be written to.NullPointerException - The parameter str or enc is null.public static void EncodeToWriter(ICharacterInput input, ICharacterEncoding encoding, OutputStream output) throws IOException
In the .NET
implementation, this method is implemented as an extension method to
any object implementing ICharacterInput and can be called as follows:
input.EncodeToBytes(encoding). If the object's class already
has a EncodeToBytes method with the same parameters, that method
takes precedence over this extension method.
input - An object that implements a stream of universal code points.encoding - An object that implements a character encoding.output - A writable data stream.NullPointerException - The parameter encoding is null.IOExceptionpublic static void EncodeToWriter(ICharacterInput input, ICharacterEncoder encoder, OutputStream output) throws IOException
In the .NET
implementation, this method is implemented as an extension method to
any object implementing ICharacterInput and can be called as follows:
input.EncodeToBytes(encoder). If the object's class already
has a EncodeToBytes method with the same parameters, that method
takes precedence over this extension method.
input - An object that implements a stream of universal code points.encoder - An object that implements a character encoder.output - A writable data stream.NullPointerException - The parameter encoder or input is null.IOExceptionpublic static void EncodeToWriter(String str, ICharacterEncoding enc, OutputStream output) throws IOException
In the
.NET implementation, this method is implemented as an extension
method to any String object and can be called as follows:
str.EncodeToBytes(enc, writer). If the object's class already
has a EncodeToBytes method with the same parameters, that method
takes precedence over this extension method.
str - A text string to encode.enc - An object implementing a character encoding (gives access to an
encoder and a decoder).output - A writable data stream.NullPointerException - The parameter str or enc is null.IOExceptionpublic static ICharacterInput GetDecoderInput(ICharacterEncoding encoding, IByteReader stream)
In the .NET implementation, this method is implemented as an extension method to any object implementing ICharacterEncoding and can be called as follows: "encoding.GetDecoderInput(input)". If the object's class already has a GetDecoderInput method with the same parameters, that method takes precedence over this extension method.
encoding - Encoding that exposes a decoder to be converted into a
character input stream. If the decoder returns -2 (indicating a
decode error), the character input stream handles the error by
returning a replacement character in its place.stream - Byte stream to convert into Unicode characters.public static ICharacterInput GetDecoderInput(ICharacterEncoding encoding, InputStream input)
In the .NET implementation, this method is
implemented as an extension method to any object implementing
ICharacterEncoding and can be called as follows:
encoding.GetDecoderInput(input). If the object's class already
has a GetDecoderInput method with the same parameters, that method
takes precedence over this extension method.
encoding - Encoding object that exposes a decoder to be converted into
a character input stream. If the decoder returns -2 (indicating a
decode error), the character input stream handles the error by
returning a replacement character in its place.input - Byte stream to convert into Unicode characters.public static ICharacterInput GetDecoderInputSkipBom(ICharacterEncoding encoding, IByteReader stream)
This method implements the "decode" algorithm specified in the Encoding standard.
In the .NET implementation, this method is
implemented as an extension method to any object implementing
ICharacterEncoding and can be called as follows:
encoding.GetDecoderInputSkipBom(input). If the object's class
already has a GetDecoderInputSkipBom method with the same
parameters, that method takes precedence over this extension
method.
encoding - Encoding object that exposes a decoder to be converted into
a character input stream. If the decoder returns -2 (indicating a
decode error), the character input stream handles the error by
returning a replacement character in its place.stream - Byte stream to convert into Unicode characters.public static ICharacterInput GetDecoderInputSkipBom(ICharacterEncoding encoding, InputStream input)
In the .NET implementation, this method is implemented as an
extension method to any object implementing ICharacterEncoding and
can be called as follows:
encoding.GetDecoderInputSkipBom(input). If the object's class
already has a GetDecoderInputSkipBom method with the same
parameters, that method takes precedence over this extension
method.
encoding - Encoding object that exposes a decoder to be converted into
a character input stream. If the decoder returns -2 (indicating a
decode error), the character input stream handles the error by
returning a replacement character in its place.input - Byte stream to convert into Unicode characters.public static ICharacterEncoding GetEncoding(String name)
name - A string naming a character encoding. See the ResolveAlias
method. Can be null.public static ICharacterEncoding GetEncoding(String name, boolean forEmail)
name - A string naming a character encoding. See the ResolveAlias
method. Can be null.forEmail - If false, uses the encoding resolution rules in the Encoding
Standard. If true, uses modified rules as described in the
ResolveAliasForEmail method.public static ICharacterEncoding GetEncoding(String name, boolean forEmail, boolean allowReplacement)
name - A string naming a character encoding. See the ResolveAlias
method. Can be null.forEmail - If false, uses the encoding resolution rules in the Encoding
Standard. If true, uses modified rules as described in the
ResolveAliasForEmail method.allowReplacement - If true, allows the label replacement to
return the replacement encoding.public static String InputToString(ICharacterInput reader)
In the .NET implementation, this method is implemented as
an extension method to any object implementing ICharacterInput and
can be called as follows: reader.InputToString(). If the
object's class already has a InputToString method with the same
parameters, that method takes precedence over this extension
method.
reader - A character input whose characters will be converted to a text
string.public static String ResolveAlias(String name)
In several Internet specifications, this name is known as a "charset" parameter. In HTML and HTTP, for example, the "charset" parameter indicates the encoding used to represent text in the HTML page, text file, etc.
name - A string that names a given character encoding. Can be null. Any
leading and trailing whitespace is removed and the name converted to
lowercase before resolving the encoding's name. The Encoding
Standard supports only the following encodings (and defines aliases
for most of them). UTF-8 - UTF-8 (8-bit encoding of
the universal character set, the encoding recommended by the Encoding
Standard for new data formats)UTF-16LE - UTF-16
little-endian (16-bit UCS)UTF-16BE - UTF-16
big-endian (16-bit UCS)x-user-definedreplacement, which this function returns only if one of several
aliases are passed to it, as defined in the Encoding Standard.windows-1252 :
Western Europe (Note: The Encoding Standard aliases the names US-ASCII and ISO-8859-1 to windows-1252, which uses
a different character set from either; it differs from ISO-8859-1 by assigning different characters to some bytes from 0x80
to 0x9F. The Encoding Standard does this for compatibility with
existing Web pages.)ISO-8859-2, windows-1250 : Central EuropeISO-8859-10 :
Northern EuropeISO-8859-4, windows-1257 :
BalticISO-8859-13 : EstonianISO-8859-14 : CelticISO-8859-16 : RomanianISO-8859-5, IBM-866, KOI8-R, windows-1251, x-mac-cyrillic : CyrillicKOI8-U : UkrainianISO-8859-7, windows-1253
: GreekISO-8859-6, windows-1256 :
ArabicISO-8859-8, ISO-8859-8-I, windows-1255 : HebrewISO-8859-3 : Latin 3ISO-8859-15, windows-1254 : Turkishwindows-874 : Thaiwindows-1258 :
Vietnamesemacintosh : Mac RomanShift_JIS, EUC-JP, ISO-2022-JPGBK and gb18030Big5 :
legacy traditional Chinese encodingEUC-KR : legacy
Korean encodingThe UTF-8, UTF-16LE, and
UTF-16BE encodings don't encode a byte-order mark at the
start of the text (doing so is not recommended for UTF-8,
while in UTF-16LE and UTF-16BE, the byte-order mark
character U + FEFF is treated as an ordinary character, unlike in the
UTF-16 encoding form). The Encoding Standard aliases UTF-16
to UTF-16LE "to deal with deployed content".
name is null or empty, or if the encoding name is
unsupported.public static String ResolveAliasForEmail(String name)
name - A string naming a character encoding. Can be null. Uses a
modified version of the rules in the Encoding Standard to better
conform, in some cases, to email standards like MIME. In addition to
the encodings mentioned in ResolveAlias, the following additional
encodings are supported:. US-ASCII - ASCII
single-byte encoding, rather than an alias to windows-1252 as
specified in the Encoding Standard. The character set's code points
match those in the Unicode Standard's Basic Latin block (0-127 or
U + 0000 to U + 007F).ISO-8859-1 - Latin-1 single-byte
encoding, rather than an alias to windows-1252 as specified
in the Encoding Standard. The character set's code points match those
in the Unicode Standard's Basic Latin and Latin-1 Supplement blocks
(0-255 or U + 0000 to U + 00FF).UTF-7 - UTF-7 (7-bit
universal character set).name is null or empty, or if the encoding name is
unsupported.public static byte[] StringToBytes(ICharacterEncoding encoding, String str)
In the
.NET implementation, this method is implemented as an extension
method to any object implementing ICharacterEncoding and can be
called as follows: encoding.StringToBytes(str). If the
object's class already has a StringToBytes method with the same
parameters, that method takes precedence over this extension
method.
encoding - An object that implements a character encoding.str - A string to be encoded into a byte array.NullPointerException - The parameter encoding is null.public static byte[] StringToBytes(ICharacterEncoder encoder, String str)
In the .NET
implementation, this method is implemented as an extension method to
any object implementing ICharacterEncoder and can be called as
follows: encoder.StringToBytes(str). If the object's class
already has a StringToBytes method with the same parameters, that
method takes precedence over this extension method.
encoder - An object that implements a character encoder.str - A text string to encode into a byte array.NullPointerException - The parameter encoder or str is null.public static ICharacterInput StringToInput(String str)
In the .NET implementation, this method is implemented
as an extension method to any String object and can be called as
follows: str.StringToInput(offset, length). If the object's
class already has a StringToInput method with the same parameters,
that method takes precedence over this extension method.
str - The parameter str is a text string.NullPointerException - The parameter str is null.public static ICharacterInput StringToInput(String str, int offset, int length)
In the .NET implementation, this
method is implemented as an extension method to any String object and
can be called as follows: str.StringToInput(offset, length).
If the object's class already has a StringToInput method with the
same parameters, that method takes precedence over this extension
method.
str - The parameter str is a text string.offset - A zero-based index showing where the desired portion of str begins.length - The length, in code units, of the desired portion of str (but not more than str 's length).NullPointerException - The parameter str is null.IllegalArgumentException - Either offset or length is
less than 0 or greater than str 's length, or str ' s
length minus offset is less than length.Encoding for Java documentation, generated in 2017.