Character Encoding

DataLoad, PHP and the Oracle client are installed pre-configured to use the same character set and this will support all characters and languages and all Oracle databases. All PHP functions may be used and these will seamlessly handle data regardless of the characters and languages included in that data. Users who either do not understand about character set encoding or do not wish to change the DataLoad default solution do not need to do anything and do not need to read any further!

Put simply, character encoding refers to how data is represented on computers. Data (letters, symbols, etc) is represented differently by different encodings and many encodings do not support all languages so may not be able to represent the data at all. Given that many different encodings exist it is very important that the correct encoding is used otherwise unpredictable results will occur, such as data not being displayed correctly or data corruption. Therefore, DataLoad and the PHP scripts must exchange data using the same encoding and that encoding must be able to support all data that is likely to be processed. Furthermore, if the script is using a database then that connection must also be setup to use the correct encoding.

By default DataLoad, PHP and Oracle database connections using the Oracle class are all setup to use the Unicode UTF-8 encoding. This ensures each application is using the same encoding and can exchange data without either the risk of data loss or further data transformation being required. Furthermore, UTF-8 supports all languages and symbols so this configuration will work for all users worldwide. PHP has been setup to support UTF-8 transparently so all the standard string functions may be used without the script writer needing to consider character encoding and the actual length of strings, etc. This should be left unchanged unless the script writer has a good reason to use a different encoding approach.

The information below provides more information about how DataLoad Scripting is setup to use UTF-8.

DataLoad

DataLoad's communication with PHP uses UTF-8 encoding, that is all data sent to PHP uses UTF-8 and DataLoad expects to receive data from PHP that is encoded using UTF-8. If PHP is changed to use another encoding then the data sent to DataLoad via the PHP functions must still be encoded in UTF-8; furthermore, the script must convert the data received from DataLoad from UTF-8 to whatever encoding is being used.

PHP

PHP assumes all strings use one byte of memory per character and there is no seamless support for Unicode (UTF-8 or other Unicode encodings). However, the mbstring extension does add support for multi byte strings including Unicode encoding. The mbstring extension must be used with UTF-8 encoding because each character in a string may use more than one byte and this is not supported by the standard PHP string functions. DataLoad Scripting provides PHP already configured with mbstring. Furthermore, mbstring is setup to overload, i.e. replace, all standard PHP string functions with the mbstring versions. This means the standard PHP string functions, e.g. substr, strpos, strlen, etc, can be used with UTF-8 strings and in the background the mbstring versions will actually be used. Thus the script writer can use PHP as normal and all strings will be correctly handled in UTF-8 encoding. This behaviour can be changed by editing the DataLoad section of the php.in file (PHP\php.ini under the directory where DataLoad is installed) and changing the mbstring configuration accordingly. However, given that UTF-8 supports all languages are enables seamless data movement between DataLoad, PHP and Oracle it is strongly recommended that this is not changed.

Oracle

Oracle databases store data in a particular character set. The character set is chosen by the DBA according to what data the database is expected to store and may be a local ANSI character set, e.g. WE8ISO8859P15, or a Unicode character set capable of supporting all languages, e.g. UTF-8. When client applications connect to an Oracle database they also use a character set and this specifies the encoding for the data retrieved from and sent to the database. The client character set may be the same as the database character set but that is not required. However, the client character set must support all data to be processed by the client.

The character set may be specified as a parameter in the connect function when connecting to Oracle using the PHP OCI8 library or the Oracle library supplied with DataLoad. If DataLoad's Oracle library is used and the character set is not specified then the character set defaults to UTF-8. This matches the DataLoad and PHP character sets and thus ensures seamless data compatibility without having to specify a character set. If a character set is not specified when using OCI8 then the Oracle client will look for the NLS_LANG variable, first in the Windows environment and then in the registry, and will use whatever character set is specified there. If NLS_LANG is not set then the US7ASCII character set will be used. That is compatible with UTF-8 but only supports English letters and numbers; other characters such as the accented European characters or Hebrew, Arabic, Japanese, etc, are not supported. Thus, if the OCI8 library is used then it is recommended that the character set is specified in the oci_connect() function call to provide certainty about what character set will be used. This should normally be set to "UTF8". If another character set is used then the script writer must take steps to ensure this is correctly handled in PHP and is converted to UTF-8 when sent to DataLoad.