RDBMS Reader
RDBMS Reader plugin supports reading data from traditional RDBMS. This is a generic relational database reading plugin that can support more relational database reading by registering database drivers.
At the same time, RDBMS Reader is also the base class for other relational database reading plugins. The following reading plugins all depend on this plugin:
- Oracle Reader
- MySQL Reader
- PostgreSQL Reader
- ClickHouse Reader
- SQLServer Reader
- Access Reader
- Databend Reader
Note: If a dedicated database reading plugin is already provided, it is recommended to use the dedicated plugin. If the database you need to read does not have a dedicated plugin, consider using this generic plugin. Before use, you need to perform the following operations to run normally, otherwise exceptions will occur.
Configure Driver
Suppose you need to read data from IBM DB2. Since no dedicated reading plugin is provided, we can use this plugin to implement it. Before use, you need to download the corresponding JDBC driver and copy it to the plugin/reader/rdbmsreader/libs directory. If your driver class name is special, you need to find the driver item in the task configuration file and fill in the correct JDBC driver name, such as DB2's driver name com.ibm.db2.jcc.DB2Driver. If not filled, the plugin will automatically guess the driver name.
The following lists common databases and their corresponding driver names:
- Apache Impala:
com.cloudera.impala.jdbc41.Driver - Enterprise DB:
com.edb.Driver - PrestoDB:
com.facebook.presto.jdbc.PrestoDriver - IBM DB2:
com.ibm.db2.jcc.DB2Driver - MySQL:
com.mysql.cj.jdbc.Driver - Sybase Server:
com.sybase.jdbc3.jdbc.SybDriver - TDengine:
com.taosdata.jdbc.TSDBDriver - 达梦数据库:
dm.jdbc.driver.DmDriver - 星环Inceptor:
io.transwarp.jdbc.InceptorDriver - TrinoDB:
io.trino.jdbc.TrinoDriver - PrestoSQL:
io.prestosql.jdbc.PrestoDriver - Oracle DB:
oracle.jdbc.OracleDriver - PostgreSQL:
org.postgresql.Drive
Configuration
The following configuration shows how to read data from Presto database to terminal
{
"job": {
"setting": {
"speed": {
"byte": 1048576,
"channel": 1
},
"errorLimit": {
"record": 0,
"percentage": 0.02
}
},
"content": {
"reader": {
"name": "rdbmsreader",
"parameter": {
"username": "hive",
"password": "",
"column": [
"*"
],
"driver": "io.prestosql.jdbc.PrestoDriver",
"connection": {
"table": [
"default.table"
],
"jdbcUrl": "jdbc:presto://127.0.0.1:8080/hive"
},
"fetchSize": 1024,
"where": "1 = 1"
}
},
"writer": {
"name": "streamwriter",
"parameter": {
"print": true
}
}
}
}
}Parameters
| Configuration | Required | Data Type | Default Value | Description |
|---|---|---|---|---|
| jdbcUrl | Yes | list | None | JDBC connection information of the target database, jdbcUrl follows RDBMS official specifications and can include connection attachment control information |
| driver | No | string | None | Custom driver class name to solve compatibility issues, see description below |
| username | Yes | string | None | Username of the data source |
| password | No | string | None | Password for the specified username of the data source |
| table | Yes | list | None | Selected table names to be synchronized, using JSON data format. When configured for multiple tables, users need to ensure that multiple tables have the same table structure |
| column | Yes | list | None | Collection of column names to be synchronized in the configured table, detailed description below |
| splitPk | No | string | None | Use the field represented by splitPk for data sharding, which can greatly improve data synchronization efficiency, see notes below |
| autoPk | No | boolean | false | Whether to automatically guess the sharding primary key, introduced in version 3.2.6, see description below |
| where | No | string | None | Filtering conditions for the table |
| session | No | list | None | For local connections, modify session configuration, see below |
| querySql | No | string | None | Use custom SQL instead of specified table to get data. When this item is configured, table and column configuration items are ignored |
| fetchSize | No | int | 1024 | Defines the number of batch data fetched between plugin and database server each time. Increasing this value may cause Addax OOM |
| excludeColumn | No | list | None | Column name fields to be excluded, only valid when column is configured as * |
jdbcUrl
In addition to configuring necessary information, jdbcUrl configuration can also add specific configuration properties for each specific driver. Here we particularly mention that we can use configuration properties to support proxies to access databases through proxies. For example, for PrestoSQL database JDBC driver, it supports the socksProxy parameter, so the above configured jdbcUrl can be modified to:
jdbc:presto://127.0.0.1:8080/hive?socksProxy=192.168.1.101:1081
Most relational database JDBC drivers support socksProxyHost,socksProxyPort parameters for proxy access. There are also some special cases.
The following are the proxy types and configuration methods supported by various database JDBC drivers:
| Database | Proxy Type | Proxy Configuration | Example |
|---|---|---|---|
| MySQL | socks | socksProxyHost,socksProxyPort | socksProxyHost=192.168.1.101&socksProxyPort=1081 |
| Presto | socks | socksProxy | socksProxy=192.168.1.101:1081 |
| Presto | http | httpProxy | httpProxy=192.168.1.101:3128 |
driver
In most cases, the JDBC driver for a database is fixed, but some have different recommended driver class names due to different versions, such as MySQL. The new MySQL JDBC driver type recommends using com.mysql.cj.jdbc.Driver instead of the previous com.mysql.jdbc.Driver. If you want to use the old driver name, you can configure the driver configuration item. Otherwise, the plugin will automatically guess the driver name based on the string in jdbcUrl.
column
Collection of column names to be synchronized in the configured table, using JSON array to describe field information. Users use * to represent default use of all column configurations, such as ["*"].
Supports column pruning, i.e., columns can be selected for partial export.
Supports column reordering, i.e., columns can be exported not according to table schema information.
Supports constant configuration, users need to follow JSON format:
["id", "`table`", "1", "'bazhen.csy'", "null", "to_char(a + 1)", "2.3" , "true"]
idis ordinary column name`table`is column name containing reserved words1is integer constant'bazhen.csy'is string constantnullis null pointer. Note thatnullhere must appear as a string, i.e., quoted with double quotesto_char(a + 1)is expression2.3is floating point numbertrueis boolean value. Similarly, boolean values here must also be quoted with double quotes
Column must be explicitly filled and cannot be empty!
excludeColumn
There is a situation where we need to read most fields of a table. If the table has many fields, configuring column is obviously time-consuming. In particular, when we collect business data to big data platforms, we generally add some additional fields including partition fields and collection information. When we need to write back to business data tables, we need to exclude these fields. Under this consideration, we introduced the excludeColumn configuration item. When column is configured as *, the excludeColumn configuration item takes effect to exclude some fields.
For example:
{
"column": ["*"],
"excludeColumn": ["partition_col", "etl_time"]
}