Excel Reader
Excel Reader plugin implements the ability to read data from Microsoft Excel files.
Configuration
Get Sample Files
Download the Excel compressed file for demonstration from here and extract it to the /tmp/in directory. The three folders have the same content, where:
demo.xlsxis the new Excel formatdemo.xlsis the old Excel formatdemo_gbk.xlsxis created under Windows and stored with GBK encoding
File content is shown in the following table:
| No. | Integer Type | Float Type | String Type | Date Type | Formula Calculation | Cell Formatting |
|---|---|---|---|---|---|---|
| 1 | 11 | 1102.234 | Addax 加上中文 | 2021/9/10 | 5544.17 | ¥1,102.23 |
| 2 | 12 | 1103.234 | Addax 加上中文 | 2021/9/11 | 5552.17 | ¥1,103.23 |
| 3 | 13 | 1104.234 | Addax 加上中文 | 2021/9/12 | 5560.17 | ¥1,104.23 |
| 4 | 14 | 1105.234 | Addax 加上中文 | 2021/9/13 | 5568.17 | ¥1,105.23 |
| 5 | 15 | 1106.234 | Addax 加上中文 | 2021/9/14 | 5576.17 | ¥1,106.23 |
| 6 | 16 | 1107.234 | Addax 加上中文 | 2021/9/15 | 5584.17 | ¥1,107.23 |
| 7 | 17 | 1108.234 | Addax 加上中文 | 2021/9/16 | 5592.17 | ¥1,108.23 |
| 8 | 18 | 1109.234 | Addax 加上中文 | 2021/9/17 | 5600.17 | ¥1,109.23 |
| 9 | 19 | 1110.234 | Addax 加上中文 | 2021/9/18 | 5608.17 | ¥1,110.23 |
| 10 | 20 | 1111.234 | Addax 加上中文 | 2021/9/19 | 5616.17 | ¥1,111.23 |
| 11 | 21 | 1112.234 | Addax 加上中文 | 2021/9/20 | 5624.17 | ¥1,112.23 |
| 12 | 22 | 1113.234 | Addax 加上中文 | 2021/9/21 | 5632.17 | ¥1,113.23 |
| 13 | 23 | 1114.234 | Addax 加上中文 | 2021/9/22 | 5640.17 | ¥1,114.23 |
| 14 | 24 | 1115.234 | Addax 加上中文 | 2021/9/23 | 5648.17 | ¥1,115.23 |
| 15 | 25 | 1116.234 | Addax 加上中文 | 2021/9/24 | 5656.17 | ¥1,116.23 |
| 16 | 26 | 1117.234 | Addax 加上中文 | 2021/9/25 | 5664.17 | ¥1,117.23 |
| 17 | 27 | 1118.234 | Addax 加上中文 | 2021/9/26 | 5672.17 | ¥1,118.23 |
| 18 | 28 | 1119.234 | Addax 加上中文 | 2021/9/27 | 5680.17 | ¥1,119.23 |
| 19 | 29 | 1120.234 | Addax 加上中文 | 2021/9/28 | 5688.17 | ¥1,120.23 |
| 20 | 30 | 1121.234 | Addax 加上中文 | 2021/9/29 | 5696.17 | ¥1,121.23 |
The table headers roughly describe the characteristics of cell data.
Create Job File
Create the following JSON file:
{
"job": {
"setting": {
"speed": {
"channel": 2,
"bytes": -1
}
},
"content": {
"reader": {
"name": "excelreader",
"parameter": {
"path": [
"/tmp/in"
],
"header": true,
"skipRows": 0
}
},
"writer": {
"parameter": {
"print": true
},
"name": "streamwriter"
}
}
}
}Save the output content to the job/excel2stream.json file and execute the collection command:
$ bin/addax.sh job/excel2stream.jsonIf there are no errors, you should get the following output:
Details
___ _ _
/ _ \ | | | |
/ /_\ \ __| | __| | __ ___ __
| _ |/ _` |/ _` |/ _` \ \/ /
| | | | (_| | (_| | (_| |> <
\_| |_/\__,_|\__,_|\__,_/_/\_\
:: Addax version :: (v4.0.3)
2021-09-09 14:43:42.579 [ main] INFO VMInfo - VMInfo# operatingSystem class => sun.management.OperatingSystemImpl
2021-09-09 14:43:42.621 [ main] INFO Engine -
{
"content":
{
"reader":{
"parameter":{
"path":[
"/tmp/in"
],
"column":[
{
"name":"no",
"type":"long"
},
{
"name":"birth",
"format":"yyyy-MM-dd HH:mm:ss",
"type":"date"
},
{
"name":"kk",
"type":"string"
}
],
"header":true,
"skipHeader":true,
"encoding":"UTF-8",
"fieldDelimiter":","
},
"name":"excelreader"
},
"writer":{
"parameter":{
"print":true
},
"name":"streamwriter"
}
},
"setting":{
"speed":{
"bytes":-1,
"channel":2
}
}
}
2021-09-09 14:43:42.653 [ main] INFO PerfTrace - PerfTrace traceId=job_-1, isEnable=false, priority=0
2021-09-09 14:43:42.653 [ main] INFO JobContainer - Addax jobContainer starts job.
2021-09-09 14:43:42.655 [ main] INFO JobContainer - Set jobId = 0
2021-09-09 14:43:42.669 [ job-0] INFO ExcelReader$Job - add file [/tmp/in/demo_old.xls] as a candidate to be read.
2021-09-09 14:43:42.669 [ job-0] INFO ExcelReader$Job - add file [/tmp/in/demo_gbk.xlsx] as a candidate to be read.
2021-09-09 14:43:42.670 [ job-0] INFO ExcelReader$Job - add file [/tmp/in/demo.xlsx] as a candidate to be read.
2021-09-09 14:43:42.670 [ job-0] INFO ExcelReader$Job - The number of files to read is: [3]
2021-09-09 14:43:42.677 [ job-0] INFO JobContainer - Addax Reader.Job [excelreader] do prepare work .
2021-09-09 14:43:42.678 [ job-0] INFO JobContainer - Addax Writer.Job [streamwriter] do prepare work .
2021-09-09 14:43:42.679 [ job-0] INFO JobContainer - Job set Channel-Number to 2 channels.
2021-09-09 14:43:42.681 [ job-0] INFO JobContainer - Addax Reader.Job [excelreader] splits to [3] tasks.
2021-09-09 14:43:42.682 [ job-0] INFO JobContainer - Addax Writer.Job [streamwriter] splits to [3] tasks.
2021-09-09 14:43:42.727 [ job-0] INFO JobContainer - Scheduler starts [1] taskGroups.
2021-09-09 14:43:42.736 [ taskGroup-0] INFO TaskGroupContainer - taskGroupId=[0] start [2] channels for [3] tasks.
2021-09-09 14:43:42.741 [ taskGroup-0] INFO Channel - Channel set byte_speed_limit to -1, No bps activated.
2021-09-09 14:43:42.742 [ taskGroup-0] INFO Channel - Channel set record_speed_limit to -1, No tps activated.
2021-09-09 14:43:42.755 [0-0-1-reader] INFO ExcelReader$Task - The first row is skipped as a table header
2021-09-09 14:43:42.755 [0-0-1-reader] INFO ExcelReader$Task - begin read file /tmp/in/demo.xlsx
2021-09-09 14:43:42.757 [0-0-0-reader] INFO ExcelReader$Task - The first row is skipped as a table header
2021-09-09 14:43:42.758 [0-0-0-reader] INFO ExcelReader$Task - begin read file /tmp/in/demo_gbk.xlsx
1 11 1102.234 Addax加上中文 2021-09-10 00:00:00 5544.17 1102.234
1 12 1103.234 Addax加上中文 2021-09-11 00:00:00 5552.17 1103.234
1 13 1104.234 Addax加上中文 2021-09-12 00:00:00 5560.17 1104.234
1 14 1105.234 Addax加上中文 2021-09-13 00:00:00 5568.17 1105.234
1 15 1106.234 Addax加上中文 2021-09-14 00:00:00 5576.17 1106.234
1 16 1107.234 Addax加上中文 2021-09-15 00:00:00 5584.17 1107.234
1 17 1108.234 Addax加上中文 2021-09-16 00:00:00 5592.17 1108.234
1 18 1109.234 Addax加上中文 2021-09-17 00:00:00 5600.17 1109.234
1 19 1110.234 Addax加上中文 2021-09-18 00:00:00 5608.17 1110.234
1 20 1111.234 Addax加上中文 2021-09-19 00:00:00 5616.17 1111.234
1 21 1112.234 Addax加上中文 2021-09-20 00:00:00 5624.17 1112.234
1 22 1113.234 Addax加上中文 2021-09-21 00:00:00 5632.17 1113.234
1 23 1114.234 Addax加上中文 2021-09-22 00:00:00 5640.17 1114.234
1 24 1115.234 Addax加上中文 2021-09-23 00:00:00 5648.17 1115.234
1 25 1116.234 Addax加上中文 2021-09-24 00:00:00 5656.17 1116.234
1 26 1117.234 Addax加上中文 2021-09-25 00:00:00 5664.17 1117.234
1 27 1118.234 Addax加上中文 2021-09-26 00:00:00 5672.17 1118.234
1 28 1119.234 Addax加上中文 2021-09-27 00:00:00 5680.17 1119.234
1 29 1120.234 Addax加上中文 2021-09-28 00:00:00 5688.17 1120.234
1 30 1121.234 Addax加上中文 2021-09-29 00:00:00 5696.17 1121.234
1 11 1102.234 Addax加上中文 2021-09-10 00:00:00 5544.17 1102.234
2 12 1103.234 Addax加上中文 2021-09-11 00:00:00 5552.17 1103.234
3 13 1104.234 Addax加上中文 2021-09-12 00:00:00 5560.17 1104.234
4 14 1105.234 Addax加上中文 2021-09-13 00:00:00 5568.17 1105.234
5 15 1106.234 Addax加上中文 2021-09-14 00:00:00 5576.17 1106.234
6 16 1107.234 Addax加上中文 2021-09-15 00:00:00 5584.17 1107.234
7 17 1108.234 Addax加上中文 2021-09-16 00:00:00 5592.17 1108.234
8 18 1109.234 Addax加上中文 2021-09-17 00:00:00 5600.17 1109.234
9 19 1110.234 Addax加上中文 2021-09-18 00:00:00 5608.17 1110.234
10 20 1111.234 Addax加上中文 2021-09-19 00:00:00 5616.17 1111.234
11 21 1112.234 Addax加上中文 2021-09-20 00:00:00 5624.17 1112.234
12 22 1113.234 Addax加上中文 2021-09-21 00:00:00 5632.17 1113.234
13 23 1114.234 Addax加上中文 2021-09-22 00:00:00 5640.17 1114.234
14 24 1115.234 Addax加上中文 2021-09-23 00:00:00 5648.17 1115.234
15 25 1116.234 Addax加上中文 2021-09-24 00:00:00 5656.17 1116.234
16 26 1117.234 Addax加上中文 2021-09-25 00:00:00 5664.17 1117.234
17 27 1118.234 Addax加上中文 2021-09-26 00:00:00 5672.17 1118.234
18 28 1119.234 Addax加上中文 2021-09-27 00:00:00 5680.17 1119.234
19 29 1120.234 Addax加上中文 2021-09-28 00:00:00 5688.17 1120.234
20 30 1121.234 Addax加上中文 2021-09-29 00:00:00 5696.17 1121.234
2021-09-09 14:43:43.894 [0-0-2-reader] INFO ExcelReader$Task - The first row is skipped as a table header
2021-09-09 14:43:43.894 [0-0-2-reader] INFO ExcelReader$Task - begin read file /tmp/in/demo_old.xls
1 11 1102.234 Addax加上中文 2021-09-10 00:00:00 5544.17 1102.234
2 12 1103.234 Addax加上中文 2021-09-11 00:00:00 5552.17 1103.234
3 13 1104.234 Addax加上中文 2021-09-12 00:00:00 5560.17 1104.234
4 14 1105.234 Addax加上中文 2021-09-13 00:00:00 5568.17 1105.234
5 15 1106.234 Addax加上中文 2021-09-14 00:00:00 5576.17 1106.234
6 16 1107.234 Addax加上中文 2021-09-15 00:00:00 5584.17 1107.234
7 17 1108.234 Addax加上中文 2021-09-16 00:00:00 5592.17 1108.234
8 18 1109.234 Addax加上中文 2021-09-17 00:00:00 5600.17 1109.234
9 19 1110.234 Addax加上中文 2021-09-18 00:00:00 5608.17 1110.234
10 20 1111.234 Addax加上中文 2021-09-19 00:00:00 5616.17 1111.234
11 21 1112.234 Addax加上中文 2021-09-20 00:00:00 5624.17 1112.234
12 22 1113.234 Addax加上中文 2021-09-21 00:00:00 5632.17 1113.234
13 23 1114.234 Addax加上中文 2021-09-22 00:00:00 5640.17 1114.234
14 24 1115.234 Addax加上中文 2021-09-23 00:00:00 5648.17 1115.234
15 25 1116.234 Addax加上中文 2021-09-24 00:00:00 5656.17 1116.234
16 26 1117.234 Addax加上中文 2021-09-25 00:00:00 5664.17 1117.234
17 27 1118.234 Addax加上中文 2021-09-26 00:00:00 5672.17 1118.234
18 28 1119.234 Addax加上中文 2021-09-27 00:00:00 5680.17 1119.234
19 29 1120.234 Addax加上中文 2021-09-28 00:00:00 5688.17 1120.234
20 30 1121.234 Addax加上中文 2021-09-29 00:00:00 5696.17 1121.234
2021-09-09 14:43:45.753 [ job-0] INFO AbstractScheduler - Scheduler accomplished all tasks.
2021-09-09 14:43:45.754 [ job-0] INFO JobContainer - Addax Writer.Job [streamwriter] do post work.
2021-09-09 14:43:45.756 [ job-0] INFO JobContainer - Addax Reader.Job [excelreader] do post work.
2021-09-09 14:43:45.761 [ job-0] INFO JobContainer - PerfTrace not enable!
2021-09-09 14:43:45.762 [ job-0] INFO StandAloneJobContainerCommunicator - Total 60 records, 3360 bytes | Speed 1.09KB/s, 20 records/s | Error 0 records, 0 bytes | All Task WaitWriterTime 0.000s | All Task WaitReaderTime 0.993s | Percentage 100.00%
2021-09-09 14:43:45.764 [ job-0] INFO JobContainer -
任务启动时刻 : 2021-09-09 14:43:42
任务结束时刻 : 2021-09-09 14:43:45
任务总计耗时 : 3s
任务平均流量 : 1.09KB/s
记录写入速度 : 20rec/s
读出记录总数 : 60
读写失败总数 : 0Parameters
| Configuration | Required | Type | Default Value | Description |
|---|---|---|---|---|
| path | Yes | string/list | None | Specify the folder to read, multiple can be specified |
| header | No | boolean | false | Whether the file contains headers |
| skipRows | No | int | 0 | How many rows to skip at the beginning |
header
Whether the Excel file contains headers, if so, skip them.
skipRows
Specify the number of rows to skip, default is 0, meaning no skipping. Note that if header is set to true and skipRows is set to 2, it means the first three rows are all skipped. If header is false, it means skipping the first two rows.
Supported Data Types
The Excel reading functionality implementation depends on the Apache POI project, which has a very broad definition of cell data types. It only defines three types: Boolean, Double (numeric), and String. Among them, numeric type includes all integers, decimals, and dates. Currently, simple distinction is made for numeric types:
- Use library utility classes to detect if it's a date type, if so, determine it as date type
- Convert the numeric value to long integer and compare with the original value, if equal, determine as Long type
- Otherwise determine as Double type
Limitations
- Currently only reads the first Sheet of the file and ignores other Sheets
- Does not support specifying column reading
- Does not support skipping trailing rows (for example, summary tail rows may not meet requirements)
- Does not check if the number of columns in each row is equal, Excel must ensure this
- Only reads files with
xlsxorxlsfile extensions in the specified directory, other extension files will be ignored with warning messages