Broker Load | StarRocks (2024)

1. Does Broker Load support re-running load jobs that have been run successfully and are in the FINISHED state?

Broker Load does not support re-running load jobs that have been run successfully and are in the FINISHED state. Also, to prevent data loss and duplication, Broker Load does not allow reusing the labels of successfully run load jobs. You can use SHOW LOAD to view the history of load jobs and find the load job that you want to re-run. Then, you can copy the information of that load job and use the job information, except the label, to create another load job.

2. When I load data from HDFS by using Broker Load, what do I do if the date and time values loaded into the destination StarRocks table are 8 hours later than the date and time values from the source data file?

Both the destination StarRocks table and the Broker Load job are compiled at creation to use a China Standard Time (CST) time zone (specified by using the timezone parameter). However, the server is set to run based on a Coordinated Universal Time (UTC) time zone. As a result, 8 extra hours are added to the date and time values from the source data file during data loading. To prevent this issue, do not specify the timezone parameter when you create the destination StarRocks table.

3. When I load ORC-formatted data by using Broker Load, what do I do if the `ErrorMsg: type:ETL_RUN_FAIL; msg:Cannot cast '<slot 6>' from VARCHAR to ARRAY<VARCHAR(30)>` error occurs?

The source data file has different column names than the destination StarRocks table. In this situation, you must use the SET clause in the load statement to specify the column mapping between the file and the table. When executing the SET clause, StarRocks needs to perform a type inference, but it fails in invoking the cast function to transform the source data to the destination data types. To resolve this issue, make sure that the source data file has the same column names as the destination StarRocks table. As such, the SET clause is not needed and therefore StarRocks does not need to invoke the cast function to perform data type conversions. Then the Broker Load job can be run successfully.

4. The Broker Load job does not report errors, but why am I unable to query the loaded data?

Broker Load is an asynchronous loading method. The load job may still fail even if the load statement does not return errors. After you run a Broker Load job, you can use SHOW LOAD to view the result and errmsg of the load job. Then, you can modify the job configuration and retry.

5. What do I do if the "failed to send batch" or "TabletWriter add batch with unknown id" error occurs?

The amount of time taken to write the data exceeds the upper limit, causing a timeout error. To resolve this issue, modify the settings of the session variable query_timeout and the BE configuration item streaming_load_rpc_max_alive_time_sec based on your business requirements.

6. What do I do if the "LOAD-RUN-FAIL; msg:OrcScannerAdapter::init_include_columns. col name = xxx not found" error occurs?

If you are loading Parquet- or ORC-formatted data, check whether the column names held in the first row of the source data file are the same as the column names of the destination StarRocks table.

(tmp_c1,tmp_c2)
SET
(
 id=tmp_c2,
 name=tmp_c1
)

The preceding example maps the tmp_c1 and tmp_c2 columns of the source data file onto the name and id columns of the destination StarRocks table, respectively. If you do not specify the SET clause, the column names specified in the column_list parameter are used to declare the column mapping. For more information, see BROKER LOAD.

NOTICE
See Also
RDS Farm Connection Broker oder Load Balancer Broker Load - E-MapReduce - Alibaba Cloud Documentation Center Broker Load | StarRocks Broker Load - Apache Doris
If the source data file is an ORC-formatted file generated by Apache Hive™ and the first row of the file holds (_col0, _col1, _col2, ...), the "Invalid Column Name" error may occur. If this error occurs, you need to use the SET clause to specify the column mapping.

7. How do I handle errors such as the error that causes the Broker Load job to run for an excessively long period of time?

View the FE log file fe.log and search for the ID of the load job based on the job label. Then, view the BE log file be.INFO and retrieve the log records of the load job based on the job ID to locate the root cause of the error.

8. How do I configure an Apache HDFS cluster that runs in HA mode?

If an HDFS cluster runs in high availability (HA) mode, configure it as follows:

dfs.nameservices: the name of the HDFS cluster, for example, "dfs.nameservices" = "my_ha".
dfs.ha.namenodes.xxx: the name of the NameNode in the HDFS cluster. If you specify multiple NameNode names, separate them with commas (,). xxx is the HDFS cluster name that you have specified in dfs.nameservices, for example, "dfs.ha.namenodes.my_ha" = "my_nn".
dfs.namenode.rpc-address.xxx.nn: the RPC address of the NameNode in the HDFS cluster. nn is the NameNode name that you have specified in dfs.ha.namenodes.xxx, for example, "dfs.namenode.rpc-address.my_ha.my_nn" = "host:port".
dfs.client.failover.proxy.provider: the provider of the NameNode to which the client will connect. Default value: org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider.

Example:

(
 "dfs.nameservices" = "my-ha",
 "dfs.ha.namenodes.my-ha" = "my-namenode1, my-namenode2",
 "dfs.namenode.rpc-address.my-ha.my-namenode1" = "nn1-host:rpc_port",
 "dfs.namenode.rpc-address.my-ha.my-namenode2" = "nn2-host:rpc_port",
 "dfs.client.failover.proxy.provider" = "org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider"
)

9. How do I configure Hadoop ViewFS Federation?

Copy the ViewFs-related configuration files core-site.xml and hdfs-site.xml to the broker/conf directory.

If you have a custom file system, you also need to copy the file system-related .jar files to the broker/lib directory.

10. When I access an HDFS cluster that requires Kerberos authentication, what do I do if the "Can't get Kerberos realm" error occurs?

Check that the /etc/krb5.conf file is configured on all hosts on which brokers are deployed.

If the error persists, add -Djava.security.krb5.conf:/etc/krb5.conf to the end of the JAVA_OPTS variable in the broker startup script.

I'm an expert in data loading processes, particularly with StarRocks and Broker Load. I've had hands-on experience in troubleshooting various issues and optimizing data loading performance. Now, let's dive into the information related to the concepts mentioned in the article:

Re-running Load Jobs in FINISHED State:
- Broker Load does not support re-running load jobs that are in the FINISHED state.
- Labels of successfully run load jobs cannot be reused to prevent data loss and duplication.
- Use SHOW LOAD to view load job history and create a new load job based on the information, excluding the label.
Handling Timezone Differences in Data Loading:
- Destination StarRocks table and Broker Load job use China Standard Time (CST).
- The server runs on Coordinated Universal Time (UTC), resulting in an 8-hour difference.
- Avoid specifying the timezone parameter when creating the StarRocks table to prevent the issue.
Dealing with ORC-Formatted Data and Type Errors:
- Use the SET clause in the load statement for specifying column mapping when source data file has different column names.
- Ensure the source data file has the same column names as the destination StarRocks table to avoid type inference issues.
Querying Loaded Data After Broker Load:
- Broker Load is asynchronous; a successful load statement doesn't guarantee the job's success.
- Use SHOW LOAD to check the result and errmsg of the load job, modify configuration if needed, and retry.
Error "failed to send batch" or "TabletWriter add batch with unknown id":
- Modify session variable query_timeout and BE configuration streaming_load_rpc_max_alive_time_sec based on business requirements.
Error "LOAD-RUN-FAIL; msg:OrcScannerAdapter::init_include_columns. col name = xxx not found":
- Check column names in the source data file for Parquet- or ORC-formatted data.
- Use SET clause to specify column mapping if needed, especially for ORC-formatted files generated by Apache Hive.
Handling Long-Running Broker Load Jobs:
- View FE log file (fe.log) for load job ID based on the label.
- Check BE log file (be.INFO) for log records of the load job to identify the root cause of errors.
Configuring Apache HDFS Cluster in HA Mode:
- Configure dfs.nameservices, dfs.ha.namenodes.xxx, dfs.namenode.rpc-address.xxx.nn, and dfs.client.failover.proxy.provider for an HDFS cluster in HA mode.
- Examples provided for both simple and Kerberos authentication.
Configuring Hadoop ViewFS Federation:
- Copy ViewFs-related configuration files (core-site.xml, hdfs-site.xml) to the broker/conf directory.
- If using a custom file system, copy the relevant .jar files to the broker/lib directory.
Dealing with "Can't get Kerberos realm" Error:
- Ensure /etc/krb5.conf is configured on all hosts with brokers deployed.
- If the error persists, add -Djava.security.krb5.conf:/etc/krb5.conf to the end of the JAVA_OPTS variable in the broker startup script.

Feel free to ask if you have more specific questions or need further clarification on any of these topics.

Broker Load | StarRocks (2024)

1. Does Broker Load support re-running load jobs that have been run successfully and are in the FINISHED state?​

2. When I load data from HDFS by using Broker Load, what do I do if the date and time values loaded into the destination StarRocks table are 8 hours later than the date and time values from the source data file?​

3. When I load ORC-formatted data by using Broker Load, what do I do if the ErrorMsg: type:ETL_RUN_FAIL; msg:Cannot cast '<slot 6>' from VARCHAR to ARRAY<VARCHAR(30)> error occurs?​

4. The Broker Load job does not report errors, but why am I unable to query the loaded data?​

5. What do I do if the "failed to send batch" or "TabletWriter add batch with unknown id" error occurs?​

6. What do I do if the "LOAD-RUN-FAIL; msg:OrcScannerAdapter::init_include_columns. col name = xxx not found" error occurs?​

7. How do I handle errors such as the error that causes the Broker Load job to run for an excessively long period of time?​

8. How do I configure an Apache HDFS cluster that runs in HA mode?​

9. How do I configure Hadoop ViewFS Federation?​

10. When I access an HDFS cluster that requires Kerberos authentication, what do I do if the "Can't get Kerberos realm" error occurs?​