Hbase数据模型设计和批量导入
-
参考资料:
https://livebook.manning.com/book/hbase-in-action/chapter-4/1
https://www.ibm.com/support/knowledgecenter/SSCRJT_5.0.1/com.ibm.swg.im.bigsql.analyze.doc/doc/bigsql_designhints.html
Hbase数据模型的一些基本概念:
Table: HBase organizes data into tables. Table names are Strings and composed of characters that are safe for use in a file system path.
Row: Within a table, data is stored according to its row. Rows are identified uniquely by their row key. Row keys do not have a data type and are always treated as a byte[] (byte array).每一行都要有一个key 不像关系数据库可以修改表的主键 Hbase的表不能修改
Column Family: Data within a row is grouped by column family. Column families also impact the physical arrangement of data stored in HBase. For this reason, they must be defined up front and are not easily modified. Every row in a table has the same column families, although a row need not store data in all its families. Column families are Strings and composed of characters that are safe for use in a file system path.
Column families 列簇
列簇在实际存储时是单独的HFiles文件,因此在逻辑上经常需要一起使用的数据可以放到一个列簇中。列簇的名字要尽可能短,因为这些列簇会和每一个数据绑定,有多少数据就会重复使用,因此名字太长也会占用很多空间。列簇要在创建表的时候就设计好。Column Qualifier: Data within a column family is addressed via its column qualifier, or simply, column. Column qualifiers need not be specified in advance. Column qualifiers need not be consistent between rows. Like row keys, column qualifiers do not have a data type and are always treated as a byte[ ]. CQ是真正的列。可以动态修改
Cell: A combination of row key, column family, and column qualifier uniquely identifies a cell. The data stored in a cell is referred to as that cell’s value. Values also do not have a data type and are always treated as a byte[ ].
Timestamp: Values within a cell are versioned. Versions are identified by their version number, which by default is the timestamp of when the cell was written. If a timestamp is not specified during a write, the current timestamp is used. If the timestamp is not specified for a read, the latest one is returned. The number
of cell value versions retained by HBase is configured for each column family.The default number of cell versions is three.

Hbase的表实际上是一个多维度的映射表(multidimensional map):

从上面的例子可以看出 每一行实际的key 是由[row key, column family, column qualifier, timestamp]唯一确定的 cell是真正的value.The org.apache.hadoop.hbase.mapreduce.ImportTsv utility and the completebulkload
tool are used to bulk load data into HBase. The procedure to upload is as follows:- Put the data file, which is a TSV file, to be uploaded into HDFS
- Run the ImportTsv utility to generate multiple HFiles from
the TSV file - Run the completebulkload tool to bulk load the HFiles into
HBase
-
-