暗能星系

    • 登录
    • 搜索

    Hbase数据模型设计和批量导入

    大数据
    1
    3
    16
    正在加载更多帖子
    • 从旧到新
    • 从新到旧
    • 最多赞同
    回复
    • 在新帖中回复
    登录后回复
    此主题已被删除。只有拥有主题管理权限的用户可以查看。
    • A
      anneng 最后由 anneng 编辑

      参考资料:
      https://livebook.manning.com/book/hbase-in-action/chapter-4/1
      https://www.ibm.com/support/knowledgecenter/SSCRJT_5.0.1/com.ibm.swg.im.bigsql.analyze.doc/doc/bigsql_designhints.html
      Hbase数据模型的一些基本概念:
      Table: HBase organizes data into tables. Table names are Strings and composed of characters that are safe for use in a file system path.
      Row: Within a table, data is stored according to its row. Rows are identified uniquely by their row key. Row keys do not have a data type and are always treated as a byte[] (byte array).每一行都要有一个key 不像关系数据库可以修改表的主键 Hbase的表不能修改 
      Column Family: Data within a row is grouped by column family. Column families also impact the physical arrangement of data stored in HBase. For this reason, they must be defined up front and are not easily modified. Every row in a table has the same column families, although a row need not store data in all its families. Column families are Strings and composed of characters that are safe for use in a file system path.
      Column families 列簇
      列簇在实际存储时是单独的HFiles文件,因此在逻辑上经常需要一起使用的数据可以放到一个列簇中。列簇的名字要尽可能短,因为这些列簇会和每一个数据绑定,有多少数据就会重复使用,因此名字太长也会占用很多空间。列簇要在创建表的时候就设计好。

      Column Qualifier: Data within a column family is addressed via its column qualifier, or simply, column. Column qualifiers need not be specified in advance. Column qualifiers need not be consistent between rows. Like row keys, column qualifiers do not have a data type and are always treated as a byte[ ]. CQ是真正的列。可以动态修改
      Cell: A combination of row key, column family, and column qualifier uniquely identifies a cell. The data stored in a cell is referred to as that cell’s value. Values also do not have a data type and are always treated as a byte[ ].
      Timestamp: Values within a cell are versioned. Versions are identified by their version number, which by default is the timestamp of when the cell was written. If a timestamp is not specified during a write, the current timestamp is used. If the timestamp is not specified for a read, the latest one is returned. The number
      of cell value versions retained by HBase is configured for each column family.The default number of cell versions is three.
      af8e4195-9cf7-40c3-b26c-4352bcbc2fde-image.png
      Hbase的表实际上是一个多维度的映射表(multidimensional map):
      9d7dad41-19bd-4b8d-8341-dc3dd85ccee9-image.png
      从上面的例子可以看出 每一行实际的key 是由[row key, column family, column qualifier, timestamp]唯一确定的 cell是真正的value.

      The org.apache.hadoop.hbase.mapreduce.ImportTsv utility and the completebulkload
      tool are used to bulk load data into HBase. The procedure to upload is as follows:

      1. Put the data file, which is a TSV file, to be uploaded into HDFS
      2. Run the ImportTsv utility to generate multiple HFiles from
        the TSV file
      3. Run the completebulkload tool to bulk load the HFiles into
        HBase
      1 条回复 最后回复 回复 引用 0
      • A
        anneng 最后由 编辑

        Introduction to HBase Schema Design.pdf

        1 条回复 最后回复 回复 引用 0
        • A
          anneng 最后由 anneng 编辑

          参考资料:
          https://www.jigsawacademy.com/hbase-a-versatile-data-store/
          https://dzone.com/articles/understanding-hbase-and-bigtab

          1 条回复 最后回复 回复 引用 0
          • First post
            Last post
          Powered by 暗能星系