<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title><![CDATA[Hbase批量导入数据]]></title><description><![CDATA[<p dir="auto">对于tsv形式的数据　可以用下面的方式导入</p>
<pre><code>将数据加入hdfs
hdfs dfs -put /ceph_disk1/gene_data/MetaDatabase/NCBI_blast_db_FASTA/nt/nt_exported_fasta/newnt1.fa /nt/

export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:/opt/hadoop/hbase-2.3.4/lib/*:/opt/hadoop/hadoop-3.2.2/lib/native/*:/opt/hadoop/hbase-2.3.4/conf
time hadoop jar /opt/hadoop/hbase-2.3.4/lib/hbase-mapreduce-2.3.4.jar importtsv -Dimporttsv.columns=HBASE_ROW_KEY,seq:seqt,seq:taxid,seq:seqs -Dimporttsv.bulk.output=hdfs://192.168.1.2:7000/nt/hfiles/ nt hdfs://192.168.1.2:7000/nt/newnt1.fa &amp;

</code></pre>
<p dir="auto">默认用tab分割 不用下面的&amp;的分割<br />
nohup time hadoop jar /opt/hadoop/hbase-2.3.4/lib/hbase-mapreduce-2.3.4.jar importtsv '-Dimporttsv.separator=&amp;' -Dimporttsv.columns=HBASE_ROW_KEY,seq:seqt,seq:taxid,seq:seqs -Dimporttsv.bulk.output=hdfs://192.168.1.2:7000/nt/hfiles/ nt hdfs://192.168.1.2:7000/nt/ &amp;</p>
<p dir="auto">生成HFiles的步骤 370GB NT数据 大概用了不到4个小时<br />
Map-Reduce Framework<br />
Map input records=74547430<br />
Map output records=74547430<br />
Map output bytes=408633905926<br />
Map output materialized bytes=408933633110<br />
Input split bytes=302394<br />
Combine input records=74681364<br />
Combine output records=74681364<br />
Reduce input groups=74547430<br />
Reduce shuffle bytes=408933633110<br />
Reduce input records=74547430<br />
Reduce output records=223642290<br />
Spilled Records=223598923<br />
Shuffled Maps =320358<br />
Failed Shuffles=0<br />
Merged Map outputs=320358<br />
GC time elapsed (ms)=137615<br />
Total committed heap usage (bytes)=27134307336192<br />
ImportTsv<br />
Bad Lines=0<br />
Shuffle Errors<br />
BAD_ID=0<br />
CONNECTION=0<br />
IO_ERROR=0<br />
WRONG_LENGTH=0<br />
WRONG_MAP=0<br />
WRONG_REDUCE=0<br />
File Input Format Counters<br />
Bytes Read=421141495842<br />
File Output Format Counters<br />
Bytes Written=410423108883<br />
9057.42user 3919.59system 3:40:55elapsed 97%CPU (0avgtext+0avgdata 31694828maxresident)k</p>
]]></description><link>http://an.forum.genostack.com/topic/272/hbase批量导入数据</link><generator>RSS for Node</generator><lastBuildDate>Sat, 13 Jun 2026 12:32:44 GMT</lastBuildDate><atom:link href="http://an.forum.genostack.com/topic/272.rss" rel="self" type="application/rss+xml"/><pubDate>Thu, 01 Apr 2021 12:19:33 GMT</pubDate><ttl>60</ttl><item><title><![CDATA[Reply to Hbase批量导入数据 on Fri, 09 Apr 2021 10:48:29 GMT]]></title><description><![CDATA[<p dir="auto">2021-04-09 18:38:33,008 ERROR [main] tool.LoadIncrementalHFiles (LoadIncrementalHFiles.java:checkHFilesCountPerRegionPerFamily(610)) - Trying to load more than 32 hfiles to family seq of region with start key<br />
2021-04-09 18:38:33,019 INFO  [main] client.ConnectionImplementation (ConnectionImplementation.java:closeMasterService(1898)) - Closing master protocol: MasterService<br />
Exception in thread "main" java.io.IOException: Trying to load more than 32 hfiles to one family of one region<br />
at org.apache.hadoop.hbase.tool.LoadIncrementalHFiles.performBulkLoad(LoadIncrementalHFiles.java:455)<br />
at org.apache.hadoop.hbase.tool.LoadIncrementalHFiles.doBulkLoad(LoadIncrementalHFiles.java:367)<br />
at org.apache.hadoop.hbase.tool.LoadIncrementalHFiles.run(LoadIncrementalHFiles.java:1216)<br />
at org.apache.hadoop.hbase.tool.LoadIncrementalHFiles.run(LoadIncrementalHFiles.java:1229)<br />
at org.apache.hadoop.hbase.tool.LoadIncrementalHFiles.run(LoadIncrementalHFiles.java:1264)<br />
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)<br />
at org.apache.hadoop.hbase.tool.BulkLoadHFilesTool.main(BulkLoadHFilesTool.java:66)</p>
<p dir="auto">java -cp /opt/hadoop/hbase-2.3.4/lib/hbase-mapreduce-2.3.4.jar:/opt/hadoop/hbase-2.3.4/lib/hbase-server-2.3.4.jar:/opt/hadoop/hbase-2.3.4/lib/<em>:/opt/hadoop/hadoop-3.2.2/share/hadoop/common/</em>:/opt/hadoop/hadoop-3.2.2/share/hadoop/common/lib/* org.apache.hadoop.hbase.tool.BulkLoadHFilesTool <strong>-Dhbase.mapreduce.bulkload.max.hfiles.perRegion.perFamily=1024</strong> hdfs://192.168.1.2:7000/nt/hfiles2/ nt</p>
]]></description><link>http://an.forum.genostack.com/post/577</link><guid isPermaLink="true">http://an.forum.genostack.com/post/577</guid><dc:creator><![CDATA[anneng]]></dc:creator><pubDate>Fri, 09 Apr 2021 10:48:29 GMT</pubDate></item><item><title><![CDATA[Reply to Hbase批量导入数据 on Fri, 09 Apr 2021 03:48:08 GMT]]></title><description><![CDATA[<pre><code>java -cp /opt/hadoop/hbase-2.3.4/lib/hbase-mapreduce-2.3.4.jar:/opt/hadoop/hbase-2.3.4/lib/hbase-server-2.3.4.jar:/opt/hadoop/hbase-2.3.4/lib/*:/opt/hadoop/hadoop-3.2.2/share/hadoop/common/*:/opt/hadoop/hadoop-3.2.2/share/hadoop/common/lib/* org.apache.hadoop.hbase.tool.BulkLoadHFilesTool hdfs://192.168.1.2:7000/nt/hfiles/ nt
</code></pre>
<p dir="auto">加载上一步生成的文件  这一步时间很快 可以忽略</p>
]]></description><link>http://an.forum.genostack.com/post/569</link><guid isPermaLink="true">http://an.forum.genostack.com/post/569</guid><dc:creator><![CDATA[anneng]]></dc:creator><pubDate>Fri, 09 Apr 2021 03:48:08 GMT</pubDate></item><item><title><![CDATA[Reply to Hbase批量导入数据 on Thu, 08 Apr 2021 08:31:17 GMT]]></title><description><![CDATA[<p dir="auto">可以提前把文件下载好 然后执行krakenuniq-download 就会自动检测已经下载好的文件<br />
<img src="/assets/uploads/files/1617870525702-293d9cbe-faae-48f8-a0e9-77481cf57599-image.png" alt="293d9cbe-faae-48f8-a0e9-77481cf57599-image.png" class=" img-responsive img-markdown" /></p>
]]></description><link>http://an.forum.genostack.com/post/568</link><guid isPermaLink="true">http://an.forum.genostack.com/post/568</guid><dc:creator><![CDATA[anneng]]></dc:creator><pubDate>Thu, 08 Apr 2021 08:31:17 GMT</pubDate></item><item><title><![CDATA[Reply to Hbase批量导入数据 on Thu, 08 Apr 2021 11:16:45 GMT]]></title><description><![CDATA[<p dir="auto">java.lang.reflect.InvocationTargetException<br />
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)<br />
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)<br />
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)<br />
at java.lang.reflect.Method.invoke(Method.java:498)<br />
at org.apache.hadoop.hbase.mapreduce.Driver.main(Driver.java:64)<br />
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)<br />
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)<br />
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)<br />
at java.lang.reflect.Method.invoke(Method.java:498)<br />
at org.apache.hadoop.util.RunJar.run(RunJar.java:323)<br />
at org.apache.hadoop.util.RunJar.main(RunJar.java:236)<br />
Caused by: java.lang.NoSuchMethodError: org.apache.hadoop.hdfs.client.HdfsDataInputStream.getReadStatistics()Lorg/apache/hadoop/hdfs/DFSInputStream$ReadStatistics;<br />
at org.apache.hadoop.hbase.io.FSDataInputStreamWrapper.updateInputStreamStatistics(FSDataInputStreamWrapper.java:253)<br />
at org.apache.hadoop.hbase.io.FSDataInputStreamWrapper.close(FSDataInputStreamWrapper.java:300)<br />
at org.apache.hadoop.hbase.io.hfile.HFile.isHFileFormat(HFile.java:590)<br />
at org.apache.hadoop.hbase.io.hfile.HFile.isHFileFormat(HFile.java:571)<br />
at org.apache.hadoop.hbase.tool.LoadIncrementalHFiles.visitBulkHFiles(LoadIncrementalHFiles.java:1072)<br />
at org.apache.hadoop.hbase.tool.LoadIncrementalHFiles.discoverLoadQueue(LoadIncrementalHFiles.java:988)<br />
at org.apache.hadoop.hbase.tool.LoadIncrementalHFiles.prepareHFileQueue(LoadIncrementalHFiles.java:249)<br />
at org.apache.hadoop.hbase.tool.LoadIncrementalHFiles.doBulkLoad(LoadIncrementalHFiles.java:356)<br />
at org.apache.hadoop.hbase.tool.LoadIncrementalHFiles.run(LoadIncrementalHFiles.java:1216)<br />
at org.apache.hadoop.hbase.tool.LoadIncrementalHFiles.run(LoadIncrementalHFiles.java:1229)<br />
at org.apache.hadoop.hbase.tool.LoadIncrementalHFiles.run(LoadIncrementalHFiles.java:1264)<br />
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)<br />
at org.apache.hadoop.hbase.tool.BulkLoadHFilesTool.main(BulkLoadHFilesTool.java:66)<br />
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)<br />
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)<br />
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)<br />
at java.lang.reflect.Method.invoke(Method.java:498)<br />
at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:71)<br />
at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:144)<br />
at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:152)<br />
... 11 more<br />
Command exited with non-zero status 1</p>
<p dir="auto">不知道什么原因 执行下面的方法就好了<br />
java -cp /opt/hadoop/hbase-2.3.4/lib/hbase-mapreduce-2.3.4.jar:/opt/hadoop/hbase-2.3.4/lib/hbase-server-2.3.4.jar:/opt/hadoop/hbase-2.3.4/lib/:/opt/hadoop/hadoop-3.2.2/share/hadoop/common/:/opt/hadoop/hadoop-3.2.2/share/hadoop/common/lib/ org.apache.hadoop.hbase.tool.BulkLoadHFilesTool hdfs://192.168.1.2:7000/nt/hfiles/ nt</p>
]]></description><link>http://an.forum.genostack.com/post/567</link><guid isPermaLink="true">http://an.forum.genostack.com/post/567</guid><dc:creator><![CDATA[anneng]]></dc:creator><pubDate>Thu, 08 Apr 2021 11:16:45 GMT</pubDate></item><item><title><![CDATA[Reply to Hbase批量导入数据 on Thu, 08 Apr 2021 03:19:03 GMT]]></title><description><![CDATA[<p dir="auto">有一次运行 下面的参数写错了 应该是importtsv.bulk.output<br />
-Dimporttst.bulk.output=hdfs://192.168.1.2:7000/nt/hfiles/</p>
<p dir="auto">importtsv 在这个参数错误的情况下 顺利执行了 但是不知道把结果写入到哪里去了</p>
]]></description><link>http://an.forum.genostack.com/post/565</link><guid isPermaLink="true">http://an.forum.genostack.com/post/565</guid><dc:creator><![CDATA[anneng]]></dc:creator><pubDate>Thu, 08 Apr 2021 03:19:03 GMT</pubDate></item><item><title><![CDATA[Reply to Hbase批量导入数据 on Thu, 08 Apr 2021 03:01:14 GMT]]></title><description><![CDATA[<p dir="auto">Hbase支持两种调用方法　一种直接调用类　一种使用hadoop的driver机制<br />
Explicit Classname<br />
$ bin/hbase org.apache.hadoop.hbase.tool.LoadIncrementalHFiles <a rel="nofollow ugc">hdfs://storefileoutput</a> &lt;tablename&gt;<br />
Driver<br />
HADOOP_CLASSPATH=<code>${HBASE_HOME}/bin/hbase classpath</code> ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-mapreduce-VERSION.jar completebulkload <a rel="nofollow ugc">hdfs://storefileoutput</a> &lt;tablename&gt;</p>
<p dir="auto"><a href="https://www.oreilly.com/library/view/learning-hbase/9781783985944/ch06s05.html" rel="nofollow ugc">https://www.oreilly.com/library/view/learning-hbase/9781783985944/ch06s05.html</a><br />
hadoop的driver机制　还可以调用下面这些hbase的功能:<br />
completebulkload: This is for a bulk data load<br />
copytable: This is to export a table data from the local to peer cluster<br />
export: This is to export data from an HBase table to HDFS as a sequence file<br />
import: This is to import data written by export<br />
importtsv: This is to import data in TSV format to HBase<br />
rowcounter: This is to count rows in an HBase table using MapReduce<br />
verifyrep: This is to compare the data from tables of different clusters</p>
]]></description><link>http://an.forum.genostack.com/post/564</link><guid isPermaLink="true">http://an.forum.genostack.com/post/564</guid><dc:creator><![CDATA[anneng]]></dc:creator><pubDate>Thu, 08 Apr 2021 03:01:14 GMT</pubDate></item><item><title><![CDATA[Reply to Hbase批量导入数据 on Wed, 07 Apr 2021 09:40:35 GMT]]></title><description><![CDATA[<p dir="auto">2021-04-07 17:37:09,260 INFO  [htable-pool3866-t1] client.AsyncRequestFutureImpl (AsyncRequestFutureImpl.java:resubmit(763)) - id=2899, table=nt, attempt=12/16, failureCount=1ops, last exception=org.apache.hadoop.hbase.RegionTooBusyException: org.apache.hadoop.hbase.RegionTooBusyException: Over memstore limit=512.0 M, regionName=1efaf86c7220c641629a386589e1a8ef, server=anneng01,16020,1617700403892<br />
at org.apache.hadoop.hbase.regionserver.HRegion.checkResources(HRegion.java:4535)<br />
at org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java:4101)<br />
at org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java:4041)<br />
at org.apache.hadoop.hbase.regionserver.RSRpcServices.doBatchOp(RSRpcServices.java:1081)<br />
at org.apache.hadoop.hbase.regionserver.RSRpcServices.doNonAtomicBatchOp(RSRpcServices.java:1013)<br />
at org.apache.hadoop.hbase.regionserver.RSRpcServices.doNonAtomicRegionMutation(RSRpcServices.java:978)<br />
at org.apache.hadoop.hbase.regionserver.RSRpcServices.multi(RSRpcServices.java:2828)<br />
at org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:44870)<br />
at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:393)<br />
at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:133)<br />
at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:338)<br />
at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:318)<br />
on anneng01,16020,1617700403892, tracking started null, retrying after=20194ms, operationsToReplay=1</p>
<pre><code>hbase.hregion.memstore.flush.size=1024M
hbase.hregion.memstore.block.multiplier=4
</code></pre>
]]></description><link>http://an.forum.genostack.com/post/549</link><guid isPermaLink="true">http://an.forum.genostack.com/post/549</guid><dc:creator><![CDATA[anneng]]></dc:creator><pubDate>Wed, 07 Apr 2021 09:40:35 GMT</pubDate></item><item><title><![CDATA[Reply to Hbase批量导入数据 on Wed, 07 Apr 2021 12:57:19 GMT]]></title><description><![CDATA[<p dir="auto"><a href="https://stackoverflow.com/questions/14326308/how-to-include-hbase-site-xml-in-the-classpath" rel="nofollow ugc">https://stackoverflow.com/questions/14326308/how-to-include-hbase-site-xml-in-the-classpath</a></p>
<p dir="auto">如何把hbase-site.xml加入到classpath 使其生效</p>
<p dir="auto">export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:/opt/hadoop/hbase-2.3.4/lib/<em>:/opt/hadoop/hadoop-3.2.2/lib/native/</em>:/opt/hadoop/hbase-2.3.4/conf</p>
<p dir="auto">/opt/hadoop/hbase-2.3.4/conf  必须加上这个让客户端读取hbase-site.xml</p>
]]></description><link>http://an.forum.genostack.com/post/547</link><guid isPermaLink="true">http://an.forum.genostack.com/post/547</guid><dc:creator><![CDATA[anneng]]></dc:creator><pubDate>Wed, 07 Apr 2021 12:57:19 GMT</pubDate></item><item><title><![CDATA[Reply to Hbase批量导入数据 on Tue, 06 Apr 2021 03:26:01 GMT]]></title><description><![CDATA[<p dir="auto">hbase的cell有最大长度限制 默认10MB 需要禁用掉<br />
&lt;property&gt;<br />
&lt;name&gt;hbase.client.keyvalue.maxsize&lt;/name&gt;<br />
&lt;value&gt;0&lt;/value&gt;<br />
&lt;/property&gt;</p>
<p dir="auto">配置修改后要同步到各个节点 然后重启  hbase<br />
<a href="http://stop-hbase.sh" rel="nofollow ugc">stop-hbase.sh</a><br />
<a href="http://start-hbase.sh" rel="nofollow ugc">start-hbase.sh</a></p>
]]></description><link>http://an.forum.genostack.com/post/541</link><guid isPermaLink="true">http://an.forum.genostack.com/post/541</guid><dc:creator><![CDATA[anneng]]></dc:creator><pubDate>Tue, 06 Apr 2021 03:26:01 GMT</pubDate></item><item><title><![CDATA[Reply to Hbase批量导入数据 on Fri, 02 Apr 2021 02:24:13 GMT]]></title><description><![CDATA[<p dir="auto">因为blastdbcmd导出的数据有很多字符 为了满足 importtsv 的格式要求 我们需要以tab分割  分割tab有两个方法：<br />
1.把下面的命令保存到sh文件中 方便输入tab<br />
blastdbcmd -db nt -entry all  -out newnt1.fa -outfmt '%a        %t      %T      %s'</p>
<p dir="auto">2.在命令行首先输入ctrl+v 然后输入tab 就可以直接把tab输入到命令行</p>
]]></description><link>http://an.forum.genostack.com/post/539</link><guid isPermaLink="true">http://an.forum.genostack.com/post/539</guid><dc:creator><![CDATA[anneng]]></dc:creator><pubDate>Fri, 02 Apr 2021 02:24:13 GMT</pubDate></item><item><title><![CDATA[Reply to Hbase批量导入数据 on Thu, 01 Apr 2021 12:19:42 GMT]]></title><description><![CDATA[<p dir="auto">java.lang.Exception: java.lang.IllegalArgumentException: TsvParser only supports single-byte separators<br />
at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:492)<br />
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:552)<br />
Caused by: java.lang.IllegalArgumentException: TsvParser only supports single-byte separators<br />
at org.apache.hbase.thirdparty.com.google.common.base.Preconditions.checkArgument(Preconditions.java:142)<br />
at org.apache.hadoop.hbase.mapreduce.ImportTsv$TsvParser.&lt;init&gt;(ImportTsv.java:161)<br />
at org.apache.hadoop.hbase.mapreduce.TsvImporterMapper.setup(TsvImporterMapper.java:108)<br />
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143)<br />
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:799)<br />
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:347)<br />
at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:271)<br />
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)<br />
at java.util.concurrent.FutureTask.run(FutureTask.java:266)<br />
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)<br />
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)<br />
at java.lang.Thread.run(Thread.java:748)</p>
]]></description><link>http://an.forum.genostack.com/post/538</link><guid isPermaLink="true">http://an.forum.genostack.com/post/538</guid><dc:creator><![CDATA[anneng]]></dc:creator><pubDate>Thu, 01 Apr 2021 12:19:42 GMT</pubDate></item></channel></rss>