Hive与HBase数据互导

1、Hive到HBase
1.1、创建hive表

create table inpatient_hv(
PATIENT_NO String COMMENT '住院号',
NAME String COMMENT '姓名',
SEX_CODE String COMMENT '性别',
BIRTHDATE TIMESTAMP COMMENT '生日',
BALANCE_COST String COMMENT '总费用')
COMMENT '住院患者基本信息'
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' ESCAPED BY '\\'
STORED AS TEXTFILE;

1.2、hive表导入数据

load data inpath '/usr/hadoop/inpatient.txt' into table inpatient_hv

1.3、创建hbase表

create table inpatient_hb(
PATIENT_NO String COMMENT '住院号',
NAME String COMMENT '姓名',
SEX_CODE String COMMENT '性别',
BIRTHDATE TIMESTAMP COMMENT '生日',
BALANCE_COST String COMMENT '总费用')
COMMENT '住院患者基本信息'
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,pinfo:NAME,pinfo:SEX_CODE,pinfo:BIRTHDATE,pinfo:BALANCE_COST") 
TBLPROPERTIES ("hbase.table.name" = "inpatient_hb");

1.4、数据从hive导入hbase

INSERT OVERWRITE TABLE inpatient_hb SELECT * FROM inpatient_hv;

2、hbase到hive
2.1、创建hbase表

create 'inpatient_hb','pinfo'

2.2、hbase表导入数据

./hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.separator="," -Dimporttsv.columns=HBASE_ROW_KEY,pinfo:INPATIENT_NO,pinfo:NAME,pinfo:SEX_CODE,pinfo:BIRTHDATE,pinfo:BALANCE_COS inpatient_hb /usr/hadoop/inpatient.txt

2.3、创建hive表

#创建hbase external表
create external table inpatient_hb(
PATIENT_NO String COMMENT '住院号',
INPATIENT_NO String  COMMENT '住院流水号',
NAME String COMMENT '姓名',
SEX_CODE String COMMENT '性别',
BIRTHDATE TIMESTAMP COMMENT '生日',
BALANCE_COST String COMMENT '总费用')
COMMENT '住院患者基本信息'
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,pinfo:INPATIENT_NO,pinfo:NAME,pinfo:SEX_CODE,pinfo:BIRTHDATE,pinfo:BALANCE_COST") 
TBLPROPERTIES ("hbase.table.name" = "inpatient_hb");

#创建hive表
create table inpatient_hv(
PATIENT_NO String COMMENT '住院号',
INPATIENT_NO String  COMMENT '住院流水号',
NAME String COMMENT '姓名',
SEX_CODE String COMMENT '性别',
BIRTHDATE TIMESTAMP COMMENT '生日',
BALANCE_COST String COMMENT '总费用')
COMMENT '住院患者基本信息'
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' ESCAPED BY '\\'
STORED AS TEXTFILE;

2.4、数据从hbase导入hive

INSERT OVERWRITE TABLE inpatient_hv SELECT * FROM inpatient_hb;

Hive环境搭建04

1、建表

create table inpatient(
PATIENT_NO String COMMENT '住院号',
NAME String COMMENT '姓名',
SEX_CODE String COMMENT '性别',
BIRTHDATE TIMESTAMP COMMENT '生日',
BALANCE_COST String COMMENT '总费用')
COMMENT '住院患者基本信息'
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,pinfo:INPATIENT_NO,pinfo:NAME,pinfo:SEX_CODE,pinfo:BIRTHDATE,pinfo:BALANCE_COST") 
TBLPROPERTIES ("hbase.table.name" = "inpatient");

2、Hbase导入数据
2.1、Hbase直接导入

./hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.separator="," -Dimporttsv.columns=HBASE_ROW_KEY,pinfo:INPATIENT_NO,pinfo:NAME,pinfo:SEX_CODE,pinfo:BIRTHDATE,pinfo:BALANCE_COST inpatient /usr/hadoop/inpatient.txt
......
2016-12-22 10:33:36,985 INFO  [main] client.RMProxy: Connecting to ResourceManager at hadoop-master/172.16.172.13:8032
2016-12-22 10:33:37,340 INFO  [main] Configuration.deprecation: io.bytes.per.checksum is deprecated. Instead, use dfs.bytes-per-checksum
2016-12-22 10:33:43,450 INFO  [main] input.FileInputFormat: Total input paths to process : 1
2016-12-22 10:33:44,640 INFO  [main] mapreduce.JobSubmitter: number of splits:1
2016-12-22 10:33:44,952 INFO  [main] Configuration.deprecation: io.bytes.per.checksum is deprecated. Instead, use dfs.bytes-per-checksum
2016-12-22 10:33:47,173 INFO  [main] mapreduce.JobSubmitter: Submitting tokens for job: job_1482371551462_0002
2016-12-22 10:33:50,830 INFO  [main] impl.YarnClientImpl: Submitted application application_1482371551462_0002
2016-12-22 10:33:51,337 INFO  [main] mapreduce.Job: The url to track the job: http://hadoop-master:8088/proxy/application_1482371551462_0002/
2016-12-22 10:33:51,338 INFO  [main] mapreduce.Job: Running job: job_1482371551462_0002
2016-12-22 10:34:39,499 INFO  [main] mapreduce.Job: Job job_1482371551462_0002 running in uber mode : false
2016-12-22 10:34:39,572 INFO  [main] mapreduce.Job:  map 0% reduce 0%
2016-12-22 10:35:48,228 INFO  [main] mapreduce.Job:  map 1% reduce 0%
2016-12-22 10:36:06,876 INFO  [main] mapreduce.Job:  map 3% reduce 0%
2016-12-22 10:36:09,981 INFO  [main] mapreduce.Job:  map 5% reduce 0%
2016-12-22 10:36:13,739 INFO  [main] mapreduce.Job:  map 7% reduce 0%
2016-12-22 10:36:17,592 INFO  [main] mapreduce.Job:  map 10% reduce 0%
2016-12-22 10:36:22,891 INFO  [main] mapreduce.Job:  map 12% reduce 0%
2016-12-22 10:36:45,217 INFO  [main] mapreduce.Job:  map 17% reduce 0%
2016-12-22 10:37:14,914 INFO  [main] mapreduce.Job:  map 20% reduce 0%
2016-12-22 10:37:35,739 INFO  [main] mapreduce.Job:  map 25% reduce 0%
2016-12-22 10:37:39,013 INFO  [main] mapreduce.Job:  map 34% reduce 0%
2016-12-22 10:38:24,289 INFO  [main] mapreduce.Job:  map 42% reduce 0%
2016-12-22 10:38:36,644 INFO  [main] mapreduce.Job:  map 49% reduce 0%
2016-12-22 10:38:57,618 INFO  [main] mapreduce.Job:  map 54% reduce 0%
2016-12-22 10:39:00,808 INFO  [main] mapreduce.Job:  map 56% reduce 0%
2016-12-22 10:39:07,879 INFO  [main] mapreduce.Job:  map 58% reduce 0%
2016-12-22 10:39:11,489 INFO  [main] mapreduce.Job:  map 60% reduce 0%
2016-12-22 10:39:24,708 INFO  [main] mapreduce.Job:  map 62% reduce 0%
2016-12-22 10:39:29,188 INFO  [main] mapreduce.Job:  map 63% reduce 0%
2016-12-22 10:39:34,165 INFO  [main] mapreduce.Job:  map 65% reduce 0%
2016-12-22 10:40:12,473 INFO  [main] mapreduce.Job:  map 66% reduce 0%
2016-12-22 10:40:39,471 INFO  [main] mapreduce.Job:  map 73% reduce 0%
2016-12-22 10:40:40,910 INFO  [main] mapreduce.Job:  map 74% reduce 0%
2016-12-22 10:40:42,936 INFO  [main] mapreduce.Job:  map 75% reduce 0%
2016-12-22 10:40:46,471 INFO  [main] mapreduce.Job:  map 77% reduce 0%
2016-12-22 10:40:50,495 INFO  [main] mapreduce.Job:  map 79% reduce 0%
2016-12-22 10:40:53,267 INFO  [main] mapreduce.Job:  map 81% reduce 0%
2016-12-22 10:41:06,843 INFO  [main] mapreduce.Job:  map 83% reduce 0%
2016-12-22 10:41:13,140 INFO  [main] mapreduce.Job:  map 92% reduce 0%
2016-12-22 10:41:22,305 INFO  [main] mapreduce.Job:  map 93% reduce 0%
2016-12-22 10:41:27,671 INFO  [main] mapreduce.Job:  map 96% reduce 0%
2016-12-22 10:41:48,688 INFO  [main] mapreduce.Job:  map 100% reduce 0%
2016-12-22 10:43:20,552 INFO  [main] mapreduce.Job: Job job_1482371551462_0002 completed successfully
2016-12-22 10:43:28,574 INFO  [main] mapreduce.Job: Counters: 31
        File System Counters
                FILE: Number of bytes read=0
                FILE: Number of bytes written=127746
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
                HDFS: Number of bytes read=43306042
                HDFS: Number of bytes written=0
                HDFS: Number of read operations=2
                HDFS: Number of large read operations=0
                HDFS: Number of write operations=0
        Job Counters
                Launched map tasks=1
                Data-local map tasks=1
                Total time spent by all maps in occupied slots (ms)=460404
                Total time spent by all reduces in occupied slots (ms)=0
                Total time spent by all map tasks (ms)=460404
                Total vcore-seconds taken by all map tasks=460404
                Total megabyte-seconds taken by all map tasks=471453696
        Map-Reduce Framework
                Map input records=115411
                Map output records=115152
                Input split bytes=115
                Spilled Records=0
                Failed Shuffles=0
                Merged Map outputs=0
                GC time elapsed (ms)=26590
                CPU time spent (ms)=234550
                Physical memory (bytes) snapshot=83329024
                Virtual memory (bytes) snapshot=544129024
                Total committed heap usage (bytes)=29036544
        ImportTsv
                Bad Lines=259
        File Input Format Counters
                Bytes Read=43305927
        File Output Format Counters
                Bytes Written=0

2.2、completebulkload导入

#/etc/profile中添加下面一行
#export HADOOP_CLASSPATH="$HADOOP_CLASSPATH:$HBASE_HOME/lib/*"
./hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.separator="," -Dimporttsv.bulk.output=/usr/hadoop/inpatient.tmp -Dimporttsv.columns=HBASE_ROW_KEY,pinfo:INPATIENT_NO,pinfo:NAME,pinfo:SEX_CODE,pinfo:BIRTHDATE,pinfo:BALANCE_COST inpatient /usr/hadoop/inpatient.txt
......
2016-12-22 12:26:04,496 INFO  [main] client.RMProxy: Connecting to ResourceManager at hadoop-master/172.16.172.13:8032
2016-12-22 12:26:12,411 INFO  [main] input.FileInputFormat: Total input paths to process : 1
2016-12-22 12:26:12,563 INFO  [main] mapreduce.JobSubmitter: number of splits:1
2016-12-22 12:26:12,577 INFO  [main] Configuration.deprecation: io.bytes.per.checksum is deprecated. Instead, use dfs.bytes-per-checksum
2016-12-22 12:26:13,220 INFO  [main] mapreduce.JobSubmitter: Submitting tokens for job: job_1482371551462_0005
2016-12-22 12:26:13,764 INFO  [main] impl.YarnClientImpl: Submitted application application_1482371551462_0005
2016-12-22 12:26:13,832 INFO  [main] mapreduce.Job: The url to track the job: http://hadoop-master:8088/proxy/application_1482371551462_0005/
2016-12-22 12:26:13,833 INFO  [main] mapreduce.Job: Running job: job_1482371551462_0005
2016-12-22 12:26:35,952 INFO  [main] mapreduce.Job: Job job_1482371551462_0005 running in uber mode : false
2016-12-22 12:26:36,156 INFO  [main] mapreduce.Job:  map 0% reduce 0%
2016-12-22 12:27:15,839 INFO  [main] mapreduce.Job:  map 3% reduce 0%
2016-12-22 12:27:18,868 INFO  [main] mapreduce.Job:  map 53% reduce 0%
2016-12-22 12:27:21,981 INFO  [main] mapreduce.Job:  map 58% reduce 0%
2016-12-22 12:27:29,195 INFO  [main] mapreduce.Job:  map 67% reduce 0%
2016-12-22 12:27:41,582 INFO  [main] mapreduce.Job:  map 83% reduce 0%
2016-12-22 12:27:52,819 INFO  [main] mapreduce.Job:  map 85% reduce 0%
2016-12-22 12:27:59,189 INFO  [main] mapreduce.Job:  map 93% reduce 0%
2016-12-22 12:28:07,498 INFO  [main] mapreduce.Job:  map 100% reduce 0%
2016-12-22 12:29:11,199 INFO  [main] mapreduce.Job:  map 100% reduce 67%
2016-12-22 12:29:24,353 INFO  [main] mapreduce.Job:  map 100% reduce 70%
2016-12-22 12:29:32,324 INFO  [main] mapreduce.Job:  map 100% reduce 74%
2016-12-22 12:29:37,001 INFO  [main] mapreduce.Job:  map 100% reduce 79%
2016-12-22 12:29:38,011 INFO  [main] mapreduce.Job:  map 100% reduce 82%
2016-12-22 12:29:41,038 INFO  [main] mapreduce.Job:  map 100% reduce 84%
2016-12-22 12:29:45,082 INFO  [main] mapreduce.Job:  map 100% reduce 88%
2016-12-22 12:29:48,115 INFO  [main] mapreduce.Job:  map 100% reduce 90%
2016-12-22 12:29:51,154 INFO  [main] mapreduce.Job:  map 100% reduce 92%
2016-12-22 12:29:54,186 INFO  [main] mapreduce.Job:  map 100% reduce 94%
2016-12-22 12:29:57,205 INFO  [main] mapreduce.Job:  map 100% reduce 97%
2016-12-22 12:30:00,236 INFO  [main] mapreduce.Job:  map 100% reduce 100%
2016-12-22 12:30:06,388 INFO  [main] mapreduce.Job: Job job_1482371551462_0005 completed successfully
2016-12-22 12:30:09,203 INFO  [main] mapreduce.Job: Counters: 50
        File System Counters
                FILE: Number of bytes read=237707880
                FILE: Number of bytes written=357751428
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
                HDFS: Number of bytes read=43306042
                HDFS: Number of bytes written=195749237
                HDFS: Number of read operations=8
                HDFS: Number of large read operations=0
                HDFS: Number of write operations=3
        Job Counters
                Launched map tasks=1
                Launched reduce tasks=1
                Data-local map tasks=1
                Total time spent by all maps in occupied slots (ms)=99691
                Total time spent by all reduces in occupied slots (ms)=83330
                Total time spent by all map tasks (ms)=99691
                Total time spent by all reduce tasks (ms)=83330
                Total vcore-seconds taken by all map tasks=99691
                Total vcore-seconds taken by all reduce tasks=83330
                Total megabyte-seconds taken by all map tasks=102083584
                Total megabyte-seconds taken by all reduce tasks=85329920
        Map-Reduce Framework
                Map input records=115411
                Map output records=115152
                Map output bytes=118397787
                Map output materialized bytes=118853937
                Input split bytes=115
                Combine input records=115152
                Combine output records=115077
                Reduce input groups=115077
                Reduce shuffle bytes=118853937
                Reduce input records=115077
                Reduce output records=3337137
                Spilled Records=345231
                Shuffled Maps =1
                Failed Shuffles=0
                Merged Map outputs=1
                GC time elapsed (ms)=2017
                CPU time spent (ms)=38130
                Physical memory (bytes) snapshot=383750144
                Virtual memory (bytes) snapshot=1184014336
                Total committed heap usage (bytes)=231235584
        ImportTsv
                Bad Lines=259
        Shuffle Errors
                BAD_ID=0
                CONNECTION=0
                IO_ERROR=0
                WRONG_LENGTH=0
                WRONG_MAP=0
                WRONG_REDUCE=0
        File Input Format Counters
                Bytes Read=43305927
        File Output Format Counters
                Bytes Written=195749237

3、在hive中进行查询

hive> select * from inpatient limit 1;
OK
......
Time taken: 12.419 seconds, Fetched: 1 row(s)

hive> select count(*) from inpatient;
Query ID = hadoop_20161222114304_b247c745-a6ec-4e52-b76d-daefb657ac20
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1482371551462_0004, Tracking URL = http://hadoop-master:8088/proxy/application_1482371551462_0004/
Kill Command = /home/hadoop/Deploy/hadoop-2.5.2/bin/hadoop job  -kill job_1482371551462_0004
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2016-12-22 11:44:22,634 Stage-1 map = 0%,  reduce = 0%
2016-12-22 11:45:08,704 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 5.74 sec
2016-12-22 11:45:50,754 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 8.19 sec
MapReduce Total cumulative CPU time: 8 seconds 190 msec
Ended Job = job_1482371551462_0004
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 8.19 sec   HDFS Read: 13353 HDFS Write: 7 SUCCESS
Total MapReduce CPU Time Spent: 8 seconds 190 msec
OK
115077
Time taken: 170.801 seconds, Fetched: 1 row(s)

./hadoop jar /home/hadoop/Deploy/hbase-1.1.2/lib/hbase-server-1.1.2.jar completebulkload /usr/hadoop/inpatient.tmp inpatient
......
16/12/22 12:42:04 INFO zookeeper.ZooKeeper: Initiating client connection, connectString=localhost:2181 sessionTimeout=90000 watcher=hconnection-0x4df040780x0, quorum=localhost:2181, baseZNode=/hbase
16/12/22 12:42:04 INFO zookeeper.ClientCnxn: Opening socket connection to server localhost/127.0.0.1:2181. Will not attempt to authenticate using SASL (unknown error)
16/12/22 12:42:04 INFO zookeeper.ClientCnxn: Socket connection established to localhost/127.0.0.1:2181, initiating session
16/12/22 12:42:04 INFO zookeeper.ClientCnxn: Session establishment complete on server localhost/127.0.0.1:2181, sessionid = 0x5924755d380005, negotiated timeout = 90000
16/12/22 12:42:06 INFO zookeeper.RecoverableZooKeeper: Process identifier=hconnection-0x7979cd9c connecting to ZooKeeper ensemble=localhost:2181
16/12/22 12:42:06 INFO zookeeper.ZooKeeper: Initiating client connection, connectString=localhost:2181 sessionTimeout=90000 watcher=hconnection-0x7979cd9c0x0, quorum=localhost:2181, baseZNode=/hbase
16/12/22 12:42:06 INFO zookeeper.ClientCnxn: Opening socket connection to server localhost/127.0.0.1:2181. Will not attempt to authenticate using SASL (unknown error)
16/12/22 12:42:06 INFO zookeeper.ClientCnxn: Socket connection established to localhost/127.0.0.1:2181, initiating session
16/12/22 12:42:07 INFO zookeeper.ClientCnxn: Session establishment complete on server localhost/127.0.0.1:2181, sessionid = 0x5924755d380006, negotiated timeout = 90000
16/12/22 12:42:07 WARN mapreduce.LoadIncrementalHFiles: Skipping non-directory hdfs://hadoop-master:9000/usr/hadoop/inpatient.tmp/_SUCCESS
16/12/22 12:42:08 INFO hfile.CacheConfig: CacheConfig:disabled
16/12/22 12:42:08 INFO mapreduce.LoadIncrementalHFiles: Trying to load hfile=hdfs://hadoop-master:9000/usr/hadoop/inpatient.tmp/pinfo/7ee330c0f66c4d36b5d614a337d3929f first=" last="B301150360"
16/12/22 12:42:08 INFO client.ConnectionManager$HConnectionImplementation: Closing master protocol: MasterService
16/12/22 12:42:08 INFO client.ConnectionManager$HConnectionImplementation: Closing zookeeper sessionid=0x5924755d380006
16/12/22 12:42:08 INFO zookeeper.ZooKeeper: Session: 0x5924755d380006 closed
16/12/22 12:42:08 INFO zookeeper.ClientCnxn: EventThread shut down

HBase简单通讯代码

首先,就要说一下配置问题了。HBase客户端的配置有两种方式,一种是通过配置文件,另一种是通过代码设置。

1、配置文件方式
配置文件名称为hbase-site.xml,该文件必须放置到CLASS_PATH下面才会有效,文件示例如下:
hbase-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
	<property>
		<name>hbase.rootdir</name>
		<value>hdfs://hadoop-master:9000/hbase</value>
	</property>
	<property>
		<name>hbase.cluster.distributed</name>
		<value>true</value>
	</property>
	<property>
		<name>hbase.master</name>
		<value>hdfs://hadoop-master:60000</value>
	</property>
	<property>
		<name>hbase.zookeeper.quorum</name>
		<value>hadoop-master,hadoop-slave01,hadoop-slave02</value>
	</property>
</configuration>

2、通过代码配置方式

        Configuration hbaseConfig = HBaseConfiguration.create();
        hbaseConfig.setInt("timeout", 120000);
        hbaseConfig.set("hbase.master", "hdfs://hadoop-master:60000");
        hbaseConfig.set("hbase.zookeeper.quorum", "hadoop-master,hadoop-slave01,hadoop-slave02");
        hbaseConfig.setInt("hbase.zookeeper.property.clientPort", 2181);
        hbaseConfig.setInt("hbase.client.retries.number", 1);

Continue reading HBase简单通讯代码

HBase基本操作03

接上一篇,HBase与传统数据库一个很大的不同之处:HBase可以保存多个版本的值,而不仅仅是保存最新的值。版本之间,通过timestamp属性来区分。

我们来具体看一下,首先新增一行visit100,我们多次设置列personinfo:name,每次设置后都查询一次最新的值:

hbase(main):011:0> put 'patientvisit','visit100','personinfo:name','A'
0 row(s) in 0.0890 seconds

hbase(main):012:0> get 'patientvisit','visit100','personinfo:name'
COLUMN                                        CELL                                                                                                                               
 personinfo:name                              timestamp=1454342758182, value=A                                                                                                   
1 row(s) in 0.0310 seconds

hbase(main):013:0> put 'patientvisit','visit100','personinfo:name','B'
0 row(s) in 0.0100 seconds

hbase(main):014:0> get 'patientvisit','visit100','personinfo:name'
COLUMN                                        CELL                                                                                                                               
 personinfo:name                              timestamp=1454342796166, value=B                                                                                                   
1 row(s) in 0.0230 seconds

hbase(main):015:0> put 'patientvisit','visit100','personinfo:name','B'
0 row(s) in 0.0140 seconds

hbase(main):016:0> get 'patientvisit','visit100','personinfo:name'
COLUMN                                        CELL                                                                                                                               
 personinfo:name                              timestamp=1454342802829, value=B                                                                                                   
1 row(s) in 0.0120 seconds

我们可以看到,每次设置该列的数值timestamp都有相应的变化:

timestamp value
1454342758182 A
1454342796166 B
1454342802829 B

Continue reading HBase基本操作03

HBase基本操作02

接上一篇,patientvisit表中的数据,最终是什么样子的呢?
为了便于查看,我把数据拆分成了两个Table:

Row Key personinfo personinfoex
empi name sex birthday address patid
visit001 empi001 zhangsan male 1999-12-31 shanghai xxx road pat001
visit002 empi002 lisi male 2000-01-01 beijing pat002
visit003 empi003 wangwu female 1999-12-30 guangzhou pat002
Row Key visitinfo visitinfoex
visitid visittime visitdocid visitdocname
visit001 visit001 2015-07-25 10:10:00 doc001 Dr. Yang
visit002 visit002 2015-07-26 11:11:00 doc001 Dr. Yang
visit003 visit003 2015-07-27 13:13:00 doc002 Dr. Li

Continue reading HBase基本操作02

HBase基本操作01

首先说一下HBase与传统的关系型数据库在逻辑层次上的不同:
1、HBase的表结构定义中是不需要定义列的,只需要定义列族(可以暂时把列族当成多个列的集合)。所以在建表的时候,只需要指定列族即可。在列族中新增列,是不需要任何事先声明的,直接使用就好了。
2、HBase中,行是通过key来定位的,扫描更是通过key来进行的。所以行的key值选择,就显得十分重要。表结构的定义及key值的选择,实际上决定了数据是否可以高效利用。
3、HBase中,行Key+列族名+列名,可以定义到唯一的一个Cell

其实,从这里大家可以看出:
1、HBase通过对行进行了一定的限制,实现了列的灵活操作,解决了列扩展的问题
2、我们实际应用中,往往将主表及多个关联表不计重复的一起记录到列中,通过对空间的浪费,来实现了时间的节省,解决了查询效率的问题。
换句话说,数据的增加速度,超出了硬件进步的水平,硬件处理速度已经无法满足如此大量数据的处理,只能通过分布式技术,通过并发处理,将任务分配到多个节点,才能满足速度的需要。
3、分布式处理也要付出代价,节点间的通信,再小也会是个技术瓶颈。从这个角度来说,如果数据量没有这么大的话,采用分布式处理,反而不如单节点优化的效果好。
4、分布式处理的优点是,可以通过一堆性能一般的电脑,达到一台高性能计算机的处理速度,同时自带了数据冗余机制,降低了维护量。但其维护量,总体上还是上升了。所以是否采用,就要权衡数据量及维护量之间的关系了。

哦,扯远了。。。咱们继续。

第一步当然是看一下帮助:

hadoop@hadoop-master:~/Deploy/hbase-1.1.2$ bin/hbase shell

hbase(main):061:0> help
HBase Shell, version 1.1.2, rcc2b70cf03e3378800661ec5cab11eb43fafe0fc, Wed Aug 26 20:11:27 PDT 2015
Type 'help "COMMAND"', (e.g. 'help "get"' -- the quotes are necessary) for help on a specific command.
Commands are grouped. Type 'help "COMMAND_GROUP"', (e.g. 'help "general"') for help on a command group.

COMMAND GROUPS:
  Group name: general
  Commands: status, table_help, version, whoami

  Group name: ddl
  Commands: alter, alter_async, alter_status, create, describe, disable, disable_all, drop, drop_all, enable, enable_all, exists, get_table, is_disabled, is_enabled, list, show_filters

  Group name: namespace
  Commands: alter_namespace, create_namespace, describe_namespace, drop_namespace, list_namespace, list_namespace_tables

  Group name: dml
  Commands: append, count, delete, deleteall, get, get_counter, get_splits, incr, put, scan, truncate, truncate_preserve

  Group name: tools
  Commands: assign, balance_switch, balancer, balancer_enabled, catalogjanitor_enabled, catalogjanitor_run, catalogjanitor_switch, close_region, compact, compact_rs, flush, major_compact, merge_region, move, split, trace, unassign, wal_roll, zk_dump

  Group name: replication
  Commands: add_peer, append_peer_tableCFs, disable_peer, disable_table_replication, enable_peer, enable_table_replication, list_peers, list_replicated_tables, remove_peer, remove_peer_tableCFs, set_peer_tableCFs, show_peer_tableCFs

  Group name: snapshots
  Commands: clone_snapshot, delete_all_snapshot, delete_snapshot, list_snapshots, restore_snapshot, snapshot

  Group name: configuration
  Commands: update_all_config, update_config

  Group name: quotas
  Commands: list_quotas, set_quota

  Group name: security
  Commands: grant, revoke, user_permission

  Group name: visibility labels
  Commands: add_labels, clear_auths, get_auths, list_labels, set_auths, set_visibility

SHELL USAGE:
Quote all names in HBase Shell such as table and column names.  Commas delimit
command parameters.  Type <RETURN> after entering a command to run it.
Dictionaries of configuration used in the creation and alteration of tables are
Ruby Hashes. They look like this:

  {'key1' => 'value1', 'key2' => 'value2', ...}

and are opened and closed with curley-braces.  Key/values are delimited by the
'=>' character combination.  Usually keys are predefined constants such as
NAME, VERSIONS, COMPRESSION, etc.  Constants do not need to be quoted.  Type
'Object.constants' to see a (messy) list of all constants in the environment.

If you are using binary keys or values and need to enter them in the shell, use
double-quote'd hexadecimal representation. For example:

  hbase> get 't1', "key\x03\x3f\xcd"
  hbase> get 't1', "key\003\023\011"
  hbase> put 't1', "test\xef\xff", 'f1:', "\x01\x33\x40"

The HBase shell is the (J)Ruby IRB with the above HBase-specific commands added.
For more on the HBase Shell, see http://hbase.apache.org/book.html
hbase(main):062:0> 

Continue reading HBase基本操作01

搭建HBase集群环境

1、正常运行Hadoop集群

2、检查HBase与Hadoop的兼容性,下载对应的正确版本(*如果要看后续文章,建议使用hadoop-2.5.2 hbase-1.1.2 hive-1.2.1 spark-2.0.0)
S:支持
X:不支持
NT:未测试

HBase-0.94.x HBase-0.98.x (Support for Hadoop 1.1+ is deprecated.) HBase-1.0.x (Hadoop 1.x is NOT supported) HBase-1.1.x HBase-1.2.x
Hadoop-1.0.x X X X X X
Hadoop-1.1.x S NT X X X
Hadoop-0.23.x S X X X X
Hadoop-2.0.x-alpha NT X X X X
Hadoop-2.1.0-beta NT X X X X
Hadoop-2.2.0 NT S NT NT NT
Hadoop-2.3.x NT S NT NT NT
Hadoop-2.4.x NT S S S S
Hadoop-2.5.x NT S S S S
Hadoop-2.6.0 X X X X X
Hadoop-2.6.1+ NT NT NT NT S
Hadoop-2.7.0 X X X X X
Hadoop-2.7.1+ NT NT NT NT S

Continue reading 搭建HBase集群环境