Internals of sqoop import

Let us first see how the data is imported into Hadoop. We import structured data from any database into Hadoop, for this Sqoop will read the table row by row into HDFS.As we have seen the import process is done parallel so the output files which are generated are multiple files. These files can be text files, sequence files, avro files, etc.

So when we import any data a java class is generated in the current local directory. This class is used by the sqoop itself. Sqoop can use that generated code for deserialization of table specific data before writing into HDFS. This generated class is capable of holding a single record retrieved from imported table.

We can also use a different Sqoop tool to generate code without performing import,

% sqoop codegen --connect jdbc:mysql://localhost/mysql \

> --table student --class-name student1

This tool can be used to regenerate code if we accidentally remove the source file or we can generate the code with different settings.

Similar to Hadoop, sqoop is also written in java. So let us see the working of sqoop.

First download JDBC driver for the type of database which we want to import and install its .jar file in Sqoophome/lib directory on our Sqoop client. The .jar file have a specific driver class which mentions the entry point to driver. Example MySQL Connector/J library has driver class of com.mysql.jdbc.Driver. This class must be provided as an argument to ‘—driver’.

To access the data stored in RDBMS we use JDBC which is a java API. JDBC is used to examine the tables which we are importing, it maps all its columns datatypes to java datatypes. JDBC driver contains the necessary code to connect to its database servers. Sqoop can predict the JDBC driver based on URL of the connect string which we are using.

If we don’t know which JDBC driver is appropriate, then we can also specify JDBC driver explicitly by using –driver argument. Then as we have discussed about code generator, it will use that information and create a table class and holds the record extracted from the table.

To get a consistent snapshot of source data we should disable the process which are updating any rows during importing process. After this whole process we get the result which we can export back to RDBMS using Export process. It will read a set of text files, parse them into records, and will insert them as new rows in the target database.

To connect to SqlServer database, download the driver from Microsoft.com and install it in our sqoop lib path and then run import as,

$ sqoop import --driver com.microsoft.jdbc.sqlserver.SQLServerDriver \

    --connect <connect-string> ...

To see the Sqoop version we can use,

hdadmin@ubuntu:~/sqoop-1.4.5-cdh5.3.2$ sqoop version

17/12/23 00:15:56 INFO sqoop.Sqoop: Running Sqoop version: 1.4.5-cdh5.3.2

Sqoop 1.4.5-cdh5.3.2

git commit id

Compiled by jenkins on Tue Feb 24 12:45:57 PST 2015

Usage of Generic and Specific arguments

Let us take an example and show you the generic and specific arguments of Sqoop.

$ sqoop help import

usage: sqoop import [GENERIC-ARGS] [TOOL-ARGS]



Common arguments:

   --connect <jdbc-uri>     Specify JDBC connect string

   --connect-manager <class-name>     Specify connection manager class to use

   --driver <class-name>    Manually specify JDBC driver class to use

   --hadoop-mapred-home <dir>      Override $HADOOP_MAPRED_HOME

   --help                   Print usage instructions

   --password-file          Set path for file containing authentication password

   -P                       Read password from console

   --password <password>    Set authentication password

   --username <username>    Set authentication username

   --verbose                Print more information while working

   --hadoop-home <dir>     Deprecated. Override $HADOOP_HOME



[...]



Generic Hadoop command-line arguments:

(must preceed any tool-specific arguments)

Generic options supported are

-conf <configuration file>     specify an application configuration file

-D <property=value>            use value for given property

-fs <local|namenode:port>      specify a namenode

-jt <local|jobtracker:port>    specify a job tracker

-files <comma separated list of files>    specify comma separated files to be copied to the map reduce cluster

-libjars <comma separated list of jars>    specify comma separated jar files to include in the classpath.

-archives <comma separated list of archives>    specify comma separated archives to be unarchived on the compute machines.



The general command line syntax is

bin/hadoop command [genericOptions] [commandOptions]

So you must supply generic arguments before you specify tool arguments as you can see in the above syntax. The generic arguments specify the Hadoop server settings and its configuration information. Generic arguments can also be specified by modifying ‘conf/sqoop-site.xml’ file like,

<property>

    <name>property.name</name>

    <value>property.value</value>

  </property>

We can give complete syntax in command line interface of sqoop or we can also use Options file. An Option file is a text file where each line indicates an option in the same order as that of command line. Comments are also supported in Options file.

For example, the following Sqoop invocation for import can be specified alternatively as shown below:

$ sqoop import --connect jdbc:mysql://localhost/db --username root --table TEST

$ sqoop --options-file /users/homer/work/import.txt --table TEST

where the options file /users/homer/work/import.txt contains the following:

import

--connect

jdbc:mysql://localhost/db

--username

root

Sqoop Connectors:

To import and export data in sqoop we need to have sqoop connectors. It is used to enable sqoop import and export. Sqoop ships with connectors for working with many databases like Oracle, MySql, Sql Server, etc. We have built in connectors and also third party connectors for this.