Colm O hEigeartaigh

Subscribe to Colm O hEigeartaigh feed
Colm O hEigeartaighhttp://www.blogger.com/profile/10711987281965801793noreply@blogger.comBlogger238125
Updated: 6 hours 42 min ago

Securing Apache Solr - part I

Mon, 06/26/2017 - 11:46
This is the first post in a series of articles on securing Apache Solr. In this post we will look at deploying an example SolrCloud instance and securing access to it via basic authentication.

1) Install and deploy a SolrCloud example

Download and extract Apache Solr (6.6.0 was used for the purpose of this tutorial). Now start SolrCloud via:
  • bin/solr -e cloud
Accept all of the default options. This creates a cluster of two nodes, with a collection "gettingstarted" split into two shards and two replicas per-shard. A web interface is available after startup at: http://localhost:8983/solr/.

Once the cluster is up and running we can post some data to the collection we have created via the REST interface:
  • curl http://localhost:8983/solr/gettingstarted/update -d '[ {"id" : "book1", "title_t" : "The Merchant of Venice", "author_s" : "William Shakespeare"}]'
  • curl http://localhost:8983/solr/gettingstarted/update -d '[ {"id" : "book2", "title_t" : "Macbeth", "author_s" : "William Shakespeare"}]'
  • curl http://localhost:8983/solr/gettingstarted/update -d '[ {"id" : "book3", "title_t" : "Death of a Salesman", "author_s" : "Arthur Miller"}]'
We can search the REST interface to for example return all entries by William Shakespeare as follows:
  • curl http://localhost:8983/solr/gettingstarted/query?q=author_s:William+Shakespeare
2) Authenticating users to our SolrCloud instance

Now that our SolrCloud instance is up and running, let's look at how we can secure access to it, by using HTTP Basic Authentication to authenticate our REST requests. Download the following security configuration which enables Basic Authentication in Solr:
Two users are defined - "alice" and "bob" - both with password "SolrRocks". Now upload this configuration to the Apache Zookeeper instance that is running with Solr:
  • server/scripts/cloud-scripts/zkcli.sh -zkhost localhost:9983 -cmd putfile /security.json security.json
Now try to run the search query above again using Curl. A 401 error will be returned. Once we specify the correct credentials then the request will work as expected, e.g.:
  • curl -u alice:SolrRocks http://localhost:8983/solr/gettingstarted/query?q=author_s:Arthur+Miller
Categories: Colm O hEigeartaigh

SSO support for Apache Syncope REST services

Thu, 06/22/2017 - 18:05
Apache Syncope has recently added SSO support for its REST services in the 2.0.3 release. Previously, access to the REST services of Syncope was via HTTP Basic Authentication. From the 2.0.3 release, SSO support is available using JSON Web Tokens (JWT). In this post, we will look at how this works and how it can be configured.

1) Obtaining an SSO token from Apache Syncope

As stated above, in the past it was necessary to supply HTTP Basic Authentication credentials when invoking on the REST API. Let's look at an example using curl. Assume we have a running Apache Syncope instance with a user "alice" with password "ecila". We can make a GET request to the user self service via:
  • curl -u alice:ecila http://localhost:8080/syncope/rest/users/self
It may be inconvenient to supply user credentials on each request or the authentication process might not scale very well if we are authenticating the password to a backend resource. From Apache Syncope 2.0.3, we can instead get an SSO token by sending a POST request to "accessTokens/login" as follows:
  • curl -I -u alice:ecila -X POST http://localhost:8080/syncope/rest/accessTokens/login
The response contains two headers:
  • X-Syncope-Token: A JWT token signed according to the JSON Web Signature (JWS) spec.
  • X-Syncope-Token-Expire: The expiry date of the token
The token in question is signed using the (symmetric) "HS512" algorithm. It contains the subject "alice" and the issuer of the token ("ApacheSyncope"), as well as a random token identifier, and timestamps that indicate when the token was issued, when it expires, and when it should not be accepted before.

The signing key and the issuer name can be changed by editing 'security.properties' and specifying new values for 'jwsKey' and 'jwtIssuer'. Please note that it is critical to change the signing key from the default value! It is also possible to change the signature algorithm from the next 2.0.4 release via a custom 'securityContext.xml' (see here). The default lifetime of the token (120 minutes) can be changed via the "jwt.lifetime.minutes" configuration property for the domain.

2) Using the SSO token to invoke on a REST service

Now that we have an SSO token, we can use it to invoke on a REST service instead of specifying our username and password as before, e.g.:
  • curl -H "X-Syncope-Token: eyJ0e..." http://localhost:8080/syncope/rest/users/self
The signature is first checked on the token, then the issuer is verified so that it matches what is configured, and then the expiry and not-before dates are checked. If the identifier matches that of a saved access token then authentication is successful.

Finally, SSO tokens can be seen in the admin console under "Dashboard/Access Token", where they can be manually revoked by the admin user:


Categories: Colm O hEigeartaigh

Querying Apache HBase using Talend Open Studio for Big Data

Mon, 06/19/2017 - 18:23
Recent blog posts have described how to set up authorization for Apache HBase using Apache Ranger. However the posts just covered inputing and reading data using the HBase Shell. In this post, we will show how Talend Open Studio for Big Data can be used to read data stored in Apache HBase. This post is along the same lines of other recent tutorials on reading data from Kafka and HDFS.

1) HBase setup

Follow this tutorial on setting up Apache HBase in standalone mode, and creating a 'data' table with some sample values using the HBase Shell.

2) Download Talend Open Studio for Big Data and create a job

Now we will download Talend Open Studio for Big Data (6.4.0 was used for the purposes of this tutorial). Unzip the file when it is downloaded and then start the Studio using one of the platform-specific scripts. It will prompt you to download some additional dependencies and to accept the licenses. Click on "Create a new job" called "HBaseRead". In the search bar on the right-hand side, enter "hbase" and hit enter. Drag "tHBaseConnection" and "tHBaseInput" onto the palette, as well as "tLogRow".

"tHBaseConnection" is used to set up the connection to "HBase", "tHBaseInput" uses the connection to read data from HBase, and "tLogRow" will log the data that was read so that we can see that the job ran successfully. Right-click on "tHBaseConnection" and select "Trigger/On Subjob Ok" and drag the resulting arrow to the "tHBaseInput" component. Now right click on "tHBaseInput" and select "Row/Main" and drag the arrow to "tLogRow".
3) Configure the components

Now let's configure the individual components. Double click on "tHBaseConnection" and select the distribution "Hortonworks" and Version "HDP V2.5.0" (from an earlier tutorial we are using HBase 1.2.6). We are not using Kerberos here so we can skip the rest of the security configuration. Now double click on "tHBaseInput". Select the "Use an existing connection" checkbox. Now hit "Edit Schema" and add two entries to map the column we created in two different column families: "c1" which matches DB "col1" of type String, and "c2" which matches DB "col1" of type String.


Select "data" for the table name back in tHBaseInput and add a mapping for "c1" to "colfam1", and "c2" to "colfam2".


Now we are ready to run the job. Click on the "Run" tab and then hit the "Run" button. You should see "val1" and "val2" appear in the console window.
Categories: Colm O hEigeartaigh

Securing Apache HBase - part II

Wed, 06/14/2017 - 18:42
This is the second (and final for now) post in a short series of blog posts on securing Apache HBase. The first post looked at how to set up a standalone instance of HBase and how to authorize access to a table using Apache Ranger. In this post, we will look at how Apache Ranger can create "tag" based authorization policies for Apache HBase using Apache Atlas.

1) Start Apache Atlas and create entities/tags for HBase

First let's look at setting up Apache Atlas. Download the latest released version (0.8-incubating) and extract it. Build the distribution that contains an embedded HBase and Solr instance via:
  • mvn clean package -Pdist,embedded-hbase-solr -DskipTests
The distribution will then be available in 'distro/target/apache-atlas-0.8-incubating-bin'. To launch Atlas, we need to set some variables to tell it to use the local HBase and Solr instances:
  • export MANAGE_LOCAL_HBASE=true
  • export MANAGE_LOCAL_SOLR=true
Now let's start Apache Atlas with 'bin/atlas_start.py'. Open a browser and go to 'http://localhost:21000/', logging on with credentials 'admin/admin'. Click on "TAGS" and create a new tag called "customer_data". Now click on "Search" and then follow the "Create new entity" link of type "hbase_table" with the following parameters:
  • Name: data
  • QualifiedName: data@cl1
  • Uri: data
Now add the 'customer_data' tag to the entity that we have created.

2) Use the Apache Ranger TagSync service to import tags from Atlas into Ranger

To create tag based policies in Apache Ranger, we have to import the entity + tag we have created in Apache Atlas into Ranger via the Ranger TagSync service. After building Apache Ranger then extract the file called "target/ranger-<version>-tagsync.tar.gz". Edit 'install.properties' as follows:
  • Set TAG_SOURCE_ATLAS_ENABLED to "false"
  • Set TAG_SOURCE_ATLASREST_ENABLED to  "true" 
  • Set TAG_SOURCE_ATLASREST_DOWNLOAD_INTERVAL_IN_MILLIS to "60000" (just for testing purposes)
  • Specify "admin" for both TAG_SOURCE_ATLASREST_USERNAME and TAG_SOURCE_ATLASREST_PASSWORD
Save 'install.properties' and install the tagsync service via "sudo ./setup.sh". Start the Apache Ranger admin service via "sudo ranger-admin start" and then the tagsync service via "sudo ranger-tagsync-services.sh start".

3) Create Tag-based authorization policies in Apache Ranger

Now let's create a tag-based authorization policy in the Apache Ranger admin UI. Click on "Access Manager" and then "Tag based policies". Create a new Tag service called "HBaseTagService". Create a new policy for this service called "CustomerDataPolicy". In the "TAG" field enter a "c" and the "customer_data" tag should pop up, meaning that it was successfully synced in from Apache Atlas. Create an "Allow" condition for the user "bob" with the "Read" permission for the "HBase" component.

We also need to do is to go back to the Resource based policies and edit "cl1_hbase" and select the tag service we have created above. Now we are ready to test the authorization policy we have created with HBase. Start the shell as "bob" and we should be able to read the table we created in the first tutorial:
  • sudo -E -u bob bin/hbase shell
  • scan 'data'
Categories: Colm O hEigeartaigh

Securing Apache HBase - part I

Tue, 06/13/2017 - 14:18
This is the first in a short series of blog posts on securing Apache HBase. HBase is a column-based database that facilitates random read/write access to data stored in the Hadoop FileSystem (HDFS). In this post we will focus on setting up a standalone instance of Apache HBase, and then demonstrate how to use Apache Ranger to authorize access to a HBase table.

1) Install Apache HBase

Download Apache HBase (version 1.2.6 was used for the purposes of this tutorial) and extract it. As stated above, we will set up a standalone version of HBase, which means that HBase itself and Apache Zookeeper run in a single JVM, and data is stored in the local filesystem instead of HDFS. Normally we would authenticate users via Kerberos, but as we are just running HBase in standalone mode, we will focus solely on authorization in this series of tutorials. Start HBase via:
  • bin/start-hbase.sh
Then start the shell and create a sample table called "data", with two column families, and add some rows to the table:
  • bin/hbase shell
  • create 'data', 'colfam1', 'colfam2'
  • put 'data', 'row1', 'colfam1:col1', 'val1'
  • put 'data', 'row1', 'colfam2:col1', 'val2'
  • scan 'data'
The latter command will print out the values stored in the table. Next we will look at using Apache Ranger to restrict access to the 'data' table to authorized users only.

2) Install the Apache Ranger HBase plugin 

Download Apache Ranger and verify that the signature is valid and that the message digests match. Extract and build the source, and copy the resulting plugin to a location where you will configure and install it, e.g.:
  • mvn clean package assembly:assembly -DskipTests
  • tar zxvf target/ranger-1.0.0-SNAPSHOT-hbase-plugin.tar.gz
  • mv ranger-1.0.0-SNAPSHOT-hbase-plugin ${ranger.hbase.home}
Now go to ${ranger.hbase.home} and edit "install.properties". You need to specify the following properties:
  • POLICY_MGR_URL: Set this to "http://localhost:6080"
  • REPOSITORY_NAME: Set this to "cl1_hbase".
  • COMPONENT_INSTALL_DIR_NAME: The location of your Apache HBase installation
Save "install.properties" and install the plugin as root via "sudo ./enable-hbase-plugin.sh". The Apache Ranger HBase plugin should now be successfully installed. The ranger plugin will try to store policies by default in "/etc/ranger/cl1_hbase/policycache". As we installed the plugin as "root" make sure that this directory is accessible to the user that is running HBase.

3) Configure authorization policies in the Apache Ranger Admin UI 

The next step is to create some authorization policies for Apache HBase in the Apache Ranger admin service. Please refer to this blog post for information on how to install the Apache Ranger admin service. Assuming the admin service is already installed, start it via "sudo ranger-admin start". Open a browser and log on to "localhost:6080" with the credentials "admin/admin".

Create a new HBase service, adding the following configuration items to the default values:
  • Service Name: cl1_hbase
  • Username/Password: admin
  • hbase.zookeeper.quorum: localhost
Click on "Test Connection" (if HBase is running) to verify that the connection is successful (note: only works from 1.0.0 onwards - see RANGER-1640) and then save the service. Click on "cl1_hbase" and edit the default policy which has been created, and add the user running HBase to the "Allow Condition" permission.

Now we will add a new authorization policy to test access to HBase. Under "Settings + Users/Groups" add two new users called "alice" and "bob", and also create these new users in your local system. Now we can create a new authorization policy to grant "alice" the "Read" permission for the "data" table (all column families and columns).



4) Testing authorization in HBase

The policy we have created above will get downloaded and enforced by the Ranger HBase plugin we installed into HBase. Restart HBase before proceeding further (if it was running with the Ranger plugin before downloading the policy which granted the user running HBase "admin" privileges, then HBase might not be working properly). Now start the shell as "alice" and try to read the table we created earlier:
  • sudo -E -u alice bin/hbase shell
  • scan 'data'
This should work due to the authorization policy we created. However "alice" should not be allowed to write to "data", e.g the following should result in a "AccessDeniedException":
  • put 'data', 'row1', 'colfam1:col1', 'val3'
Categories: Colm O hEigeartaigh

Securing Apache Storm - part IV

Tue, 06/06/2017 - 16:20
This is the fourth and final post in a series of blog posts on securing Apache Storm. The first post looked at setting up a simple Storm cluster that authenticates users via Kerberos, and deploying a topology. The second post looked at deploying the Storm UI using Kerberos, and accessing it via a REST client. The third post looked at how to use Apache Ranger to authorize access to Apache Storm.  In this post, we will look at how Apache Ranger can create "tag" based authorization policies for Apache Storm using Apache Atlas.

1) Start Apache Atlas and create entities/tags for Storm

First let's look at setting up Apache Atlas. Download the latest released version (0.8-incubating) and extract it. Build the distribution that contains an embedded HBase and Solr instance via:
  • mvn clean package -Pdist,embedded-hbase-solr -DskipTests
    The distribution will then be available in 'distro/target/apache-atlas-0.8-incubating-bin'. To launch Atlas, we need to set some variables to tell it to use the local HBase and Solr instances:
    • export MANAGE_LOCAL_HBASE=true
    • export MANAGE_LOCAL_SOLR=true
    Now let's start Apache Atlas with 'bin/atlas_start.py'. Open a browser and go to 'http://localhost:21000/', logging on with credentials 'admin/admin'. Click on "TAGS" and create a new tag called "user_topologies".  Unlike for HDFS or Kafka, Atlas doesn't provide an easy way to create a Storm Entity in the UI. Instead we can use the following json file to create a Storm Entity for "*" topologies:

    You can upload it to Atlas via:
    • curl -v -H 'Accept: application/json, text/plain, */*' -H 'Content-Type: application/json;  charset=UTF-8' -u admin:admin -d @storm-create.json http://localhost:21000/api/atlas/entities
    Once the new entity has been uploaded, then you can search for it in the Atlas UI, then click on "+" beside "Tags" and associate the new entity with the "user_topologies" tag.

    2) Use the Apache Ranger TagSync service to import tags from Atlas into Ranger

    To create tag based policies in Apache Ranger, we have to import the entity + tag we have created in Apache Atlas into Ranger via the Ranger TagSync service. After building Apache Ranger then extract the file called "target/ranger-<version>-tagsync.tar.gz". Edit 'install.properties' as follows:
    • Set TAG_SOURCE_ATLAS_ENABLED to "false"
    • Set TAG_SOURCE_ATLASREST_ENABLED to  "true" 
    • Set TAG_SOURCE_ATLASREST_DOWNLOAD_INTERVAL_IN_MILLIS to "60000" (just for testing purposes)
    • Specify "admin" for both TAG_SOURCE_ATLASREST_USERNAME and TAG_SOURCE_ATLASREST_PASSWORD
    Save 'install.properties' and install the tagsync service via "sudo ./setup.sh". Start the Apache Ranger admin service via "sudo ranger-admin start" and then the tagsync service via "sudo ranger-tagsync-services.sh start".

    3) Create Tag-based authorization policies in Apache Ranger

    Now let's create a tag-based authorization policy in the Apache Ranger admin UI. Click on "Access Manager" and then "Tag based policies". Create a new Tag service called "StormTagService". Create a new policy for this service called "UserTopologiesPolicy". In the "TAG" field enter a "u" and the "user_topologies" tag should pop up, meaning that it was successfully synced in from Apache Atlas. Create an "Allow" condition for the user "alice" with all of the component permissions for "Storm":


    We also need to do is to go back to the Resource based policies and edit "cl1_storm" and select the tag service we have created above. Finally, edit the existing "cl1_storm" policy created as par of the previous tutorials, and remove the permissions for "alice" there, so that we can be sure that authorization is working correctly. Then follow the first tutorial and verify that "alice" is authorized to deploy a topology as per the tag-based authorization policy we have created in Ranger.
    Categories: Colm O hEigeartaigh

    Securing Apache Storm - part III

    Fri, 06/02/2017 - 18:41
    This is the third in a series of blog posts on securing Apache Storm. The first post looked at setting up a simple Storm cluster that authenticates users via Kerberos, and deploying a topology. The second post looked at deploying the Storm UI using Kerberos, and accessing it via a REST client. Thus far we have only looked at how to authenticate users to Storm using Kerberos. In this post we will look at how to use Apache Ranger to authorize access to Apache Storm.

    1) Install the Apache Ranger Storm plugin
     
    Follow the steps in the first tutorial (parts 1 - 3) to setup the Apache Kerby testcase, Apache Zookeeper instance, and the Apache Storm distribution, if you have not done this already. Now we will install the Apache Ranger Storm plugin. If you want to be able to download the topologies from Storm to Ranger when creating policies, then follow the second tutorial to start the Storm UI.

    Download Apache Ranger and verify that the signature is valid and that the message digests match. Due to some bugs that were fixed for the installation process, I am using version 1.0.0-SNAPSHOT in this post. Now extract and build the source, and copy the resulting plugin to a location where you will configure and install it:
    • mvn clean package assembly:assembly -DskipTests
    • tar zxvf target/ranger-1.0.0-SNAPSHOT-storm-plugin.tar.gz
    • mv ranger-1.0.0-SNAPSHOT-storm-plugin.tar.gz ${ranger.storm.home}
    Now go to ${ranger.storm.home} and edit "install.properties". You need to specify the following properties:
    • POLICY_MGR_URL: Set this to "http://localhost:6080"
    • REPOSITORY_NAME: Set this to "cl1_storm".
    • COMPONENT_INSTALL_DIR_NAME: The location of your Apache Storm installation
    Save "install.properties" and install the plugin as root via "sudo ./enable-hdfs-plugin.sh". The Apache Ranger Storm plugin should now be successfully installed. Now start Kerby, Zookeeper and Storm as covered in the first tutorial.

    2) Create authorization policies in the Apache Ranger Admin console

    Next we will use the Apache Ranger admin console to create authorization policies for Apache Storm. Follow the steps in this tutorial to install the Apache Ranger admin service. To retrieve the running topologies from Apache Storm, then you need to configure Kerberos appropriately for Apache Ranger. You can first point to the Kerby krb5.conf via:
    • export JAVA_OPTS="-Djava.security.krb5.conf=/path.to./kerby.project/target/krb5.conf"
    Start the Apache Ranger admin service with "sudo -E ranger-admin start" and open a browser and go to "http://localhost:6080/" and log on with "admin/admin". Add a new Storm service with the following configuration values:
    • Service Name: cl1_storm
    • Username: storm-client
    • Password: storm-client
    • Nimbus URL: http://localhost:8080
    Click on "Test Connection" to verify that we can connect successfully to Storm  + then save the new service. Now click on the "cl1_storm" service that we have created. Edit the existing policy for the "*" Storm topology, adding the user "alice" (create this user if you have not done so already under "Settings, Users/Groups") to all of the available permissions.

    3) Testing authorization in Storm

    Now let's test the Ranger authorization policy we created above in action. The Ranger authorization plugin will pull policies from the Admin service every 30 seconds by default. For the "cl1_storm" example above, they are stored in "/etc/ranger/cl1_storm/policycache/" by default. Make sure that the user you are running Storm as can access this directory. To test authorization follow step 4 in the first tutorial, but use the user "storm-client" instead (and "storm_client.keytab"). You should see an authorization exception. Now try again with user "alice" (and "alice.keytab") and authorization should succeed.
    Categories: Colm O hEigeartaigh

    Securing Apache Storm - part II

    Wed, 05/31/2017 - 13:16
    This is the second in a series of tutorials on securing Apache Storm. The first post looked at setting up a simple Storm cluster that authenticates users via Kerberos, and deploying a topology. Apache Storm also ships with a UI (and REST API) that can be used to download configuration, start/stop topologies, etc. This post looks at deploying the Storm UI using Kerberos, and accessing it via a REST client.

    1) Configure the Apache Storm UI

    The first step is to follow the previous tutorial to deploy the Apache Kerby KDC, to configure Apache Zookeeper, and to download and deploy Apache Storm (sections 1-3). Note that there is a bug in Kerby that is not yet fixed in the 1.0.0 release that you might run in to when using curl (see below), depending on whether the MIT libraries are installed or not. In additional to the principals listed in the last post, the Kerby deployment test for Storm also contains a principal for the Storm UI (HTTP/localhost@storm.apache.org).

    Now edit 'conf/storm.yaml' and add the following properties:
    • ui.filter: "org.apache.hadoop.security.authentication.server.AuthenticationFilter"
    •  ui.filter.params:
      • "type": "kerberos"
      • "kerberos.principal": "HTTP/localhost@storm.apache.org"
      • "kerberos.keytab": "/path.to.kerby.project/target/http.keytab"
      • "kerberos.name.rules": "RULE:[2:$1@$0]([jt]t@.*EXAMPLE.COM)s/.*/$MAPRED_USER/ RULE:[2:$1@$0]([nd]n@.*EXAMPLE.COM)s/.*/$HDFS_USER/DEFAULT"
    Start the UI with:
    • bin/storm ui
    2) Invoke on the Storm UI REST API

    We will invoke on the Storm UI REST API using "curl" on the command line. This can be done as follows:
    • export KRB5_CONFIG=/path.to.kerby.project/target/krb5.conf
    • kinit -k -t /path.to.kerby.project/target/alice.keytab alice
    • curl --negotiate -u : -b ~/cookiejar.txt -c ~/cookiejar.txt http://localhost:8080/api/v1/cluster/configuration
    You should see the cluster configuration in JSON format if the call is successful.
    Categories: Colm O hEigeartaigh

    Securing Apache Storm - part I

    Fri, 05/26/2017 - 18:01
    This is the first tutorial in a planned three part series on securing Apache Storm. In this post we will look at setting up a simple Storm cluster that authenticates users via Kerberos, and how to run a simple topology on it. Future posts will cover authorization using Apache Ranger. For more information on how to setup Kerberos for Apache Storm, please see the following documentation.

    1) Set up a KDC using Apache Kerby

    As for other kerberos-related tutorials that I have written on this blog, we will use a github project I wrote that uses Apache Kerby to start up a KDC:
    • bigdata-kerberos-deployment: This project contains some tests which can be used to test kerberos with various big data deployments, such as Apache Hadoop etc.
    The KDC is a simple junit test that is available here. To run it just comment out the "org.junit.Ignore" annotation on the test method. It uses Apache Kerby to define the following principals:
    • zookeeper/localhost@storm.apache.org
    • zookeeper-client@storm.apache.org
    • storm/localhost@storm.apache.org
    • storm-client@@storm.apache.org
    • alice@storm.apache.org
    Keytabs are created in the "target" folder. Kerby is configured to use a random port to lauch the KDC each time, and it will create a "krb5.conf" file containing the random port number in the target directory.

    2) Download and configure Apache Zookeeper

    Apache Storm uses Apache Zookeeper to help coordinate the cluster. Download Apache Zookeeper (this tutorial used 3.4.10) and extract it to a local directory. Configure Zookeeper to use Kerberos by adding a new file 'conf/zoo.cfg' with the following properties:
    • dataDir=/tmp/zookeeper
    • clientPort=2181
    • authProvider.1 = org.apache.zookeeper.server.auth.SASLAuthenticationProvider
    • requireClientAuthScheme=sasl 
    • jaasLoginRenew=3600000 
    Now create 'conf/zookeeper.jaas' with the following content:

    Server {
            com.sun.security.auth.module.Krb5LoginModule required refreshKrb5Config=true useKeyTab=true keyTab="/path.to.kerby.project/target/zookeeper.keytab" storeKey=true principal="zookeeper/localhost";
    };

    Before launching Zookeeper, we need to point to the JAAS configuration file above and also to the krb5.conf file generated in the Kerby test-case above. Add a new file 'conf/java.env' adding the SERVER_JVMFLAGS property to the classpath with:
    • -Djava.security.auth.login.config=/path.to.zookeeper/conf/zookeeper.jaas
    • -Djava.security.krb5.conf=/path.to.kerby.project/target/krb5.conf".
    Start Zookeeper via:
    • bin/zkServer.sh start
    3) Download and configure Apache Storm

    Now download and extract the Apache Storm distribution (1.1.0 was used in this tutorial). Edit 'conf/storm.yaml' and edit the following properties:
    • For "storm.zookeeper.servers" add "- localhost"
    • nimbus.seeds: ["localhost"]
    • storm.thrift.transport: "org.apache.storm.security.auth.kerberos.KerberosSaslTransportPlugin"
    • java.security.auth.login.config: "/path.to.storm/conf/storm.jaas"
    • storm.zookeeper.superACL: "sasl:storm"
    • nimbus.childopts: "-Djava.security.auth.login.config=/path.to.storm/conf/storm.jaas -Djava.security.krb5.conf=/path.to.kerby.project/target/krb5.conf" 
    • ui.childopts: "-Djava.security.auth.login.config=/path.to.storm/conf/storm.jaas -Djava.security.krb5.conf=/path.to.kerby.project/target/krb5.conf" 
    • supervisor.childopts: "-Djava.security.auth.login.config=/path.to.storm/conf/storm.jaas -Djava.security.krb5.conf=/path.to.kerby.project/target/krb5.conf"
    Create a file called 'conf/storm.jaas' with the content:

    Client {
        com.sun.security.auth.module.Krb5LoginModule required refreshKrb5Config=true useKeyTab=true keyTab="/path.to.kerby.project/target/zookeeper_client.keytab" storeKey=true principal="zookeeper-client";
    };

    StormClient {  
        com.sun.security.auth.module.Krb5LoginModule required refreshKrb5Config=true useKeyTab=true keyTab="path.to.kerby.project/target/storm_client.keytab" storeKey=true principal="storm-client" serviceName="storm";
    };

    StormServer {
        com.sun.security.auth.module.Krb5LoginModule required refreshKrb5Config=true useKeyTab=true keyTab="path.to.kerby.project/target/storm.keytab" storeKey=true principal="storm/localhost@storm.apache.org";
    };

    'Client' is used to communicate with Zookeeper, 'StormClient' is used by the supervisor nodes and 'StormServer' is used by nimbus. Now start Nimbus and a supervisor node via:
    • bin/storm nimbus
    • bin/storm supervisor
    4) Deploy a Topology

    As we have the Storm cluster up and running, the next task is to deploy a Topology to it. For this we will need to use another Storm distribution, so extract Storm again to another directory. Edit 'conf/storm.yaml' and edit the following properties:
    • For "storm.zookeeper.servers" add "- localhost"
    • nimbus.seeds: ["localhost"]
    • storm.thrift.transport: "org.apache.storm.security.auth.kerberos.KerberosSaslTransportPlugin"
    • java.security.auth.login.config: "/path.to.storm.client/conf/storm.jaas"
    Create a file called 'conf/storm.jaas' with the content:

    StormClient {
                com.sun.security.auth.module.Krb5LoginModule required refreshKrb5Config=true useTicketCache=true serviceName="storm";
    };

    Note that we are not using keytabs here, but instead a ticket cache. Now edit 'conf/storm_env.ini' and add:
    • STORM_JAR_JVM_OPTS:-Djava.security.krb5.conf=/path.to.kerby.project/target/krb5.conf
    Now that we have everything set up, it's time to deploy a topology to our cluster. I have a simple Storm topology that wires a WordSpout + WordCounterBolt into a topology that can be used for this in github here. Check this project out from github + build it via "mvn assembly:assembly". We will need a Kerberos ticket store in our ticket cache to deploy the job:
    • export KRB5_CONFIG=/path.to.kerby.project/target/krb5.conf
    • kinit -k -t /path.to.kerby.project/target/alice.keytab alice
    Finally we can submit our topology:
    • bin/storm jar /path.to.storm.project/target/bigdata-storm-demo-1.0-jar-with-dependencies.jar  org.apache.coheigea.bigdata.storm.StormMain /path.to.storm.project/target/test-classes/words.txt
    If you take a look at the logs in the nimbus distribution you should see that the topology has run correctly, e.g. 'logs/workers-artifacts/mytopology-1-1495813912/6700/worker.log'.

    Categories: Colm O hEigeartaigh

    Configuring Kerberos for Kafka in Talend Open Studio for Big Data

    Tue, 05/23/2017 - 17:23
    A recent blog post showed how to use Talend Open Studio for Big Data to access data stored in HDFS, where HDFS had been configured to authenticate users using Kerberos. In this post we will follow a similar setup, to see how to create a job in Talend Open Studio for Big Data to read data from an Apache Kafka topic using kerberos.

    1) Kafka setup

    Follow a recent tutorial to setup an Apache Kerby based KDC testcase and to configure Apache Kafka to require kerberos for authentication. Create a "test" topic and write some data to it, and verify with the command-line consumer that the data can be read correctly.

    2) Download Talend Open Studio for Big Data and create a job

    Now we will download Talend Open Studio for Big Data (6.4.0 was used for the purposes of this tutorial). Unzip the file when it is downloaded and then start the Studio using one of the platform-specific scripts. It will prompt you to download some additional dependencies and to accept the licenses. Click on "Create a new job" called "KafkaKerberosRead". 
    In the search bar under "Palette" on the right hand side enter "kafka" and hit enter. Drag "tKafkaConnection" and "tKafkaInput" to the middle of the screen. Do the same for "tLogRow":
    We now have all the components we need to read data from the Kafka topic. "tKafkaConnection" will be used to configure the connection to Kafka. "tKafkaInput" will be used to read the data from the "test" topic, and finally "tLogRow" will just log the data so that we can be sure that it was read correctly. The next step is to join the components up. Right click on "tKafkaConnection" and select "Trigger/On Subjob Ok" and drag the resulting line to "tKafkaInput". Right click on "tKafkaInput" and select "Row/Main" and drag the resulting line to "tLogRow":

    3) Configure the components

    Now let's configure the individual components. Double click on "tKafkaConnection". If a message appears that informs you that you need to install additional jars, then click on "Install". Select the version of Kafka that corresponds to the version you are using (if it doesn't match then select the most recent version). For the "Zookeeper quorum list" property enter "localhost:2181". For the "broker list" property enter "localhost:9092".

    Now we will configure the kerberos related properties of "tKafkaConnection". Select the "Use kerberos authentication" checkbox and some additional configuration properties will appear. For "JAAS configuration path" you need to enter the path of the "client.jaas" file as described in the tutorial to set up the Kafka test-case. You can leave "Kafka brokers principal name" property as the default value ("kafka"). Finally, select the "Set kerberos configuration path" property and enter the path of the "krb5.conf" file supplied in the target directory of the Apache Kerby test-case.



    Now click on "tKafkaInput". Select the checkbox for "Use an existing connection" + select the "tKafkaConnection" component in the resulting component list. For "topic name" specify "test". The "Consumer group id" can stay as the default "mygroup".

    Now we are ready to run the job. Click on the "Run" tab and then hit the "Run" button. Send some data via the producer to the "test" topic and you should see the data appear in the Run Window in the Studio.
    Categories: Colm O hEigeartaigh

    Security advisories issued for Apache CXF Fediz

    Mon, 05/22/2017 - 18:23
    Two security advisories were recently issued for Apache CXF Fediz. In addition to fixing these issues, the recent releases of Fediz impose tighter security constraints in some areas by default compared to older releases. In this post I will document the advisories and the other security-related changes in the recent Fediz releases.

    1) Security Advisories

    The first security advisory is CVE-2017-7661: "The Apache CXF Fediz Jetty and Spring plugins are vulnerable to CSRF attacks.". Essentially, both the Jetty 8/9 and Spring Security 2/3 plugins are subject to a CSRF-style vulnerability when the user doesn't complete the authentication process. In addition, the Jetty plugins are vulnerable even if the user does first complete the authentication process, but only the root context is available as part of this attack.

    The second advisory is CVE-2017-7662: "The Apache CXF Fediz OIDC Client Registration Service is vulnerable to CSRF attacks". The OIDC client registration service is a simple web application that allows the creation of clients for OpenId Connect, as well as a number of other administrative tasks. It is vulnerable to CSRF attacks, where a malicious application could take advantage of an existing session to make changes to the OpenId Connect clients that are stored in the IdP.

    2) Fediz IdP security constraints

    This section only concerns the WS-Federation (and SAML-SSO) IdP in Fediz. The WS-Federation RP application sends its address via the 'wreply' parameter to the IdP. For SAML SSO, the address to reply to is taken from the consumer service URL of the SAML SSO Request. Previously, the Apache CXF Fediz IdP contained an optional 'passiveRequestorEndpointConstraint' configuration value in the 'ApplicationEntity', which allows the admin to specify a regular expression constraint on the 'wreply' URL.

    From Fediz 1.4.0, 1.3.2 and 1.2.4, a new configuration option is available in the 'ApplicationEntity' called 'passiveRequestorEndpoint'. If specified, this is directly matched against the 'wreply' parameter. In a change that breaks backwards compatibility, but that is necessary for security reasons, one of 'passiveRequestorEndpointConstraint' or 'passiveRequestorEndpoint must be specified in the 'ApplicationEntity' configuration. This ensures that the user cannot be redirected to a malicious client. Similarly, new configuration options are available called 'logoutEndpoint' and 'logoutEndpointConstraint' which validate the 'wreply' parameter in the case of redirecting the user after logging out, one of which must be specified.

    3) Fediz RP security constraints

    This section only concerns the WS-Federation RP plugins available in Fediz. When the user tries to log out of the Fediz RP application, a 'wreply' parameter can be specified to give the address that the Fediz IdP can redirect to after logout is complete. The old functionality was that if 'wreply' was not specified, then the RP plugin instead used the value from the 'logoutRedirectTo' configuration parameter.

    From Fediz 1.4.0, 1.3.2 and 1.2.4, a new configuration option is available called 'logoutRedirectToConstraint'. If a 'wreply' parameter is presented, then it must match the regular expression that is specified for 'logoutRedirectToConstraint', otherwise the 'wreply' value is ignored and it falls back to 'logoutRedirectTo'. 
    Categories: Colm O hEigeartaigh

    Configuring Kerberos for HDFS in Talend Open Studio for Big Data

    Thu, 05/18/2017 - 16:33
    A recent series of blog posts showed how to install and configure Apache Hadoop as a single node cluster, and how to authenticate users via Kerberos and authorize them via Apache Ranger. Interacting with HDFS via the command line tools as shown in the article is convenient but limited. Talend offers a freely-available product called Talend Open Studio for Big Data which you can use to interact with HDFS instead (and many other components as well). In this article we will show how to access data stored in HDFS that is secured with Kerberos as per the previous tutorials.

    1) HDFS setup

    To begin with please follow the first tutorial to install Hadoop and to store the LICENSE.txt in a '/data' folder. Then follow the fifth tutorial to set up an Apache Kerby based KDC testcase and configure HDFS to authenticate users via Kerberos. To test everything is working correctly on the command line do:
    • export KRB5_CONFIG=/pathtokerby/target/krb5.conf
    • kinit -k -t /pathtokerby/target/alice.keytab alice
    • bin/hadoop fs -cat /data/LICENSE.txt
    2) Download Talend Open Studio for Big Data and create a job

    Now we will download Talend Open Studio for Big Data (6.4.0 was used for the purposes of this tutorial). Unzip the file when it is downloaded and then start the Studio using one of the platform-specific scripts. It will prompt you to download some additional dependencies and to accept the licenses. Click on "Create a new job" called "HDFSKerberosRead". In the search bar under "Palette" on the right hand side enter "tHDFS" and hit enter. Drag "tHDFSConnection" and "tHDFSInput" to the middle of the screen. Do the same for "tLogRow":
    We now have all the components we need to read data from HDFS. "tHDFSConnection" will be used to configure the connection to Hadoop. "tHDFSInput" will be used to read the data from "/data" and finally "tLogRow" will just log the data so that we can be sure that it was read correctly. The next step is to join the components up. Right click on "tHDFSConnection" and select "Trigger/On Subjob Ok" and drag the resulting line to "tHDFSInput". Right click on "tHDFSInput" and select "Row/Main" and drag the resulting line to "tLogRow":
    3) Configure the components

    Now let's configure the individual components. Double click on "tHDFSConnection". For the "version", select the "Hortonworks" Distribution with version HDP V2.5.0 (we are using the original Apache distribution as part of this tutorial, but it suffices to select Hortonworks here). Under "Authentication" tick the checkbox called "Use kerberos authentication". For the Namenode principal specify "hdfs/localhost@hadoop.apache.org". Select the checkbox marked "Use a keytab to authenticate". Select "alice" as the principal and "<path.to.kerby.project>/target/alice.keytab" as the "Keytab":
    Now click on "tHDFSInput". Select the checkbox for "Use an existing connection" + select the "tHDFSConnection" component in the resulting component list. For "File Name" specify the file we want to read: "/data/LICENSE.txt":
    Now click on "Edit schema" and hit the "+" button. This will create a "newColumn" column of type "String". We can leave this as it is, because we are not doing anything with the data other than logging it. Save the job. Now the only thing that remains is to point to the krb5.conf file that is generated by the Kerby project. Click on "Window/Preferences" at the top of the screen. Select "Talend" and "Run/Debug". Add a new JVM argument: "-Djava.security.krb5.conf=/path.to.kerby.project/target/krb5.conf":

    Now we are ready to run the job. Click on the "Run" tab and then hit the "Run" button. If everything is working correctly, you should see the contents of "/data/LICENSE.txt" displayed in the Run window.
    Categories: Colm O hEigeartaigh

    Securing Apache Kafka with Kerberos

    Mon, 05/15/2017 - 16:45
    Last year, I wrote a series of blog articles based on securing Apache Kafka. The articles covered how to secure access to the Apache Kafka broker using TLS client authentication, and how to implement authorization policies using Apache Ranger and Apache Sentry. Recently I wrote another article giving a practical demonstration how to secure HDFS using Kerberos. In this post I will look at how to secure Apache Kafka using Kerberos, using a test-case based on Apache Kerby. For more information on securing Kafka with kerberos, see the Kafka security documentation.

    1) Set up a KDC using Apache Kerby

    A github project that uses Apache Kerby to start up a KDC is available here:
    • bigdata-kerberos-deployment: This project contains some tests which can be used to test kerberos with various big data deployments, such as Apache Hadoop etc.
    The KDC is a simple junit test that is available here. To run it just comment out the "org.junit.Ignore" annotation on the test method. It uses Apache Kerby to define the following principals:
    • zookeeper/localhost@kafka.apache.org
    • kafka/localhost@kafka.apache.org
    • client@kafka.apache.org
    Keytabs are created in the "target" folder. Kerby is configured to use a random port to lauch the KDC each time, and it will create a "krb5.conf" file containing the random port number in the target directory. 

    2) Configure Apache Zookeeper

    Download Apache Kafka and extract it (0.10.2.1 was used for the purposes of this tutorial). Edit 'config/zookeeper.properties' and add the following properties:
    • authProvider.1=org.apache.zookeeper.server.auth.SASLAuthenticationProvider
    • requireClientAuthScheme=sasl 
    • jaasLoginRenew=3600000
    Now create 'config/zookeeper.jaas' with the following content:

    Server {
            com.sun.security.auth.module.Krb5LoginModule required refreshKrb5Config=true useKeyTab=true keyTab="/path.to.kerby.project/target/zookeeper.keytab" storeKey=true principal="zookeeper/localhost";
    };

    Before launching Zookeeper, we need to point to the JAAS configuration file above and also to the krb5.conf file generated in the Kerby test-case above. This can be done by setting the "KAFKA_OPTS" system property with the JVM arguments:
    • -Djava.security.auth.login.config=/path.to.zookeeper/config/zookeeper.jaas 
    • -Djava.security.krb5.conf=/path.to.kerby.project/target/krb5.conf
    Now start Zookeeper via:
    • bin/zookeeper-server-start.sh config/zookeeper.properties 
    3) Configure Apache Kafka broker

    Create 'config/kafka.jaas' with the content:

    KafkaServer {
                com.sun.security.auth.module.Krb5LoginModule required refreshKrb5Config=true useKeyTab=true keyTab="/path.to.kerby.project/target/kafka.keytab" storeKey=true principal="kafka/localhost";
    };

    Client {
            com.sun.security.auth.module.Krb5LoginModule required refreshKrb5Config=true useKeyTab=true keyTab="/path.to.kerby.project/target/kafka.keytab" storeKey=true principal="kafka/localhost";
    };

    The "Client" section is used to talk to Zookeeper. Now edit  'config/server.properties' and add the following properties:
    • listeners=SASL_PLAINTEXT://localhost:9092
    • security.inter.broker.protocol=SASL_PLAINTEXT 
    • sasl.mechanism.inter.broker.protocol=GSSAPI 
    • sasl.enabled.mechanisms=GSSAPI 
    • sasl.kerberos.service.name=kafka 
    We will just concentrate on using SASL for authentication, and hence we are using "SASL_PLAINTEXT" as the protocol. For "SASL_SSL" please follow the keystore generation as outlined in the following article. Again, we need to set the "KAFKA_OPTS" system property with the JVM arguments:
    • -Djava.security.auth.login.config=/path.to.kafka/config/kafka.jaas 
    • -Djava.security.krb5.conf=/path.to.kerby.project/target/krb5.conf
    Now we can start the server and create a topic as follows:
    • bin/kafka-server-start.sh config/server.properties
    • bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic test
    4) Configure Apache Kafka producers/consumers

    To make the test-case simpler we added a single principal "client" in the KDC for both the producer and consumer. Create a file called "config/client.jaas" with the content:

    KafkaClient {
            com.sun.security.auth.module.Krb5LoginModule required refreshKrb5Config=true useKeyTab=true keyTab="/path.to.kerby.project/target/client.keytab" storeKey=true principal="client";
    };

    Edit *both* 'config/producer.properties' and 'config/consumer.properties' and add:
    • security.protocol=SASL_PLAINTEXT
    • sasl.mechanism=GSSAPI 
    • sasl.kerberos.service.name=kafka
    Now set the "KAFKA_OPTS" system property with the JVM arguments:
    • -Djava.security.auth.login.config=/path.to.kafka/config/client.jaas 
    • -Djava.security.krb5.conf=/path.to.kerby.project/target/krb5.conf
    We should now be all set. Start the producer and consumer via:
    • bin/kafka-console-producer.sh --broker-list localhost:9092 --topic test --producer.config config/producer.properties
    • bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic test --from-beginning --consumer.config config/consumer.properties --new-consumer
    Categories: Colm O hEigeartaigh

    Securing Apache Hadoop Distributed File System (HDFS) - part VI

    Tue, 05/09/2017 - 14:21
    This is the sixth and final article in a series of posts on securing HDFS. In the second and third posts we looked at how to use Apache Ranger to authorize access to data stored in HDFS. In the fifth post, we looked at how to configure HDFS to authenticate users via Kerberos. In this post we will combine both scenarios, that is we will use Apache Ranger to authorize access to HDFS, which is secured using Kerberos.

    1) Authenticating to Apache Ranger

    Follow the fifth tutorial to set up HDFS using Kerberos for authentication. Then follow the second tutorial to install the Apache Ranger HDFS plugin. The Ranger HDFS plugin will not be able to download new policies from Apache Ranger, as we have not configured Ranger to be able to authenticate clients via Kerberos. Edit 'conf/ranger-admin-site.xml' in the Apache Ranger Admin service and edit the following properties:
    • ranger.spnego.kerberos.principal: HTTP/localhost@hadoop.apache.org
    • ranger.spnego.kerberos.keytab: Path to Kerby ranger.keytab
    • hadoop.security.authentication: kerberos
    Now we need to configure Kerberos to use the krb5.conf file generated by Apache Kerby:
    • export JAVA_OPTS="-Djava.security.krb5.conf=<path to Kerby target/krb5.conf"
    Start the Apache Ranger admin service ('sudo -E ranger-admin start' to pass the JAVA_OPTS variable through) and edit the "cl1_hadoop" service that was created in the second tutorial. Under "Add New Configurations" add the following:
    • policy.download.auth.users: hdfs
    The Ranger HDFS policy should be able to download the policies now from the Ranger Admin service and apply authorization accordingly.

    2) Authenticating to HDFS

    As we have configured HDFS to require Kerberos, we won't be able to see the HDFS directories in the Ranger Admin service when creating policies any more, without making some changes to enable the Ranger Admin service to authenticate to HDFS. Edit 'conf/ranger-admin-site.xml' in the Apache Ranger Admin service and edit the following properties:
    • ranger.lookup.kerberos.principal: ranger/localhost@hadoop.apache.org
    • ranger.lookup.kerberos.keytab: Path to Kerby ranger.keytab
    Edit the 'cl1_hadoop' policy that we created in the second tutorial and click on 'Test Connection'. This should fail as Ranger is not configured to authenticate to HDFS. Add the following properties:
    • Authentication Type: Kerberos
    • dfs.datanode.kerberos.principal: hdfs/localhost
    • dfs.namenode.kerberos.principal: hdfs/localhost
    • dfs.secondary.namenode.kerberos.principal: hdfs/localhost
    Now 'Test Connection' should be successful.
    Categories: Colm O hEigeartaigh

    Using SASL to secure the the data transfer protocol in Apache Hadoop

    Fri, 05/05/2017 - 17:57
    The previous blog article showed how to set up a pseudo-distributed Apache Hadoop cluster such that clients are authenticated using Kerberos. The DataNode that we configured authenticates itself by using privileged ports configured in the properties "dfs.datanode.address" and "dfs.datanode.http.address". This requires building and configuring JSVC as well as making sure that we can ssh to localhost without a password as root. An alternative solution (as noted in the article) is to use SASL to secure the data transfer protocol. Here we will briefly show how to do this, building on the configuration given in the previous post.

    1) Configuring Hadoop to use SASL for the data transfer protocol

    Follow section (2) of the previous post to configure Hadoop to authenticate users via Kerberos. We need to make the following changes to 'etc/hadoop/hdfs-site.xml':
    • dfs.datanode.address: Change the port number here to be a non-privileged port.
    • dfs.datanode.http.address: Change the port number here to be a non-privileged port.
    We also need add the following properties to 'etc/hadoop/hdfs-site.xml':
    • dfs.data.transfer.protection: integrity.
    • dfs.http.policy: HTTPS_ONLY.
    Edit 'etc/hadoop/hadoop-env.sh' and comment out the values we added for:
    • HADOOP_SECURE_DN_USER
    • JSVC_HOME
    2) Configure SSL keys in ssl-server.xml

    The next step is to configure some SSL keys in 'etc/hadoop/ssl-server.xml'. We'll use some sample keys that are used in Apache CXF to run the systests for the purposes of this dem. Download cxfca.jks and bob.jks into 'etc/hadoop'. Now edit 'etc/hadoop/ssl-server.xml' and define the following properties:
    • ssl.server.truststore.location: etc/hadoop/cxf-ca.jks
    • ssl.server.truststore.password: password
    • ssl.server.keystore.location: etc/hadoop/bob.jks
    • ssl.server.keystore.password: password
    • ssl.server.keystore.keypassword: password
    3) Launch Kerby and HDFS and test authorization

    Now that we have hopefully configured everything correctly it's time to launch the Kerby based KDC and HDFS. Start Kerby by running the JUnit test as described in the first section of the previous article. Now start HDFS via:
    • sbin/start-dfs.sh
    Note that 'sudo sbin/start-secure-dns.sh' is not required as we are now using SASL for the data transfer protocol. Now we can read the file we added to "/data" in the previous article as "alice":
    • export KRB5_CONFIG=/pathtokerby/target/krb5.conf
    • kinit -t -k /pathtokerby/target/alice.keytab alice
    • bin/hadoop fs -cat /data/LICENSE.txt
    Categories: Colm O hEigeartaigh

    Securing Apache Hadoop Distributed File System (HDFS) - part V

    Thu, 05/04/2017 - 14:54
    This is the fifth in a series of blog posts on securing HDFS. The first post described how to install Apache Hadoop, and how to use POSIX permissions and ACLs to restrict access to data stored in HDFS. The second post looked at how to use Apache Ranger to authorize access to data stored in HDFS. The third post looked at how Apache Ranger can create "tag" based authorization policies for HDFS using Apache Atlas. The fourth post looked at how to implement transparent encryption for HDFS using Apache Ranger. Up to now, we have not shown how to authenticate users, concentrating only on authorizing local access to HDFS. In this post we will show how to configure HDFS to authenticate users via Kerberos.

    1) Set up a KDC using Apache Kerby

    If we are going to configure Apache Hadoop to use Kerberos to authenticate users, then we need a Kerberos Key Distribution Center (KDC). Typically most documentation revolves around installing the MIT Kerberos server, adding principals, and creating keytabs etc. However, in this post we will show a simpler way of getting started by using a pre-configured maven project that uses Apache Kerby. Apache Kerby is a subproject of the Apache Directory project, and is a complete open-source KDC written entirely in Java.

    A github project that uses Apache Kerby to start up a KDC is available here:
    • bigdata-kerberos-deployment: This project contains some tests which can be used to test kerberos with various big data deployments, such as Apache Hadoop etc.
    The KDC is a simple junit test that is available here. To run it just comment out the "org.junit.Ignore" annotation on the test method. It uses Apache Kerby to define the following principals:
    • alice@hadoop.apache.org
    • bob@hadoop.apache.org
    • hdfs/localhost@hadoop.apache.org
    • HTTP/localhost@hadoop.apache.org
    Keytabs are created in the "target" folder for "alice", "bob" and "hdfs" (where the latter has both the hdfs/localhost + HTTP/localhost principals included). Kerby is configured to use a random port to lauch the KDC each time, and it will create a "krb5.conf" file containing the random port number in the target directory. So all we need to do is to point Hadoop to the keytabs that were generated and the krb5.conf, and it should be able to communicate correctly with the Kerby-based KDC.

    2) Configure Hadoop to authenticate users via Kerberos

    Download and configure Apache Hadoop as per the first tutorial. For now, we will not enable the Ranger authorization plugin, but rather secure access to the "/data" directory using ACLs, as described in section (3) of the first tutorial, such that "alice" has permission to read the file stored in "/data" but "bob" does not. The next step is to configure Hadoop to authenticate users via Kerberos.

    Edit 'etc/hadoop/core-site.xml' and adding the following property name/values:
    • hadoop.security.authentication: kerberos
    • dfs.block.access.token.enable: true 
    Next edit 'etc/hadoop/hdfs-site.xml' and add the following property name/values to configure Kerberos for the namenode:
    • dfs.namenode.keytab.file: Path to Kerby hdfs.keytab (see above).
    • dfs.namenode.kerberos.principal: hdfs/localhost@hadoop.apache.org
    • dfs.namenode.kerberos.internal.spnego.principal: HTTP/localhost@hadoop.apache.org
    Add the exact same property name/values for the secondary namenode, except using the property name "secondary.namenode" instead of "namenode". We also need to configure Kerberos for the datanode:
    • dfs.datanode.data.dir.perm: 700
    • dfs.datanode.address: 0.0.0.0:1004
    • dfs.datanode.http.address: 0.0.0.0:1006
    • dfs.web.authentication.kerberos.principal: HTTP/localhost@hadoop.apache.org
    • dfs.datanode.keytab.file: Path to Kerby hdfs.keytab (see above).
    • dfs.datanode.kerberos.principal: hdfs/localhost@hadoop.apache.org
    As we are not using SASL to secure the the data transfer protocol (see here), we need to download and configure JSVC into JSVC_HOME. Then edit 'etc/hadoop/hadoop-env.sh' and add the following properties:
    • export HADOOP_SECURE_DN_USER=(the user you are running HDFS as)
    • export JSVC_HOME=(path to JSVC as above)
    • export HADOOP_OPTS="-Djava.security.krb5.conf=<path to Kerby target/krb5.conf"
    You also need to make sure that you can ssh to localhost as "root" without specifying a password.

    3) Launch Kerby and HDFS and test authorization

    Now that we have hopefully configured everything correctly it's time to launch the Kerby based KDC and HDFS. Start Kerby by running the JUnit test as described in the first section. Now start HDFS via:
    • sbin/start-dfs.sh
    • sudo sbin/start-secure-dns.sh
    Now let's try to read the file in "/data" using "bin/hadoop fs -cat /data/LICENSE.txt". You should see an exception as we have no credentials. Let's try to read as "alice" now:
    • export KRB5_CONFIG=/pathtokerby/target/krb5.conf
    • kinit -t -k /pathtokerby/target/alice.keytab alice
    • bin/hadoop fs -cat /data/LICENSE.txt
    This should be successful. However the following should result in a "Permission denied" message:
    • kdestroy
    • kinit -t -k /pathtokerby/target/bob.keytab bob
    • bin/hadoop fs -cat /data/LICENSE.txt
    Categories: Colm O hEigeartaigh

    Securing Apache Hadoop Distributed File System (HDFS) - part IV

    Wed, 04/26/2017 - 18:39
    This is the fourth in a series of blog posts on securing HDFS. The first post described how to install Apache Hadoop, and how to use POSIX permissions and ACLs to restrict access to data stored in HDFS. The second post looked at how to use Apache Ranger to authorize access to data stored in HDFS. The third post looked at how Apache Ranger can create "tag" based authorization policies for HDFS using Apache Atlas. In this post I will look at how you can implement transparent encryption in HDFS using the Apache Ranger Key Management Service (KMS).

    1) Install and Configure the Apache Ranger KMS

    If you have not done so already, then follow the instructions in this tutorial to install the Apache Ranger admin service, and then start it via "sudo ranger-admin start". Open a browser and go to "http://localhost:6080/". Log on with "admin/admin" and click on "Settings". Create a new user corresponding to the name of the user which starts HDFS.

    The next step is to install the Apache Ranger KMS. Please follow step (2) in a blog post I wrote last year about this. When installation is complete, then start the KMS service with "sudo ranger-kms start". Log out of the Admin UI and then log back in again with the credentials "keyadmin/keyadmin". Click on the "+" button on the "KMS" tab to create a new KMS Service. Specify the following values:
    • Service Name: kmsdev
    • KMS URL: kms://http@localhost:9292/kms
    • Username: keyadmin
    • Password: keyadmin
    When the "kmsdev" service has been created then click on it and edit the default policy that has been created. Edit the existing "allow condition" for "hdfs" adding in the user that will be starting HDFS (if not the "hdfs" user itself). Also grant the "CREATE" permission to that user so that we can create keys from the command line, and the "DECRYPT EEK" permission, so that the user can decrypt the data encryption key:


    2) Create an encryption zone in HDFS

    In your Hadoop distribution (after first following the steps in the first post), edit 'etc/hadoop/core-site.xml' and add the following property:
    • hadoop.security.key.provider.path - kms://http@localhost:9292/kms
    Similarly, edit 'etc/hadoop/hdfs-site.xml' and add the following property:
    • dfs.encryption.key.provider.uri - kms://http@localhost:9292/kms
    Start HDFS via 'sbin/start-dfs.sh'. Let's create a new encryption key called "enckey" as follows:
    • bin/hadoop key create enckey
    If you go back to the Ranger Admin UI and click on "Encryption / Key Manager" and select the "kmsdev" service, you should be able to see the new key that was created. Now let's create a new encryption zone in HDFS as follows:
    • bin/hadoop fs -mkdir /zone
    • bin/hdfs crypto -createZone -keyName enckey -path /zone
    • bin/hdfs crypto -listZones
    That's it! We can put data into the '/zone' directory and it will be encrypted by a key which in turn is encrypted by the key we have created and stored in the Ranger KMS.
    Categories: Colm O hEigeartaigh

    Securing Apache Hadoop Distributed File System (HDFS) - part III

    Fri, 04/21/2017 - 11:54
    This is the third in a series of posts on securing HDFS. The first post described how to install Apache Hadoop, and how to use POSIX permissions and ACLs to restrict access to data stored in HDFS. The second post looked at how to use Apache Ranger to authorize access to data stored in HDFS. In this post we will look at how Apache Ranger can create "tag" based authorization policies for HDFS using Apache Atlas. For information on how to create tag-based authorization policies for Apache Kafka, see a post I wrote earlier this year.

    The Apache Ranger admin console allows you to create security policies for HDFS by associating a user/group with some permissions (read/write/execute) and a resource, such as a directory or file. This is called a "Resource based policy" in Apache Ranger. An alternative is to use a "Tag based policy", which instead associates the user/group + permissions with a "tag". You can create and manage tags in Apache Atlas, and Apache Ranger supports the ability to imports tags from Apache Atlas via a tagsync service, something we will cover in this post.

    1) Start Apache Atlas and create entities/tags for HDFS

    First let's look at setting up Apache Atlas. Download the latest released version (0.8-incubating) and extract it. Build the distribution that contains an embedded HBase and Solr instance via:
    • mvn clean package -Pdist,embedded-hbase-solr -DskipTests
    The distribution will then be available in 'distro/target/apache-atlas-0.8-incubating-bin'. To launch Atlas, we need to set some variables to tell it to use the local HBase and Solr instances:
    • export MANAGE_LOCAL_HBASE=true
    • export MANAGE_LOCAL_SOLR=true
    Now let's start Apache Atlas with 'bin/atlas_start.py'. Open a browser and go to 'http://localhost:21000/', logging on with credentials 'admin/admin'. Click on "TAGS" and create a new tag called "Data".  Click on "Search" and the "Create new entity" link. Select an entity type of "hdfs_path" with the following values:
    • QualifiedName: data@cl1
    • Name: Data
    • Path: /data
    Once the new entity has been created, then click on "+" beside "Tags" and associate the new entity with the "Data" tag.

    2) Use the Apache Ranger TagSync service to import tags from Atlas into Ranger

    To create tag based policies in Apache Ranger, we have to import the entity + tag we have created in Apache Atlas into Ranger via the Ranger TagSync service. First, start the Apache Ranger admin service and rename the HDFS service we created in the previous tutorial from "HDFSTest" to "cl1_hadoop". This is because the Tagsync service will sync tags into the Ranger service that corresponds to the suffix of the qualified name of the tag with "_hadoop". Also edit 'etc/hadoop/ranger-hdfs-security.xml' in your Hadoop distribution and change the "ranger.plugin.hdfs.service.name" to "cl1_hadoop". Also change the "ranger.plugin.hdfs.policy.cache.dir" along the same lines. Finally, make sure the directory '/etc/ranger/cl1_hadoop/policycache' exists and the user you are running Hadoop as can write and read from this directory.

    After building Apache Ranger then extract the file called "target/ranger-<version>-tagsync.tar.gz". Edit 'install.properties' as follows:
    • Set TAG_SOURCE_ATLAS_ENABLED to "false"
    • Set TAG_SOURCE_ATLASREST_ENABLED to  "true"
    • Set TAG_SOURCE_ATLASREST_DOWNLOAD_INTERVAL_IN_MILLIS to "60000" (just for testing purposes)
    • Specify "admin" for both TAG_SOURCE_ATLASREST_USERNAME and TAG_SOURCE_ATLASREST_PASSWORD
    Save 'install.properties' and install the tagsync service via "sudo ./setup.sh". It can now be started via "sudo ranger-tagsync-services.sh start".

    3) Create Tag-based authorization policies in Apache Ranger

    Now let's create a tag-based authorization policy in the Apache Ranger admin UI. Click on "Access Manager" and then "Tag based policies". Create a new Tag service called "HDFSTagService". Create a new policy for this service called "DataPolicy". In the "TAG" field enter a capital "D" and the "Data" tag should pop up, meaning that it was successfully synced in from Apache Atlas. Create an "Allow" condition for the user "bob" with component permission of "HDFS" and "read" and "execute":


    The last thing we need to do is to go back to the Resource based policies and edit "cl1_hadoop" and select the tag service we have created above.

    4) Testing authorization in HDFS using our tag based policy

    Wait until the Ranger authorization plugin syncs the new authorization policies from the Ranger Admin service and then we can test authorization. In the previous tutorial we showed that the file owner and user "alice" can read the data stored in '/data', but "bob" could not. Now we should be able to successfully read the data as "bob" due to the tag based authorization policy we have created:
    • sudo -u bob bin/hadoop fs -cat /data/LICENSE.txt
    Categories: Colm O hEigeartaigh

    Securing Apache Hadoop Distributed File System (HDFS) - part II

    Thu, 04/20/2017 - 16:23
    This is the second in a series of posts on securing HDFS. The first post described how to install Apache Hadoop, and how to use POSIX permissions and ACLs to restrict access to data stored in HDFS. In this post we will look at how to use Apache Ranger to authorize access to data stored in HDFS. The Apache Ranger Admin console allows you to create policies which are retrieved and enforced by a HDFS authorization plugin. Apache Ranger allows us to create centralized authorization policies for HDFS, as well as an authorization audit trail stored in SOLR or HDFS.

    1) Install the Apache Ranger HDFS plugin

    First we will install the Apache Ranger HDFS plugin. Follow the steps in the previous tutorial to setup Apache Hadoop, if you have not done this already. Then download Apache Ranger and verify that the signature is valid and that the message digests match. Due to some bugs that were fixed for the installation process, I am using version 1.0.0-SNAPSHOT in this post. Now extract and build the source, and copy the resulting plugin to a location where you will configure and install it:
    • mvn clean package assembly:assembly -DskipTests
    • tar zxvf target/ranger-1.0.0-SNAPSHOT-hdfs-plugin.tar.gz
    • mv ranger-1.0.0-SNAPSHOT-hdfs-plugin.tar.gz ${ranger.hdfs.home}
    Now go to ${ranger.hdfs.home} and edit "install.properties". You need to specify the following properties:
    • POLICY_MGR_URL: Set this to "http://localhost:6080"
    • REPOSITORY_NAME: Set this to "HDFSTest".
    • COMPONENT_INSTALL_DIR_NAME: The location of your Apache Hadoop installation
    Save "install.properties" and install the plugin as root via "sudo ./enable-hdfs-plugin.sh". The Apache Ranger HDFS plugin should now be successfully installed. Start HDFS with:
    • sbin/start-dfs.sh
    2) Create authorization policies in the Apache Ranger Admin console

    Next we will use the Apache Ranger admin console to create authorization policies for our data in HDFS. Follow the steps in this tutorial to install the Apache Ranger admin service. Start the Apache Ranger admin service with "sudo ranger-admin start" and open a browser and go to "http://localhost:6080/" and log on with "admin/admin". Add a new HDFS service with the following configuration values:
    • Service Name: HDFSTest
    • Username: admin
    • Password: admin
    • Namenode URL: hdfs://localhost:9000
    Click on "Test Connection" to verify that we can connect successfully to HDFS + then save the new service. Now click on the "HDFSTest" service that we have created. Add a new policy for the "/data" resource path for the user "alice" (create this user if you have not done so already under "Settings, Users/Groups"), with permissions of "read" and "execute".


    3) Testing authorization in HDFS

    Now let's test the Ranger authorization policy we created above in action. Note that by default the HDFS authorization plugin checks for a Ranger authorization policy that grants access first, and if this fails it falls back to the default POSIX permissions. The Ranger authorization plugin will pull policies from the Admin service every 30 seconds by default. For the "HDFSTest" example above, they are stored in "/etc/ranger/HDFSTest/policycache/" by default. Make sure that the user you are running Hadoop as can access this directory.

    Now let's test to see if I can read the data file as follows:
    • bin/hadoop fs -cat /data/LICENSE* (this should work via the underlying POSIX permissions)
    • sudo -u alice bin/hadoop fs -cat /data/LICENSE* (this should work via the Ranger authorization policy)
    • sudo -u bob bin/hadoop fs -cat /data/LICENSE* (this should fail as we don't have an authorization policy for "bob").

    Categories: Colm O hEigeartaigh

    Securing Apache Hadoop Distributed File System (HDFS) - part I

    Wed, 04/19/2017 - 17:49
    Last year, I wrote a series of articles on securing Apache Kafka using Apache Ranger and Apache Sentry. In this series of posts I will look at how to secure the Apache Hadoop Distributed File System (HDFS) using Ranger and Sentry, such that only authorized users can access data stored in it. In this post we will look at a very basic way of installing Apache Hadoop and accessing some data stored in HDFS. Then we will look at how to authorize access to the data stored in HDFS using POSIX permissions and ACLs.

    1) Installing Apache Hadoop

    The first step is to download and extract Apache Hadoop. This tutorial uses version 2.7.3. The next step is to configure Apache Hadoop as a single node cluster so that we can easily get it up and running on a local machine. You will need to follow the steps outlined in the previous link to install ssh + pdsh. If you can't log in to localhost without a password ("ssh localhost") then you need to follow the instructions given in the link about setting up passphraseless ssh.

    In addition, we want to run Apache Hadoop in pseudo-distributed mode, where each Hadoop daemon runs as a separate Java process. Edit 'etc/hadoop/core-site.xml' and add:
    Next edit 'etc/hadoop/hdfs-site.xml' and add:

    Make sure that the JAVA_HOME variable in 'etc/hadoop/hadoop-env.sh' is correct, and then format the filesystem and start Hadoop via:
    • bin/hdfs namenode -format
    • sbin/start-dfs.sh
    To confirm that everything is working correctly, you can open "http://localhost:50090" and check on the status of the cluster there. Once Hadoop has started then upload and then access some data to HDFS:
    • bin/hadoop fs -mkdir /data
    • bin/hadoop fs -put LICENSE.txt /data
    • bin/hadoop fs -ls /data
    • bin/hadoop fs -cat /data/*
    2) Securing HDFS using POSIX Permissions

    We've seen how to access some data stored in HDFS via the command line. Now how can we create some authorization policies to restrict how to access this data? Well the simplest way is to use the standard POSIX Permissions. If we look at the /data directory we see that it has the following permissions "-rw-r--r--", which means other users can read the LICENSE file stored there. Remove access to other users apart from the owner via:
    • bin/hadoop fs -chmod og-r /data
    Now create a test user called "alice" on your system and try to access the LICENSE we uploaded above via:
    • sudo -u alice bin/hadoop fs -cat /data/*
    You will see an error that says "cat: Permission denied: user=alice, access=READ_EXECUTE".

    3) Securing HDFS using ACLs

    Securing access to data stored in HDFS via POSIX permissions works fine, however it does not allow you for example to specify fine-grained permissions for users other than the file owner. What if we want to allow "alice" from the previous section to read the file but not "bob"? We can achieve this via Hadoop ACLs. To enable ACLs, we will need to add a property called "dfs.namenode.acls.enabled" with value "true" to 'etc/hadoop/hdfs-site.xml' + re-start HDFS.

    We can grant read access to 'alice' via:
    • bin/hadoop fs -setfacl -m user:alice:r-- /data/*
    • bin/hadoop fs -setfacl -m user:alice:r-x /data
    To check to see the new ACLs associated with LICENSE.txt do:
    • bin/hadoop fs -getfacl /data/LICENSE.txt
    In addition to the owner, we now have the ACL "user:alice:r--". Now we can read the data as "alice". However another user "bob" cannot read the data. To avoid confusion with future blog posts on securing HDFS, we will now remove the ACLs we added via:
    • bin/hadoop fs -setfacl -b /data
    • bin/hadoop fs -setfacl -b /data/LICENSE.txt
    Categories: Colm O hEigeartaigh

    Pages