Toan Hoang

TAB103 / Data Sources and Connectivity

An understanding of Data Sources is key to maximising your Tableau experience; in this part of our 14-day introduction series, we will cover the what Data Sources you can connect to, the elements of the Edit Data Source window and finally how we can customise and optimise our Data Sources.

Data Sources

Tableau provides connectivity to flat files, databases, and additional connectivity via Java Database Connectivity (JDBC), Open Database Connectivity (ODBC) and Web Data Connectors (WDC). Feel free skip this part, but we will go through what is available, and hey, you might hear of a new technology, or find something that fit your fancy.

Data SourceDescription
Actian MatrixParAccel, Inc. was a California-based software company. It provided a database management system designed for advanced analytics for business intelligence. ParAccel was acquired by Actian in April 2013.
Actian VectorwiseActian Vector (formerly known as VectorWise) is an SQL relational database management system designed for high performance in analytical database applications. It published record-breaking results on the Transaction Processing Performance Council’s TPC-H benchmark for database sizes of 100 GB, 300 GB, 1 TB and 3 TB on non-clustered hardware
Amazon AthenaAmazon Athena is an interactive query service that makes it easy to analyze data directly in Amazon Simple Storage Service (Amazon S3) using standard SQL. With a few actions in the AWS Management Console, you can point Athena at your data stored in Amazon S3 and begin using standard SQL to run ad-hoc queries and get results in seconds.
Amazon AuroraAmazon Aurora is a hosted relational database service developed and offered by Amazon since October 2014. Aurora is available as part of the Amazon Relational Database Service (RDS). Although it is a proprietary technology, it offers MySQL compatible service since its release and PostgreSQL compatible since October 2017 and it is also possible to stop and start Aurora Clusters since September 2018. Since August 2018 Amazon also offers a serverless version of AWS Aurora.
Amazon EMR Hadoop Hive An Amazon EMR release is a set of open-source applications from the big-data ecosystem. Hive is an open-source, data warehouse, and analytics package that runs on top of a Hadoop cluster. Hive scripts use a SQL-like language called Hive QL (query language) that abstracts programming models and supports typical data warehouse interactions. 

 

Note: Starting in version 2018.2, Tableau supports Amazon EMR Hadoop Hive only, not Impala. Amazon no longer provides Impala drivers.

Amazon RedshiftAmazon Redshift is an Internet hosting service and data warehouse product which forms part of the larger cloud-computing platform Amazon Web Services.
AnaplanAnaplan is a platform, powered by proprietary Hyperblock technology, purpose-built for Connected Planning, which enables dynamic, collaborative, and intelligent planning. Large global enterprises use the solution to connect people, data and plans to enable real-time planning and decision-making in rapidly changing business environments to give customers a competitive advantage.
Apache DrillApache Drill is an open-source software framework that supports data-intensive distributed applications for interactive analysis of large-scale datasets. Drill is the open source version of Google’s Dremel system which is available as an infrastructure service called Google BigQuery. One explicitly stated design goal is that Drill is able to scale to 10,000 servers or more and to be able to process petabytes of data and trillions of records in seconds. Drill is an Apache top-level project.
Aster DatabaseAster Data Systems was a data management and analysis software company headquartered in San Carlos, California. It was founded in 2005 and acquired by Teradata in 2011.
Box.comBox, Inc. (formerly Box.net), based in Redwood City, California, is a cloud content management and file sharing service for businesses. Official clients and apps are available for Windows, macOS, and several mobile platforms. Box was founded in 2005.
Cloudera HadoopCloudera, Inc. is a US-based software company that provides a software platform for data engineering, data warehousing, machine learning and analytics that runs in the cloud or on premises. Cloudera started as a hybrid open-source Apache Hadoop distribution, CDH (Cloudera Distribution Including Apache Hadoop), that targeted enterprise-class deployments of that technology.
DenodoWhat started to evolve as the most agile and real-time enterprise data fabric, data virtualization is proving to go beyond its initial promise and is becoming one of the most important enterprise big data fabrics.
DropboxDropbox is a file hosting service operated by American company Dropbox, Inc., headquartered in San Francisco, California, that offers cloud storage, file synchronization, personal cloud, and client software. Dropbox was founded in 2007 by MIT students Drew Houston and Arash Ferdowsi as a startup company, with initial funding from seed accelerator Y Combinator.
ExasolExasol is an analytic database management software company. Its product is called Exasol, an in-memory, column-oriented, relational database management system. Since 2008 Exasol led the Transaction Processing Performance Council’s TPC-H benchmark for analytical scenarios, in all data volume-based categories 100 GB, 300 GB, 1 TB, 3 TB, 10 TB, 30 TB and 100 TB.
FirebirdFirebird is an open source SQL relational database management system that runs on Linux, Microsoft Windows, MacOSX and several Unix platforms. The database forked from Borland’s open source edition of InterBase in 2000, but since Firebird 1.5 the code has been largely rewritten.
Google AnalyticsGoogle Analytics is a freemium web analytics service offered by Google that tracks and reports website traffic, currently as a platform inside the Google Marketing Platform brand. Google launched the service in November 2005 after acquiring Urchin. Google Analytics is now the most widely used web analytics service on the web. Google Analytics also provides an SDK that allows gathering usage data from iOS and Android Apps, known as Google Analytics for Mobile Apps.
Google BigQueryBigQuery is a RESTful web service that enables interactive analysis of massively large datasets working in conjunction with Google Storage. It is an Infrastructure as a Service (IaaS) that may be used complementarily with MapReduce.
Google Cloud SQLGoogle Cloud SQL is a Database as a Service (DBaaS) based on MySQL and PostgreSQL.
Google SheetsGoogle Sheets is a spreadsheet program included as part of a free, web-based software office suite offered by Google within its Google Drive service.
Hortonworks Hadoop HiveHortonworks is a data software company based in Santa Clara, California that develops, supports, and provides expertise on a set of open-source software designed to manage data and processing for things such as IOT (connected cars, for example), single view of X (such as customer, risk, patient), and advanced analytics and machine learning (such as next best action and realtime cybersecurity).
IBM BigInsightsBigInsights® is a software platform for discovering, analyzing, and visualizing data from disparate sources. You use this software to help process and analyze the volume, variety, and velocity of data that continually enters your organization every day. BigInsights helps your organization to understand and analyze massive volumes of unstructured information as easily as smaller volumes of information. The flexible platform is built on an Apache Hadoop open source framework that runs in parallel on commonly available, low-cost hardware.
IBM DB2The name DB2 refer to database-server products developed by IBM. These products all support the relational model, but in recent years, some products have been extended to support object-relational features and non-relational structures like JSON and XML.
IBM PDA (Netezza)IBM Netezza (pronounced ne-teez-a) is a subsidiary of American technology company IBM that designs and markets high-performance data warehouse appliances and advanced analytics applications for uses including enterprise data warehousing, business intelligence, predictive analytics and business continuity planning. Netezza was acquired by IBM on September 20, 2010.
Intuit QuickBooks OnlineQuickBooks is an accounting software package developed and marketed by Intuit. QuickBooks products are geared mainly toward small and medium-sized businesses and offer on-premises accounting applications as well as cloud-based versions that accept business payments, manage and pay bills, and payroll functions.
KognitioKognitio is an in-memory analytical software platform that supports BI, OLAP and analytical applications on large and complex data.
MapR Hadoop HiveMapR is a business software company headquartered in Santa Clara, California. MapR provides access to a variety of data sources from a single computer cluster, including big data workloads such as Apache Hadoop and Apache Spark, a distributed file system, a multi-model database management system, and event stream processing, combining analytics in real-time with operational applications. Its technology runs on both commodity hardware and public cloud computing services.
MarketoMarketo, Inc. is a software company focused on account-based marketing, including email, mobile, social, digital ads, web management, and analytics. It sells products for industries including healthcare, technology, financial services, manufacturing, media, and higher education.
MarkLogicMarkLogic Corporation is an American software business that develops and provides an enterprise NoSQL database, also named MarkLogic. The company was founded in 2001 and is based in San Carlos, California.
MemSQLMemSQL is a distributed, in-memory, SQL database management system. It is a relational database management system (RDBMS). It compiles Structured Query Language (SQL) into machine code, via termed code generation. On April 23, 2013, MemSQL launched its first generally available version of the database to the public.
Microsoft Analysis Services (SSAS)Microsoft SQL Server Analysis Services, SSAS, is an online analytical processing (OLAP) and data mining tool in Microsoft SQL Server. SSAS is used as a tool by organizations to analyze and make sense of information possibly spread out across multiple databases, or in disparate tables or files.
Microsoft Power PivotPower Pivot is a feature of Microsoft Excel. It is available as an add-in in Excel 2010, 2013 in separate downloads, and as an add-in included with the Excel 2016 program. Power Pivot extends a local instance of Microsoft Analysis Services Tabular that is embedded directly into an Excel Workbook.
Microsoft SQL ServerMicrosoft SQL Server is a relational database management system developed by Microsoft. As a database server, it is a software product with the primary function of storing and retrieving data as requested by other software applications—which may run either on the same computer or on another computer across a network (including the Internet).
MonetDBMonetDB is an open source column-oriented database management system developed at the Centrum Wiskunde & Informatica (CWI) in the Netherlands. It was designed to provide high performance on complex queries against large databases, such as combining tables with hundreds of columns and millions of rows.
MongoDBMongoDB is a free and open-source cross-platform document-oriented database program. Classified as a NoSQL database program, MongoDB uses JSON-like documents with schemata. MongoDB is developed by MongoDB Inc., and is published under a combination of the Server Side Public License and the Apache License.
MySQLMySQL is an open source relational database management system (RDBMS). The MySQL development project has made its source code available under the terms of the GNU General Public License, as well as under a variety of proprietary agreements. MySQL was owned and sponsored by a single for-profit firm, the Swedish company MySQL AB, now owned by Oracle Corporation.
ODataIn computing, Open Data Protocol (OData) is an open protocol which allows the creation and consumption of queryable and interoperable RESTful APIs in a simple and standard way. Microsoft initiated OData in 2007. Versions 1.0, 2.0, and 3.0 are released under the Microsoft Open Specification Promise. Version 4.0 was standardized at OASIS, with a release in March 2014.
OneDriveOneDrive (formerly known as SkyDrive, Windows Live SkyDrive, and Windows Live Folders) is a file hosting service and synchronization service operated by Microsoft as part of its suite of Office Online services.
OracleOracle Database (commonly referred to as Oracle RDBMS or simply as Oracle) is a multi-model database management system produced and marketed by Oracle Corporation. It is a database commonly used for running online transaction processing (OLTP), data warehousing (DW) and mixed (OLTP & DW) database workloads. The latest generation, Oracle Database 18c, is available on-prem, on-Cloud, or in a hybrid-Cloud environment. 18c may also be deployed on Oracle Engineered Systems (e.g. Exadata) on-prem, on Oracle (public) Cloud or (private) Cloud at Customer. At Openworld 2017 in San Francisco, Executive Chairman of the Board and CTO, Larry Ellison announced the next database generation, Oracle Autonomous Database.
Oracle EloquaEloqua is a software as a service (SaaS) platform for marketing automation offered by Oracle that aims to help B2B marketers and organizations manage marketing campaigns and sales lead generation.
Oracle EssbaseEssbase is a multidimensional database management system (MDBMS) that provides a multidimensional database platform upon which to build analytic applications. Essbase, whose name derives from “extended spreadsheet database”, began as a product of Arbor Software, which merged with Hyperion Software in 1998. Oracle Corporation acquired Hyperion Solutions Corporation in 2007.
Pivotal Greenplum DatabaseGreenplum was a big data analytics company headquartered in San Mateo, California. Greenplum was acquired by EMC Corporation in July 2010. Starting in 2012 its database management system software became known as the Pivotal Greenplum Database sold through Pivotal Software and is currently actively developed by the Greenplum Database open source community and Pivotal.
PostgreSQL DatabasePostgreSQL, often simply Postgres, is an object-relational database management system (ORDBMS) with an emphasis on extensibility and standards compliance. It can handle workloads ranging from small single-machine applications to large Internet-facing applications (or for data warehousing) with many concurrent users; on macOS Server, PostgreSQL is the default database; and it is also available for Microsoft Windows and Linux (supplied in most distributions). 

 

Note: Tableau uses a PostgreSQL Database within Tableau Server.

Presto DatabaseFacebook commenced development efforts on Presto in 2012, and announced its release as open source for Apache Hadoop in 2013. In 2014, Netflix disclosed they used Presto on 10 petabytes of data stored in the Amazon Simple Storage Service (S3). Airbnb released the source to web interface software called Airpal for Presto in March, 2015.
Progress OpenEdgeOpenEdge Advanced Business Language, or OpenEdge ABL for short, is a business application development language created and maintained by Progress Software Corporation (PSC). The language, typically classified as a fourth-generation programming language, uses an English-like syntax to simplify software developmen
SalesforceSalesforce.com, Inc. (styled in its logo as salesƒorce; abbreviated usually as SF or SFDC) is an American cloud-based software company headquartered in San Francisco, California. Though its revenue comes from a customer relationship management (CRM) product, Salesforce also sells commercial applications of social networking through acquisition and internal development. Salesforce was ranked first in Fortune’s 100 Best Companies to Work For in 2018.
SplunkSplunk Inc. is an American multinational corporation based in San Francisco, California, that produces software for searching, monitoring, and analyzing machine-generated big data, via a Web-style interface. Splunk (the product) captures, indexes, and correlates real-time data in a searchable repository from which it can generate graphs, reports, alerts, dashboards, and visualizations.
SAP HANASAP HANA is an in-memory, column-oriented, relational database management system developed and marketed by SAP SE. Its primary function as a database server is to store and retrieve data as requested by the applications. In addition, it performs advanced analytics (predictive analytics, spatial data processing, text analytics, text search, streaming analytics, graph data processing) and includes ETL capabilities as well as an application server.
SAP NetWeaver Business WarehouseSAP Business Warehouse (SAP BW) is SAP’s Enterprise Data Warehouse product. It can transform and consolidate business information from virtually any source system.[citation needed] It ran on industry standard RDBMS until version 7.3 at which point it began to transition onto SAP’s HANA in-memory DBMS, particularly with the release of version 7.4.
SAP Sybase ASESAP ASE (Adaptive Server Enterprise), originally known as Sybase SQL Server, and also commonly known as Sybase DB or Sybase ASE, is a relational model database server product for businesses developed by Sybase Corporation which became part of SAP AG. ASE is predominantly used on the Unix platform, but is also available for Microsoft Windows.
SAP Sybase IQSAP IQ (formerly known as SAP Sybase IQ or Sybase IQ) is a column-based, petabyte scale, relational database software system used for business intelligence, data warehousing, and data marts. Produced by Sybase Inc., now an SAP company, its primary function is to analyze large amounts of data in a low-cost, highly available environment. SAP IQ is often credited with pioneering the commercialization of column-store technology.
ServiceNowServiceNow, Inc. (Service-now in 2011) is an American cloud computing company with its headquarters in Santa Clara, California. It was founded in 2004 by Fred Luddy, the previous CTO of software companies Peregrine Systems and Remedy Corporation. ServiceNow is listed on the New York Stock Exchange and is a constituent of the Russell 1000 Index.
SharePoint ListsSharePoint is a web-based collaborative platform that integrates with Microsoft Office. Launched in 2001, SharePoint is primarily sold as a document management and storage system, but the product is highly configurable and usage varies substantially among organizations.
SnowflakeSnowflake is an analytic data warehouse provided as Software-as-a-Service (SaaS). Snowflake provides a data warehouse that is faster, easier to use, and far more flexible than traditional data warehouse offerings.
Spark SQLApache Spark is an open-source distributed general-purpose cluster-computing framework. Originally developed at the University of California, Berkeley’s AMPLab, the Spark codebase was later donated to the Apache Software Foundation, which has maintained it since. Spark provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.
Teradata and Teradata OLAPTeradata Corporation is a provider of database and analytics-related products and services. The company was formed in 1979 in Brentwood, California, as a collaboration between researchers at Caltech and Citibank’s advanced technology group. The company was acquired by NCR Corporation in 1991, and subsequently spun-off again as an independent public company on October 1, 2007.
TIBCO Data VirtualizationTIBCO Data Virtualization. TIBCO Data Virtualization is an enterprise data virtualization solution that curates access to multiple and varied data sources and creates standard, federated views ― the data foundation for nearly any analytics solution.
VerticaVertica Systems is an analytic database management software company. Vertica was founded in 2005 by database researcher Michael Stonebraker and Andrew Palmer. Palmer was the founding CEO; later, Ralph Breslauer and Christopher P. Lynch served as CEOs.
ODBCIn computing, Open Database Connectivity (ODBC) is a standard application programming interface (API) for accessing database management systems (DBMS). The designers of ODBC aimed to make it independent of database systems and operating systems. An application written using ODBC can be ported to other platforms, both on the client and server side, with few changes to the data access code.
JDBCJava Database Connectivity (JDBC) is an application programming interface (API) for the programming language Java, which defines how a client may access a database. It is a Java-based data access technology used for Java database connectivity. It is part of the Java Standard Edition platform, from Oracle Corporation. It provides methods to query and update data in a database and is oriented towards relational databases. A JDBC-to-ODBC bridge enables connections to any ODBC-accessible data source in the Java virtual machine (JVM) host environment.

Ok, I have to admit, I did know most of the technologies on the list, but did also find a few hidden gems. But we are not quite finished yet, Tableau also supports the following Data Sources:

Note: For the drivers listed in the table above you will need to download and install the required driver before trying to connect. If you are a Tableau Server administrator, you will also need to install these drivers onto the server. You can find all the available drivers here: https://www.tableau.com/support/drivers

Connecting to a Data Source

Note: for the rest of this article I will be using the Microsoft SQL Server RDBMS and the AdventureWorks sample database. If you have not done so, check out my tutorial SQL101 / Relational Database Management Systems 

Let us get started by:

Note: you can check that the Microsoft SQL Server is running by typing Services.msc in the search bar. In the Services.msc window, scroll downwards to find SQL Server (Express) and check that it is running. If it is not running, right-click and choose Start.

You will now see the following:

Let us take some time to explore this window and various options:

  1. Right-click here to allow you to edit your connection details.
  2. This drop-down allows you to choose different databases within your connection.
  3. Dragging this component onto the query panel allows you to enter in Custom SQL, this is used if you do not want to use the visual editor.
  4. This allows you to performance union queries.
  5. All Stored Procedures available here.
  6. You can click on this button when editing your Worksheet to return to the Edit Data Source window.
  7. Click on this button to bring back a sample of data from the database; this is useful for previewing your data.
  8. You can edit each individual column here, or create additional Calculated Fields.
  9. This toolbar allows you to customise your preview window.
  10. This is the visual query editor where you can create queries by dragging tables and defining the joins.
    • Once you drag on additional tables you can define how the tables are joined.
    • If you have not done so already, check out my article about database joins: SQL104 / Table Joins
    • At #TC18 we saw a preview of automatic join detection.
  11. You can choose between the two Connection Types:
    • Live – means that Tableau will always connect directly to the database and return data each time the query is refreshed.
    • Extract – means that Tableau will run the SQL statement in the visual editor and cache the data within the Tableau Workbook. This gives you a performance boost when building your dashboards, and when publishing to a Tableau Server, can be scheduled to refresh periodically.
  12. You can apply a Data Source filter here which will apply to the SQL Statement sent to the database.

Now that we have our query, please try it for yourself by:

And boom, that is a wrap for this article, I hope you have found this as interest read, but why not follow through and create a visualisation with the AdventureWorks database. 

Summary

In this tutorial, we have installed Tableau Desktop and activated our 14-day trial license. We looked at the different elements of the Tableau Desktop Workspace as well as the different types of Tableau files; in later tutorials, we will dig deeper into these. Last, but not least, we built our first Tableau worksheet with data from the World Indicators Data Source.

I hope you enjoyed this tutorial, and look forward to the next one. I am always looking for ways to help share knowledge around the community, so if you want to know more about something, and wish there was a tutorial out there, do get in touch on Twitter at @Tableau_Magic.

Exit mobile version