Data WareHousing Interview Questions & Answers: April 2009

1. What is the difference between FastLoad and MultiLoad?

FastLoad uses multiple sessions to quickly load large amount of data on empty table. MultiLoad is used for high-volume maintenance on tables and views. It works with non-empty tables also. Maximum 5 tables can be used in MultiLoad.

2. Which is faster?

FastLoad.

3. Difference between Inner join and outer join?

An inner join gets data from both tables where the specified data exists in both tables.

An outer join gets data from the source table at all times, and returns data from the outer joined table ONLY if it matches the criteria.

4. What is multi Insert?

Inserting data records into the table using multiple insert statements. Putting a semi colon in front of the key word INSERT in the next statement rather than terminating the first statement with a semi colon achieves it.

Insert into Sales “select * from customer”

; Insert into Loan “select * from customer”;

5. Is multi insert ANSI standard?

No.

6. How do you create a table with an existing structure of another table with data and with no data?

Create table Customerdummy as Customer with data / with no data;

7. What is the opening step in Basic Teradata Query script?

.Logon tdipid/username, password.

8. You are calling a Bteq script, which drops a table and creates a table. It will throw an error if the table does not exist. How can you do it without throwing the error?

You can it by setting error level to zero before dropping and resetting the error level to 8 after dropping.

You can do it like this

ERRORLEVEL (3807) SEVERITY 0;

DROP TABLE EMPLOYEE;

9. ERRORLEVEL (3807) SEVERITY 8;

10. Can you FastExport a field, which is primary key by putting equality on that key?

No.

11. Did you write stored procedures in Teradata?

No, because they become a single amp operation and my company didn’t encourage that.

12. What is the use of having index’s on table?

For faster record search.

13. Did you use Query man or SQL assistance?

SQL assistant 6.1

14. I am updating a table in Bteq. It has to update a large number of rows, so it’s really slow. What do you suggest?

In Teradata it is not recommended to update more than 1 million rows due to journal space problems, if it is less than that and it’s slow in the Bteq, you might want to add collect statistics statement before the update statement.

15. Is it necessary to add? QUIT statement after a Bteq query when I am calling it in a Unix environment?

Not necessary but it is good to add a QUIT statement after a query.

16. There is a column with date in it. If I want to get just month how It can be done? Can I use sub string?

Sub string is used with char fields. So it cannot be used. To extract month from a date column, ex select extract (month from ). Same thing for year or day. Or hour or minutes if it’s a time stamp (select extract (minute from column name).

17. What’s the syntax of sub string?

SUBSTRING (string_expression, n1 [n2])

18. Did you use CASE WHEN statement. Can you tell us a little about it?

Yes. When a case has to be selected depending upon the value of the expression.

19. While creating table my DBA has FALLBACK or NO FALLBACK in his DDL. What is that?

FALLBACK requests that a second copy of each row inserted into a table be stored on another AMP in the same cluster. This is done when AMP goes down or disk fails.

20. My table got locked during MLOAD due to a failed job. What do I do to perform other operations on it?

Using RELEASE MLOAD. It removes access locks from the target tables in Teradata. It must be entered from BTEQ and not from MultiLoad.

To proceed, you can do RELEASE MLOAD

21. How to find duplicates in a table?

Group by those fields and select id, count(*) from table group by id having count (*) > 1

22. How to you verify a complicated SQL?

I use explain statement to check if the query is doing what I wanted it to do.

23. How many tables can you join in V2R5

Up to 64 tables.

24. Did u ever use UPPER Function?

UPPER Function is used to convert all characters in a column to the same characters in upper case.

25. What does a LOWER Function do?

LOWER function is used to convert all characters in a column to the lower case characters.

26. How do you see a DDL for an existing table?

By using show table command.

27. Which is more efficient GROUP BY or DISTINCT to find duplicates?

With more duplicates GROUP BY is more efficient, if only a few duplicates exist DISTINCT is more efficient.

28. Syntax for CASE WHEN statement?

CASE value_expression_1 WHEN value_expression_n THEN scalar_expression_n END;

29. What’s the difference between TIMESTAMP (0) and TIMESTAMP (6)?

TIMESTAMP (0) is CHAR (19) and TIMESTAMP (6) is CHAR (26)

Everything is same except that TIMESTAMP (6) has microseconds too.

30. How do you determine the number of sessions?

· Teradata performance and workload

· Client platform type, performance and workload

· Channel performance for channel attached systems

· Network topology and performance for network attached systems.

· Volume of data to be processed by the application.

31. What is node? How many nodes and AMPs used in your previous project?

Node is a database running in a server. We used 318 nodes and each node has 2 to 4 AMPS.

32. What is a clique?

Clique is a group of disk arrays physically cabled to a group of nodes.

33. Interviewer explained about their project (Environment, nature of work)

Listen to them carefully so that at the end of the interview you can ask questions about the project when you are given a chance to ask questions.

34. Tell us something about yourself?

Describe about your project experience, technical skill sets, hard working, good team player, self-learner and self-motivated.

35. What is the best project you ever worked with and why it is best project?

All the projects I worked so far are best projects. I treat every project is equal and work hard for the success of the project.

36. What makes a project successful and how you have contributed to the success of the project?

Good team members, technical knowledge of team members, hard work, sharing knowledge among the team, individual’s contribution to the project. Explain them that you posses all the skills you mentioned above.

37. Have you worked under stress and how did you handle it?

Yes. Many times to deliver the project on schedule, we were under lot of pressure. During those times we work extra hours and help each other in the team to deliver the project on schedule. Team effort is key factor for the success of the project.

38. What is the difference between FastLoad and MultiLoad?

FastLoad uses multiple sessions to quickly load large amount of data on empty table.

MultiLoad is used for high-volume maintenance on tables and views. It works with non-empty tables also. Maximum 5 tables can be used in MultiLoad.

39. Have you used procedures?

No. I have not used procedures. But I have expertise knowledge writing procedures. My company have not encouraged me to write procedures because it becomes single AMP operation, as such uses lot of resources and expensive in terms of resource and time.

40. What is the purpose of indexes?

An index is a mechanism that can be used by the SQL query optimizer to make table access more performant. Indexes enhance data access by providing a more-or-less direct path to stored data and avoiding the necessity to perform full table scans to locate the small number of rows you typically want to retrieve or update.

41. What is primary index and secondary index?

Primary index is the mechanism for assigning a data row to an AMP and a location on the AMP’s disks. Indexes also used to access rows from a table without having to search the entire table.

Secondary indexes enhance set selection by specifying access paths less frequently used than the primary index path. Secondary indexes are also used to facilitate aggregate operations. If a secondary index covers a query, then the Optimizer determines that it would be less costly to accesses its rows directly rather than using it to access the base table rows it points to. Sometimes multiple secondary indexes with low individual selectivity can be overlapped and bit mapped to provide enhanced

42. Why primary and secondary index is used?

Refer answer from previous question.

43. What are the things to considered while creating secondary index?

Creating a secondary index causes Teradata to build a sub-table to contain its index rows, thus adding another set of rows that requires updating each time a table row is inserted, deleted, or updated. Secondary index sub-tables are also duplicated whenever a table is defined with FALLBACK, so the maintenance overhead is effectively doubled.

44. What is collect statistics?

Collects demographic data for one or more columns of a table, hash index, or join index, computes a statistical profile of the collected data, and stores the synopsis in the data dictionary. The Optimizer uses the synopsis data when it generates its table access and join plans.

45. Can we collect statistics on multiple columns?

Yes we can collect statistics on multiple columns.

46. Can we collect statistics on table level?

Yes we can collect statistics on table level. The syntax is COLLECT STAT ON TAB_A;

47. What is inner join and outer join?

An inner join gets data from both tables where the specified data exists in both tables.

An outer join gets data from the source table at all times, and returns data from the outer joined table ONLY if it matches the criteria.

48. When Tpump is used instead of MultiLoad?

TPump provides an alternative to MultiLoad for the low volume batch maintenance of large databases under control of a Teradata system. Instead of updating Teradata databases overnight, or in batches throughout the day, TPump updates information in real time, acquiring every bit of data from the client system with low processor utilization. It does this through a continuous feed of data into the data warehouse, rather than the traditional batch updates. Continuous updates results in more accurate, timely data. And, unlike most load utilities, TPump uses row hash locks rather than table level locks. This allows you to run queries while TPump is running. This also means that TPump can be stopped instantaneously. As a result, businesses can make better decisions that are based on the most current data.

49. What is spool space and when running a job if it reaches the maximum spool space how you solve the problem?

Spool space is used to hold intermediate rows during processing, and to hold the rows in the answer set of a transaction. Spool space reaches maximum when the query is not properly optimized. Use appropriate conditions in WHERE clause of the query to limit the answer set.

50. What is your level of expertise in using MS office suite?

Expert level. Using it for last 8 years for documentation.

51. Have you used Net meeting?

Yes. Used net meeting for team meeting when members of the team geographically in different locations.

52. Do you have any question?

What is the team size going to be? What is the current status of the project? What is the project schedule?

53. What is your available date?

Immediate. Or your available date for the project.

54. How much experience you have in MVS?

Intermediate. In my previous two projects used MVS to submit JCL jobs.

55. Have you created JCL script from scratch?

Yes. I have created JCL scripts from scratch while creating jobs in the development environment.

56. Have you modified any JCL script and used?

Yes I have modified JCL scripts. In my previous projects many applications were re-engineered so the existing JCL scripts were modified according to the company coding standards.

57. Rate yourself on using Teradata tools like BTEQ, Query man, FastLoad, MultiLoad and Tpump!

Intermediate to expert level. Extensively using for last 4 years. Also I am certified in Teradata.

58. Which is your favorite area in the project?

I enjoy every working on every part of the project. Volunteer my time for my peers so that I can also learn and contribute more towards the project success.

59. What is data mart?

A data mart is a special purpose subset of enterprise data used by a particular department, function or application. Data marts may have both summary and details data, however, usually the data has been pre aggregated or transformed in some way to better handle the particular type of requests of a specific user community. Data marts are categorized as independent, logical and dependant data marts.

60. Difference between star and snowflake schemas?

Star schema is De-normalized and snowflake schema is normalized.

61. Why should you put your data warehouse in a different system other than OLTP system?

Relational Data Modeling (OLTP design)	Dimensional Data Modeling (OLAP design)
Data is stored in RDBMS	Data is stored in RDBMS or Multidimensional databases
Tables are units of storage	Cubes are units of storage
Data is normalized and used for OLTP. Optimized for OLTP processing	Data is de-normalized and used in data warehouse and data mart. Optimized for OLAP
Several tables and chains of relationships among them	Few tables and fact tables are connected to dimensional tables
Volatile (several updates) and time variant	Non volatile and time invariant
SQL is used to manipulate data	MDX is used to manipulate data
Detailed level of transactional data	Summary of bulky transactional data (Aggregates and Measures) used in business decisions
Normal Reports	User friendly, interactive, drag and drop multidimensional OLAP Reports

62. Why are OLTP database designs not generally a good idea for a Data Warehouse?

OLTP designs are for real time data and they are not normalized and pre-aggregated. They are not good for decision support systems.

63. What type of Indexing mechanism do we need to use for a typical data warehouse?

Primary Index mechanism is the ideal type of index for data warehouse.

64. What is VLDB?

Very Large databases. Please find more information on it.

65. What is the difference between OLTP and OLAP?

Refer answer for question 61.

66. What is real time data warehousing?

Real-time data warehousing is a combination of two things: 1) real-time activity and 2) data warehousing. Real-time activity is activity that is happening right now. The activity could be anything such as the sale of widgets. Once the activity is complete, there is data about it. Data warehousing captures business activity data. Real-time data warehousing captures business activity data as it occurs. As soon as the business activity is complete and there is data about it, the completed activity data flows into the data warehouse and becomes available instantly. In other words, real-time data warehousing is a framework for deriving information from data as the data becomes available.

67. What is ODS?

An operational data store (ODS) is primarily a “dump” of relevant information from a very small number of systems (often just one) usually with little or no transformation. The benefits are an ad hoc query database, which does not affect the operation of systems required to run the business. ODS’s usually deal with data “raw” and “current” and can answer a limited set of queries as a result.

68. What is real time and near real time data warehousing?

The difference between real time and near real time can be summed up in one word: latency. Latency is the time lag that is between an activity completion and the completed activity data being available in the data warehouse. In real time, the latency is negligible whereas in near real time the latency is a tangible time frame such as two hours.

69. What are Normalization, First Normal Form, Second Normal Form and Third Normal Form?

Normalization is the process of efficiently organizing data in a database. The two goals of the normalization process are eliminate redundant data (storing the same data in more than one table) and ensure data dependencies make sense (only storing related data in the table).

First normalization form:

· Eliminate duplicate columns from the same table.

· Create separate tables for each group of related data and identify each row with a unique column or set of columns (primary key)

Second normal form:

· Removes sub set of data that apply to multiple rows of table and place them in separate table.

· Create relationships between these new tables and their predecessors through the use of foreign keys.

Third normal form:

· Remove column that are not dependent upon the primary key.

70. What is fact table?

The centralized table in a star schema is called as FACT table i.e. a table in that contains facts and connected to dimensions. A fact table typically has two types of columns: those that contain facts and those that are foreign keys to dimension tables. The primary key of a fact table is usually a composite key that is made up of all of its foreign keys. A fact table might contain either detail level facts or facts that have been aggregated (fact tables that contain aggregated facts are often instead called summary tables). In the real world, it is possible to have a fact table that contains no measures or facts. These tables are called as Factless Fact tables.

71. What is ETL?

Extract, transformation, and loading. ETL refers to the methods involved in accessing and manipulating source data and loading it into target database. The first step in ETL process is mapping the data between source systems and target database (data warehouse or data mart). The second step is cleansing of source data in staging area. The third step is transforming cleansed source data and then loading into the target system. Note that ETT (extract, transformation, transportation) and ETM (extraction, transformation, move) are sometimes used instead of ETL.

72. What is ER diagram?

It is Entity relationship diagram. Describes the relationship among the entities in the database model.

73. What is data mining?

Analyzing of large volumes of relatively simple data to extract important trends and new, higher level information. For example, a data-mining program might analyze millions of product orders to determine trends among top-spending customers, such as their likelihood to purchase again, or their likelihood to switch to a different vendor.

74. What is Star schema?

Star Schema is a relational database schema for representing multi-dimensional data. It is the simplest form of data warehouse schema that contains one or more dimensions and fact tables. It is called a star schema because the entity-relationship diagram between dimensions and fact tables resembles a star where one fact table is connected to multiple dimensions. The center of the star schema consists of a large fact table and it points towards the dimension tables. The advantages of star schema are slicing down, performance increase and easy understanding of data.

75. What is a lookup table?

Refer answer for questions 78. Dimension tables are sometimes called as lookup or reference tables.

76. What is a level of Granularity of a fact table?

The components that make up the granularity of the fact table correspond directly with the dimensions of the data model. Thus, when you define the granularity of the fact table, you identify the dimensions of the data model. The granularity of the fact table also determines how much storage space the database requires. For example, consider the following possible granularities for a fact table:

· Product by day by region

· Product by month by region

The size of a database that has a granularity of product by day by region would be much greater than a database with a granularity of product by month by region because the database contains records for every transaction made each day as opposed to a monthly summation of the transactions. You must carefully determine the granularity of your fact table because too fine a granularity could result in an astronomically large database. Conversely, too coarse granularity could mean the data is not detailed enough for users to perform meaningful queries against the database.

77. What is a dimension table?

Dimension table is one that describes the business entities of an enterprise, represented as hierarchical, categorical information such as time, departments, locations, and products. Dimension tables are sometimes called lookup or reference tables. In a relational data modeling, for normalization purposes, country lookup, state lookup, county lookup, and city lookups are not merged as a single table. In a dimensional data modeling (star schema), these tables would be merged as a single table called LOCATION DIMENSION for performance and slicing data requirements. This location dimension helps to compare the sales in one region with another region. We may see good sales profit in one region and loss in another region. If it is a loss, the reasons for that may be a new competitor in that area, or failure of our marketing strategy etc.

78. What are the various Reporting tools in the Market?

Crystal reports, Business objects, micro strategy and etc.,

79. What are the various ETL tools in the Market?

Ab Initio, Informatica and etc.,

80. What are the Different methods of loading Dimension tables?

81. What are Semi-additive and factless facts and in which scenario will you use such kinds of fact tables?

82. What is a three-tier data warehouse?

The three-tier differs from the two-tier architecture by strictly enforcing a logical separation of the graphical user interface, business logic, and data. The three-tier is widely used for data warehousing today. Organizations that require greater performance and scalability, the three-tier architecture may be more appropriate. In this architecture, data extracted from legacy systems is cleansed, transformed, and stored in high –speed database servers, which are used as the target database for front-end data access.

83. What are the various transformations available?

84. Importance of Surrogate Key in Data warehousing?

Surrogate Key is a Primary Key for a Dimension table. Most importance of using it is independent of underlying database. i.e. Surrogate Key is not affected by the changes going on with a database

85. Differentiate Primary Key and Partition Key?

Primary Key is a combination of unique and not null. It can be a collection of key values called as composite primary key. Partition Key is a just a part of Primary Key. There are several methods of partition like Hash, DB2, and Random etc. While using Hash partition we specify the Partition Key.

86. Differentiate Database data and Data warehouse data?

Data in a Database is Detailed or Transactional, Both Readable and Write able and current.

Data in data warehouse is detailed or summarized, storage place for historical data.

87. What are OLAP, MOLAP, ROLAP, DOLAP and HOLAP? Examples?

OLAP:

OLAP stands for Online Analytical Processing. It uses database tables (fact and dimension tables) to enable multidimensional viewing, analysis and querying of large amounts of data. E.g. OLAP technology could provide management with fast answers to complex queries on their operational data or enable them to analyze their company’s historical data for trends and patterns.

MOLAP:

Stands for Multidimensional OLAP. In MOLAP cubes the data aggregations and a copy of the fact data are stored in a multidimensional structure on the Analysis Server computer. It is best when extra storage space is available on the Analysis Server computer and the best query performance is desired. MOLAP local cubes contain all the necessary data for calculating aggregates and can be used offline. MOLAP cubes provide the fastest query response time and performance but require additional storage space for the extra copy of data from the fact table.

ROLAP:

Stands for Relational OLAP. In ROLAP cubes a copy of data from the fact table is not made and the data aggregates are stored in tables in the source relational database. A ROLAP cube is best when there is limited space on the Analysis Server and query performance is not very important. ROLAP local cubes contain the dimensions and cube definitions but aggregates are calculated when they are needed. A ROLAP cube requires less storage space than MOLAP and HOLAP cubes.

HOLAP:

Stands for Hybrid OLAP. A HOLAP cube has a combination of the ROLAP and MOLAP cube characteristics. It does not create a copy of the source data however; data aggregations are stored in a multidimensional structure on the Analysis Server computer. HOLAP cubes are best when storage space is limited but faster query responses are needed

DOLAP:

88. What is OLTP?

OLTP stands for Online Transaction Processing. OLTP uses normalized tables to quickly record large amounts of transactions while making sure that these updates of data occur in as few places as possible. Consequently OLTP database are designed for recording the daily operations and transactions of a business. E.g. a timecard system that supports a large production environment must record successfully a large number of updates during critical periods like lunch hour, breaks, startup and close of work.

89. Hierarchy of DWH?

90. What is aggregate awareness?

91. Explain reference cursor?

92. What are parallel queries and query hints?

93. DWH architecture?

94. What are cursors?

95. Advantages of de normalized data?

96. What is Meta data and system catalog?

97. What is confirmed dimension?

98. What is the capacity of power cube?

99. What are difference in macros and prompts?

100. What is hash partition?

101. What is DTM session?

102. What is staging area?

The data staging area is a system that stands between the legacy systems and the analytics system, usually a data warehouse and sometimes an ODS. The data staging area is considered the “back room” portion of the data warehouse environment. The data staging area is where the extract, transform and load (ETL) takes place and is out of bounds for end users. Some of the functions of the data staging area include:

· Extracting data from multiple legacy systems

· Cleansing the data, usually with a specialized tool

· Integrating data from multiple legacy systems into a single data warehouse

· Transforming legacy system keys into data warehouse keys, usually surrogate keys

· Transforming disparate codes for gender, marital status, etc., into the data warehouse standard

· Transforming the heterogeneous legacy data structures to the data warehouse data structures

· Loading the various data warehouse tables via automated jobs in a particular sequence through the bulk loader provided with the data warehouse database or a third-party bulk loader

103. What are data merging, data cleansing and sampling?

104. OLAP architecture?

105. What is subject area?

Subject area means fundamental entities that make up the major components of the business, e.g. customer, product, employee.

106. Why do we use DSS database for OLAP tools?

Refer answer for questions 61.

107. What is tenacity?

Number of hours Teradata utility will try to establish a connection to the system. Default is 4 hours.

108. What is a checkpoint?

Checkpoints are entries posted to a restart log table at regular intervals during the data transfer operation. If processing stops while a job is running, you can restart the job at the most recent checkpoint.

109. What is slowly changing dimension?

In a slowly changing dimension the attribute for a record varies over time. There are three ways to solve this problem.

· Type 1 – Replace an old record with a new record. No historical data available.

· Type 2 – Keep the old record and insert a new record. Historical data available but resources intensive.

· Type 3 – In the existing record, maintain extra columns for the new values.

110. What is sleep?

Number of minutes the Teradata utility will wait between logon attempts. Default is 6 minutes.

111. Difference between MultiLoad and TPump?

Tpump provides an alternative to MultiLoad for low volume batch maintenance of large databases under control of a Teradata system. Tpump updates information in real time, acquiring every bit of a data from the client system with low processor utilization. It does this through a continuous feed of data into the data warehouse, rather than the traditional batch updates. Continuous updates results in more accurate, timely data. Tpump uses row hash locks than table level locks. This allows you to run queries while Tpump is running.

112. Different phases of MultiLoad?

· Preliminary phase

· DML phase

· Acquisition phase

· Application phase

· End phase

113. Explain modifier!

The explain modifier generates an English translation of the parser’s plan. It is fully parsed and optimized but not executed. Explain returns

·         Text showing how a statement will be processed.

·         As estimate of how many rows will be involved

·         A relative cost of the request in units of time.

This information is useful for predicting row counts, predicting performance, testing queries before production and analyzing various approaches to a problem.

114. Explain how hash distribution is done!

115. Difference between oracle and Teradata warehouse!

Teradata can handle multi tera bytes of data. Teradata is linearly expandable, uses matured optimizer, shared nothing architecture. Uses data parallelism.

The Teradata DBA’s never have to reorganize data or index space, pre-allocate table/index space, format partitions, tune buffer space, ensure the queries run in parallel, pre-process data for loading and write or run programs to split the input data into partitions for loading.

116. What is dimensional modeling?

Dimensional Data Modeling comprises of one or more dimension tables and fact tables. Good examples of dimensions are location, product, time, promotion, organization etc. Dimension tables store records related to that particular dimension and no facts (measures) are stored in these tables

117. How will you solve the problem that occurs during update?

When there is an error during the update process, an entry is posted in the error log table. Query the log table and fix the error and restart the job.

118. How data is distributed in Teradata system?

119. Can you connect MultiLoad from Ab Initio?

Yes we can connect.

120. What interface is used to connect to windows based applications?

WinCLI interface.

121. What is data warehousing?

A data warehouse is a subject oriented, integrated, time variant, non-volatile collection of data in support of management’s decision-making process.

122. What is data modeling?

A Data model is a conceptual representation of data structures (tables) required for a database and is very powerful in expressing and communicating the business requirements.

123. What is logical data model?

A Logical data model is the version of a data model that represents the business requirements (entire or part) of an organization and is developed before the physical data model. A sound logical design should streamline the physical design process by clearly defining data structures and the relationships between them. A good data model is created by clearly thinking about the current and future business requirements. Logical data model includes all required entities, attributes, key groups, and relationships that represent business information and define business rules.

124. Tell us something about data modeling tools?

Data modeling tools to transform business requirements into logical data model, and logical data model to physical data model. From physical data model, these tools can be instructed to generate SQL code for creating database entities.

125. Steps to create a data model?

· Get business requirements.

· Create High Level Conceptual Data Model.

· Create Logical Data Model.

· Select target DBMS where data-modeling tool creates the physical schema.

· Create standard abbreviation document according to business standard

126. What is the maximum number of DML can be coded in a MultiLoad script?

Maximum 5 DML can be coded in a MultiLoad script.

Data WareHousing Interview Questions & Answers

Wednesday, April 29, 2009

Teradata Interview Questions and Answers

Labels