You design an Azure Data Factory data flow activity to move large amounts of data from text files to an Azure Synapse Analytics database. You add a data flow script to your data flow. The data flow in the designer has the following tasks: distinctRows1: Aggregate data by using myCols that produce columns. source1: Import data from DelimitedText1. derivedColumn1: Create and update the C1 columns. select1: Rename derivedColumn1 as select1 with columns C1. sink1: Add a sink dataset. You need to ensure that all the rows in source1 are deduplicated. What should you do?

Category : Microsoft Azure Data Engineering | Sub Category : Practice Assessment for Exam DP-203 - Data Engineering on Microsoft Azure | By Prasad Bonam Last updated: 2023-09-10 06:02:36 Viewed : 23

Ans: Change the incoming stream for distinctRows1 to source1.

Changing the incoming stream for distinctRows1 to source1 will move the dedupe script right after source1, and only retrieve distinct rows.

Creating a new aggregate task after source1 and copying the script to the aggregate task will not work, and cause errors in the flow.

Changing the incoming stream for derivedColumn1 to distinctRows1 will break the flow as there will be no data coming into distinctRows1.

Creating a new flowlet task after source1 adds a subflow to the task.

Dedupe rows and find nulls by using data flow snippets - Azure Data Factory | Microsoft Learn

Orchestrating data movement and transformation in Azure Data Factory - Training | Microsoft Learn

To ensure that all the rows in source1 are deduplicated in an Azure Data Factory data flow, you can use the "distinct" transformation in your data flow script. Here are the steps to achieve this:

  1. Open your Azure Data Factory data flow.

  2. In the data flow designer, select the "source1" transformation, which represents your source data from "DelimitedText1."

  3. Add a "distinct" transformation after the source to deduplicate the rows. The "distinct" transformation will ensure that only unique rows are passed downstream.

  4. Connect the "distinct" transformation to the "derivedColumn1" transformation to continue processing the deduplicated data.

  5. Save your data flow.

  6. Publish and execute your data flow as part of your Azure Data Factory pipeline.

Here is a sample data flow script to add the "distinct" transformation:

{ "name": "MyDataFlow", "type": "DataFlow", "typeProperties": { "sources": [ { "name": "source1", "outputs": [ { "name": "output" } ] } ], "transformations": [ { "name": "distinctRows1", "description": "Deduplicate rows", "type": "Distinct", "typeProperties": { "columns": [ "myCols" ] } }, { "name": "derivedColumn1", "description": "Create and update C1 columns", "type": "DerivedColumn", "typeProperties": { "columns": [ { "name": "C1", "type": "Expression", "expression": "myExpression" } ] } }, { "name": "select1", "description": "Rename derivedColumn1 as select1 with columns C1", "type": "Select", "typeProperties": { "columns": [ "C1" ] } } ], "sinks": [ { "name": "sink1", "inputs": [ { "name": "output" } ], "writeBatchSize": 10000 } ], "scriptActions": [], "integrationRuntime": { "type": "Managed" } } }

In this script, the "distinctRows1" transformation uses the "Distinct" type with the "columns" property set to "myCols" to deduplicate rows based on those columns. Adjust the script as needed for your specific column names and expressions.

By adding the "distinct" transformation, you ensure that only unique rows from source1 are processed in the subsequent transformations and sent to the sink.

Related Articles

Leave a Comment: