Category : Microsoft Azure Data Engineering | Sub Category : Practice Assessment for Exam DP-203 - Data Engineering on Microsoft Azure | By Prasad Bonam Last updated: 2023-09-10 00:32:36 Viewed : 589
Ans: Change the incoming stream for distinctRows1 to source1.
Changing the incoming stream for distinctRows1 to source1 will move the dedupe script right after source1, and only retrieve distinct rows.
Creating a new aggregate task after source1 and copying the script to the aggregate task will not work, and cause errors in the flow.
Changing the incoming stream for derivedColumn1 to distinctRows1 will break the flow as there will be no data coming into distinctRows1.
Creating a new flowlet task after source1 adds a subflow to the task.
Dedupe rows and find nulls by using data flow snippets - Azure Data Factory | Microsoft Learn
Orchestrating data movement and transformation in Azure Data Factory - Training | Microsoft Learn
To ensure that all the rows in source1
are deduplicated in an Azure Data Factory data flow, you can use the "distinct" transformation in your data flow script. Here are the steps to achieve this:
Open your Azure Data Factory data flow.
In the data flow designer, select the "source1" transformation, which represents your source data from "DelimitedText1."
Add a "distinct" transformation after the source to deduplicate the rows. The "distinct" transformation will ensure that only unique rows are passed downstream.
Connect the "distinct" transformation to the "derivedColumn1" transformation to continue processing the deduplicated data.
Save your data flow.
Publish and execute your data flow as part of your Azure Data Factory pipeline.
Here is a sample data flow script to add the "distinct" transformation:
json{
"name": "MyDataFlow",
"type": "DataFlow",
"typeProperties": {
"sources": [
{
"name": "source1",
"outputs": [
{
"name": "output"
}
]
}
],
"transformations": [
{
"name": "distinctRows1",
"description": "Deduplicate rows",
"type": "Distinct",
"typeProperties": {
"columns": [
"myCols"
]
}
},
{
"name": "derivedColumn1",
"description": "Create and update C1 columns",
"type": "DerivedColumn",
"typeProperties": {
"columns": [
{
"name": "C1",
"type": "Expression",
"expression": "myExpression"
}
]
}
},
{
"name": "select1",
"description": "Rename derivedColumn1 as select1 with columns C1",
"type": "Select",
"typeProperties": {
"columns": [
"C1"
]
}
}
],
"sinks": [
{
"name": "sink1",
"inputs": [
{
"name": "output"
}
],
"writeBatchSize": 10000
}
],
"scriptActions": [],
"integrationRuntime": {
"type": "Managed"
}
}
}
In this script, the "distinctRows1" transformation uses the "Distinct" type with the "columns" property set to "myCols" to deduplicate rows based on those columns. Adjust the script as needed for your specific column names and expressions.
By adding the "distinct" transformation, you ensure that only unique rows from source1
are processed in the subsequent transformations and sent to the sink.