Finding duplicated data across one or more columns in a database table
A few months ago I posted a little query about [finding duplicate rows in a database table]({{<relref “/blog/2008-10-17-finding-duplicate-rows-in-the-database”>}} “Finding duplicate rows in a database table”).
I'm revisiting this because I helped out Doogie with a similar query last night but with some complications.
Let's start with the original simple scenario of checking duplicates in a single column.
Some example data, a Users table:
Id | |
---|---|
1 |
joe@bloggs.com |
2 |
joe@bloggs.com |
3 |
joe@bloggs.com |
4 |
jane@doe.com |
5 |
jane@doe.com |
6 |
john@doe.com |
You can see that joe@bloggs.com
and jane@doe.com
have been duplicated. This could have been prevented by putting a unique index on the Email column.
So to find what emails have duplicates in our table:
Microsoft.Playwright.PlaywrightException: Error: The language "mysql" has no grammar.
at Object.highlight (http://127.0.0.1:41899/main.js:3978:11)
at eval (eval at evaluate (:226:30), :2:18)
at UtilityScript.evaluate (:233:19)
at UtilityScript. (:1:44)
at Microsoft.Playwright.Transport.Connection.InnerSendMessageToServerAsync[T](ChannelOwnerBase object, String method, Dictionary`2 dictionary, Boolean keepNulls) in /_/src/Playwright/Transport/Connection.cs:line 209
at Microsoft.Playwright.Transport.Connection.WrapApiCallAsync[T](Func`1 action, Boolean isInternal) in /_/src/Playwright/Transport/Connection.cs:line 535
at Microsoft.Playwright.Core.Frame.EvaluateAsync[T](String script, Object arg) in /_/src/Playwright/Core/Frame.cs:line 548
at SiteGen.Extensions.Markdown.Prism.PrismHost.Highlight(String source, String language)
at SiteGen.Extensions.Markdown.Prism.PrismCodeBlockRenderer.Write(HtmlRenderer renderer, PrismCodeBlock obj)
Results:
Duplicates | |
---|---|
jane@doe.com |
2 |
joe@bloggs.com |
3 |
So, to help us manually correct our data, what are the Ids of the duplicates?
In MySQL (4.1+), we can use GROUP_CONCAT
(after casting the numerical Id to a character string):
Microsoft.Playwright.PlaywrightException: Error: The language "mysql" has no grammar.
at Object.highlight (http://127.0.0.1:41899/main.js:3978:11)
at eval (eval at evaluate (:226:30), :2:18)
at UtilityScript.evaluate (:233:19)
at UtilityScript. (:1:44)
at Microsoft.Playwright.Transport.Connection.InnerSendMessageToServerAsync[T](ChannelOwnerBase object, String method, Dictionary`2 dictionary, Boolean keepNulls) in /_/src/Playwright/Transport/Connection.cs:line 209
at Microsoft.Playwright.Transport.Connection.WrapApiCallAsync[T](Func`1 action, Boolean isInternal) in /_/src/Playwright/Transport/Connection.cs:line 535
at Microsoft.Playwright.Core.Frame.EvaluateAsync[T](String script, Object arg) in /_/src/Playwright/Core/Frame.cs:line 548
at SiteGen.Extensions.Markdown.Prism.PrismHost.Highlight(String source, String language)
at SiteGen.Extensions.Markdown.Prism.PrismCodeBlockRenderer.Write(HtmlRenderer renderer, PrismCodeBlock obj)
Our results:
Duplicates | Culprits | |
---|---|---|
jane@doe.com |
2 | 4,5 |
joe@bloggs.com |
3 | 1,2,3 |
That's quite handy, but what about just a list of the duplicates we can go through, instead of these rows of comma-separated Ids? This fugly query will do that for us: (I'm sure I could do this a better way but I'm tired and this works!)
Microsoft.Playwright.PlaywrightException: Error: The language "mysql" has no grammar.
at Object.highlight (http://127.0.0.1:41899/main.js:3978:11)
at eval (eval at evaluate (:226:30), :2:18)
at UtilityScript.evaluate (:233:19)
at UtilityScript. (:1:44)
at Microsoft.Playwright.Transport.Connection.InnerSendMessageToServerAsync[T](ChannelOwnerBase object, String method, Dictionary`2 dictionary, Boolean keepNulls) in /_/src/Playwright/Transport/Connection.cs:line 209
at Microsoft.Playwright.Transport.Connection.WrapApiCallAsync[T](Func`1 action, Boolean isInternal) in /_/src/Playwright/Transport/Connection.cs:line 535
at Microsoft.Playwright.Core.Frame.EvaluateAsync[T](String script, Object arg) in /_/src/Playwright/Core/Frame.cs:line 548
at SiteGen.Extensions.Markdown.Prism.PrismHost.Highlight(String source, String language)
at SiteGen.Extensions.Markdown.Prism.PrismCodeBlockRenderer.Write(HtmlRenderer renderer, PrismCodeBlock obj)
Id | |
---|---|
4 | jane@doe.com |
5 | jane@doe.com |
1 | joe@bloggs.com |
2 | joe@bloggs.com |
3 | joe@bloggs.com |
Now you can edit / delete the rows you want to get rid of if you ran the query in something like phpMyAdmin.
And don't forget, after the clean-up job, add that index to prevent duplicates re-appearing:
Microsoft.Playwright.PlaywrightException: Error: The language "mysql" has no grammar.
at Object.highlight (http://127.0.0.1:41899/main.js:3978:11)
at eval (eval at evaluate (:226:30), :2:18)
at UtilityScript.evaluate (:233:19)
at UtilityScript. (:1:44)
at Microsoft.Playwright.Transport.Connection.InnerSendMessageToServerAsync[T](ChannelOwnerBase object, String method, Dictionary`2 dictionary, Boolean keepNulls) in /_/src/Playwright/Transport/Connection.cs:line 209
at Microsoft.Playwright.Transport.Connection.WrapApiCallAsync[T](Func`1 action, Boolean isInternal) in /_/src/Playwright/Transport/Connection.cs:line 535
at Microsoft.Playwright.Core.Frame.EvaluateAsync[T](String script, Object arg) in /_/src/Playwright/Core/Frame.cs:line 548
at SiteGen.Extensions.Markdown.Prism.PrismHost.Highlight(String source, String language)
at SiteGen.Extensions.Markdown.Prism.PrismCodeBlockRenderer.Write(HtmlRenderer renderer, PrismCodeBlock obj)
Now, the new scenario. What about duplicates across multiple columns? For example, our Locations table:
Id | CountryCode | AreaCode | Prefix |
---|---|---|---|
1 | 64 | 9 | 489 |
2 | 64 | 9 | 489 |
3 | 64 | 9 | 489 |
4 | 64 | 3 | 942 |
5 | 64 | 3 | 942 |
6 | 64 | 9 | 536 |
Here, we want to find duplicates that have the same values in the 3 columns.
For example, you can see that 64-9-489 is duplicated three times, and 64-3-942 two times.
We can do this without much alteration to our original queries:
Microsoft.Playwright.PlaywrightException: Error: The language "mysql" has no grammar.
at Object.highlight (http://127.0.0.1:41899/main.js:3978:11)
at eval (eval at evaluate (:226:30), :2:18)
at UtilityScript.evaluate (:233:19)
at UtilityScript. (:1:44)
at Microsoft.Playwright.Transport.Connection.InnerSendMessageToServerAsync[T](ChannelOwnerBase object, String method, Dictionary`2 dictionary, Boolean keepNulls) in /_/src/Playwright/Transport/Connection.cs:line 209
at Microsoft.Playwright.Transport.Connection.WrapApiCallAsync[T](Func`1 action, Boolean isInternal) in /_/src/Playwright/Transport/Connection.cs:line 535
at Microsoft.Playwright.Core.Frame.EvaluateAsync[T](String script, Object arg) in /_/src/Playwright/Core/Frame.cs:line 548
at SiteGen.Extensions.Markdown.Prism.PrismHost.Highlight(String source, String language)
at SiteGen.Extensions.Markdown.Prism.PrismCodeBlockRenderer.Write(HtmlRenderer renderer, PrismCodeBlock obj)
CountryCode | AreaCode | Prefix | Duplicates |
---|---|---|---|
64 | 3 | 942 | 2 |
64 | 9 | 489 | 3 |
Then to get the Ids:
Microsoft.Playwright.PlaywrightException: Error: The language "mysql" has no grammar.
at Object.highlight (http://127.0.0.1:41899/main.js:3978:11)
at eval (eval at evaluate (:226:30), :2:18)
at UtilityScript.evaluate (:233:19)
at UtilityScript. (:1:44)
at Microsoft.Playwright.Transport.Connection.InnerSendMessageToServerAsync[T](ChannelOwnerBase object, String method, Dictionary`2 dictionary, Boolean keepNulls) in /_/src/Playwright/Transport/Connection.cs:line 209
at Microsoft.Playwright.Transport.Connection.WrapApiCallAsync[T](Func`1 action, Boolean isInternal) in /_/src/Playwright/Transport/Connection.cs:line 535
at Microsoft.Playwright.Core.Frame.EvaluateAsync[T](String script, Object arg) in /_/src/Playwright/Core/Frame.cs:line 548
at SiteGen.Extensions.Markdown.Prism.PrismHost.Highlight(String source, String language)
at SiteGen.Extensions.Markdown.Prism.PrismCodeBlockRenderer.Write(HtmlRenderer renderer, PrismCodeBlock obj)
CountryCode | AreaCode | Prefix | Duplicates | Culprits |
---|---|---|---|---|
64 | 3 | 942 | 2 | 4,5 |
64 | 9 | 489 | 3 | 1,2,3 |
I think you're getting the point.
Here's to get the rows for the culprits:
Microsoft.Playwright.PlaywrightException: Error: The language "mysql" has no grammar.
at Object.highlight (http://127.0.0.1:41899/main.js:3978:11)
at eval (eval at evaluate (:226:30), :2:18)
at UtilityScript.evaluate (:233:19)
at UtilityScript. (:1:44)
at Microsoft.Playwright.Transport.Connection.InnerSendMessageToServerAsync[T](ChannelOwnerBase object, String method, Dictionary`2 dictionary, Boolean keepNulls) in /_/src/Playwright/Transport/Connection.cs:line 209
at Microsoft.Playwright.Transport.Connection.WrapApiCallAsync[T](Func`1 action, Boolean isInternal) in /_/src/Playwright/Transport/Connection.cs:line 535
at Microsoft.Playwright.Core.Frame.EvaluateAsync[T](String script, Object arg) in /_/src/Playwright/Core/Frame.cs:line 548
at SiteGen.Extensions.Markdown.Prism.PrismHost.Highlight(String source, String language)
at SiteGen.Extensions.Markdown.Prism.PrismCodeBlockRenderer.Write(HtmlRenderer renderer, PrismCodeBlock obj)
Id | CountryCode | AreaCode | Prefix |
---|---|---|---|
4 | 64 | 3 | 942 |
5 | 64 | 3 | 942 |
1 | 64 | 9 | 489 |
2 | 64 | 9 | 489 |
3 | 64 | 9 | 489 |
Again I'm sure there'd be an easier way to do that, but hey, it works, and for something that should be a one-off.
So how to prevent the duplicated data in our second scenario? Add a composite unique key on those columns:
Microsoft.Playwright.PlaywrightException: Error: The language "mysql" has no grammar.
at Object.highlight (http://127.0.0.1:41899/main.js:3978:11)
at eval (eval at evaluate (:226:30), :2:18)
at UtilityScript.evaluate (:233:19)
at UtilityScript. (:1:44)
at Microsoft.Playwright.Transport.Connection.InnerSendMessageToServerAsync[T](ChannelOwnerBase object, String method, Dictionary`2 dictionary, Boolean keepNulls) in /_/src/Playwright/Transport/Connection.cs:line 209
at Microsoft.Playwright.Transport.Connection.WrapApiCallAsync[T](Func`1 action, Boolean isInternal) in /_/src/Playwright/Transport/Connection.cs:line 535
at Microsoft.Playwright.Core.Frame.EvaluateAsync[T](String script, Object arg) in /_/src/Playwright/Core/Frame.cs:line 548
at SiteGen.Extensions.Markdown.Prism.PrismHost.Highlight(String source, String language)
at SiteGen.Extensions.Markdown.Prism.PrismCodeBlockRenderer.Write(HtmlRenderer renderer, PrismCodeBlock obj)