Using AWS Cognito in production… a retrospective.

Jesse Davda
11 min readApr 18, 2021
Photo by İsmail Enes Ayhan on Unsplash

Back in July 2020, Clutch was using Auth0 for all our user account management and authorization, when a new developer that had joined our team showed us how powerful leveraging cloud computing could be. At the time we were using a select few AWS services to power our application (EC2, Route53 and a few lambdas and queues for certain asynchronous workloads), all managed through the AWS console. It was at the same time we were facing limitations with Auth0, mainly around how they handled hooks and storing of user metadata. Their pricing model was also very expensive for a small startup.

Once we had made the decision to migrate from Auth0 to Cognito the next step was designing our user account architecture.

Application Structure

We ultimately decided to go with DynamoDB to store our user information and use Cognito’s lambda triggers to populate the table with data. We integrated with Facebook and Google as identity providers and had two types of users: guest and registered. guests are non-authenticated users that are generated through leads such as our site’s “contact us” form, whereas registered users are authenticated and have signed up through our site, either with email/password, or with one of our identity providers. On the front-end, we used Amplify to trigger all of these flows.

We employed the use of the ‘pre-signup’ and ‘post-confirmation’ lambda triggers to run a findOrCreate on our users table. This was to make sure there was a user profile row for each sign-up.

With our original design, each registration method created a unique user in the Cognito user pool and a corresponding unique profile row in the DynamoDB users table, i.e. Facebook, Google, and native user are all separate accounts with the same email. Recently, due to reported user confusion around the separated accounts, we decided to merge providers based on email.

Implementing The New Structure

A quick google search akin to “aws cognito merging user accounts”, lead us to the adminLinkProviderForUser method. After reading a bit about the functionality of the method, we thought that this is what we would need to achieve our new structure, it was, but not exactly how we thought…

On the face of it adminLinkProviderForUser links multiple providers to a single native account, allowing multiple methods of authorization connected to the same root Cognito user. The documentation states that you pass the method two users: a destinationUser and a sourceUser. The destinationUser must be a native Cognito user existing in the user pool at the time of linking, and the sourceUser must be a social user that DOES NOT exist as a user in the Cognito user pool. After reading this we decided that the most logical place in our registration flow to link the users would be in the ‘pre-signup’ lambda trigger, since technically, the social user signing up doesn’t yet exist in the pool, but you still have the required user information to complete the link operation.

This is when we came across a known bug in Cognito, relating to the adminLinkProviderForUser method. If you link the incoming social user to the root user in the ‘pre-signup’ trigger, Cognito will still try to create the social user as a standalone user in the pool, this operation fails with the error: An entry for the user: USER_ID already exist, this is because although the user doesn’t exist as it’s own entity as such, it does exist as an identity linked to an existing user. A quick remedy for this was to detect the error on the front-end (the error message is returned from the OAuth flow in the redirect URL) and run the sign-in flow, since the error would only happen on the first registration attempt, as registration is the only time Cognito will try to create a user. Although on sign-in, Cognito will find the root user that is linked to the social user that has just been authenticated and will return the correct tokens.

This approach works fine for Facebook, since their authentication flow remembers the user and won’t require another challenge to be completed, instead it will redirect you back to the application successfully. Google, on the other hand, if you have more than one account signed in, will require you to select the account you want to continue with, on every initiation of the OAuth flow — an unacceptable user experience.

Moving the link operation to ‘post-confirmation’

The next iteration was moving the linking step to ‘post-confirmation’, and keeping the ‘findOrCreate’ in ‘pre-signup’. Because the sourceUser needs to be a non-existing user, and this trigger runs after the user has been created, the newly made social user has to be deleted before we can link it to the root native user. Theoretically this would work, the last trigger in the registration process is the ‘post-confirmation’, so there can’t be anything that will break the flow once the trigger has completed.

Once we had moved the code over and pushed it to our testing environment we encountered some weird behavior, signing up with Facebook or Google worked, the Cognito console showed one user, with linked providers, merged user data, and our users dynamo table was being populated correctly. But for some reason, we were getting a 403 — invalid_grant error from the oauth2/token endpoint on our user pool when sending the authorization code.

After doing some research into the token endpoint and the error, it turns out that the invalid_grant error is only returned when either the refresh token has been revoked, or in our case, the authorization code does not exist. This is because the code that is generated in the social providers OAuth flow is only returned once the ‘post-confirmation’ trigger has run. But that user was deleted in order to be linked to the root user with the same email, hence the invalid_grant error. A quick fix to allow the user to sign in, is to delete the cookie that Cognito sets (this can only be done by calling the sign out endpoint, since the cookie has the httpOnly flag set), which I assume is the authorization code or some reference to it. Once deleted the next sign is successful and the user is authorized with no issue, although this isn’t ideal as the page will flash when the user is redirected twice.

Utilizing the custom authentication flow to link the users and return the correct tokens

Because the previous two solutions left the user with a somewhat bad experience, we had to come up with something different. Through a mixture of finding this Github issue and a response from AWS support, we came to a solution. Cognito allows you to create your own custom authentication flow allowing you to create and verify your own authentication challenge. In our implementation we kept the define auth trigger and the create auth challenge trigger mostly default. The verify auth challenge response lambda took in the JWT returned from the created social user, verified it and then performed the linking step, returning the correct tokens.

Custom authentication flow using AWS Lambda

These are the three triggers that control the custom authentication flow:

Define auth challenge

This lambda is like the controller for the entire flow, it has access to the sessions array which is an array of the past challenges that a user has completed. This trigger also decides whether to issue tokens (authentication success) or fail authentication.

Create auth challenge

This lambda creates the authentication challenge that will be completed by the user, for example if the flow was a one-time password emailed to the user, this trigger would be responsible for generating the one-time code and emailing it to the user.

Verify auth challenge response

This lambda is responsible for verifying what the user enters as a response to the challenge set by the previous lambda. The create auth challenge lambda passes the expected response to this lambda using the privateChallengeParamaters — these have an expiry time of 3 minutes, meaning that if this lambda isn’t invoked within that time limit the flow must be restarted.

In our implementation of this custom flow the define auth challenge lambda will set the the challengeName to CUSTOM_CHALLENGE, this triggers the next part of the flow, the create auth challenge. This lambda doesn’t do much in our flow since the custom challenge we use doesn’t require a code or password to be generated. Once these two lambdas have run, the user then sends the JWT that was generated for the social user into the verify auth challenge response trigger, where we verify that the JWT was signed by the correct user pool with the audience being the app client id and the email in the verified JWT matches the email that the authentication flow was started with, as long as these conditions are met we delete the social user and link them to the user found or created in the ‘pre-signup’ trigger, once that is complete the ‘define auth challenge’ trigger is spun up again and decides whether or not the authentication flow is complete (in our case that was the end, but you could use a traditional password authentication flow and subsequently a one-time password challenge using this flow). The trigger then issues new tokens to the user so that they are now logged in as the native user with the social user linked to them.

This only needs to happen the first time a user signs up with a social account, once the user has completed that flow, they can sign in with the usual Cognito handled methods.

Challenges with this solution

Although this solution is the best one of the 3 that we tried it wasn’t without it’s challenges. The main challenge we faced was that for social users the email_verified flag is set to false by default unless you set it in your user pool’s social provider attribute mapping:

AWS Cognito Google provider attribute mapping

This is easy with google since they supply the email_verified data point with there user API, were as Facebook HAD a verified flag available in their docs but it is now deprecated.

This is an issue because since we create a root account for new social if the user wants to sign up with a native account they will have to go through the password reset flow, Nintendo does a similar approach to this. The problem arises when the user goes to send a password reset email, Cognito won’t send one if the email isn’t verified. We came up with a few different solutions to this problem that included automatically verifying the users email on sign in, provided they were in the aforementioned state or verifying the users email just before they tried to reset their password. The solution we decided on came after a bit of a Hail Mary to check whether the ‘deprecated’ verified returned in the Facebook payload , which it does — we aren’t too worried if this returns false positives since we make the assumption that in order to sign up with Facebook the user would have verified their email address.

For now we are taking the risk that Facebook is returning the verified flag, although we have added alerts to check for an instance where that flag isn’t being returned properly.

The login issue

We thought that the email verification problem was the last issue we would face in our journey to implementing the new user structure, but it wasn’t, there was one more problem that came up during testing: Once we logged in and out with a social account, the subsequent sign in with that same social account fails, with the oauth2/token endpoint returning a 403 — invalid grant error. We had faced this before, previously it was caused by an OAuth code being used in the flow that doesn’t exist anymore, because we deleted the user — makes sense. Although, this time we had a successful login after deleting the user.

After trying a few different things, namely retrying the login flow and deleting cookies we found the culprit, the same ‘cognito’ cookie that was left on the users machine the last time we had an issue, deleting that allowed the user to login again. This didn’t make much sense to us, since in order to reproduce the issue we had to signOut and in theory the signOut call should purge any Cognito/Amplify related tokens from localStorage and cookies, but it wasn’t.

This took a bit of detective work to figure out, we wanted to see what Amplify was doing when you called Auth.signOut(), this is the code we found in the amplify Auth module:

Since we weren’t calling Auth.signOut with the global option, we only needed to focus on these parts:

What this does is it checks if the last sign in was through a hosted UI — like Facebook and Google, this is because the hosted UI’s use an OAuth flow which stores tokens on the users computer, therefore if the last login was through a hosted UI then Amplify will need to remove any of the tokens from the machine. As you can see in the code the way Amplify checks this is the combination of the oAuthHandler resolving to true and the ‘amplify-signin-with-hostedUI’ item in localStorage being true. Observant readers may have figured out the solution to this problem already, but for those that have not, if you remember in the flow we settled on, the first step is to login through a hosted UI (Facebook or Google) and then call the custom auth flow we implemented, well as it turns out calling the custom flow sets that localStorage item to ‘false’, since the custom flow is not a flow that uses the a hostedUI (bear in mind that Cognito does offer a hostedUI flow that uses the users email/password). This means that once a user logs out, Amplify performs that check and because the ‘hostedUI’ item is set to false it will just clear localStorage, without making a call to the user pool’s logout endpoint, which won’t clear the previous session, hence the 403 — invalid_grant error. So in order to fix this problem we set the ‘hostedUI’ localStorage item to “true” once the custom auth flow has been completed, that way once the user logs out the previous session is properly cleared and causes no further issues upon signing in again.

Conclusion

Looking back on the challenges we faced and the hours spent researching and trying different solutions, there were many times where we contemplated throwing in the towel and trying to either build a custom solution, or integrating with one of Cognito’s competitors (firebase, or back to Auth0) both of those have disadvantages over Cognito.

The main disadvantage of building a custom solution is the security aspect, sure we could have built a user account service that integrates into our app, that takes care of storing user information, sign up, sign in etc..., and use something like bcrypt or argon2 to hash user passwords/authenticate but since non of out team members are security experts, there would be many attack vectors we never would have thought of. At our current team/org size building a custom solution introduces too muck risk as well as potentially not being fully compliant to HIPPA, PCI etc… where as Cognito is out of the box — this is without mentioning the extra dev time it would take to build the custom solution out.

Overall I’m happy with the decision to use AWS Cognito to handle our user accounts, I think as the organization grows, Cognito may not fit all of our criteria and at that point we probably will migrate to a more custom solution.

--

--