min read

Recap: NLP Meetup #2

On Wednesday, 9th of December 2020, we held the second NLP Meetup with speakers from Trust You, Stanford and Fraunhofer IIS. This blog post recaps the event.

The NLP Meetup is a part of our efforts to push the conversation and boundaries of Natural Language Processing. We at KNOWRON deeply believe that the ability to interact with computers using natural language will be the great enabler of technology for all age groups and people of all abilities. Despite the tough conditions due to the Corona Pandemic, we held the second edition of the NLP meetup, completely online. The event took place on the 9th of December 2020 with over 60 attendees and a good mix of speakers and topics.  

If you missed the event, don’t worry! We’ve shortly summarized the three topics below. If you’d like to attend the next meetup, the third edition of the NLP Meetup Series is due to be held on the 10th of February 2021! Be sure to sign up on our  Meetup Page.


Speaker: Ivan Bilan

"In Search of Best Practices for NLP Projects"

Ivan is the Team Lead for Data Science at Trust You and holds a M.Sc. in Linguistics from the Ludwig Maximilians University in Munich. He also maintains the popular Github repository  The NLP Pandect.

Ivan’s pandect is a compilation of NLP related resources. He is also very active on LinkedIn.

The questions that Ivan gets asked most frequently, are always related to the best practices in developing and deploying NLP Systems. Unlike Software Engineering, developing production grade NLP does not have decades of history behind it, and therefore information on best practices and pitfalls to avoid is scarce. Ivan answered some burning questions on topics such as collecting quality data, building a scalable pipeline and feedback collection.

Generally, the NLP development pipeline consists of the following steps:

  • Get the data: This is the most crucial step in building an effective NLP system. Investing time early to build quality dataset can result in up to 5 times better results of the model.
  • Building a baseline: There are a number of benefits of building an early baseline. For one, it shows that the problem is solvable. It also provides the direction in which to move towards to build the final model.
  • Build the prototype: Before developing a full-blown model, you must first validate that your product is actually useful to users. Not only does a prototype validate the need, but it also generates user interest and helps develop a better understanding of customers.
  • Build the NLP pipeline: Once the product is validated, start building the development pipeline. Choose the right framework (from many possible options) and develop your model. It is important to build rigorous testing and verification modules to ensure system upkeep in production. 
  • Release to the clients: This is the step that most companies struggle with. Don’t let your Data Scientists do the deployments. Data Scientists are good at building statistical models but usually develop subpar deployment pipelines. Instead, hire dedicated Data Engineers to build deployment pipelines and do the ops work. Keep all your models under version control to allow for transparency of releases and allowing for rollback options.
  • Collect feedback: No amount of testing can cover all possible bugs in your models. Problems still arise in production, but the good news is that your users will usually react faster to wrong data than you. Allow users to report issues through the application UI but be prepared for people with malicious intent who would try to feed false data to your system.
  • Monitor data and NLP pipeline: Set up end to end tracking on your model. Everything from language drift to classification accuracy should be measured in order to be able to respond quickly to potential changes in the data. 

Speaker: Giovanni Campagna

"Genie, An Open-Source Toolkit for High Quality, Affordable Virtual Assistants"

Giovanni is a Researcher at the Open Virtual Assistant Lab (OVAL) at Stanford University. OVAL is responsible for running the Open Virtual Assistant Initiative, a project funded by the Alfred A. Sloan Foundation to democratize virtual assistant technology, protect privacy, and keep open knowledge access. Genie was a part of Giovanni’s Ph.D. research. 

Why did the team decide to build Genie? They believe that natural language will be a first-class user interface in the future, along with the keyboard and touch input. Currently, the most popular approach to developing Digital Assistants is with Intents and Deep Learning. There are a number of problems with this state of the industry:

  • Cost: As with any other Deep Learning solution, training intents with dialogues requires too many example dialogues. This adds significant cost to development of any digital assistant.
  • Life cycle: Until the necessary data is obtained to train the assistant, there will be cold start problem to overcome. 
  • Usability: With so many possible utterances for an intent, not only is collecting so many examples a significant challenge, but also annotating them would result in human errors in labeling. This would result in ineffective personal assistant, despite having large datasets. 
  • Effectiveness: Managing the conversation flow and responding to a change of topics is a hard problem to solve.
  • Scalability: There are billions of web pages, in thousands of natural languages. The existing approaches simply do not work at that scale. 

While all of these problems are valid for any company aiming to develop digital assistants, the big three (Google, Amazon and Apple) have a huge advantage over any other player. Not only do they have large teams to work on these projects, but they also have the silver bullet to this approach: Data. With these companies collecting data from millions of actual users, no startup can stand to compete with them.

Genie can transform user input into executable actions in an interpretable manner.

The idea behind Genie is twofold. The first idea is skip human conversation understanding entirely and aim to achieve end-to-end translation from natural language to a code format. It is not necessary to understand human conversation in order to have an effective assistant: Assistants don’t need to digest the entirety of Wikipedia to post pictures of cats. Since what the computer requires with each interaction is code to execute, the team developed a language called “ThingTalk” to translate human utterances into executable code. The second idea is to avoid the collection of Big Data. Genie provides a toolkit to engineer and amplify existing small to data to generate a dataset that can train a Neural Network. Benchmarking Siri, Alexa and Google Home on local restaurant queries, Genie outperformed them all, thus validating the thesis that an effective assistant can be built with a different approach.

A link to the corresponding paper can be found here.

Speaker: Birgit Brüggemeier

"Conversational Privacy: How can chatbots and speech assistants communicate privacy?"

Birgit is currently working at Fraunhofer Institute for Integrated Circuits in Erlangen, where her team is working on building privacy into personal assistants. She holds a Ph.D. in Neuroscience from Oxford University. 

The Fraunhofer SPEAKER project aims to provide a German-made voice assistant solution that implements European standards of data security.

Why build privacy focused personal assistants? Currently there are three big players in the Digital Assistant market, and all of them have had their run-ins with Data Privacy issues. Amazon particularly was accused by whistleblowers of listening to users conversations for labeling purposes. Another reason for interest is that in the past 2 years, the topic of personal assistants has grown at a tremendous pace and is expected to continue to do so. This combined with the fact that speech will become an additional way of interacting with machines makes personal assistants an important data privacy concern. As development of assistants gains more and more traction, the question of how to build privacy-driven assistants becomes more important. 

The GDPR requires seamless communication of privacy to the end user. The word “seamless” is important here. According to the legislature, when a user interacts with the speech device, they should never have to stop the interaction and look at the phone to explicitly see how their data is being handled. This is not natural, and no one will do that, except those who are really motivated to do so. This is not something that existing assistants do very well. As an experiment, the team asked Siri to stop processing of all personal data. Siri responded “OK”, but it didn't do anything, and the data was still being processed. 

The team created an experiment to measure the trustworthiness of a regular personal assistant against a privacy-focused assistant. They gave users a banking use case and a chatbot to assist them with it. The users were split into two groups. The assistant worked the same for both groups throughout the use case until the final prompt. For the control group using a regular assistant, the assistant asks, "Is there something else I can help you with?". The user can explicitly ask this assistant to delete personal data, and it will. In contrast, the other group using a privacy focused assistant got a privacy specific question "Do you want me to delete your data from this interaction?". The variables under observation were:

  • Does conversational privacy affect user perceptions? 
  • Does conversational privacy affect user choices? 

For the control group, majority of users said “I don’t want any more help”, the thought of asking to delete personal data seemingly never crossing their minds. On the other hand, the majority of the users using a privacy-focused assistant accepted and asked for their data to be deleted for the transaction. Furthermore, both the assistants scored similarly on all performance metrics except privacy, on which the control group gave low trust worthiness rating to the chatbot.

The results of the experiment showed that privacy must be baked into the assistants and should always be a transparent option in order for the assistant to be viewed by the users as privacy-friendly.

Future Events:

We have already announced the third edition of the NLP Meetup Series, due to be held on the 10th of February 2021!

Sign up on the  Meetup Page and we look forward to having you there.

Want receive the best maketing insights? Subscribe now!

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Massa adipiscing in at orci semper. Urna, urna.

Thanks for joining our newsletter.
Oops! Something went wrong.

Latest Articles

No items found.