Generative AI in Data Science: Gemini vs Manual Preparation

TL;DR: A data scientist used Gemini to automate a preprocessing task that took an hour manually. AI did it in seconds, but the solution was suboptimal. The lesson: AI accelerates work, but fundamentals remain essential.

What happened?

An article published in Towards Data Science recounts the experience of a data scientist who spent an hour on a preprocessing task with Pandas. Out of curiosity, they asked Gemini (Google's language model) to generate the necessary code. The result: Gemini produced a functional solution in seconds. However, the author warns that the solution, though fast, was not optimal and required human review. This case is not isolated: since the launch of ChatGPT in November 2022, AI code generation has become popular, but this specific example shows current limitations. According to the article, the task involved cleaning and transforming a dataset with multiple columns and missing values; Gemini's solution omitted some integrity checks that the data scientist considered essential.

Why is it important?

Data preparation is one of the most tedious and time-consuming stages in data science. According to various surveys (such as CrowdFlower's 2016 and Anaconda's 2020 updates), it can take up to 80% of a project's time. If generative AI can automate this phase, the impact on productivity would be enormous. But the case also shows that AI does not replace human judgment: Gemini's solution was correct but suboptimal, reinforcing the need for data scientists to understand fundamentals. Additionally, the article notes that Gemini's code did not properly handle mixed data types in a column, something an expert human would immediately detect. This underscores that while AI accelerates repetitive tasks, human oversight remains critical to avoid errors that could propagate to downstream models.

Consequences for the sector

Automation of routine tasks: Tools like Gemini will allow analysts to focus on higher-value tasks such as interpreting results or designing experiments. A GitHub Copilot study showed that developers complete tasks 55% faster with AI assistance, though code quality does not always improve.
Risk of uncritical dependence: If professionals blindly trust generated code, they may overlook errors or inefficiencies. In Gemini's case, the author detected that the solution used loops instead of vectorized operations, which on large datasets (millions of rows) could increase execution time from seconds to minutes.
Evolution of the data scientist profile: The ability to evaluate and refine AI-generated solutions will be valued more than writing code from scratch. Companies like Dataiku and Alteryx already integrate AI assistants, and by 2025, 60% of preprocessing tasks are expected to be AI-assisted (Gartner).
Impact on education: Data science programs will need to balance teaching fundamentals with critical use of AI tools. Universities like Stanford already include modules on interacting with generative models.

What should readers know?

Gemini and other models like GPT-4 are powerful tools for accelerating preprocessing, but they do not replace expertise. The article's author highlights that although Gemini saved time, the solution they had developed themselves was more efficient. The key is to use AI as an assistant, not a substitute for technical judgment. Additionally, it is important to consider that generative models can have biases: for example, they tend to generate code that works in typical cases but fails on atypical or noisy data. In this case, Gemini did not correctly validate the presence of null values in certain columns, which could have led to incorrect results in subsequent analyses.

Generative AI can write code, but it does not understand business context or domain subtleties. The data scientist remains indispensable to ensure analysis quality and relevance.

For professionals, the recommendation is to adopt a hybrid workflow: use AI to generate quick drafts, but always review, test, and optimize the code. Tools like Pandas Profiling or D-Tale can complement automated review. In the future, we will see more specialized models in data science, such as CodeGemini or Codex, incorporating best preprocessing practices. However, the ultimate responsibility for the analysis rests with the human.

Generative AI in Data Science: Goodbye to Manual Preparation?

What happened?

Why is it important?

Consequences for the sector

What should readers know?

Keep reading