...

/

Adding Image-to-Text Capabilities with Gemini

Adding Image-to-Text Capabilities with Gemini

Learn how to process images with Gemini in our Gradio chatbot.

Gemini is a popular multimodal chatbot built by Google. It can take input from various data modalities, such as text, images, charts, PDFs, videos, and audio. We are particularly interested in Gemini’s image-processing capabilities for our use case. A simple use case would be to generate HTML code from the image of a web page. This will greatly enhance our educational chatbot’s capabilities. Let’s begin!

Google AI Studio is a web-based tool designed to prototype and experiment with the Gemini AI models. The AI Studio can be a great place to get started with Gemini, but most importantly, the Studio also allows us to generate an API key that can be used to access Gemini using code.

Creating a Gemini API key

Let’s quickly walk through the API key creation process. Head over to the AI Studio and login. Then, follow the slides below:

Press + to interact
Choose “Get API key” on the welcome page
Choose “Get API key” on the welcome page
1 of 6

Now that the API key is created, we can go ahead and start using Gemini. For Python, we will also need to install the google-generativeai library. This can be done with the code below:

pip install google-generativeai

Once again, the library has already been set up for the widgets in this course. Installations are not needed.

The AI Studio also provides a “Get code” button that can be used to get the Python code to send a request to the model. We have copied the code from the AI Studio into the widget below.

import os
import google.generativeai as genai

genai.configure(api_key=os.environ["GEMINI_API_KEY"])

# Create the model
generation_config = {
  "temperature": 1,
  "top_p": 0.95,
  "top_k": 64,
  "max_output_tokens": 8192,
  "response_mime_type": "text/plain",
}

model = genai.GenerativeModel(
  model_name="gemini-1.5-pro",
  generation_config=generation_config,
)

chat_session = model.start_chat(
  history=[
  ]
)

response = chat_session.send_message("Hello!")

print(response.text)
Accessing Gemini using Python

Let’s review the code:

  • Line 1: We import the google.generativeai library to interact with Google’s Generative AI API.

  • Line 4: We configure the generative AI client using an API key stored in the environment variable GEMINI_API_KEY. This grants access to the generative AI models.

  • Lines 7–12: We define a dictionary named generation_config that specifies optional parameters for generating the response. These parameters control aspects like:

    • Temperature: Controls randomness (1 being more balanced).

    • Top P: Focuses on the most likely tokens (0.95 means high focus).

    • Top K: Considers top K most likely next words (64 provides some diversity).

    • Max output tokens: Limits the length of the generated text (8192 sets a maximum of 8192 words or sub-words).

    • Response Mime Type: Sets the output format (text/plain indicates plain text).

  • Lines 15–18: We create a GenerativeModel object named model by specifying the model name gemini-1.5-pro and the generation configuration we defined earlier.

  • Lines 20–23: We initiate a chat session with the model using the ...