Ryan's Manuals

Databricks

~9m skim, 1,914 words, updated Jun 2, 2026

Top 

A modern data engineering platform.


Contents



Generative AI Tools in Databricks

Notes from the class “Generative AI Engineering with Databricks” were recorded on June 1st and 2nd 2026. The platform may have changed since these notes were taken.

Prompting, Context, RAG

Prompting Styles:

  1. Zero-shot: Just the instructions
  2. Few-shot: Provide a set of sample results
  3. Chain-of-Thought: Serial prompts to answer a problem as a series of steps, not required for thinking models which do this automatically

Retrieval Augmented Generation:

Limitations & Issues for Prompts:

  1. Knowledge cut-offs are facts that were not included in the training data that must be provided as searchable context
  2. Hallucinations are likely but fabricated data points that “prioritize plausibility over truth”1
  3. Ambiguity is caused by a lack of specificity or context in a prompt, leading to a generic interpretation of the question by the model
  4. Context poisoning is the inclusion of irrelevant data in the prompt which confuses the model and may cause it to use the wrong data to come up with an answer
  5. LLMs can ignore data that is lost in the middle, where facts at the beginning and end of the context window are prioritized

Managing the Context Window:

A growing context window degrades reasoning, increases latency, and eats memory.

  1. Summarization to keep only the key details
  2. Moving window to keep only the recently added context
  3. Selective persistence to choose particular details to retain

As a rule of thumb, 75 English words are currently equivalent to ~100 tokens.2

Context “Engineering”

The strategic design of the entire prompt provided to an LLM including the system prompt, history, tool output, retrieved data, user constraints, grounding, and constraints.

Key Principles:

  1. Define the context environment
  2. Design the system prompts
  3. Ground the response using the most useful context chunks
  4. Reduce the input size by managing multi-turn state

Document Parsing

The ability for a model to retrieve information from a document is limited by the quality of the initial data extraction from the document into text and then embeddings. An embedding is a mathematical representation of meaning.

The general process to prepare a document for retrieval is:

  1. Read, OCR, and process the data into text
  2. Chunk the text into groups of tokens/context
  3. Create embeddings for each chunk

Chunking Strategies:

py
# Read all files from the documents volume into a dataframe
# => Including "path" and "content" columns, including the binary content
docs_df = spark.read.format("binaryFile").load(user_docs_path)

# The pyspark.sql.functions.expr function allows you to apply a pyspark function to each row
# => Parse each document using ai_parse_document (use expr for SQL function)
parsed_df = docs_df.withColumn("parsed_content",
                               expr(f"""ai_parse_document(content, map(
                                    "version", "2.0",
                                    "imageOutputPath", "{user_docs_path}/parsed_images/"
                                   ))""")
                              )

# Drop binary content column
parsed_df = parsed_df.drop("content")

# Display a sample of the parsed results
# Each row will have "path", "modificationTime", "length", "parsed_content"
display(parsed_df)

# Parsed images are stored at:
spark.sql(f"LIST '{user_docs_path}/parsed_images'").display()

This could also be run as a single SQL query:

sql
SELECT
  path,
  ai_parse_document(
    content,
    map(
      'version', '2.0'
    )
  ) as parsed_doc
FROM read_files('/docs-path', format => 'binaryFile');

This ai_parse_document Databricks function will produce a JSON document containing this - note the type (text/figure) and coordinates showing the position in the document.

json
{"elements": [
  {
    "bbox": [{"coord": [181, 1044, 807, 1501], "page_id": 0}],
    "confidence": 0.9992,
    "content": "As illustrated on the right, the firmware follows a closed-
                loop flow: sensors feed a Kalman-based state estimator,
                which refines motion data before forwarding it to the PID
                controller. Commands are transmitted to motor drivers, while
                encoders provide feedback for precise error correction.
                This structure allows Orion A1 to walk fluidly, react
                to collisions, and recover from instability with minimal
                delay.",
    "description": null,
    "id": 4,
    "type": "text"  // <== Extracted from OCR
  },
  {
    "bbox": [{"coord": [301, 459, 1353, 1010], "page_id": 1}],
    "confidence": 0.9467,
    "content": "Response Curve: Tuned vs Untuned PID\nTuned PID
                \nUntuned PID\nOutput (%)\nTime (ms)",
    "description": "Two lines, one solid and one dashed, illustrate
                    response curves labeled \"Tuned PID\" and
                    \"Untuned PID\" against a time scale.",
    "id": 9,
    "type": "figure"  // <== An OCR'ed and visually interpreted image
  }
]}

The result with bounding boxes can be displayed with DocumentRenderer:

python
# Import the DocumentRenderer helper class
import sys, os
sys.path.append(os.path.abspath('..'))
from Includes.document_renderer import render_ai_parse_output, render_ai_parse_output_interactive

# Select a sample document and render its parsed content using render_ai_parse_output
sample = parsed_df.select("parsed_content").limit(1).collect()
doc = sample[0]["parsed_content"]

render_ai_parse_output(doc)

Document Chunking

First, pull the content into an additional column.

python
from pyspark.sql.functions import expr
from pyspark.sql import functions as F

# Create a UDF to extract the content
# => Convert VARIANT/struct/map to a JSON string first (avoids VariantVal issues)
safe_json_col = F.coalesce(
    F.to_json(F.col("parsed_content")),
    F.col("parsed_content").cast("string")
)

# Apply the UDF
plain_text_df = parsed_df.withColumn(
    "plain_text",
    extract_contents_udf()(safe_json_col)
)

# AI-QUERY Function
# Alternatively, we can give the JSON to an LLM to summarize to markdown
ENDPOINT = "databricks-gpt-oss-20b"
prompt_prefix = ''' You are a helpful assistant. Given a JSON object
representing a parsed document (with pages, elements, and metadata),
convert the content into clean, readable markdown. Use "== page ==" to
separate each page. Preserve important structure such as headers,
tables, and captions. Do not include any JSON or code blocks in the
output—just the clean markdown text.

JSON:
'''

# Apply ai_query to batch process the parsed JSON text
transformed_df = (
    parsed_df.withColumn(
        "clean_markdown_text",
        expr(f"""
          ai_query(
            '{ENDPOINT}',
            CONCAT('{prompt_prefix}', CAST(parsed_content AS STRING)),
            responseFormat => '{{"type":"text"}}'
          )""")))

Use a langchain object like RecursiveCharacterTextSplitter to split each block of markdown text into a chunk.

python
from langchain_text_splitters import RecursiveCharacterTextSplitter
from pyspark.sql.types import StructType, StructField, StringType
import pandas as pd

# Build the text splitter with preferred separators
splitter = RecursiveCharacterTextSplitter(
    chunk_size=2000,
    chunk_overlap=200,
    separators=["\n== page ==\n", "== page ==", "\n\n", "\n", " ", ""]
)

def split_rows(iterator):
    for pdf in iterator:
        out = []
        for _, row in pdf.iterrows():
            path = row["document_path"]
            text = row["plain_text"]
            if isinstance(text, str) and text.strip():
                for c in splitter.split_text(text):
                    if c and c.strip():
                        out.append((path, c))

        # Provide the full path to the document and the text chunk
        yield pd.DataFrame(out, columns=["path", "chunk"])

# Apply the splitter to the plain text DataFrame:
schema = StructType([StructField("path", StringType(), True), StructField("chunk", StringType(), True)])
df_chunks = (
    plain_text_df.select("path", "plain_text")
    .mapInPandas(split_rows, schema=schema)
)

Document Embedding

Applying an embedding algorithm to chunks of text enables us to compare and find that text according to its semantic meaning - that is, the concepts and ideas instead of a 1:1 string comparison.

Comparing Embedding Vectors:

Vector Search Strategies:

  1. K-Nearest Neighbors (KNN) calculates the distance against every vector in the database, which is accurate but expensive
  2. Approximate Nearest Neighbors (ANN) uses indexing algorithms like HNSW to reduce the check to a group of close vectors

Reranking can occur after a vector search to reassess and order the similarity of the most similar chunks and/or documents.

Mosaic AI Vector Search:

Databricks includes the Mosaic AI Vector Search database, which provides:

The change data feed feature must be enabled on the table in Databricks for vector search to prevent re-embedding.

After using the GUI to create a vector search index on your chunks table, we can create a client to perform vector searches.

python
from databricks.vector_search.client import VectorSearchClient
from databricks.vector_search.reranker import DatabricksReranker

# Initialize the Vector Search client for later use
vsc = VectorSearchClient(disable_notice=True)
index = vsc.get_index(index_name=f"catalog.yourschema.docs_chunked_lab_index")
print(index.describe())

query_text = "How does the motion controller maintain balance during rapid movement?"

# Perform similarity search:
reranked_results = index.similarity_search(
    query_text=query_text,
    columns=["path", "chunk"],
    num_results=3,

    # Optional: provide a re-ranker
    reranker=DatabricksReranker(columns_to_rerank=["chunk"])

    # Optional: filter by particular document/path
    filters={"path LIKE": "05_Orion_Maintenance_and_Servicing_Guide_v3.pdf"},

    # Optional: just run full-text search
    query_type="FULL_TEXT",

    # Optional: hybrid search that uses keywords as well
    query_type="HYBRID",
)

# Print search results
display(reranked_results)

AI Agents with MLflow & LangChain

An AI agent is an intermediary between a human (or another LLM) and a data system, combining one or more LLMs, MCP servers, callable functions/tools, and connections to databases and file servers. Agents have the agency to make decisions and take actions. Frequently they also have memory systems to save conversation data and learned information.

Categorizations for AI Agents:

  1. Reflex: Make decisions based on present data
  2. Model-Based Reflex: Use statistical and mathematical models
  3. Goal-Based: Plan multi-tool strategies to meet goals
  4. Utility-Based: Use risk-reward models and optimization criteria combined with multiple execution options to meet goals
  5. Learning: Self-improves and adapts

MLflow3 and LangChain4 can be used on Databricks to build auditable AI agents.

“MLflow is the largest open source AI engineering platform for agents, LLMs, and ML models. MLflow enables teams of all sizes to debug, evaluate, monitor, and optimize production-quality AI applications while controlling costs and managing access to models and data." 3

  1. MLflow Tracking for checking code versions and high level metrics
  2. MLflow Tracing for capturing execution flows and tool calls
  3. MLflow Models for packaging ‘models’ to serve
  4. MLflow Model Registry for storing and version control of ‘models’

Anything from LLM inference to ML models and normal python code can be stored and used as a model in the Unity Catalog.

Deploying Models & Agents

MLflow Models are folders that contain key files and a MLModel file, which includes flavors which are interfaces (ways to use the model.)

Deployment Types balance throughput and latency:

  1. Batch deployments have the highest throughput - overnight reports
  2. Stream deployments - personalized marketing messages
  3. Real-time deployments - chatbots, image generators
  4. Embedded/Edge deployments have the lowest latency - car environmental controls

Lifecycle Management:

Batch deployments are ideal for cases when the volume of new records to process is very large, immediate replies are not necessary, and a high latency is OK. These can be implemented easily and run when compute is cheapest - though the data may be stale.

Batch Deployment Inference Options:

  1. Python functions (pyfunc.predict)
  2. Pandas or spark UDF (from the model registry)
  3. The ai_query() function with a prompt and incoming text data

Evaluating Agents

Monitoring Agents


  1. Databricks Training Material: Lecture 1.1 ↩︎

  2. Anecdotal, check with tiktokenizer.vercel.app or other tool ↩︎

  3. MLflow Documentation: mlflow.org/docs/latest/ml  ↩︎ ↩︎

  4. LangChain Documentation: docs.langchain.com  ↩︎



Site Directory

Pages are organized by last modified.



Page Information

Title: Databricks
Word Count: 1914 words
Reading Time: 9 minutes
Permalink:
https://manuals.ryanfleck.ca/databricks/