If you prefer to handle your filtering locally, you can also download all spans as a dataframe using the get_spans_dataframe() method:
from phoenix.client import Clientclient = Client()# Download all spans from your default projectclient.spans.get_spans_dataframe()# Download all spans from a specific projectclient.spans.get_spans_dataframe(project_name='your project name')
You can query for data using our query DSL (domain specific language).
This Query DSL is the same as what is used by the filter bar in the dashboard. It can be helpful to form your query string in the Phoenix dashboard for more immediate feedback, before moving it to code.
Below is an example of how to pull all retriever spans and select the input value. The output of this query is a DataFrame that contains the input values for all retriever spans.
from phoenix.client import Clientfrom phoenix.trace.dsl import SpanQueryclient = Client()query = SpanQuery().where( # Filter for the `RETRIEVER` span kind. # The filter condition is a string of valid Python boolean expression. "span_kind == 'RETRIEVER'",).select( # Extract the span attribute `input.value` which contains the query for the # retriever. Rename it as the `input` column in the output dataframe. input="input.value",)# The Phoenix Client can take this query and return the dataframe.client.spans.get_spans_dataframe(query=query)
DataFrame Index By default, the result DataFrame is indexed by span_id, and if .explode() is used, the index from the exploded list is added to create a multi-index on the result DataFrame. For the special retrieval.documents span attribute, the added index is renamed as document_position.
By default, all queries will collect all spans that are in your Phoenix instance. If you’d like to focus on most recent spans, you can pull spans based on time frames using start_time and end_time.
from phoenix.client import Clientfrom phoenix.trace.dsl import SpanQueryfrom datetime import datetime, timedeltaclient = Client()# Get spans from the last 7 days onlystart = datetime.now() - timedelta(days=7)# Get spans to exclude the last 24 hoursend = datetime.now() - timedelta(days=1)phoenix_df = client.spans.get_spans_dataframe(start_time=start, end_time=end)
By default all queries are executed against the default project or the project set via the PHOENIX_PROJECT_NAME environment variable. If you choose to pull from a different project, all methods on the Client have an optional parameter named project_name
from phoenix.client import Clientfrom phoenix.trace.dsl import SpanQueryclient = Client()# Get spans from a projectclient.spans.get_spans_dataframe(project_name="<my-project>")# Using the query DSLquery = SpanQuery().where("span_kind == 'CHAIN'").select(input="input.value")client.spans.get_spans_dataframe(query=query, project_name="<my-project>")
Let’s say we want to extract the retrieved documents into a DataFrame that looks something like the table below, where input denotes the query for the retriever, reference denotes the content of each document, and document_position denotes the (zero-based) index in each span’s list of retrieved documents.Note that this DataFrame can be used directly as input for the Retrieval (RAG) Relevance evaluations.
context.span_id
document_position
input
reference
5B8EF798A381
0
What was the author’s motivation for writing …
In fact, I decided to write a book about …
5B8EF798A381
1
What was the author’s motivation for writing …
I started writing essays again, and wrote a bunch of …
…
…
…
…
E19B7EC3GG02
0
What did the author learn about …
The good part was that I got paid huge amounts of …
We can accomplish this with a simple query as follows:
from phoenix.client import Clientfrom phoenix.trace.dsl import SpanQueryclient = Client()query = SpanQuery().where( # Filter for the `RETRIEVER` span kind. # The filter condition is a string of valid Python boolean expression. "span_kind == 'RETRIEVER'",).select( # Extract the span attribute `input.value` which contains the query for the # retriever. Rename it as the `input` column in the output dataframe. input="input.value",).explode( # Specify the span attribute `retrieval.documents` which contains a list of # objects and explode the list. Extract the `document.content` attribute from # each object and rename it as the `reference` column in the output dataframe. "retrieval.documents", reference="document.content",)# The Phoenix Client can take this query and return the dataframe.client.spans.get_spans_dataframe(query=query)
In addition to the document content, if we also want to explode the document score, we can simply add the document.score attribute to the .explode() method alongside document.content as follows. Keyword arguments are necessary to name the output columns, and in this example we name the output columns as reference and score. (Python’s double-asterisk unpacking idiom can be used to specify arbitrary output names containing spaces or symbols. See here for an example.)
The .where() method accepts a string of valid Python boolean expression. The expression can be arbitrarily complex, but restrictions apply, e.g. making function calls are generally disallowed. Below is a conjunction filtering also on whether the input value contains the string 'programming'.
query = SpanQuery().where( "span_kind == 'RETRIEVER' and 'programming' in input.value")
Filtering spans by evaluation results, e.g. score or label, can be done via a special syntax. The name of the evaluation is specified as an indexer on the special keyword evals. The example below filters for spans with the incorrect label on their correctness evaluations. (See here for how to compute evaluations for traces, and here for how to ingest those results back to Phoenix.)
You can also use Python boolean expressions to filter spans in the Phoenix UI. These expressions can be entered directly into the search bar above your experiment runs, allowing you to apply complex conditions involving span attributes. Any expressions that work with the .where() method above can also be used in the UI.
Keyword-argument style can be used to rename the columns in the dataframe. The example below returns two columns named input and output instead of the original names of the attributes.
If arbitrary output names are desired, e.g. names with spaces and symbols, we can leverage Python’s double-asterisk idiom for unpacking a dictionary, as shown below.
The document contents can also be concatenated together. The query below concatenates the list of document.content with (double newlines), which is the default separator. Keyword arguments are necessary to name the output columns, and in this example we name the output column as reference. (Python’s double-asterisk unpacking idiom can be used to specify arbitrary output names containing spaces or symbols. See here for an example.)
This is useful for joining a span to its parent span. To do that we would first index the child span by selecting its parent ID and renaming it as span_id. This works because span_id is a special column name: whichever column having that name will become the index of the output DataFrame.
To do this, provide two queries to Phoenix and join the resulting dataframes with pandas.
import pandas as pdfrom phoenix.client import Clientfrom phoenix.trace.dsl import SpanQueryclient = Client()parent_df = client.spans.get_spans_dataframe(query=query_for_parent_spans)child_df = client.spans.get_spans_dataframe(query=query_for_child_spans)pd.concat( [parent_df, child_df], axis=1, # joining on the row indices join="inner", # inner-join by the indices of the dataframes)
from phoenix.client import Clientfrom phoenix.trace.dsl import SpanQueryclient = Client()query = SpanQuery().where( "span_kind == 'LLM'",).select( input="input.value", output="output.value",)# The Phoenix Client can take this query and return a dataframe.client.spans.get_spans_dataframe(query=query)
import pandas as pdfrom phoenix.client import Clientfrom phoenix.trace.dsl import SpanQueryclient = Client()query_for_root_span = SpanQuery().where( "parent_id is None", # Filter for root spans).select( input="input.value", # Input contains the user's question output="output.value", # Output contains the LLM's answer)query_for_retrieved_documents = SpanQuery().where( "span_kind == 'RETRIEVER'", # Filter for RETRIEVER span).select( # Rename parent_id as span_id. This turns the parent_id # values into the index of the output dataframe. span_id="parent_id",).concat( "retrieval.documents", reference="document.content",)root_df = client.spans.get_spans_dataframe(query=query_for_root_span)docs_df = client.spans.get_spans_dataframe(query=query_for_retrieved_documents)# Perform an inner join on the two sets of spans.pd.concat( [root_df, docs_df], axis=1, join="inner",)
The output DataFrame would look something like the one below. The input contains the question, the output column contains the answer, and the reference column contains a concatenation of all the retrieved documents.
context.span_id
input
output
reference
CDBC4CE34
What was the author’s trick for …
The author’s trick for …
Even then it took me several years to understand …